tarsum: checksum utility for tar files

description

tarsum generates and validates checksums for files encapsulated within common archive file formats. It supports a format specification grammar based on the BSD stat(1) utility for specifying the checklist and reporting formats. This makes it easy to interoperate with different checksum utilities (e.g. GNU coreutils md5sum or OpenBSD sha256), which utilize different line formats. This also effectively provides a way to programmatically query file metadata from a shell without expanding the archive.

Why? Because for various reasons I prefer tar for archiving data. But examining tar archives can be slow, especially on media like Blu-ray discs. In most cases what I'm attempting to do is compare archived content against manifest files of the source file trees generated by common checksum utilities, and to do so in a way that is easily scriptable.

news

2022-04-07

Change checklist report status word "MISSED" to "MISSING", matching the OpenBSD sha256(1) spelling as intended when checklist verification was originally committed two days ago. Other utilities (e.g. GNU coreutils sha256sum, Perl shasum) seem to report missing files as failures ("FAILED").

2022-04-06

Add -R verification report format specification option.

2022-04-05

Add -C checklist verification option.

2019-08-16

Project page.

todo

Support POSIX pax listopt format strings in addition to (or instead of) BSD stat(1) format strings. I didn't realize pax had such a feature until after I'd written the existing format parser. I have half a mind to simply submit patches to various pax implementations to support SHA-2 and SHA-3 checksums, though at least for BSD pax I'd also need to submit support for listopt itself, which is lacking.

Implement tarmap, a utility to index tar files and compressed tarballs for random access and file extraction. Compression formats are blocked based so an index can simply map block boundary offsets to the position in the uncompressed stream that begins at that block. tarsum could either use a map directly or build one internally when generating manifests; offset mappings could be printed with new format specifiers. Ideally the generated mappings would be simple enough to be parsed by shell scripts and used to extract individual files using regular utilities like dd(1), gzip(1), xz(1), etc, without having to build or install tarsum or tarmap.

usage

Usage: tarsum [-0a:C:f:R:s:t:h] [TARFILE-PATH]
  -0          use NUL (\0) as default record separator
  -a DIGEST   digest algorithm (default: "SHA256")
  -C PATH     checklist for verification of archive contents
  -f FORMAT   format specification (default: "%C  %N%$")
  -R FORMAT   verification report format (default: "%N: %R%$" )
  -s SUBEXPR  path substitution expression (see BSD tar(1) -s)
  -t TIMEFMT  strftime format specification (default: "%a %b %e %X %Y")
  -h          print this usage message

FORMAT (see printf(1) and BSD stat(1))
  \NNN  octal escape sequence
  \xNN  hexadecimal escape sequence
  \L    C escape sequence (\\, \a, \b, \f, \n, \r, \t, \v)
  %%    percent literal
  %$    record separator (e.g. \n or \0)
  %A    digest name
  %C    file digest
  %N    file name (full path)
  %R    verification status (OK, FAILED, MISSING)
  %T    file type (ls -L suffix character; use %HT for long name, %MT for single letter)
  %g    GID or group name (%Sg)
  %m    last modification time (%Sm: strftime formatting)
  %o    file offset (%Ho: header record, %Lo: end of last file record)
  %u    UID or user name (%Su)
  %z    file size (%Hz: header record(s), %Lz: header and file records)

Report bugs to <william@25thandClement.com>

-s behaves identically to the BSD tar(1) -s option. See also GNU tar(1) --transform, which has the same semantics but requires the leading `s` operator implicit in the BSD option.
-t time format specifications can potentially cause checklist verification headaches if string timestamps (e.g. %Sm) aren't fixed length. To load checklists the format specification is translated to a regular expression. For timestamp components tarsum attempts to determine the minimum and maximum length for the time format without grokking the internal elements, emitting a simple `.{M,N}` subexpression, increasing potential conflict with the wildcard file name subexpression capture. Fortunately, timestamps aren't included in any common checklist formats. And at least for the en_US locale, the default time format is fixed length.

contributions

I originally threw together tarsum.c in a few hours. Some parts are organized as-if they were to be built as a shared library, others assume a simple binary.

The format specification grammar should be straightforward (tastes notwithstanding) for someone with experience hacking on common Unix software providing similar facilties (e.g. libc). Most of that logic is kept encapsulated in the `print` routine. What's somewhat unique is that for parsing checklists the format specification is translated into a regular expression, a task performed by the `format2re` routine.

license

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

source

git clone https://25thandClement.com/~william/projects/tarsum.git

Or visit the GitHub mirror

other projects