Rust script to hash many files, quickly.
There are 3 modes of operation.
Parallelises file discovery (in usage #1) and hashing. Default hasher is not cryptographically secure.
"{path}"\t{hex_digest}
is printed to stdout.
This is reversed compared to most hashing utilities (md5sum
, sha1sum
etc.) with the intention of making it easier to sort deterministically by file name (pipe the output through awk -F '\t' 'BEGIN {OFS = FS} {print $2,$1}'
to reverse it, keeping the tab intact).
Ongoing progress information, and a final time and rate, are printed to stderr.
Contributions welcome.
With cargo
installed (get it with rustup):
sh
cargo install recursum
``` recursum Hash lots of files fast, in parallel.
USAGE: recursum [FLAGS] [OPTIONS] ...
FLAGS: -h, --help Prints help information -q, --quiet Do not show progress information -V, --version Prints version information
OPTIONS:
-d, --digest
ARGS: ... File name, directory name (every file recursively will be hashed, in depth first order), or '-' for getting list of files from stdin (order is conserved) ```
Example:
sh
fd --threads 1 --type file | recursum --threads 10 --digest 64 - > my_checksums.txt
This should be more efficient, and have better logging, than using --exec
or | xargs
.
Broadly speaking, recursum
uses >= 1 thread to populate a queue of files to hash; either
Simulaneously, items are popped off this queue and executed using tokio's threaded scheduler. There should be no context switches within each task; the tasks are processed in the same order that they are received. The main thread fetches results (in the same order) and prints them to stdout.
find
(or fd
) with -exec
(--exec
), e.g.
sh
find . -type f -exec md5sum {} \;
find
is single-threaded, and -exec
flattens the list of found files, passing each as an additional argument to the hashing utility.
This can break if the number of files is large.
Additionally, many built-in hashing utilities are not multi-threaded; furthermore, the utility is not actually called until the file list has been populated.
There you can also pipe a list of arguments to xargs
, which can parallelise with -P
and restrict the number of arguments given with -n
:
sh
find . -type f -print0 | xargs -0 -P 8 -n 1 -I _ md5sum "_"
This spawns a new shell for every invocation, which could be problematic, and may not make as good use of the CPU as there can be no communication between processes. However, these tools are far more mature than recursum, so they may work better for you.