rhuffle is a random shuffler for large file with many lines which can exceed available RAM.
rhuffle supports: - shuffling huge files which does not fit in memory - skipping head lines which should not include for shuffling (e.g. csv/tsv) - multiple file input and flexible input formats - rhuffle works very fast (see benchmark results.)
See lib.rs.
``` USAGE: rhuffle [OPTIONS]
FLAGS: --help Prints help information -V, --version Prints version information
OPTIONS:
-b, --buf
--dst <PATH>
Sets destination file path. If not set, destination sets to stdout. (default: None)
--feed <LF|LF_CRLF> Sets acceptable line feed as EOL (default: LF_CRLF).
-h, --head <NUMBER>
Sets first `n` lines without shuffling (default: 0). For multiple input sources, take README a look.
--log <off|error|warn|info|debug|trace> Sets log level. (default: off)
--src <[PATH]>
Sets source file paths (space separated). If not set, source sets to stdin. (default: None)
```
--head n
Optionn
lines in the first input source forwards to output source without shuffling.n
lines in the first input source are skipped.in1.txt
head1-1
head2-1
line1-1
line2-1
in2.txt
head1-2
head2-2
line1-2
line2-2
$ rhuffle --src in1.txt in2.txt --dst out.txt --head 2
out.txt
head1-1 // L1-L2: fixed
head2-1
line2-1 // L3-L6: shuffled globally
line1-2
line2-2
line1-1
--feed
OptionThe results shown below are focused on execution time in a limited memory space. Two datasets are used for testing.
Three softwares are used for performance comparison.
shuf {src} -o {dst}
terashuf < {src} > {dst}
rhuffle --src {src} --dst {dst}
Benchmarks are executed on MacBook Pro 2017, Core i7 3.1GHz, RAM 16GB.
Execution time is measured by time
.
5.3GB size, 55423856 lines
|Software|real|user|sys| |---|---|---|---| |GNU shuf|0m59s|0m34s|0m14s| |terashuf|5m06s|4m43s|0m14s| |rhuffle|1m56s|1m06s|0m40s|
9.0GB size, 21550072 lines
|Software|real|user|sys| |---|---|---|---| |GNU shuf|x|x|x| |terashuf|8m12s|7m16s|0m31s| |rhuffle|1m47s|0m39s|0m51s|
GNU shuf was impossible to measure (very slow).