rhuffle

crates.io Build Status

rhuffle is a random shuffler for large file with many lines which can exceed available RAM.

rhuffle supports: - shuffling huge files which does not fit in memory - skipping head lines which should not include for shuffling (e.g. csv/tsv) - multiple file input and flexible input formats - rhuffle works very fast (see benchmark results.)

rhuffle_demo

Installation

See lib.rs.

Usage

``` USAGE: rhuffle [OPTIONS]

FLAGS: --help Prints help information -V, --version Prints version information

OPTIONS: -b, --buf Sets buffer size which is smaller than available RAM with bytes (default: 4294967296).

    --dst <PATH>
        Sets destination file path. If not set, destination sets to stdout. (default: None)

    --feed <LF|LF_CRLF>                        Sets acceptable line feed as EOL (default: LF_CRLF).
-h, --head <NUMBER>
        Sets first `n` lines without shuffling (default: 0). For multiple input sources, take README a look.

    --log <off|error|warn|info|debug|trace>    Sets log level. (default: off)
    --src <[PATH]>
        Sets source file paths (space separated). If not set, source sets to stdin. (default: None)

```

--head n Option

in1.txt head1-1 head2-1 line1-1 line2-1

in2.txt head1-2 head2-2 line1-2 line2-2

$ rhuffle --src in1.txt in2.txt --dst out.txt --head 2

out.txt head1-1 // L1-L2: fixed head2-1 line2-1 // L3-L6: shuffled globally line1-2 line2-2 line1-1

--feed Option

Benchmarks

The results shown below are focused on execution time in a limited memory space. Two datasets are used for testing.

Three softwares are used for performance comparison.

Benchmarks are executed on MacBook Pro 2017, Core i7 3.1GHz, RAM 16GB. Execution time is measured by time.

Kaggle competition dataset

5.3GB size, 55423856 lines

|Software|real|user|sys| |---|---|---|---| |GNU shuf|0m59s|0m34s|0m14s| |terashuf|5m06s|4m43s|0m14s| |rhuffle|1m56s|1m06s|0m40s|

Custom dataset

9.0GB size, 21550072 lines

|Software|real|user|sys| |---|---|---|---| |GNU shuf|x|x|x| |terashuf|8m12s|7m16s|0m31s| |rhuffle|1m47s|0m39s|0m51s|

GNU shuf was impossible to measure (very slow).