cw - Count Words

A wc clone in Rust.

Synopsis

``` -% cw --help cw 0.2.0 Thomas Hurst tom@hur.st Count Words - word, line, character and byte count

USAGE: cw [FLAGS] [input]...

FLAGS: -c, --bytes Count bytes -m, --chars Count UTF-8 characters instead of bytes -h, --help Prints help information -l, --lines Count lines -L, --max-line-length Count bytes (default) or characters (-m) of the longest line -V, --version Prints version information -w, --words Count words

ARGS: ... Input files

-% cw DickensCharlesPickwickPapers.xml 3449440 51715840 341152640 DickensCharlesPickwickPapers.xml ```

Performance

Line counts are optimized using the bytecount crate:

``` Benchmark #1: wc -l DickensCharlesPickwick_Papers.xml Time (mean ± σ): 439.7 ms ± 2.0 ms [User: 354.9 ms, System: 84.5 ms] Range (min … max): 435.3 ms … 441.4 ms

Benchmark #2: gwc -l DickensCharlesPickwick_Papers.xml Time (mean ± σ): 533.0 ms ± 1.7 ms [User: 388.8 ms, System: 144.0 ms] Range (min … max): 530.9 ms … 535.1 ms

Benchmark #3: cw -l DickensCharlesPickwick_Papers.xml Time (mean ± σ): 127.9 ms ± 1.5 ms [User: 24.1 ms, System: 103.7 ms] Range (min … max): 125.1 ms … 131.3 ms

Summary 'cw -l DickensCharlesPickwickPapers.xml' ran 3.44 ± 0.04 times faster than 'wc -l DickensCharlesPickwickPapers.xml' 4.17 ± 0.05 times faster than 'gwc -l DickensCharlesPickwick_Papers.xml' ```

Line counts with line length are optimized using the memchr crate:

``` Benchmark #1: wc -lL DickensCharlesPickwick_Papers.xml Time (mean ± σ): 441.6 ms ± 1.8 ms [User: 354.7 ms, System: 86.5 ms] Range (min … max): 438.5 ms … 443.8 ms

Benchmark #2: gwc -lL DickensCharlesPickwick_Papers.xml Time (mean ± σ): 3.851 s ± 0.005 s [User: 3.710 s, System: 0.141 s] Range (min … max): 3.847 s … 3.864 s

Benchmark #3: cw -lL DickensCharlesPickwick_Papers.xml Time (mean ± σ): 255.6 ms ± 1.1 ms [User: 154.6 ms, System: 100.9 ms] Range (min … max): 253.3 ms … 256.9 ms

Summary 'cw -lL DickensCharlesPickwickPapers.xml' ran 1.73 ± 0.01 times faster than 'wc -lL DickensCharlesPickwickPapers.xml' 15.07 ± 0.07 times faster than 'gwc -lL DickensCharlesPickwick_Papers.xml' ```

Note without -m cw only operates on bytes, and it never cares about your locale.

``` Benchmark #1: wc DickensCharlesPickwick_Papers.xml Time (mean ± σ): 2.708 s ± 0.002 s [User: 2.612 s, System: 0.095 s] Range (min … max): 2.706 s … 2.712 s

Benchmark #2: gwc DickensCharlesPickwick_Papers.xml Time (mean ± σ): 3.851 s ± 0.003 s [User: 3.714 s, System: 0.136 s] Range (min … max): 3.847 s … 3.856 s

Benchmark #3: cw DickensCharlesPickwick_Papers.xml Time (mean ± σ): 2.026 s ± 0.001 s [User: 1.939 s, System: 0.087 s] Range (min … max): 2.024 s … 2.028 s

Summary 'cw DickensCharlesPickwickPapers.xml' ran 1.34 ± 0.00 times faster than 'wc DickensCharlesPickwickPapers.xml' 1.90 ± 0.00 times faster than 'gwc DickensCharlesPickwick_Papers.xml' ```

-m enables UTF-8 processing, and currently has no fast paths.

``` Benchmark #1: wc -mLlw DickensCharlesPickwick_Papers.xml Time (mean ± σ): 8.972 s ± 0.019 s [User: 8.875 s, System: 0.096 s] Range (min … max): 8.958 s … 9.013 s

Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark #2: gwc -mLlw DickensCharlesPickwick_Papers.xml Time (mean ± σ): 3.852 s ± 0.008 s [User: 3.700 s, System: 0.151 s] Range (min … max): 3.846 s … 3.867 s

Benchmark #3: cw -mLlw DickensCharlesPickwick_Papers.xml Time (mean ± σ): 3.721 s ± 0.003 s [User: 3.598 s, System: 0.123 s] Range (min … max): 3.715 s … 3.726 s

Summary 'cw -mLlw DickensCharlesPickwickPapers.xml' ran 1.04 ± 0.00 times faster than 'gwc -mLlw DickensCharlesPickwickPapers.xml' 2.41 ± 0.01 times faster than 'wc -mLlw DickensCharlesPickwick_Papers.xml' ```

These tests are on FreeBSD 12 on a 2.1GHz Westmere Xeon. gwc is from GNU coreutils 8.30.

For best results build with:

cargo build --release --features runtime-dispatch-simd

This enables SIMD optimizations for line counting. It has no effect if you have it count anything else.

Future

See Also

[uwc]

[uwc] focuses on following Unicode rules as precisely as possible, taking into account less-common newlines, counting graphemes as well as codepoints, and following Unicode word-boundary rules precisely.

The cost of this is currently a great deal of performance, with counts on my benchmark file taking over a minute.

[rwc]

cw was originally called [rwc] until I noticed this existed. It's quite old and doesn't appear to compile.

[linecount]

A little library that only does plain newline counting, along with a binary called lc. Version 0.2 will use the same algorithm as cw.