Like wc
, but unicode-aware, and with line mode.
uwc
can count:
Additionally, it can operate in line mode, which will count things within lines.
By default, uwc
will count lines, words, and bytes. You can specify the counters
you'd like, or ask for all counters with the -a
flag.
```sh $ uwc tests/fixtures/**/input lines words bytes filename 8 5 29 tests/fixtures/allnewlines/input 0 0 0 tests/fixtures/empty/input 0 0 0 tests/fixtures/emptylinemode/input 1 9 97 tests/fixtures/flagsbp/input 1 9 97 tests/fixtures/flagscl/input 1 9 97 tests/fixtures/flagsw/input 0 1 5 tests/fixtures/hello/input 1 9 97 tests/fixtures/icaneatglass/input 8 8 29 tests/fixtures/linemode/input 7 8 28 tests/fixtures/linemodenotrailingnewline/input 7 8 28 tests/fixtures/linemodenotrailingnewlinecountnewlines/input 34 66 507 total
$ uwc -a tests/fixtures/**/input lines words bytes graphemes codepoints filename 8 5 29 23 24 tests/fixtures/allnewlines/input 0 0 0 0 0 tests/fixtures/empty/input 0 0 0 0 0 tests/fixtures/emptylinemode/input 1 9 97 51 51 tests/fixtures/flagsbp/input 1 9 97 51 51 tests/fixtures/flagscl/input 1 9 97 51 51 tests/fixtures/flagsw/input 0 1 5 5 5 tests/fixtures/hello/input 1 9 97 51 51 tests/fixtures/icaneatglass/input 8 8 29 28 28 tests/fixtures/linemode/input 7 8 28 27 27 tests/fixtures/linemodenotrailingnewline/input 7 8 28 27 27 tests/fixtures/linemodenotrailingnewlinecountnewlines/input 34 66 507 314 315 total ```
You can also switch into line mode with the --mode
flag:
sh
$ uwc -a --mode line tests/fixtures/line_mode/input
lines words bytes graphemes codepoints filename
0 1 1 1 1 tests/fixtures/line_mode/input:1
0 1 2 2 2 tests/fixtures/line_mode/input:2
0 1 3 3 3 tests/fixtures/line_mode/input:3
0 1 5 4 4 tests/fixtures/line_mode/input:4
0 1 1 1 1 tests/fixtures/line_mode/input:5
0 1 4 4 4 tests/fixtures/line_mode/input:6
0 1 2 2 2 tests/fixtures/line_mode/input:7
0 1 3 3 3 tests/fixtures/line_mode/input:8
0 8 21 20 20 tests/fixtures/line_mode/input:total
The goal of this project is to consider unicode rules correctly when counting things. Specifically, it should:
It does not aim to implement these unicode algorithms, however, so it makes use of
the unicode-segmentation
library
for most of the heavy lifting. And since Unicode support in the Rust ecosystem is
not quite mature yet, that has some consequences for this project. See the
caveats below.
It is primarily a fun side project for me, and an excuse to learn more about Rust and unicode.
It only supports UTF-8 files. UTF-16 can go on my to-do list if there is demand.
For now, you can use iconv
to convert non-UTF-8 files first.
It is slower than wc
. Much slower. On my laptop, I'm measuring about 10x slower.
My analysis hasn't been extensive, but as far as I can tell, the reasons are:
unicode-segmentation
lib is helpful, it is quite
limiting. It only exposes its functionality through iterators, which makes
certain optimizations difficult—like counting everything in a single pass.Rust, as yet, has no localization libraries, so this has some consequences. Some counts will just be wrong, such as hyphenated words, which is locale-specific and requires language dictionary lookups to be correct. Also, there are some languages that have no syntactic word separators, such as Japanese, so e.g.
私はガラスを食べられます。
should be 5 words, but without localization, we cannot determine that.