hck
is a shortening of hack
, a rougher form of cut
.
A close to drop in replacement for cut that can use a regex delimiter instead of a fixed string. Additionally this tool allows for specification of the order of the output columns using the same column selection syntax as cut (see below for examples).
No single feature of hck
on its own makes it stand out over awk
, cut
, xsv
or other such tools. Where hck
excels is making common things easy, such as reordering output fields, or splitting records on a weird delimiter.
It is meant to be simple and easy to use while exploring datasets.
-f4,2,8
the output columns will appear in the order 4
, 2
, 8
-R
), i.e. you can split on multiple spaces without and extra pipe to tr
!-F
option, or by regex by setting the -r
flaghck
does not aim to be a complete CSV / TSV parser a la xsv
which will respect quoting rules. It acts similar to cut
in that it will split on the delimiter no mater where in the line it is.hck
will always be a line-by-line tool where newlines are the standard \n
\r\n
.bash
brew tap sstadick/hck
brew install hck
* Built with profile guided optimizations
bash
curl -LO https://github.com/sstadick/hck/releases/download/<latest>/hck-linux-amd64.deb
sudo dpkg -i hck-linux-amd64.deb
* Built with profile guided optimizations
bash
export RUSTFLAGS='-C target-cpu=native'
cargo install hck
From the releases page (the binaries have been built with profile guided optimizations)
Or, if you want the absolute fastest possible build that makes use of profile guided optimizations AND native cpu features:
```bash
rustup component add llvm-tools-preview git clone https://github.com/sstadick/hck cd hck bash pgo_local.sh cp ./target/release/hck ~/.cargo/bin/hck ```
```bash ❯ hck -Ld' ' -f1-3,5- ./README.md | head -n4
```bash
❯ ps aux | hck -f1-3,5- | head -n4 USER PID %CPU VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 169452 13472 ? Ss Jun21 0:19 /sbin/init splash root 2 0.0 0 0 ? S Jun21 0:00 [kthreadd] root 3 0.0 0 0 ? I< Jun21 0:00 [rcu_gp] ```
bash
❯ ps aux | hck -f2,1,3- | head -n4
PID USER %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
1 root 0.0 0.0 169452 13472 ? Ss Jun21 0:19 /sbin/init splash
2 root 0.0 0.0 0 0 ? S Jun21 0:00 [kthreadd]
3 root 0.0 0.0 0 0 ? I< Jun21 0:00 [rcu_gp]
bash
❯ ps aux | hck -D'___' -f2,1,3 | head -n4
PID___USER___%CPU
1___root___0.0
2___root___0.0
3___root___0.0
```bash
ps aux | hck -r -F '^ST.*' -F '^USER$' | head -n4 STAT START USER Ss Jun21 root S Jun21 root I< Jun21 root ```
```bash ❯ gzip ./README.md ❯ hck -Ld' ' -f1-3,5- -z ./README.md.gz | head -n4
```bash
❯ printf 'this$;$is$;$a$;$test\na$;$b$;$3$;$four\n' > test.txt ❯ hck -Ld'$;$' -f3,4 ./test.txt a test 3 four
❯ printf 'this123_is456--a789-test\na129-b849-3109-four\n' > test.txt ❯ hck -d'\d{3}[-_]+' -f3,4 ./test.txt a test 3 four ```
This set of benchmarks is simply meant to show that hck
is in the same ballpark as other tools. These are meant to capture real world usage of the tools, so in the multi-space delimiter benchmark for gcut
, for example, we use tr
to convert the space runs to a single space and then pipe to gcut
.
Note this is not meant to be an authoritative set of benchmarks, it is just meant to give a relative sense of performance of different ways of accomplishing the same tasks.
Ubuntu 20 AMD Ryzen 9 3950X 16-Core Processor w/ 64 GB DDR4 memory and 1TB NVMe Drive
The all_train.csv data is used.
This is a CSV dataset with 7 million lines. We test it both using ,
as the delimiter, and then also using \s\s\s
as a delimiter.
PRs are welcome for benchmarks with more tools, or improved (but still realistic) pipelines for commands.
cut
:
- https://www.gnu.org/software/coreutils/manual/html_node/The-cut-command.html
- 8.30
mawk
:
- https://invisible-island.net/mawk/mawk.html
- v1.3.4
xsv
:
- https://github.com/BurntSushi/xsv
- v0.13.0 (compiled locally with optimizations)
tsv-utils
:
- https://github.com/eBay/tsv-utils
- v2.2.0 (ldc2, compiled locally with optimizations)
choose
:
- https://github.com/theryangeary/choose
- v1.3.1 (compiled locally with optimizations)
| Command | Mean [s] | Min [s] | Max [s] | Relative |
| :----------------------------------------------------------- | ------------: | ------: | ------: | ----------: |
| hck -Ld, -f1,8,19 ./hyper_data.txt > /dev/null
| 1.494 ± 0.026 | 1.463 | 1.532 | 1.00 |
| hck -Ld, -f1,8,19 --no-mmap ./hyper_data.txt > /dev/null
| 1.735 ± 0.004 | 1.729 | 1.740 | 1.16 ± 0.02 |
| hck -d, -f1,8,19 ./hyper_data.txt > /dev/null
| 1.772 ± 0.009 | 1.760 | 1.782 | 1.19 ± 0.02 |
| hck -d, -f1,8,19 --no-mmap ./hyper_data.txt > /dev/null
| 1.935 ± 0.041 | 1.862 | 1.958 | 1.30 ± 0.04 |
| choose -f , -i ./hyper_data.txt 0 7 18 > /dev/null
| 4.597 ± 0.016 | 4.574 | 4.617 | 3.08 ± 0.05 |
| tsv-select -d, -f 1,8,19 ./hyper_data.txt > /dev/null
| 1.788 ± 0.006 | 1.783 | 1.798 | 1.20 ± 0.02 |
| xsv select -d, 1,8,19 ./hyper_data.txt > /dev/null
| 5.683 ± 0.017 | 5.660 | 5.706 | 3.80 ± 0.07 |
| awk -F, '{print $1, $8, $19}' ./hyper_data.txt > /dev/null
| 5.021 ± 0.013 | 5.005 | 5.041 | 3.36 ± 0.06 |
| cut -d, -f1,8,19 ./hyper_data.txt > /dev/null
| 7.045 ± 0.415 | 6.847 | 7.787 | 4.72 ± 0.29 |
| Command | Mean [s] | Min [s] | Max [s] | Relative |
| :--------------------------------------------------------------------------------------------------------- | -------------: | ------: | ------: | -----------: |
| hck -Ld' ' -f1,8,19 ./hyper_data_multichar.txt > /dev/null
| 2.127 ± 0.004 | 2.122 | 2.133 | 1.00 |
| hck -Ld' ' -f1,8,19 --no-mmap ./hyper_data_multichar.txt > /dev/null
| 2.467 ± 0.012 | 2.459 | 2.488 | 1.16 ± 0.01 |
| hck -d'[[:space:]]+' -f1,8,19 ./hyper_data_multichar.txt > /dev/null
| 9.736 ± 0.069 | 9.630 | 9.786 | 4.58 ± 0.03 |
| hck -d'[[:space:]]+' --no-mmap -f1,8,19 ./hyper_data_multichar.txt > /dev/null
| 9.840 ± 0.024 | 9.813 | 9.869 | 4.63 ± 0.01 |
| hck -d'\s+' -f1,8,19 ./hyper_data_multichar.txt > /dev/null
| 10.446 ± 0.013 | 10.425 | 10.456 | 4.91 ± 0.01 |
| hck -d'\s+' -f1,8,19 --no-mmap ./hyper_data_multichar.txt > /dev/null
| 10.498 ± 0.118 | 10.441 | 10.710 | 4.94 ± 0.06 |
| choose -f ' ' -i ./hyper_data.txt 0 7 18 > /dev/null
| 3.266 ± 0.011 | 3.248 | 3.277 | 1.54 ± 0.01 |
| choose -f '[[:space:]]+' -i ./hyper_data.txt 0 7 18 > /dev/null
| 18.020 ± 0.022 | 17.993 | 18.040 | 8.47 ± 0.02 |
| choose -f '\s+' -i ./hyper_data.txt 0 7 18 > /dev/null
| 59.425 ± 0.457 | 58.900 | 59.893 | 27.94 ± 0.22 |
| awk -F' ' '{print $1, $8 $19}' ./hyper_data_multichar.txt > /dev/null
| 6.824 ± 0.027 | 6.780 | 6.851 | 3.21 ± 0.01 |
| awk -F' ' '{print $1, $8, $19}' ./hyper_data_multichar.txt > /dev/null
| 6.072 ± 0.181 | 5.919 | 6.385 | 2.85 ± 0.09 |
| awk -F'[:space:]+' '{print $1, $8, $19}' ./hyper_data_multichar.txt > /dev/null
| 11.125 ± 0.066 | 11.012 | 11.177 | 5.23 ± 0.03 |
| < ./hyper_data_multichar.txt tr -s ' ' \| cut -d ' ' -f1,8,19 > /dev/null
| 7.508 ± 0.059 | 7.433 | 7.591 | 3.53 ± 0.03 |
| < ./hyper_data_multichar.txt tr -s ' ' \| tail -n+2 \| xsv select -d ' ' 1,8,19 --no-headers > /dev/null
| 6.719 ± 0.241 | 6.419 | 6.983 | 3.16 ± 0.11 |
| < ./hyper_data_multichar.txt tr -s ' ' \| hck -Ld' ' -f1,8,19 > /dev/null
| 6.351 ± 0.041 | 6.296 | 6.391 | 2.99 ± 0.02 |
| < ./hyper_data_multichar.txt tr -s ' ' \| tsv-select -d ' ' -f 1,8,19 > /dev/null
| 6.359 ± 0.056 | 6.311 | 6.453 | 2.99 ± 0.03 |
The following table indicates the file extension / binary pairs that are used to try to decompress a file whent the -z
option is specified:
| Extension | Binary | Type |
| :-------- | :----------------------- | :--------- |
| *.gz
| gzip -d -c
| gzip |
| *.tgz
| gzip -d -c
| gzip |
| *.bz2
| bzip2 -d -c
| bzip2 |
| *.tbz2
| bzip -d -c
| gzip |
| *.xz
| xz -d -c
| xz |
| *.txz
| xz -d -c
| xz |
| *.lz4
| lz4 -d -c
| lz4 |
| *.lzma
| xz --format=lzma -d -c
| lzma |
| *.br
| brotli -d -c
| brotli |
| *.zst
| zstd -d -c
| zstd |
| *.zstd
| zstd -q -d -c
| zstd |
| *.Z
| uncompress -c
| uncompress |
When a file with one of the extensions above is found, hck
will open a subprocess running the the decompression tool listed above and read from the output of that tool. If the binary can't be found then hck
will try to read the compressed file as is. See grep_cli
for source code. The end goal is to add a similar preprocessor as ripgrep.
See the pgo*.sh
scripts for how to build this with optimizations. You will need to install the llvm tools via rustup component add llvm-tools-preview
for this to work. Building with PGO seems to improve performance anywhere from 5-30% depending on the platform and codepath. i.e. on mac os it seems to have a larger effect, and on the regex codepath it also seems to have a greater effect.
hck
)split.filter(|s| !s.is_empty() || config.opt.non_greedy)
https://github.com/sharkdp/bat/blob/master/.github/workflows/CICD.yml