Highly efficient "Masking tape" for Shell

TL;DR

Git Animation for Introduction

Edit 4th and 6th columns in the CSV file

bash $ cat file.csv | teip -d, -f 4,6 -- sed 's/./@/g'

Convert timestamps in /var/log/secure to UNIX time

bash $ cat /var/log/secure | teip -c 1-15 -- date -f- +%s

Percent-encode bare-minimum range of the file

bash $ cat file | teip -r '[^-a-zA-Z0-9@:%._\+~#=/]+' -- php -R 'echo urlencode($argn)."\n";'

Performance enhancement

teip allows a command to focus on its own task.

Here is the comparison of processing time to replace approx 761,000 IP addresses with dummy ones in 100 MiB text file.

benchmark bar chart

See detail on wiki > Benchmark.

Features

Allows any command to "ignore unwanted input" which most commands cannot do
- Execute the targeted command with masking standard input partially
- Flexible methods for selecting a range
High performance
- The targeted command's standard input/output are intercepted by multiple teip's threads asynchronously.
- If general UNIX commands in your environment can process a few hundred MB files in a few seconds, then teip can do the same or better performance.

Installation (x86_64)

With Homebrew (for macOS users)

bash $ brew install greymd/tools/teip

With apt (For Ubuntu users)

bash $ wget https://git.io/teip-1.1.2.x86_64.deb $ sudo dpkg -i ./teip*.deb SHA256: 73c54c36c1c30e2137629f08993693a154d7f08c80655ae7fd485ca60b1eaae7

With dnf (For CentOS, RHEL users)

bash $ sudo dnf install https://git.io/teip-1.1.2.x86_64.rpm SHA256: 15597b5ee5678d28058decd53ff8e75b3831d1287fceeedf5988ae363309b4f6

With yum (For CentOS7, RHEL7, Amazon Linux 2 users)

bash $ sudo yum install https://git.io/teip-1.1.2.x86_64.rpm SHA256: 15597b5ee5678d28058decd53ff8e75b3831d1287fceeedf5988ae363309b4f6

For other architectures (i686, ARM, etc..)

See Wiki > Installation

For Windows

Unfortunately, teip does not work on Windows due to technical reason.

Usage

``` Usage: teip (-r | -R ) [-svz] [--] [...] teip -f [-d | -D ] [-svz] [--] [...] teip -c [-svz] [--] [...] teip --help | --version

Options: --help Display this help and exit --version Show version and exit -r Select strings matched by given regular expression -R EXPERIMENTAL: Same as -r but use Oniguruma regular expressions -f Select only these white-space separated fields -d Use for field delimiter of -f -D Use regular expression for field delimiter of -f -c Select only these characters -s Execute command for each selected part -v Invert the sense of selecting -z Line delimiter is NUL instead of newline ```

Getting Started

Try this at first.

bash $ echo "100 200 300 400" | teip -f 3

The result is almost the same as the input but "300" is highlighted and surrounded by [...]. Because -f 3 selects the 3rd field of space-separated input.

bash 100 200 [300] 400

Next, put the sed and its arguments at the end.

bash $ echo "100 200 300 400" | teip -f 3 sed 's/./@/g'

The result is as below. Highlight and [...] is gone then.

100 200 @@@ 400

As you can see, teip passes only highlighted part to the sed and replaces it with the result of the sed.

Off-course, any command whatever you like can be specified. It is called the targeted command in this article.

Let's try the cut as the targeted command to extract the first character only.

bash $ echo "100 200 300 400" | teip -f 3 cut -c 1 teip: Invalid arguments.

Oops? Why is it failed?

This is because the cut uses the -c option. The option of the same name is also provided by teip, which is confusing.

When entering a targeted command with teip, it is better to enter it after --. Then, teip interprets the arguments after -- as the targeted command and its argument.

bash $ echo "100 200 300 400" | teip -f 3 -- cut -c 1 100 200 3 400

Great, the first character 3 is extracted from 300!

Although -- is not always necessary, it is always better to be used. So, -- is used in all the examples from here.

Now let's double this number with the awk. The command looks like the following (Note that the variable to be doubled is not $3).

bash $ echo "100 200 300 400" | teip -f 3 -- awk '{print $1*2}' 100 200 600 400

OK, the result went from 300 to 600.

Now, let's change -f 3 to -f 3,4 and run it.

bash $ echo "100 200 300 400" | teip -f 3,4 -- awk '{print $1*2}' 100 200 600 800

The numbers in the 3rd and 4th were doubled!

As some of you may have noticed, the argument of -f is compatible with the LIST of cut.

Let's see how it works with cut --help.

```bash $ echo "100 200 300 400" | teip -f -3 -- sed 's/./@/g' @@@ @@@ @@@ 400

$ echo "100 200 300 400" | teip -f 2-4 -- sed 's/./@/g' 100 @@@ @@@ @@@

$ echo "100 200 300 400" | teip -f 1- -- sed 's/./@/g' @@@ @@@ @@@ @@@ ```

Select range by character

The -c option allows you to select a range by character-base. The below example is selecting 1st, 3rd, 5th, 7th characters and apply the sed command to them.

```bash $ echo ABCDEFG | teip -c 1,3,5,7 [A]B[C]D[E]F[G]

$ echo ABCDEFG | teip -c 1,3,5,7 -- sed 's/./@/' @B@D@F@ ```

As same as -f, -c's argument is compatible with cut's LIST.

Processing delimited text like CSV, TSV

The -f option recognizes delimited fields like awk by default.

The continuous white spaces (all forms of whitespace categorized by Unicode) is interpreted as a single delimiter.

bash $ printf "A B \t\t\t\ C \t D" | teip -f 3 -- sed s/./@@@@/ A B @@@@ C D

This behavior might be inconvenient for the processing of CSV and TSV.

However, the -d option in conjunction with the -f can be used to specify a delimiter. Now you can process the CSV file like this.

bash $ echo "100,200,300,400" | teip -f 3 -d , -- sed 's/./@/g' 100,200,@@@,400

In order to process TSV, the TAB character need to be typed. If you are using Bash, type $'\t' which is one of ANSI-C Quoting.

bash $ printf "100\t200\t300\t400\n" | teip -f 3 -d $'\t' -- sed 's/./@/g' 100 200 @@@ 400

teip also provides -D option to specify an extended regular expression as the delimiter. This is useful when you want to ignore consecutive delimiters, or when there are multiple types of delimiters.

bash $ echo 'A,,,,,B,,,,C' | teip -f 2 -D ',+' A,,,,,[B],,,,C

bash $ echo "1970-01-02 03:04:05" | teip -f 2-5 -D '[-: ]' 1970-[01]-[02] [03]:[04]:05

The regular expression of TAB character (\t) can also be specified with the -D option, but -d has slightly better performance. Regarding available notations of the regular expression, refer to regular expression of Rust.

Matching with Regular Expression

You can also specify the range by a regular expression with -r. Here is an example of using \d which matches numbers.

```bash $ echo ABC100EFG200 | teip -r '\d+' ABC[100]EFG[200]

$ echo ABC100EFG200 | teip -r '\d+' -- sed 's/.*/@@@/g' ABC@@@EFG@@@ ```

This feature is quite versatile and can be useful for handling the file that has no fixed form like logs, markdown, etc.

However, you should pay attention to use it.

The below example is almost the same as above one but \d+ is replaced with \d.

bash $ echo ABC100EFG200 | teip -r '\d' -- sed 's/.*/@@@/g' ABC@@@@@@@@@EFG@@@@@@@@@

Although the selected characters are the same, the result is different.

It is necessary to know the "Tokenization" of teip in order to understand this behavior.

Tokenization

teip divides the standard input into tokens. A token that does not match the pattern will be displayed on the standard output as it is. On the other hand, the matched token is passed to the standard input of a targeted command. After that, the matched token is replaced with the result of the targeted command.

In the next example, the standard input is divided into four tokens as follows.

bash echo ABC100EFG200 | teip -r '\d+' -- sed 's/.*/@@@/g'

ABC => Token(1) 100 => Token(2) -- Matched EFG => Token(3) 200 => Token(4) -- Matched

By default, the matched tokens are combined by line breaks and used as the new standard input for the targeted command. Imagine that teip executes the following command in its process.

bash $ printf "100\n200\n" | sed 's/.*/@@@/g' @@@ # => Result of Token(2) @@@ # => Result of Token(4)

(It is not technically accurate but you can now see why $1 is used not $3 in one of the examples in "Getting Started")

After that, matched tokens are replaced with each line of result.

ABC => Token(1) @@@ => Token(2) -- Replaced EFG => Token(3) @@@ => Token(4) -- Replaced

Finally, all the tokens are concatenated and the following result is printed.

ABC@@@EFG@@@

Practically, the above process is performed asynchronously. Tokens being printed sequentially as they become available.

Back to the story, the reason why a lot of @ are printed in the example below is that the input is broken up into many tokens.

bash $ echo ABC100EFG200 | teip -r '\d' ABC[1][0][0]EFG[2][0][0]

teip recognizes input matched with the entire regular expression as a single token. \d matches a single digit, and it results in many tokens.

ABC => Token(1) 1 => Token(2) -- Matched 0 => Token(3) -- Matched 0 => Token(4) -- Matched EFG => Token(5) 2 => Token(6) -- Matched 0 => Token(7) -- Matched 0 => Token(8) -- Matched

Therefore, sed loads many newline characters.

bash $ printf "1\n0\n0\n2\n0\n0\n" | sed 's/.*/@@@/g' @@@ # => Result of Token(2) @@@ # => Result of Token(3) @@@ # => Result of Token(4) @@@ # => Result of Token(6) @@@ # => Result of Token(7) @@@ # => Result of Token(8)

The tokens of the final form are like the following.

ABC => Token(1) @@@ => Token(2) -- Replaced @@@ => Token(3) -- Replaced @@@ => Token(4) -- Replaced EFG => Token(5) @@@ => Token(6) -- Replaced @@@ => Token(7) -- Replaced @@@ => Token(8) -- Replaced

And, here is the final result.

ABC@@@@@@@@@EFG@@@@@@@@@

The concept of tokenization is also used for other options. For example, if you use -f to specify a range of A-B, each field will be a separate token. Also, the field delimiter is always an unmatched token.

bash $ echo "AA,BB,CC" | teip -f 2-3 -d, AA,[BB],[CC]

With the -c option, adjacent characters are treated as the same token even if they are separated by ,.

bash $ echo "ABCDEFGHI" | teip -c1,2,3,7-9 [ABC]DEF[GHI]

What command can be used?

As explained, teip replaces tokens on a row-by-row basis. Therefore, a targeted command must follow the below rule.

A targeted command must print a single line of result for each line of input.

In the simplest example, the cat command always succeeds. Because the cat prints the same number of lines against the input.

bash $ echo ABCDEF | teip -r . -- cat ABCDEF

If the above rule is not satisfied, the result will be inconsistent. For example, grep may fail. Here is an example.

```bash $ echo ABCDEF | teip -r . [A][B][C][D][E][F]

$ echo ABCDEF | teip -r . -- grep '[ABC]' ABC teip: Output of given command is exhausted

$ echo $? 1 ```

teip could not get the result corresponding to the token of D, E, and F. That is why the above example fails.

If an inconsistency occurs, teip will exit with the error message. Also, the exit status will be 1.

Advanced usage

Solid mode

If you want to use a command that does not satisfy the condition, "A targeted command must print a single line of result for each line of input", enable "Solid mode" which is available with the -s option.

Solid mode spawns the targeted command for each matched token and executes it each time.

bash $ echo ABCDEF | teip -s -r . -- grep '[ABC]'

In the above example, understand the following commands are executed in teip's procedure.

The empty result is replaced with an empty string. Therefore, D, E, and F tokens are replaced with empty as expected.

```bash $ echo ABCDEF | teip -s -r . -- grep '[ABC]' ABC

$ echo $? 0 ```

However, this option is not suitable for processing a large file because it may significantly degrade performance instead of consolidating the results.

Overlay `teip`s

Any command can be used with teip, surprisingly, even if it is teip itself.

```bash $ echo "AAA@@@@@AAA@@@@@AAA" | teip -r '@.*@' AAA[@@@@@AAA@@@@@]AAA

$ echo "AAA@@@@@AAA@@@@@AAA" | teip -r '@.*@' -- teip -r 'A+' AAA@@@@@[AAA]@@@@@AAA

$ echo "AAA@@@@@AAA@@@@@AAA" | teip -r '@.*@' -- teip -r 'A+' -- tr A _ AAA@@@@@_@@@@@AAA ```

In other words, you can connect the multiple features of teip with AND conditions for more complex range selection. Furthermore, it works asynchronously and in multi-processes, similar to the shell pipeline. It will hardly degrade performance unless the machine faces the limits of parallelism.

Empty token

If a blank field exists when the -f option is used, the blank is not ignored and treated as an empty token.

bash $ echo ',,,' | teip -d , -f 1- [],[],[],[]

Therefore, the following command can work (Note that * matches empty as well).

bash $ echo ',,,' | teip -f 1- -d, sed 's/.*/@@@/' @@@,@@@,@@@,@@@

In the above example, the sed loads four newline characters and prints @@@ four times.

Invert match

The -v option allows you to invert the selected range. When the -f or -c option is used, the complement of the selected field is selected instead.

bash $ echo 1 2 3 4 5 | teip -v -f 1,3,5 -- sed 's/./_/' 1 _ 3 _ 5

Of course, it can also be used for the -r option.

bash $ printf 'AAA\n123\nBBB\n' | teip -vr '\d+' -- sed 's/./@/g' @@@ 123 @@@

NUL as line delimiter

If you want to process the data in a more flexible way, the -z option may be useful. This option allows you to use the NUL character (the ASCII NUL character) instead of the newline character. It behaves like -z provided by GNU sed or GNU grep, or -0 option provided by xargs.

bash $ printf '111,\n222,33\n3\0\n444,55\n5,666\n' | teip -z -f3 -d, 111, 222,[33 3] 444,55 5,[666]

With this option, the standard input is interpreted per a NUL character rather than per a newline character. You should also pay attention to that matched tokens are concatenated with the NUL character instead of a newline character in teip's procedure.

In other words, if you use a targeted command that cannot handle NUL characters (and cannot print NUL-separated results), the final result can be unintended.

```bash $ printf '111,\n222,33\n3\0\n444,55\n5,666\n' | teip -z -f3 -d, -- sed -z 's/.*/@@@/g' 111, 222,@@@ 444,55 5,@@@

$ printf '111,\n222,33\n3\0\n444,55\n5,666\n' | teip -z -f3 -d, -- sed 's/.*/@@@/g' 111, 222,@@@ @@@ 444,55 5,teip: Output of given command is exhausted ```

Specifying from one line to another is a typical use case for this option.

```bash $ cat test.html | teip -z -r '.*' AAA [

AAA

BBB

CCC

]

$ cat test.html | teip -z -r '.*' -- grep -a BBB AAA

BBB

```

Environment variables

teip refers to the following environment variables. Add the statement to your default shell's startup file (i.e .bashrc, .zshrc) to change them as you like.

`TEIP_HIGHLIGHT`

DEFAULT VALUE: \x1b[36m[\x1b[0m\x1b[01;31m{}\x1b[0m\x1b[36m]\x1b[0m

The default format for highlighting matched token. It must include at least one {} as a placeholder.

Example: ``` $ export TEIP_HIGHLIGHT="<<<{}>>>" $ echo ABAB | teip -r A <<>>B<<>>B

$ export TEIP_HIGHLIGHT=$'\x1b[01;31m{}\x1b[0m' $ echo ABAB | teip -r A ABAB ### Same color as grep ```

ANSI Escape Sequences and ANSI-C Quoting are helpful to customize this value.

Background

Why made it?

See this post.

Why "teip"?

Came from Irish verb "teip" which means "fail" and it can also mean "blank out", "cut off".
Sounds similar to Masking-"tape".

License

Modules imported from other repositories

Thank you so much for helpful modules!

./src/list/ranges.rs
- One of the module used in cut command of uutils/coreutils
- Original souce codes are distributed under MIT license
- The license file is on the same directory

Source code

The scripts are available as open source under the terms of the MIT License.

Logo

The logo of teip is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.