gb
A Command that does a SQL like "group by" on delimited files OR arbitrary lines of textgb is a command that takes delimited data (like csv files) or lines of text (like a log file) and emulates a SQL select-group-by on that data. This is a utility partially inspired by xsv and the desire to stop having to write the same perl one-liners to analyze massive log files.
It does this very fast by "slicing" blocks of data on line boundary points and forwarding those line-even blocks to multiple parser threads.
You identify fields as column numbers. These will either be part of the "key" or group-by, an aggregate (avg or sum), or count distinct. You may use none to many fields for each kind of field.
Using some airline flight data taken from: http://stat-computing.org/dataexpo/2009/2008.csv.bz2
Note that this data is truncated a bit here and reformated from this csv to make it readable. ``` 1 2 3 4 5 6 7 8 9
Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier ...
2008 1 3 4 2003 1955 2211 2225 WN ...
2008 1 3 4 754 735 1002 1000 WN ...
2008 1 3 4 628 620 804 750 WN ...
2008 1 3 4 926 930 1054 1100 WN ...
2008 1 3 4 1829 1755 1959 1925 WN ...
2008 1 3 4 1940 1915 2121 2110 WN ...
2008 1 3 4 1937 1830 2037 1940 WN ...
2008 1 3 4 1039 1040 1132 1150 WN ...
2008 1 3 4 617 615 652 650 WN ...
....
Running the command:
gb -f 2008.csv -k 2,9 -s 14 --skip_header -c | head -10
```
How this command corresponds to the SQL:
select Month, UniqueCarrier, count(*), sum(AirTime) from csv group by Month, UniqueCarrier
^ ^ ^
| | |
-k 2, 9 -s 14
Here's a partial output:
k:2 | k:9 | count | s:14 | a:14
-----+-----+--------+---------+--------------------
1 | 9E | 22840 | 1539652 | 71.22412915760744
1 | AA | 52390 | 7317245 | 144.5524496246543
1 | AQ | 4026 | 247200 | 61.830915457728864
1 | AS | 12724 | 1587637 | 129.23378103378104
1 | B6 | 16441 | 2477670 | 152.93315227455096
1 | CO | 25168 | 3878167 | 155.51858683883387
1 | DL | 38241 | 4812768 | 130.22967853663818
1 | EV | 23106 | 1584233 | 72.21739526826822
....
Another example that determines number of airplanes used and time spent in the air by that carrier.
select Carrier, count(*), sum(AirTime), count(distinct TailNum), sum(AirTime) average(AirTime) from csv
group by Carrier
The following command emulates this:
gb -f ~/dl/2008.csv -k 9 -s 14 -u 11 -s 14 -a 14 --skip_header
Output:
Note that the output order of the columns does not correspond to the order of the field options. It is fixed to keys, count, sums, avgs, and then uniques.
k:9 | count | s:14 | s:14 | a:14 | u:11 -----|---------|-----------|-----------|--------------------|------ 9E | 262109 | 18080077 | 18080077 | 71.11840692300127 | 162 AA | 604655 | 82989937 | 82989937 | 141.8001178963196 | 656 AQ | 7797 | 479581 | 479581 | 61.889405084527034 | 21 AS | 151045 | 19569900 | 19569900 | 131.83977040764768 | 126 B6 | 196018 | 28849406 | 28849406 | 150.22524356777978 | 154 CO | 298342 | 45784515 | 45784515 | 155.86589297447088 | 378 ...
Alternatively, you can use a regular expression (see -r option) against the lines where the sub groups captured become fields and these field indices correspond with the subgroup. This is useful for data that is not as organized like a csv such as logs files etc.
This example will peal off the date from mayapp log files and summarize the ERROR based on the first 5 to 40 characters of that line.
This is example of using the file path as part of the reporting.
gb --walk /some/log/directory -p 'myapp.*(2019-\d\d-\d\d).log' -r '.*ERROR(.{5,40}).*' -k 1,2
Here the subgroups of 1 and 2 are used to create a composite key of the date from the log file name, and the bit of text after the ERROR string in the log file.
If you want to test how how a line of text and your regular expression interact use the options -R "some regular expression" and the -L "line of text" to get the sub groups gb will find.
gb --help
``` csv-groupby ver: 0.8.0 rev: b5aad23 date: 2020-08-17 Execute a sql-like group-by on arbitrary text or csv files. Field indices start at 1.
Note that -l, -f, and -i define where data comes from. If none of these options is given then it default to reading stdin.
USAGE: gb [FLAGS] [OPTIONS]
FLAGS: -c, --csvoutput Write delimited output -v Verbosity - use more than one v for greater detail --skipheader Skip the first (header) line --norecordcount Do not write records --noopproc Do no real work - no parsing - for testing -i Read a list of files to parse from stdin --stats Write stats after processing --nooutput do not write summary output --recycleioblocksdisable disable reusing memory io blocks --disablekeysort disables the key sort -E, --printexamples Prints example usage scenarious - extra help -h, --help Prints help information -V, --version Prints version information
OPTIONS:
-R, --testre csv quote character
-e, --escape
TODO/ideas:
Fix -i to also support -p option
More better readme - sometimes more is not better