csv-groupby

gb A Command that does a SQL like "group by" on delimited files OR arbitrary lines of text

gb is a command that takes delimited data (like csv files) or lines of text (like a log file) and emulates a SQL select-group-by on that data. This is a utility partially inspired by xsv and the desire to stop having to write the same perl one-liners to analyze massive log files.

It does this very fast by "slicing" blocks of data on line boundary points and forwarding those line-even blocks to multiple parser threads.

HOW-TO:

You identify fields as column numbers. These will either be part of the "key" or group-by, an aggregate (avg or sum), or count distinct. You may use none to many fields for each kind of field.

Summaries on a csv file:

Using some airline flight data taken from: http://stat-computing.org/dataexpo/2009/2008.csv.bz2

Note that this data is truncated a bit here and reformated from this csv to make it readable. ``` 1 2 3 4 5 6 7 8 9

Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier ... 2008 1 3 4 2003 1955 2211 2225 WN ... 2008 1 3 4 754 735 1002 1000 WN ... 2008 1 3 4 628 620 804 750 WN ... 2008 1 3 4 926 930 1054 1100 WN ... 2008 1 3 4 1829 1755 1959 1925 WN ... 2008 1 3 4 1940 1915 2121 2110 WN ... 2008 1 3 4 1937 1830 2037 1940 WN ... 2008 1 3 4 1039 1040 1132 1150 WN ... 2008 1 3 4 617 615 652 650 WN ... .... Running the command: gb -f 2008.csv -k 2,9 -s 14 --skip_header -c | head -10 ```

How this command corresponds to the SQL: select Month, UniqueCarrier, count(*), sum(AirTime) from csv group by Month, UniqueCarrier ^ ^ ^ | | | -k 2, 9 -s 14

Here's a partial output: k:2 | k:9 | count | s:14 | a:14 -----+-----+--------+---------+-------------------- 1 | 9E | 22840 | 1539652 | 71.22412915760744 1 | AA | 52390 | 7317245 | 144.5524496246543 1 | AQ | 4026 | 247200 | 61.830915457728864 1 | AS | 12724 | 1587637 | 129.23378103378104 1 | B6 | 16441 | 2477670 | 152.93315227455096 1 | CO | 25168 | 3878167 | 155.51858683883387 1 | DL | 38241 | 4812768 | 130.22967853663818 1 | EV | 23106 | 1584233 | 72.21739526826822 ....

Another example that determines number of airplanes used and time spent in the air by that carrier.

select Carrier, count(*), sum(AirTime), count(distinct TailNum), sum(AirTime) average(AirTime) from csv group by Carrier The following command emulates this: gb -f ~/dl/2008.csv -k 9 -s 14 -u 11 -s 14 -a 14 --skip_header

Output:

Note that the output order of the columns does not correspond to the order of the field options. It is fixed to keys, count, sums, avgs, and then uniques. k:9 | count | s:14 | s:14 | a:14 | u:11 -----|---------|-----------|-----------|--------------------|------ 9E | 262109 | 18080077 | 18080077 | 71.11840692300127 | 162 AA | 604655 | 82989937 | 82989937 | 141.8001178963196 | 656 AQ | 7797 | 479581 | 479581 | 61.889405084527034 | 21 AS | 151045 | 19569900 | 19569900 | 131.83977040764768 | 126 B6 | 196018 | 28849406 | 28849406 | 150.22524356777978 | 154 CO | 298342 | 45784515 | 45784515 | 155.86589297447088 | 378 ...

Summaries on arbitrary text

Using regular expression sub groups as data fields.

Alternatively, you can use a regular expression (see -r option) against the lines where the sub groups captured become fields and these field indices correspond with the subgroup. This is useful for data that is not as organized like a csv such as logs files etc.

This example will peal off the date from mayapp log files and summarize the ERROR based on the first 5 to 40 characters of that line.

This is example of using the file path as part of the reporting.
gb --walk /some/log/directory -p 'myapp.*(2019-\d\d-\d\d).log' -r '.*ERROR(.{5,40}).*' -k 1,2 Here the subgroups of 1 and 2 are used to create a composite key of the date from the log file name, and the bit of text after the ERROR string in the log file.

Test RE Mode

If you want to test how how a line of text and your regular expression interact use the options -R "some regular expression" and the -L "line of text" to get the sub groups gb will find.

Help gb --help

``` csv-groupby ver: 0.8.0 rev: b5aad23 date: 2020-08-17 Execute a sql-like group-by on arbitrary text or csv files. Field indices start at 1.

Note that -l, -f, and -i define where data comes from. If none of these options is given then it default to reading stdin.

USAGE: gb [FLAGS] [OPTIONS]

FLAGS: -c, --csvoutput Write delimited output -v Verbosity - use more than one v for greater detail --skipheader Skip the first (header) line --norecordcount Do not write records --noopproc Do no real work - no parsing - for testing -i Read a list of files to parse from stdin --stats Write stats after processing --nooutput do not write summary output --recycleioblocksdisable disable reusing memory io blocks --disablekeysort disables the key sort -E, --printexamples Prints example usage scenarious - extra help -h, --help Prints help information -V, --version Prints version information

OPTIONS: -R, --testre One-off test of a regex -L, --testline ... Line(s) of text to test with -R option instead of stdin -k, --keyfields ... Fields that will act as group by keys -u, --uniquevalues ... Fields to count distinct -D, --writedistros ... write unique value distro with -u option --writedistrosupper number highest value x count [default: 5] --writedistrosbottom number lowest value x count [default: 2] -s, --sumvalues ... Sum fields as float64s -a, --avgvalues ... Average fields -x, --maxnums ... Max fields as float64s -n, --minnums ... Min fieldss as float64 -X, --maxstrings ... Max fields as string -N, --minstrings ... Min fields as string -r, --regex ... Regex mode regular expression to parse fields -p, --pathre Match path on files and get fields from sub groups --relinecontains Grep lines that must contain a string -d, --inputdelimiter Delimiter if in csv mode [default: ,] -q, --quote csv quote character -e, --escape csv escape character -C, --comment csv mode comment character -o, --outputdelimiter Output delimiter for written summaries [default: ,] --emptystring Empty string substitution [default: ] -t, --parsethreads Number of parser threads [default: 12] -I, --iothreads Number of IO threads [default: 6] --queuesize Queue length of blocks between threads [default: 48] --pathqsize Queue length of paths to IO slicer threads [default: 0] --ioblocksize IO block size - 0 use default [default: 0] --qblocksize Block size between IO thread and worker [default: 256K] -l A file containing a list of input files -f ... List of input files -w, --walk recursively walk a tree of files to parse --nullwrite String to use for NULL fields [default: NULL] ```

TODO/ideas: