qsv version 0.16.1

qsv is a command line program for indexing, slicing, analyzing, splitting and joining CSV files. Commands should be simple, fast and composable:

Simple tasks should be easy.
Performance trade offs should be exposed in the CLI interface.
Composition should not come at the expense of performance.

This README contains information on how to install qsv, in addition to a quick tour of several commands.

Dual-licensed under MIT or the UNLICENSE.

:warning: NOTE: qsv is a fork of the popular xsv utility, merging several pending PRs since xsv 0.13.0's release, along with additional features & commands for data-wrangling (NEW/EXTENDED commands are marked accordingly).

Available commands

| Command | Description | | --- | --- | | apply | Apply series of string transformations to a CSV column. (NEW) | | behead | Drop headers from CSV file. (NEW) | | cat | Concatenate CSV files by row or by column. | | count | Count the rows in a CSV file. (Instantaneous with an index.) | | dedup | Remove redundant rows (NEW) | | enum | Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value. (NEW) | | explode | Explode rows into multiple ones by splitting a column value based on the given separator. (NEW) | | fill | Fill empty values. (NEW) | | fixlengths | Force a CSV file to have same-length records by either padding or truncating them. | | flatten | A flattened view of CSV records. Useful for viewing one record at a time. e.g., qsv slice -i 5 data.csv | qsv flatten. | | fmt | Reformat CSV data with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.) (EXTENDED) | | foreach | Loop over a CSV file to execute bash commands. (*nix only) (NEW) | | frequency | Build frequency tables of each column in CSV data. (Uses parallelism to go faster if an index is present.) | | headers | Show the headers of CSV data. Or show the intersection of all headers between many CSV files. | | index | Create an index for a CSV file. This is very quick and provides constant time indexing into the CSV file. | | input | Read CSV data with exotic quoting/escaping rules. | | join | Inner, outer and cross joins. Uses a simple hash index to make it fast. | | jsonl | Convert newline-delimited JSON to CSV. (NEW) | lua | Execute a Lua script over CSV lines to transform, aggregate or filter them. (NEW) | | partition | Partition CSV data based on a column value. | | pseudo | Pseudonymise the value of the given column by replacing them with an incremental identifier. (NEW) | | py | Evaluate a Python expression over CSV lines to transform, aggregate or filter them. (NEW) | | rename | Rename the columns of CSV data efficiently. (NEW) | | replace | Replace CSV data using a regex. (NEW) | | reverse | Reverse order of rows in CSV data. (NEW) | | sample | Randomly draw rows from CSV data using reservoir sampling (i.e., use memory proportional to the size of the sample). (EXTENDED) | | search | Run a regex over CSV data. Applies the regex to each field individually and shows only matching rows. | | select | Select or re-order columns from CSV data. (EXTENDED) | | slice | Slice rows from any part of a CSV file. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice). | | sort | Sort CSV data. | | split | Split one CSV file into many CSV files of N chunks. | | stats | Show basic types and statistics of each column in the CSV file. (i.e., mean, standard deviation, variance, median, min/max, nullcount, etc.) (EXTENDED) | | table | Show aligned output of any CSV data using elastic tabstops. (EXTENDED) | | transpose | Transpose rows/columns of CSV data. (NEW) |

A whirlwind tour

Let's say you're playing with some of the data from the Data Science Toolkit, which contains several CSV files. Maybe you're interested in the population counts of each city in the world. So grab the data and start examining it:

bash $ curl -LO https://burntsushi.net/stuff/worldcitiespop.csv $ qsv headers worldcitiespop.csv 1 Country 2 City 3 AccentCity 4 Region 5 Population 6 Latitude 7 Longitude

The next thing you might want to do is get an overview of the kind of data that appears in each column. The stats command will do this for you:

bash $ qsv stats worldcitiespop.csv --everything | qsv table field type sum min max min_length max_length mean stddev variance median mode cardinality nullcount Country Unicode ad zw 2 2 cn 234 0 City Unicode bab el ahmar Þykkvibaer 1 91 san jose 2351892 0 AccentCity Unicode Bâb el Ahmar ïn Bou Chella 1 91 San Antonio 2375760 0 Region Unicode 00 Z9 0 2 13 04 397 8 Population Integer 2289584999 7 31480498 0 8 47719.570633597126 302885.5592040396 91739661974.34377 10779 28754 3125978 Latitude Float 86294096.37312101 -54.933333 82.483333 1 12 27.188165808468785 21.95261384912504 481.91725480879654 32.4972221 51.15 1038349 0 Longitude Float 117718483.57958724 -179.9833333 180 1 14 37.08885989656418 63.223010459241635 3997.1490515293776 35.28 23.8 1167162 0

The qsv table command takes any CSV data and formats it into aligned columns using elastic tabstops. You'll notice that it even gets alignment right with respect to Unicode characters.

So, this command takes about 12 seconds to run on my machine, but we can speed it up by creating an index and re-running the command:

bash $ qsv index worldcitiespop.csv $ qsv stats worldcitiespop.csv --everything | qsv table ...

Which cuts it down to about 8 seconds on my machine. (And creating the index takes less than 2 seconds.)

Notably, the same type of "statistics" command in another CSV command line toolkit takes about 2 minutes to produce similar statistics on the same data set.

Creating an index gives us more than just faster statistics gathering. It also makes slice operations extremely fast because only the sliced portion has to be parsed. For example, let's say you wanted to grab the last 10 records:

bash $ qsv count worldcitiespop.csv 3173958 $ qsv slice worldcitiespop.csv -s 3173948 | qsv table Country City AccentCity Region Population Latitude Longitude zw zibalonkwe Zibalonkwe 06 -19.8333333 27.4666667 zw zibunkululu Zibunkululu 06 -19.6666667 27.6166667 zw ziga Ziga 06 -19.2166667 27.4833333 zw zikamanas village Zikamanas Village 00 -18.2166667 27.95 zw zimbabwe Zimbabwe 07 -20.2666667 30.9166667 zw zimre park Zimre Park 04 -17.8661111 31.2136111 zw ziyakamanas Ziyakamanas 00 -18.2166667 27.95 zw zizalisari Zizalisari 04 -17.7588889 31.0105556 zw zuzumba Zuzumba 06 -20.0333333 27.9333333 zw zvishavane Zvishavane 07 79876 -20.3333333 30.0333333

These commands are instantaneous because they run in time and memory proportional to the size of the slice (which means they will scale to arbitrarily large CSV data).

Switching gears a little bit, you might not always want to see every column in the CSV data. In this case, maybe we only care about the country, city and population. So let's take a look at 10 "random" rows. We use the --seed parameter so we get a reproducible random sample:

bash $ qsv select Country,AccentCity,Population worldcitiespop.csv \ | qsv sample --seed 42 10 \ | qsv table Country AccentCity Population vn Khánh Tàn no Kvalvåg ir Bala Dashteh af Kam Papin cn Peipiao mz Chiquefane ug Bulukatoni us Gourdsville kr Hwahungni fr Pimouget

Whoops! The sample we got don't have population counts. How pervasive is that?

bash $ qsv frequency worldcitiespop.csv --limit 5 field,value,count Country,cn,238985 Country,ru,215938 Country,id,176546 Country,us,141989 Country,ir,123872 City,san jose,328 City,san antonio,320 City,santa rosa,296 City,santa cruz,282 City,san juan,255 AccentCity,San Antonio,317 AccentCity,Santa Rosa,296 AccentCity,Santa Cruz,281 AccentCity,San Juan,254 AccentCity,San Miguel,254 Region,04,159916 Region,02,142158 Region,07,126867 Region,03,122161 Region,05,118441 Population,(NULL),3125978 Population,2310,12 Population,3097,11 Population,983,11 Population,2684,11 Latitude,51.15,777 Latitude,51.083333,772 Latitude,50.933333,769 Latitude,51.116667,769 Latitude,51.133333,767 Longitude,23.8,484 Longitude,23.2,477 Longitude,23.05,476 Longitude,25.3,474 Longitude,23.1,459

(The qsv frequency command builds a frequency table for each column in the CSV data. This one only took 5 seconds.)

So it seems that most cities do not have a population count associated with them at all (3,125,978 to be exact). No matter — we can adjust our previous command so that it only shows rows with a population count:

bash $ qsv search -s Population '[0-9]' worldcitiespop.csv \ | qsv select Country,AccentCity,Population \ | qsv sample --seed 42 10 \ | tee sample.csv \ | qsv table Country AccentCity Population fr Boissy-Saint-Léger 15451 us Iselin 17019 ru Ali-Yurt 7593 ro Panaci 2308 lu Baschleiden 185 us Mayaguez 76503 ch Vernier 29767 es Salobreña 10725 ch Aigle 7897 yt Ouangani 7273

:warning: NOTE: The tee command reads from standard input and writes to both standard output and one or more files at the same time. We do this so we can create the sample.csv file we need for the next step, and pipe the same data to the qsv table command.

Erk. Which country is yt? What continent? No clue, but DataHub.io has a CSV file called country-continent.csv. Let's grab it and do a join so we can see which countries and continents these are:

```bash curl -L https://datahub.io/JohnSnowLabs/country-and-continent-codes-list/r/0.csv > countrycontinent.csv $ qsv headers countrynames.csv 1 ContinentName 2 ContinentCode 3 CountryName 4 TwoLetterCountryCode 5 ThreeLetterCountryCode 6 CountryNumber $ qsv join --no-case Country sample.csv TwoLetterCountryCode countrycontinent.csv | qsv table Country AccentCity Population ContinentName ContinentCode CountryName TwoLetterCountryCode ThreeLetterCountryCode Country_Number fr Boissy-Saint-Léger 15451 Europe EU France, French Republic FR FRA 250 us Iselin 17019 North America NA United States of America US USA 840 ru Ali-Yurt 7593 Europe EU Russian Federation RU RUS 643 ru Ali-Yurt 7593 Asia AS Russian Federation RU RUS 643 ro Panaci 2308 Europe EU Romania RO ROU 642 lu Baschleiden 185 Europe EU Luxembourg, Grand Duchy of LU LUX 442 us Mayaguez 76503 North America NA United States of America US USA 840 ch Vernier 29767 Europe EU Switzerland, Swiss Confederation CH CHE 756 es Salobreña 10725 Europe EU Spain, Kingdom of ES ESP 724 ch Aigle 7897 Europe EU Switzerland, Swiss Confederation CH CHE 756 yt Ouangani 7273 Africa AF Mayotte YT MYT 175

```

Whoops, now we have the data but we have several unneeded columns, and the columns we do need have overly long names. Also, there are two records for Ali-Yurt - one in Europe and another in Asia. This is because Russia spans both continents.
We're primarily interested in unique cities per country for the purposes of this tour, so we need to filter these out.

Also, apart from renaming the columns, I want to reorder them to "Country,Continent,City, Population".

No worries. Let's use the select (so we only get the columns we need, in the order we want), dedup (so we only get unique County/City combinations) and rename (rename the columns with shorter names) commands:

bash $ qsv join --no-case Country sample.csv Two_Letter_Country_Code country_continent.csv \ | qsv select 'Country_Name,Continent_Name,AccentCity,Population' \ | qsv dedup -s 'Country_Name,AccentCity' \ | qsv rename Country,Continent,City,Population \ | qsv table Country Continent City Population France, French Republic Europe Boissy-Saint-Léger 15451 Luxembourg, Grand Duchy of Europe Baschleiden 185 Mayotte Africa Ouangani 7273 Romania Europe Panaci 2308 Russian Federation Asia Ali-Yurt 7593 Spain, Kingdom of Europe Salobreña 10725 Switzerland, Swiss Confederation Europe Aigle 7897 Switzerland, Swiss Confederation Europe Vernier 29767 United States of America North America Iselin 17019 United States of America North America Mayaguez 76503

Nice! Notice the data is now sorted too! That's because dedup first sorts the CSV records (by internally calling the qsv sort command) to find duplicates.

Perhaps we can do this with the original CSV data? All 3.2 million rows in a 145MB file?!

Indeed we can—because qsv is designed for speed - written in Rust with amortized memory allocations, using the performance-focused mimalloc allocator.

```bash $ qsv join --no-case Country worldcitiespop.csv TwoLetterCountryCode countrycontinent.csv \ | qsv select 'CountryName,ContinentName,AccentCity,Population,Latitude,Longitude' \ | qsv dedup -s 'Country_Name,AccentCity,Latitude,Longitude' --dupes-output dupe-countrycities.csv \ | qsv rename Country,Continent,City,Population,Latitude,Longitude \

worldcitiespopcountrycontinent.csv $ qsv sample 10 --seed 1729 worldcitiespopcountycontinent.csv | qsv table Country Continent City Population Latitude Longitude Syrian Arab Republic Asia Cheikh Sayad 36.25 37.2666667 Colombia, Republic of South America Tetillo 3.160288 -76.324643 Egypt, Arab Republic of Africa El-Kôm el-Ahmar 27.0 31.4166667 Bulgaria, Republic of Europe Oresha 42.95 24.1 Poland, Republic of Europe Wielka Wies 54.568121 17.361634 Iran, Islamic Republic of Asia Kolah Jub-e Chagalvandi 33.65 48.583333 Congo, Republic of the Africa Ebou -1.2833333 15.5869444 Congo, Democratic Republic of the Africa Yambo-Engunda 1 20.666667 Australia, Commonwealth of Oceania Cessnock 16394 -32.832111 151.356232 Brazil, Federative Republic of South America Pirapora -8.448056 -72.821389 ```

We fine-tuned dedup by adding Latitude and Longitude as there may be multiple cities with the same name in a country. We also specified the dupes-output option so we can have a separate CSV of the duplicate records it removed.

This whole thing takes about 8 seconds on my machine. The performance of join, in particular, comes from constructing a very simple hash index of one of the CSV files. The join command does an inner join by default, but it also has left, right and full outer join support too.

Finally, can we create a CSV file for each country of all its cities? Yes we can, with the partition command:

```bash $ qsv partition Country bycountry worldcitiespop_countrycontinent.csv $ cd bycountry $ ls -1shS total 191M 16M ChinaPeoplesRepublicof.csv 12M RussianFederation.csv 11M UnitedStatesofAmerica.csv 11M IndonesiaRepublicof.csv 7.5M IranIslamicRepublicof.csv ... 4.0K CocosKeelingIslands.csv 4.0K PitcairnIslands.csv 4.0K Tokelau.csv 4.0K NorfolkIsland.csv

```

Installation

Binaries for Windows, Linux and macOS are available from Github.

Alternatively, you can compile from source by installing Cargo (Rust's package manager) and installing qsv using Cargo:

bash cargo install qsv

Compiling from this repository also works similarly:

bash git clone git://github.com/jqnatividad/qsv cd qsv cargo build --release

Compilation will probably take a few minutes depending on your machine. The binary will end up in ./target/release/qsv.

If you want to squeeze more performance from your build, set this environment variable before compiling:

bash export CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'

Do note though that the resulting binary will only run on machines with the same architecture as the machine you compiled from. To find out your CPU architecture and other valid values for target-cpu:

bash rustc --print target-cpus

Benchmarks

I've compiled some very rough benchmarks of various qsv commands.