qsv is a command line program for indexing, slicing, analyzing, splitting and joining CSV files. Commands should be simple, fast and composable:
This README contains information on how to
install qsv
, in addition to
a quick tour of several commands.
Dual-licensed under MIT or the UNLICENSE.
:warning: NOTE: qsv is a fork of the popular xsv utility, merging several pending PRs since xsv 0.13.0's release, along with additional features & commands for data-wrangling (NEW/EXTENDED commands are marked accordingly).
| Command | Description |
| --- | --- |
| apply | Apply series of string transformations to a CSV column. (NEW) |
| behead | Drop headers from CSV file. (NEW) |
| cat | Concatenate CSV files by row or by column. |
| count | Count the rows in a CSV file. (Instantaneous with an index.) |
| dedup | Remove redundant rows (NEW) |
| enum | Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value. (NEW) |
| explode | Explode rows into multiple ones by splitting a column value based on the given separator. (NEW) |
| fill | Fill empty values. (NEW) |
| fixlengths | Force a CSV file to have same-length records by either padding or truncating them. |
| flatten | A flattened view of CSV records. Useful for viewing one record at a time. e.g., qsv slice -i 5 data.csv | qsv flatten
. |
| fmt | Reformat CSV data with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.) (EXTENDED) |
| foreach | Loop over a CSV file to execute bash commands. (*nix only) (NEW) |
| frequency | Build frequency tables of each column in CSV data. (Uses parallelism to go faster if an index is present.) |
| headers | Show the headers of CSV data. Or show the intersection of all headers between many CSV files. |
| index | Create an index for a CSV file. This is very quick and provides constant time indexing into the CSV file. |
| input | Read CSV data with exotic quoting/escaping rules. |
| join | Inner, outer and cross joins. Uses a simple hash index to make it fast. |
| jsonl | Convert newline-delimited JSON to CSV. (NEW)
| lua | Execute a Lua script over CSV lines to transform, aggregate or filter them. (NEW) |
| partition | Partition CSV data based on a column value. |
| pseudo | Pseudonymise the value of the given column by replacing them with an incremental identifier. (NEW) |
| py | Evaluate a Python expression over CSV lines to transform, aggregate or filter them. (NEW) |
| rename | Rename the columns of CSV data efficiently. (NEW) |
| replace | Replace CSV data using a regex. (NEW) |
| reverse | Reverse order of rows in CSV data. (NEW) |
| sample | Randomly draw rows from CSV data using reservoir sampling (i.e., use memory proportional to the size of the sample). (EXTENDED) |
| search | Run a regex over CSV data. Applies the regex to each field individually and shows only matching rows. |
| select | Select or re-order columns from CSV data. (EXTENDED) |
| slice | Slice rows from any part of a CSV file. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice). |
| sort | Sort CSV data. |
| split | Split one CSV file into many CSV files of N chunks. |
| stats | Show basic types and statistics of each column in the CSV file. (i.e., mean, standard deviation, variance, median, min/max, nullcount, etc.) (EXTENDED) |
| table | Show aligned output of any CSV data using elastic tabstops. (EXTENDED) |
| transpose | Transpose rows/columns of CSV data. (NEW) |
Let's say you're playing with some of the data from the Data Science Toolkit, which contains several CSV files. Maybe you're interested in the population counts of each city in the world. So grab the data and start examining it:
bash
$ curl -LO https://burntsushi.net/stuff/worldcitiespop.csv
$ qsv headers worldcitiespop.csv
1 Country
2 City
3 AccentCity
4 Region
5 Population
6 Latitude
7 Longitude
The next thing you might want to do is get an overview of the kind of data that
appears in each column. The stats
command will do this for you:
bash
$ qsv stats worldcitiespop.csv --everything | qsv table
field type sum min max min_length max_length mean stddev variance median mode cardinality nullcount
Country Unicode ad zw 2 2 cn 234 0
City Unicode bab el ahmar Þykkvibaer 1 91 san jose 2351892 0
AccentCity Unicode Bâb el Ahmar ïn Bou Chella 1 91 San Antonio 2375760 0
Region Unicode 00 Z9 0 2 13 04 397 8
Population Integer 2289584999 7 31480498 0 8 47719.570633597126 302885.5592040396 91739661974.34377 10779 28754 3125978
Latitude Float 86294096.37312101 -54.933333 82.483333 1 12 27.188165808468785 21.95261384912504 481.91725480879654 32.4972221 51.15 1038349 0
Longitude Float 117718483.57958724 -179.9833333 180 1 14 37.08885989656418 63.223010459241635 3997.1490515293776 35.28 23.8 1167162 0
The qsv table
command takes any CSV data and formats it into aligned columns
using elastic tabstops. You'll
notice that it even gets alignment right with respect to Unicode characters.
So, this command takes about 12 seconds to run on my machine, but we can speed it up by creating an index and re-running the command:
bash
$ qsv index worldcitiespop.csv
$ qsv stats worldcitiespop.csv --everything | qsv table
...
Which cuts it down to about 8 seconds on my machine. (And creating the index takes less than 2 seconds.)
Notably, the same type of "statistics" command in another CSV command line toolkit takes about 2 minutes to produce similar statistics on the same data set.
Creating an index gives us more than just faster statistics gathering. It also makes slice operations extremely fast because only the sliced portion has to be parsed. For example, let's say you wanted to grab the last 10 records:
bash
$ qsv count worldcitiespop.csv
3173958
$ qsv slice worldcitiespop.csv -s 3173948 | qsv table
Country City AccentCity Region Population Latitude Longitude
zw zibalonkwe Zibalonkwe 06 -19.8333333 27.4666667
zw zibunkululu Zibunkululu 06 -19.6666667 27.6166667
zw ziga Ziga 06 -19.2166667 27.4833333
zw zikamanas village Zikamanas Village 00 -18.2166667 27.95
zw zimbabwe Zimbabwe 07 -20.2666667 30.9166667
zw zimre park Zimre Park 04 -17.8661111 31.2136111
zw ziyakamanas Ziyakamanas 00 -18.2166667 27.95
zw zizalisari Zizalisari 04 -17.7588889 31.0105556
zw zuzumba Zuzumba 06 -20.0333333 27.9333333
zw zvishavane Zvishavane 07 79876 -20.3333333 30.0333333
These commands are instantaneous because they run in time and memory proportional to the size of the slice (which means they will scale to arbitrarily large CSV data).
Switching gears a little bit, you might not always want to see every column in
the CSV data. In this case, maybe we only care about the country, city and
population. So let's take a look at 10 "random" rows. We use the --seed
parameter
so we get a reproducible random sample:
bash
$ qsv select Country,AccentCity,Population worldcitiespop.csv \
| qsv sample --seed 42 10 \
| qsv table
Country AccentCity Population
vn Khánh Tàn
no Kvalvåg
ir Bala Dashteh
af Kam Papin
cn Peipiao
mz Chiquefane
ug Bulukatoni
us Gourdsville
kr Hwahungni
fr Pimouget
Whoops! The sample we got don't have population counts. How pervasive is that?
bash
$ qsv frequency worldcitiespop.csv --limit 5
field,value,count
Country,cn,238985
Country,ru,215938
Country,id,176546
Country,us,141989
Country,ir,123872
City,san jose,328
City,san antonio,320
City,santa rosa,296
City,santa cruz,282
City,san juan,255
AccentCity,San Antonio,317
AccentCity,Santa Rosa,296
AccentCity,Santa Cruz,281
AccentCity,San Juan,254
AccentCity,San Miguel,254
Region,04,159916
Region,02,142158
Region,07,126867
Region,03,122161
Region,05,118441
Population,(NULL),3125978
Population,2310,12
Population,3097,11
Population,983,11
Population,2684,11
Latitude,51.15,777
Latitude,51.083333,772
Latitude,50.933333,769
Latitude,51.116667,769
Latitude,51.133333,767
Longitude,23.8,484
Longitude,23.2,477
Longitude,23.05,476
Longitude,25.3,474
Longitude,23.1,459
(The qsv frequency
command builds a frequency table for each column in the
CSV data. This one only took 5 seconds.)
So it seems that most cities do not have a population count associated with them at all (3,125,978 to be exact). No matter — we can adjust our previous command so that it only shows rows with a population count:
bash
$ qsv search -s Population '[0-9]' worldcitiespop.csv \
| qsv select Country,AccentCity,Population \
| qsv sample --seed 42 10 \
| tee sample.csv \
| qsv table
Country AccentCity Population
fr Boissy-Saint-Léger 15451
us Iselin 17019
ru Ali-Yurt 7593
ro Panaci 2308
lu Baschleiden 185
us Mayaguez 76503
ch Vernier 29767
es Salobreña 10725
ch Aigle 7897
yt Ouangani 7273
:warning: NOTE: The
tee
command reads from standard input and writes to both standard output and one or more files at the same time. We do this so we can create thesample.csv
file we need for the next step, and pipe the same data to theqsv table
command.
Erk. Which country is yt
? What continent? No clue, but DataHub.io
has a CSV file called country-continent.csv
. Let's grab it and do a join so
we can see which countries and continents these are:
```bash curl -L https://datahub.io/JohnSnowLabs/country-and-continent-codes-list/r/0.csv > countrycontinent.csv $ qsv headers countrynames.csv 1 ContinentName 2 ContinentCode 3 CountryName 4 TwoLetterCountryCode 5 ThreeLetterCountryCode 6 CountryNumber $ qsv join --no-case Country sample.csv TwoLetterCountryCode countrycontinent.csv | qsv table Country AccentCity Population ContinentName ContinentCode CountryName TwoLetterCountryCode ThreeLetterCountryCode Country_Number fr Boissy-Saint-Léger 15451 Europe EU France, French Republic FR FRA 250 us Iselin 17019 North America NA United States of America US USA 840 ru Ali-Yurt 7593 Europe EU Russian Federation RU RUS 643 ru Ali-Yurt 7593 Asia AS Russian Federation RU RUS 643 ro Panaci 2308 Europe EU Romania RO ROU 642 lu Baschleiden 185 Europe EU Luxembourg, Grand Duchy of LU LUX 442 us Mayaguez 76503 North America NA United States of America US USA 840 ch Vernier 29767 Europe EU Switzerland, Swiss Confederation CH CHE 756 es Salobreña 10725 Europe EU Spain, Kingdom of ES ESP 724 ch Aigle 7897 Europe EU Switzerland, Swiss Confederation CH CHE 756 yt Ouangani 7273 Africa AF Mayotte YT MYT 175
```
Whoops, now we have the data but we have several unneeded columns, and the columns
we do need have overly long names. Also, there are two records for Ali-Yurt - one in
Europe and another in Asia. This is because Russia spans both continents.
We're primarily interested in unique cities per country for the purposes of this tour,
so we need to filter these out.
Also, apart from renaming the columns, I want to reorder them to "Country,Continent,City, Population".
No worries. Let's use the select
(so we only get the columns we need, in the order we want),
dedup
(so we only get unique County/City combinations) and rename
(rename the columns
with shorter names) commands:
bash
$ qsv join --no-case Country sample.csv Two_Letter_Country_Code country_continent.csv \
| qsv select 'Country_Name,Continent_Name,AccentCity,Population' \
| qsv dedup -s 'Country_Name,AccentCity' \
| qsv rename Country,Continent,City,Population \
| qsv table
Country Continent City Population
France, French Republic Europe Boissy-Saint-Léger 15451
Luxembourg, Grand Duchy of Europe Baschleiden 185
Mayotte Africa Ouangani 7273
Romania Europe Panaci 2308
Russian Federation Asia Ali-Yurt 7593
Spain, Kingdom of Europe Salobreña 10725
Switzerland, Swiss Confederation Europe Aigle 7897
Switzerland, Swiss Confederation Europe Vernier 29767
United States of America North America Iselin 17019
United States of America North America Mayaguez 76503
Nice! Notice the data is now sorted too! That's because dedup
first sorts the
CSV records (by internally calling the qsv sort
command) to find duplicates.
Perhaps we can do this with the original CSV data? All 3.2 million rows in a 145MB file?!
Indeed we can—because qsv
is designed for speed - written in Rust with
amortized memory allocations, using the
performance-focused mimalloc allocator.
```bash $ qsv join --no-case Country worldcitiespop.csv TwoLetterCountryCode countrycontinent.csv \ | qsv select 'CountryName,ContinentName,AccentCity,Population,Latitude,Longitude' \ | qsv dedup -s 'Country_Name,AccentCity,Latitude,Longitude' --dupes-output dupe-countrycities.csv \ | qsv rename Country,Continent,City,Population,Latitude,Longitude \
worldcitiespopcountrycontinent.csv $ qsv sample 10 --seed 1729 worldcitiespopcountycontinent.csv | qsv table Country Continent City Population Latitude Longitude Syrian Arab Republic Asia Cheikh Sayad 36.25 37.2666667 Colombia, Republic of South America Tetillo 3.160288 -76.324643 Egypt, Arab Republic of Africa El-Kôm el-Ahmar 27.0 31.4166667 Bulgaria, Republic of Europe Oresha 42.95 24.1 Poland, Republic of Europe Wielka Wies 54.568121 17.361634 Iran, Islamic Republic of Asia Kolah Jub-e Chagalvandi 33.65 48.583333 Congo, Republic of the Africa Ebou -1.2833333 15.5869444 Congo, Democratic Republic of the Africa Yambo-Engunda 1 20.666667 Australia, Commonwealth of Oceania Cessnock 16394 -32.832111 151.356232 Brazil, Federative Republic of South America Pirapora -8.448056 -72.821389 ```
We fine-tuned dedup
by adding Latitude
and Longitude
as there may be
multiple cities with the same name in a country. We also specified the
dupes-output
option so we can have a separate CSV of the duplicate records
it removed.
This whole thing takes about 8 seconds on my machine. The performance of join
,
in particular, comes from constructing a very simple hash index of one of the CSV
files. The join
command does an inner join by default, but it also has left,
right and full outer join support too.
Finally, can we create a CSV file for each country of all its cities? Yes we can,
with the partition
command:
```bash $ qsv partition Country bycountry worldcitiespop_countrycontinent.csv $ cd bycountry $ ls -1shS total 191M 16M ChinaPeoplesRepublicof.csv 12M RussianFederation.csv 11M UnitedStatesofAmerica.csv 11M IndonesiaRepublicof.csv 7.5M IranIslamicRepublicof.csv ... 4.0K CocosKeelingIslands.csv 4.0K PitcairnIslands.csv 4.0K Tokelau.csv 4.0K NorfolkIsland.csv
Binaries for Windows, Linux and macOS are available from Github.
Alternatively, you can compile from source by
installing Cargo
(Rust's package manager)
and installing qsv
using Cargo:
bash
cargo install qsv
Compiling from this repository also works similarly:
bash
git clone git://github.com/jqnatividad/qsv
cd qsv
cargo build --release
Compilation will probably take a few minutes depending on your machine. The
binary will end up in ./target/release/qsv
.
If you want to squeeze more performance from your build, set this environment variable before compiling:
bash
export CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'
Do note though that the resulting binary will only run on machines with the
same architecture as the machine you compiled from. To find out your CPU
architecture and other valid values for target-cpu
:
bash
rustc --print target-cpus
I've compiled some very rough
benchmarks of
various qsv
commands.