ℹ️ NOTE: qsv is a fork of the popular xsv utility, merging several pending PRs since xsv 0.13.0's May 2018 release. On top of xsv's 20 commands, it adds numerous new features, 27 additional commands, 6
apply
subcommands & 31apply
operations (for a total of 84). See FAQ for more details.
| Command | Description |
| --- | --- |
| apply^1 | Apply series of string, date, math, currency & geocoding transformations to a CSV column. It also has some basic NLP functions (similarity, sentiment analysis, profanity, eudex & language detection). |
| behead | Drop headers from a CSV. |
| cat | Concatenate CSV files by row or by column. |
| count[^2] | Count the rows in a CSV file. (Instantaneous with an index.) |
| dedup^3 | Remove duplicate rows (See also extsort
& sortcheck
commands). |
| enum | Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value. |
| excel | Exports a specified Excel/ODS sheet to a CSV file. |
| exclude[^2] | Removes a set of CSV data from another set based on the specified columns. |
| explode | Explode rows into multiple ones by splitting a column value based on the given separator. |
| extsort[^5] | Sort an arbitrarily large CSV/text file using a multithreaded external merge sort algorithm. |
| fetch | Fetches data from web services for every row using HTTP Get. Comes with jql JSON query language support, dynamic throttling (RateLimit) & caching with optional Redis support for persistent caching. |
| fetchpost | Similar to fetch
, but uses HTTP Post. (HTTP GET vs POST methods) |
| fill | Fill empty values. |
| fixlengths | Force a CSV to have same-length records by either padding or truncating them. |
| flatten | A flattened view of CSV records. Useful for viewing one record at a time.
e.g. qsv slice -i 5 data.csv \| qsv flatten
. |
| fmt | Reformat a CSV with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.) |
| foreach[^1] | Loop over a CSV to execute bash commands. (not available on Windows) |
| frequency^2 | Build frequency tables of each column. (Uses multithreading to go faster if an index is present.) |
| generate[^1] | Generate test data by profiling a CSV using Markov decision process machine learning. |
| headers | Show the headers of a CSV. Or show the intersection of all headers between many CSV files. |
| index | Create an index for a CSV. This is very quick & provides constant time indexing into the CSV file. Also enables multithreading for frequency
, split
, stats
& schema
commands. |
| input[^2] | Read CSV data with special quoting, trimming, line-skipping & UTF-8 transcoding rules. Typically used to "normalize" a CSV for further processing with other qsv commands. |
| join[^2] | Inner, outer, cross, anti & semi joins. Uses a simple hash index to make it fast. |
| jsonl | Convert newline-delimited JSON (JSONL/NDJSON) to CSV. See tojsonl
command to convert CSV to JSONL.
| luajit[^1] | Execute a LuaJIT (a Just-In-Time compiler for Lua, with partial compatibility with Lua 5.2) script over CSV lines to transform, filter or aggregate them. |
| luau[^1] | Execute a Luau script over CSV lines to transform, filter or aggregate them. |
| partition | Partition a CSV based on a column value. |
| pseudo | Pseudonymise the value of the given column by replacing them with an incremental identifier. |
| py[^1] | Evaluate a Python expression over CSV lines to transform or filter them. Python's f-strings is particularly useful for extended formatting, with the ability to evaluate Python expressions as well. |
| rename | Rename the columns of a CSV efficiently. |
| replace | Replace CSV data using a regex. |
| reverse[^3] | Reverse order of rows in a CSV. Unlike the sort --reverse
command, it preserves the order of rows with the same key. |
| sample[^2] | Randomly draw rows (with optional seed) from a CSV using reservoir sampling (i.e., use memory proportional to the size of the sample). |
| schema[^4] | Infer schema from CSV data, replete with data type & domain/range validation & output in JSON Schema format. Uses multithreading to go faster if an index is present. See validate
command to use the generated JSON Schema to validate if similar CSVs comply with the schema. |
| search | Run a regex over a CSV. Applies the regex to each field individually & shows only matching rows. |
| searchset | Run multiple regexes over a CSV in a single pass. Applies the regexes to each field individually & shows only matching rows. |
| select | Select, re-order, duplicate or drop columns. |
| slice^2 | Slice rows from any part of a CSV. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice). |
| sniff[^2] | Quickly sniff CSV metadata (delimiter, header row, preamble rows, quote character, flexible, is_utf8, number of records, number of fields, field names & data types). |
| sort[^5] | Sorts CSV data in alphabetical, numerical, reverse or random (with optional seed) order (See also extsort
& sortcheck
commands). |
| sortcheck[^2] | Check if a CSV is sorted. With the --json options, also retrieve record count, sort breaks & duplicate count. |
| split^2 | Split one CSV file into many CSV files of N chunks. (Uses multithreading to go faster if an index is present.) |
| stats^2[^4] | Infer data type (Null, String, Float, Integer, Date, DateTime) & compute descriptive statistics for each column in a CSV (sum, min/max, min/max length, mean, stddev, variance, nullcount, quartiles, IQR, lower/upper fences, skewness, median, mode & cardinality). Uses multithreading to go faster if an index is present. |
| table[^3] | Show aligned output of a CSV using elastic tabstops. |
| tojsonl[^4] | Smartly converts CSV to a newline-delimited JSON (JSONL/NDJSON). By scanning the CSV first, it "smartly" infers the appropriate JSON data type for each column. See jsonl
command to convert JSONL to CSV. |
| transpose[^3] | Transpose rows/columns of a CSV. |
| validate^2 | Validate CSV data with JSON Schema (See schema
command) & put invalid records into a separate file & a validation error report file. If no jsonschema file is provided, validates if a CSV conforms to the RFC 4180 standard. |
For macOS and Linux (64-bit), you can quickly install qsv with Homebrew. However, only the apply
feature is enabled.
brew install qsv
Prebuilt binary variants with more enabled features for Windows, Linux & macOS are also available for download, including binaries compiled with Rust Nightly/Unstable (more info).
There are three variants of qsv:
* qsv
- feature capable, with the prebuilt binaries enabling all features except Python [^7]
* qsvlite
- all features disabled (~half the size of qsv
)
* qsvdp
- optimized for use with DataPusher+, with only DataPusher+ relevant commands & the self-update engine removed (~sixth of the size of qsv
).
Alternatively, you can install from source by installing Rust
and installing qsv
using Rust's cargo command[^6]:
For example, if you use Ubuntu, you can install the build-essential
package. On macOS, you can get a C compiler by running $ xcode-select --install
.
For Windows, this means installing Visual Studio 2022. When prompted for workloads, include "Desktop Development with C++",
the Windows 10 or 11 SDK & the English language pack, along with any other language packs your require.
bash
cargo install qsv --locked --features all_full
The binary will be installed in ~/.cargo/bin
.
Compiling from source also works similarly:
bash
git clone git@github.com:jqnatividad/qsv.git
cd qsv
cargo build --release --locked --features all_full
The compiled binary will end up in ./target/release/
.
To enable optional features, use cargo --features
(see Feature Flags for more info):
```bash cargo install qsv --locked --features apply,generate,luau,fetch,foreach,python,self_update,full
cargo install qsv --locked --features all_full
cargo install qsv --locked --features lite
cargo install qsv --locked --features datapusher_plus
cargo build --release --locked --features apply,generate,luau,fetch,foreach,python,self_update,full
cargo build --release --locked --features all_full
cargo build --release --locked --features lite
cargo build --release --locked --features datapusher_plus ```
qsv's MSRV policy is to require Rust stable - currently version 1.65.
qsv's command-line options are quite extensive. Thankfully, since it uses docopt for CLI processing, we can take advantage of docopt.rs' tab completion support to make it easier to use qsv at the command-line (currently, only bash shell is supported):
```bash
cargo install docopt
echo "DOCOPTWORDLISTBIN=\"$(which docopt-wordlist)"\" >> $HOME/.bashcompletion echo "source \"$(pwd)/scripts/docopt-wordlist.bash\"" >> $HOME/.bashcompletion echo "complete -F docoptwordlistcommands qsv" >> $HOME/.bashcompletion ```
qsv recognizes UTF-8/ASCII encoded, CSV (.csv
) & TSV files (.tsv
& .tab
). CSV files are assumed to have "," (comma) as a delimiter,
and TSV files, "\t" (tab) as a delimiter. The delimiter is a single ascii character that can be set either by the --delimiter
command-line option or
with the QSV_DEFAULT_DELIMITER
environment variable or automatically detected when QSV_SNIFF_DELIMITER
is set.
When using the --output
option, note that qsv will UTF-8 encode the file & automatically change the delimiter used in the generated file based on the file extension - i.e. comma for .csv
, tab for .tsv
& .tab
files.
JSONL/NDJSON files are also recognized & converted from/to CSV with the jsonl
and tojsonl
commands respectively.
The fetch
& fetchpost
commands also produces JSONL files when its invoked without the --new-column
option & TSV files with the --report
option.
The sniff
, sortcheck
& validate
commands produce JSON files with their --json
& --pretty-json
options.
The schema
command produces a JSON Schema Validation (Draft 7) file with the ".schema.json" file extension, which can be used with the validate
command.
The excel
command recognizes Excel & Open Document Spreadsheet(ODS) files (.xls
, .xlsx
, .xlsm
, .xlsb
& .ods
files).
qsv validates against the RFC 4180 CSV standard. However IRL, CSV formats vary significantly & qsv is actually not strictly compliant with the specification so it can process "real-world" CSV files. qsv leverages the awesome Rust CSV crate to read/write CSV files.
Click here to find out more about how qsv conforms to the standard using this crate.
The following commands require UTF-8 encoded input (of which ASCII is a subset) - dedup
, exclude
, fetch
, fetchpost
, frequency
, join
, schema
, sort
, stats
& validate
.
For these commands, qsv checks if the input is UTF-8 encoded by scanning the first 8k & will abort if its not unless QSV_SKIPUTF8_CHECK
is set. On Linux & macOS, UTF-8 encoding is the default.
This was done to increase performance of these commands, as they make extensive use of from_utf8_unchecked
so as not to pay the repetitive utf-8 validation penalty, no matter how small, even for already utf-8 encoded files.
Should you need to re-encode CSV/TSV files, you can use the input
command to transcode to UTF-8. It will replace all invalid UTF-8 sequences with �
. Alternatively, there are several utilities you can use to do so on Linux/macOS & Windows.
Unlike other modern operating systems, Microsoft Windows' default encoding is UTF16-LE. This will cause problems when redirecting qsv's output to a CSV file & trying to open it with Excel (which ignores the comma delimiter, with everything in the first column):
qsv stats wcp.csv > wcpstats.csv
Which is weird, since you would think Microsoft's own Excel would properly recognize UTF16-LE encoded CSV files. Regardless, to create a properly UTF-8 encoded file, use the --output
option instead:
```
qsv stats wcp.csv > wcpstats.csv
qsv stats wcp.csv --output wcpstats.csv ```
The python
feature is NOT enabled by default on the prebuilt binaries, as doing so requires it to statically link to python, which presents distribution issues, as various operating systems have differing bundled Python versions.
If you wish to enable the python
feature - you'll just have to install/compile from source, making sure you have the development libraries for the desired Python version (Python 3.6 & up are supported) installed when doing so.
Note that if you plan to distribute your manually built qsv
with the python
feature, qsv
will look for Python shared libraries (libpython* on Linux/macOS, python*.dll on Windows) against which it was compiled starting with the current directory & abort with an error if not found, detailing the Python library it was looking for.
Note that this will happen as soon as the qsv binary is invoked, even if you're not running the py
command.
PyO3 - the underlying crate that enables the python
feature, uses a build script to determine the Python version & set the correct linker arguments. By default it uses the python3 executable.
You can override the Python interpreter by setting PYO3_PYTHON
(e.g., PYO3_PYTHON=python3.6
), before installing/compiling qsv. See the PyO3 User Guide for more information.
Consider using the luau
/luajit
commands instead of the py
command if the mapping/filtering operation you're trying to do can be done with luau
/luajit
. Lua/Luau is much faster than Python & LuaJIT is even faster still. In addition, Luau/LuaJIT is embedded into qsv, can do aggregations with its --prologue
& --epilogue
options & has no external dependencies unlike Python.
The py
command cannot do aggregations because PyO3's GIL-bound memory limitations will quickly consume a lot of memory (see issue 449 for details).
To prevent this, the py
command processes CSVs in batches (default: 30,000 records), with a GIL pool for each batch, so no globals are available across batches.
Note however, that luau
& luajit
are mutually exclusive features.
Choose luau
if you want to take advantage of its sandboxing, type-checking, additional operators & various other improvements over Lua.
Choose luajit
if its increased performance over luau
is a priority and you don't need luau
's additional features.
luau
is the default for the prebuilt binaries, with luajit
as the fallback if luau
cannot be built for the target architecture.
| Variable | Description |
| --- | --- |
| QSV_DEFAULT_DELIMITER
| single ascii character to use as delimiter. Overrides --delimeter
option. Defaults to "," (comma) for CSV files & "\t" (tab) for TSV files when not set. Note that this will also set the delimiter for qsv's output to stdout.
However, using the --output
option, regardless of this environment variable, will automatically change the delimiter used in the generated file based on the file extension - i.e. comma for .csv
, tab for .tsv
& .tab
files. |
| QSV_SNIFF_DELIMITER
| if set, the delimiter is automatically detected. Overrides QSV_DEFAULT_DELIMITER
& --delimiter
option. Note that this does not work with stdin. |
| QSV_NO_HEADERS
| if set, the first row will NOT be interpreted as headers. Supersedes QSV_TOGGLE_HEADERS
. |
| QSV_TOGGLE_HEADERS
| if set to 1
, toggles header setting - i.e. inverts qsv header behavior, with no headers being the default, & setting --no-headers
will actually mean headers will not be ignored. |
| QSV_AUTOINDEX
| if set, automatically create an index when none is detected. Also automatically update stale indices. |
| QSV_COMMENT_CHAR
| set to an ascii character. If set, any lines(including the header) that start with this character are ignored. |
| QSV_MAX_JOBS
| number of jobs to use for multithreaded commands (currently apply
, dedup
, extsort
, frequency
, schema
, sort
, split
, stats
, tojsonl
& validate
). If not set, max_jobs is set to the detected number of logical processors. See Multithreading for more info. |
| QSV_NO_UPDATE
| if set, prohibit self-update version check for the latest qsv release published on GitHub. |
| QSV_PREFER_DMY
| if set, date parsing will use DMY format. Otherwise, use MDY format (used with apply datefmt
, schema
, sniff
& stats
commands). |
| QSV_REGEX_UNICODE
| if set, makes search
, searchset
& replace
commands unicode-aware. For increased performance, these commands are not unicode-aware by default & will ignore unicode values when matching & will abort when unicode characters are used in the regex. Note that the apply operations regex_replace
operation is always unicode-aware. |
| QSV_SKIPUTF8_CHECK
| if set, skip UTF-8 encoding check. Otherwise, for several commands that require UTF-8 encoded input (see UTF8-Encoding), qsv scans the first 8k. |
| QSV_RDR_BUFFER_CAPACITY
| reader buffer size (default (bytes): 16384) |
| QSV_WTR_BUFFER_CAPACITY
| writer buffer size (default (bytes): 65536) |
| QSV_LOG_LEVEL
| desired level (default - off; error
, warn
, info
, trace
, debug
). |
| QSV_LOG_DIR
| when logging is enabled, the directory where the log files will be stored. If the specified directory does not exist, qsv will attempt to create it. If not set, the log files are created in the directory where qsv was started. See Logging for more info. |
| QSV_PROGRESSBAR
| if set, enable the --progressbar option on the apply
, fetch
, fetchpost
, foreach
, lua
, py
, replace
, search
, searchset
, sortcheck
& validate
commands. |
| QSV_REDIS_CONNSTR
| the fetch
command can use Redis to cache responses. Set to connect to the desired Redis instance. (default: redis:127.0.0.1:6379/1
). For more info on valid Redis connection string formats, see https://docs.rs/redis/latest/redis/#connection-parameters. |
| QSV_FP_REDIS_CONNSTR
| the fetchpost
command can also use Redis to cache responses (default: redis:127.0.0.1:6379/2
). Note that fetchpost
connects to database 2, as opposed to fetch
which connects to database 1. |
| QSV_REDIS_MAX_POOL_SIZE
| the maximum Redis connection pool size. (default: 20). |
| QSV_REDIS_TTL_SECONDS
| set time-to-live of Redis cached values (default (seconds): 2419200 (28 days)). |
| QSV_REDIS_TTL_REFRESH
| if set, enables cache hits to refresh TTL of cached values. |
Several dependencies also have environment variables that influence qsv's performance & behavior:
fetch
, validate
& --update
functions & will honor proxy settings set through the HTTP_PROXY
, HTTPS_PROXY
& NO_PROXY
environment variables.ℹ️ NOTE: To get a list of all active qsv-relevant environment variables, run
qsv --envlist
. Relevant env vars are defined as anything that starts withQSV_
&MIMALLOC_
& the proxy variables listed above.
qsv
has several features:
mimalloc
(default) - use the mimalloc allocator (see Memory Allocator for more info).apply
- enable apply
command. This swiss-army knife of CSV transformations is very powerful, but it has a lot of dependencies that increases both compile time and binary size.fetch
- enables the fetch
& fetchpost
commands.foreach
- enable foreach
command (not valid for Windows).generate
- enable generate
command.luau
- enable luau
command. Embeds a Luau interpreter into qsv. Luau has type-checking, sandboxing, additional language operators, increased performance & other improvements over Lua.luajit
- enable luajit
command. Embeds a LuaJIT interpreter into qsv. LuaJIT is a Just-In-Time compiler for the Lua 5.2 language & is thus faster than Luau. Note that the luau
& luajit
interpreters are mutually exclusive features.python
- enable py
command. Note that qsv will look for the shared library for the Python version (Python 3.6 & above supported) it was compiled against & will abort if the library is not found, even if you're not using the py
command. Check Python section for more info.self_update
- enable self-update engine, checking GitHub for the latest release. Note that if you manually built qsv, self-update
will only check for new releases.
It will NOT offer the choice to update itself to the prebuilt binaries published on GitHub. You need not worry that your manually built qsv will be overwritten by a self-update.
full
- enable to build qsv binary variant which is feature-capable.
all_full
- enable to build qsv binary variant with all features enabled (apply,fetch,foreach,generate,luau,python,self_update).lite
- enable to build qsvlite binary variant with all features disabled.datapusher_plus
- enable to build qsvdp binary variant - the DataPusher+ optimized qsv binary.nightly
- enable to turn on nightly/unstable features in the rand
, regex
, hashbrown
, parking_lot
& pyo3
crates when building with Rust nightly/unstable.ℹ️ NOTE:
qsvlite
, as the name implies, always has non-default features disabled.qsv
can be built with any combination of the above features using the cargo--features
&--no-default-features
flags. The prebuiltqsv
binaries has all applicable features valid for the target platform[^7].
Dual-licensed under MIT or the UNLICENSE.
This project is unrelated to Intel's Quick Sync Video.