CSV to Parquet

Convert CSV files to Apache Parquet. You may also be interested in json2parquet, csv2arrow, or json2arrow.
Installation
Download prebuilt binaries
You can get the latest releases from https://github.com/domoritz/csv2parquet/releases/.
With Cargo
cargo install csv2parquet
Usage
```
Usage: csv2parquet [OPTIONS]
Arguments:
Input CSV file
Output file
Options:
-s, --schema-file
File with Arrow schema in JSON format
--max-read-records
The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed
--header
Set whether the CSV file has headers [possible values: true, false]
-d, --delimiter
Set the CSV file's column delimiter as a byte character [default: ,]
-c, --compression
Set the compression [possible values: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd]
-e, --encoding
Sets encoding for any column [possible values: plain, rle, bit-packed, delta-binary-packed, delta-length-byte-array, delta-byte-array, rle-dictionary]
--data-pagesize-limit
Sets data page size limit
--dictionary-pagesize-limit
Sets dictionary page size limit
--write-batch-size
Sets write batch size
--max-row-group-size
Sets max size for a row group
--created-by
Sets "created by" property
--dictionary
Sets flag to enable/disable dictionary encoding for any column
--statistics
Sets flag to enable/disable statistics for any column [possible values: none, chunk, page]
--max-statistics-size
Sets max statistics size for any column. Applicable only if statistics are enabled
-p, --print-schema
Print the schema to stderr
-n, --dry
Only print the schema
-h, --help
Print help information
-V, --version
Print version information
```
The --schema-file option uses the same file format as --dry and --print-schema.
Examples
Convert a CSV to Parquet
bash
csv2parquet data.csv data.parquet
Convert a CSV with no header
to Parquet
bash
csv2parquet --header false <CSV> <PARQUET>
Get the schema
from a CSV with header
bash
csv2parquet --header true --dry <CSV> <PARQUET>
Convert a CSV using schema-file
to Parquet
Below is an example of the schema-file
content:
json
{
"fields": [
{
"name": "col1",
"data_type": "Utf8",
"nullable": false,
"dict_id": 0,
"dict_is_ordered": false
},
{
"name": " col2",
"data_type": "Utf8",
"nullable": false,
"dict_id": 0,
"dict_is_ordered": false
}
]
}
Then add the schema-file schema.json
in the command:
csv2parquet --header false --schema-file schema.json <CSV> <PARQUET>