CSV to Parquet

Convert CSV files to Apache Parquet. You may also be interested in json2parquet, csv2arrow, or json2arrow.

Installation

Download prebuilt binaries

You can get the latest releases from https://github.com/domoritz/csv2parquet/releases/.

With Cargo

cargo install csv2parquet

Usage

``` Usage: csv2parquet [OPTIONS]

Arguments: Input CSV file Output file

Options: -s, --schema-file File with Arrow schema in JSON format --max-read-records The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed --header

Set whether the CSV file has headers [possible values: true, false] -d, --delimiter Set the CSV file's column delimiter as a byte character [default: ,] -c, --compression Set the compression [possible values: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd] -e, --encoding Sets encoding for any column [possible values: plain, rle, bit-packed, delta-binary-packed, delta-length-byte-array, delta-byte-array, rle-dictionary] --data-pagesize-limit Sets data page size limit --dictionary-pagesize-limit Sets dictionary page size limit --write-batch-size Sets write batch size --max-row-group-size Sets max size for a row group --created-by Sets "created by" property --dictionary Sets flag to enable/disable dictionary encoding for any column --statistics Sets flag to enable/disable statistics for any column [possible values: none, chunk, page] --max-statistics-size Sets max statistics size for any column. Applicable only if statistics are enabled -p, --print-schema Print the schema to stderr -n, --dry Only print the schema -h, --help Print help information -V, --version Print version information ```

The --schema-file option uses the same file format as --dry and --print-schema.

Examples

Convert a CSV to Parquet

bash csv2parquet data.csv data.parquet

Convert a CSV with no `header` to Parquet

bash csv2parquet --header false <CSV> <PARQUET>

Get the `schema` from a CSV with header

bash csv2parquet --header true --dry <CSV> <PARQUET>

Convert a CSV using `schema-file` to Parquet

Below is an example of the schema-file content:

json { "fields": [ { "name": "col1", "data_type": "Utf8", "nullable": false, "dict_id": 0, "dict_is_ordered": false }, { "name": " col2", "data_type": "Utf8", "nullable": false, "dict_id": 0, "dict_is_ordered": false } ] }

Then add the schema-file schema.json in the command: csv2parquet --header false --schema-file schema.json <CSV> <PARQUET>