CSV to Parquet

Convert CSV files to Apache Parquet. This package is part of Arrow CLI tools.

Installation

Download prebuilt binaries

You can get the latest releases from https://github.com/domoritz/arrow-tools/releases.

With Cargo

cargo install csv2parquet

With Cargo B(inary)Install

To avoid re-compilation and speed up installation, you can install this tool with cargo binstall:

cargo binstall csv2parquet

Usage

``` Usage: csv2parquet [OPTIONS]

Arguments: Input CSV file, stdin if not present Output file

Options: -s, --schema-file File with Arrow schema in JSON format --max-read-records The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed --header

Set whether the CSV file has headers [possible values: true, false] -d, --delimiter Set the CSV file's column delimiter as a byte character [default: ,] -c, --compression Set the compression [possible values: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd, lz4-raw] -e, --encoding Sets encoding for any column [possible values: plain, rle, bit-packed, delta-binary-packed, delta-length-byte-array, delta-byte-array, rle-dictionary] --data-page-size-limit Sets data page size limit --dictionary-page-size-limit Sets dictionary page size limit --write-batch-size Sets write batch size --max-row-group-size Sets max size for a row group --created-by Sets "created by" property --dictionary Sets flag to enable/disable dictionary encoding for any column --statistics Sets flag to enable/disable statistics for any column [possible values: none, chunk, page] --max-statistics-size Sets max statistics size for any column. Applicable only if statistics are enabled -p, --print-schema Print the schema to stderr -n, --dry Only print the schema -h, --help Print help -V, --version Print version ```

The --schema-file option uses the same file format as --dry and --print-schema.

Examples

Convert a CSV to Parquet

bash csv2parquet data.csv data.parquet

Convert a CSV with no `header` to Parquet

bash csv2parquet --header false <CSV> <PARQUET>

Get the `schema` from a CSV with header

bash csv2parquet --header true --dry <CSV> <PARQUET>

Convert a CSV using `schema-file` to Parquet

Below is an example of the schema-file content:

json { "fields": [ { "name": "col1", "data_type": "Utf8", "nullable": false, "dict_id": 0, "dict_is_ordered": false, "metadata": {} }, { "name": " col2", "data_type": "Utf8", "nullable": false, "dict_id": 0, "dict_is_ordered": false, "metadata": {} } ], " metadata": {} }

Then add the schema-file schema.json in the command:

csv2parquet --header false --schema-file schema.json <CSV> <PARQUET>