CSV to Parquet

Crates.io Rust

Convert CSV files to Apache Parquet. You may also be interested in json2parquet, csv2arrow, or json2arrow.

Installation

Download prebuilt binaries

You can get the latest releases from https://github.com/domoritz/csv2parquet/releases/.

With Cargo

cargo install csv2parquet

Usage

``` USAGE: csv2parquet [OPTIONS]

ARGS: Input CSV file Output file

OPTIONS: -c, --compression Set the compression [possible values: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd]

    --created-by <CREATED_BY>
        Sets "created by" property

-d, --delimiter <DELIMITER>
        Set the CSV file's column delimiter as a byte character [default: ,]

    --data-pagesize-limit <DATA_PAGESIZE_LIMIT>
        Sets data page size limit

    --dictionary
        Sets flag to enable/disable dictionary encoding for any column

    --dictionary-pagesize-limit <DICTIONARY_PAGESIZE_LIMIT>
        Sets dictionary page size limit

-e, --encoding <ENCODING>
        Sets encoding for any column [possible values: plain, rle, bit-packed,
        delta-binary-packed, delta-length-byte-array, delta-byte-array, rle-dictionary]

-h, --header <HEADER>
        Set whether the CSV file has headers

    --help
        Print help information

    --max-read-records <MAX_READ_RECORDS>
        The number of records to infer the schema from. All rows if not present. Setting
        max-read-records to zero will stop schema inference and all columns will be string typed

    --max-row-group-size <MAX_ROW_GROUP_SIZE>
        Sets max size for a row group

    --max-statistics-size <MAX_STATISTICS_SIZE>
        Sets max statistics size for any column. Applicable only if statistics are enabled

-n, --dry
        Only print the schema

-p, --print-schema
        Print the schema to stderr

-s, --schema-file <SCHEMA_FILE>
        File with Arrow schema in JSON format

    --statistics <STATISTICS>
        Sets flag to enable/disable statistics for any column [possible values: none, chunk,
        page]

-V, --version
        Print version information

    --write-batch-size <WRITE_BATCH_SIZE>
        Sets write batch size

```

The --schema-file option uses the same file format as --dry and --print-schema.