JSON to Parquet

Convert JSON files to Apache Parquet. This package is part of Arrow CLI tools.

Installation

Download prebuilt binaries

You can get the latest releases from https://github.com/domoritz/arrow-tools/releases.

With Cargo

cargo install json2parquet

With Cargo B(inary)Install

To avoid re-compilation and speed up installation, you can install this tool with cargo binstall:

cargo binstall json2parquet

Usage

``` Usage: json2parquet [OPTIONS]

Arguments: Input JSON file, stdin if not present Output file

Options: -s, --schema-file File with Arrow schema in JSON format --max-read-records The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed -c, --compression Set the compression [possible values: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd, lz4-raw] -e, --encoding Sets encoding for any column [possible values: plain, rle, bit-packed, delta-binary-packed, delta-length-byte-array, delta-byte-array, rle-dictionary] --data-page-size-limit Sets data page size limit --dictionary-page-size-limit Sets dictionary page size limit --write-batch-size Sets write batch size --max-row-group-size Sets max size for a row group --created-by Sets "created by" property --dictionary Sets flag to enable/disable dictionary encoding for any column --statistics Sets flag to enable/disable statistics for any column [possible values: none, chunk, page] --max-statistics-size Sets max statistics size for any column. Applicable only if statistics are enabled -p, --print-schema Print the schema to stderr -n, --dry Only print the schema -h, --help Print help -V, --version Print version ```

The --schema-file option uses the same file format as --dry and --print-schema.

Limitations

Since we use the Arrow JSON loader, we are limited to what it supports. Right now, it supports JSON line-delimited files.

json { "a": 42, "b": true } { "a": 12, "b": false } { "a": 7, "b": true }