PostgreSQL -> Parquet

Simple tool for exporting PostgreSQL tables into parquet, with support for more esoteric Postgres features than just int and text.

Installation

For gamblers

Download random binary from ~~the internet~~ Github Actions artifacts (click on the latest run, scroll to the bottom, choose your system).

For nix flakes enjoyers

nix shell github:exyi/pg2parquet

Then use the pg2parquet in the new shell. Note that you might need to add --extra-experimental-features 'nix-command flakes' argument to the nix invocation.

For Arch Linux fans

From Sources

Install Rust and Cargo. Clone the repo.

bash cd cli env RUSTFLAGS="-C target-cpu=native" cargo build --release

It should finish in few minutes (~10 CPU minutes). Take the target/release/pg2parquet file, delete rest of the target directory (it takes quite a bit of disk space). You can optionally strip the binary, but you'll get poor stack trace if it crashes.

Basic usage

pg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -t the_table_to_export

Alternatively, you can export result of a SQL query

pg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -q 'select column_a, column_b::text from another_table'

You can also use environment variables $PGPASSWORD and $PGUSER

Supported types

Known Limitations (and workarounds)

Options

> pg2parquet export --help

``` Exports a PostgreSQL table or query to a Parquet file

Usage: pg2parquet export [OPTIONS] --output-file --host --dbname

Options: -o, --output-file Path to the output file. If the file exists, it will be overwritten

-q, --query SQL query to execute. Exclusive with --table

-t, --table

Which table should be exported. Exclusive with --query

  --compression <COMPRESSION>
      Compression applied on the output file. Default: zstd, change to Snappy or None if it's too slow

      [possible values: none, snappy, gzip, lzo, brotli, lz4, zstd]

-H, --host Database server host

-U, --user Database user name. If not specified, PGUSER environment variable is used

-d, --dbname

-p, --port

  --password <PASSWORD>
      Password to use for the connection. It is recommended to use the PGPASSWORD environment variable instead, since process arguments are visible to other users on the system

  --macaddr-handling <MACADDR_HANDLING>
      How to handle `macaddr` columns

      [default: text]

      Possible values:
      - text:       MAC address is converted to a string
      - byte-array: MAC is stored as fixed byte array of length 6
      - int64:      MAC is stored in Int64 (lowest 6 bytes)

  --json-handling <JSON_HANDLING>
      How to handle `json` and `jsonb` columns

      [default: text]

      Possible values:
      - text-marked-as-json:
        JSON is stored as a Parquet JSON type. This is essentially the same as text, but with a different ConvertedType, so it may not be supported in all tools
      - text:
        JSON is stored as a UTF8 text

  --enum-handling <ENUM_HANDLING>
      How to handle enum (Enumerated Type) columns

      [default: text]

      Possible values:
      - text:
        Enum is stored as the postgres enum name
      - int:
        Enum is stored as an 32-bit integer (zero-based index of the value in the enum definition)

  --decimal-scale <DECIMAL_SCALE>
      How many decimal digits after the decimal point are stored in the Parquet file

      [default: 18]

  --decimal-precision <DECIMAL_PRECISION>
      How many decimal digits are allowed in numeric/DECIMAL column. By default 38, the largest value which fits in 128 bits

      [default: 38]

-h, --help Print help information (use -h for a summary) ```