PostgreSQL -> Parquet

Simple tool for exporting PostgreSQL tables into parquet, with support for more esoteric Postgres features than just int and text.

Installation

Download Binary from Github

Download the binary from Github Actions artifacts (click on the latest run, scroll to the bottom, choose your system).

Using Nix flakes

If you use Nix, this command will install the latest pg2parquet version. It compiles it from sources, so the installation will take some time.

nix shell github:exyi/pg2parquet

Then use the pg2parquet in the new shell. Note that you might need to add --extra-experimental-features 'nix-command flakes' argument to the nix invocation.

Using Cargo

cargo install pg2parquet

From Sources

Install Rust and Cargo. Clone the repo.

bash cd cli env RUSTFLAGS="-C target-cpu=native" cargo build --release

It should finish in few minutes (~10 CPU minutes). Take the target/release/pg2parquet file, delete rest of the target directory (it takes quite a bit of disk space). You can optionally strip the binary, but you'll get poor stack trace if it crashes.

Basic usage

pg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -t the_table_to_export

Alternatively, you can export result of a SQL query

pg2parquet export --host localhost.for.example --dbname my_database --output-file output.parquet -q 'select column_a, column_b::text from another_table'

You can also use environment variables $PGPASSWORD and $PGUSER

Supported types

Known Limitations (and workarounds)

Options

> pg2parquet export --help

``` Exports a PostgreSQL table or query to a Parquet file

Usage: pg2parquet export [OPTIONS] --output-file --host --dbname

Options: -o, --output-file Path to the output file. If the file exists, it will be overwritten

-q, --query SQL query to execute. Exclusive with --table

-t, --table

Which table should be exported. Exclusive with --query

  --compression <COMPRESSION>
      Compression applied on the output file. Default: zstd, change to Snappy or None if it's too slow

      [possible values: none, snappy, gzip, lzo, brotli, lz4, zstd]

-H, --host Database server host

-U, --user Database user name. If not specified, PGUSER environment variable is used

-d, --dbname

-p, --port

  --password <PASSWORD>
      Password to use for the connection. It is recommended to use the PGPASSWORD environment variable instead, since process arguments are visible to other users on the system

  --macaddr-handling <MACADDR_HANDLING>
      How to handle `macaddr` columns

      [default: text]

      Possible values:
      - text:       MAC address is converted to a string
      - byte-array: MAC is stored as fixed byte array of length 6
      - int64:      MAC is stored in Int64 (lowest 6 bytes)

  --json-handling <JSON_HANDLING>
      How to handle `json` and `jsonb` columns

      [default: text]

      Possible values:
      - text-marked-as-json:
        JSON is stored as a Parquet JSON type. This is essentially the same as text, but with a different ConvertedType, so it may not be supported in all tools
      - text:
        JSON is stored as a UTF8 text

  --enum-handling <ENUM_HANDLING>
      How to handle enum (Enumerated Type) columns

      [default: text]

      Possible values:
      - text:
        Enum is stored as the postgres enum name
      - int:
        Enum is stored as an 32-bit integer (zero-based index of the value in the enum definition)

  --decimal-scale <DECIMAL_SCALE>
      How many decimal digits after the decimal point are stored in the Parquet file

      [default: 18]

  --decimal-precision <DECIMAL_PRECISION>
      How many decimal digits are allowed in numeric/DECIMAL column. By default 38, the largest value which fits in 128 bits

      [default: 38]

-h, --help Print help information (use -h for a summary) ```