warc-parquet

🗄️ A simple tool for converting WARC files to Parquet files.

📦 Install

The binary may be installed via cargo:

sh $ cargo install warc-parquet

🤸 Usage

Once installed, WARC files can be passed to the program with a target output path which Parquet will be written to:

sh $ wget --warc-file example 'https://example.com' $ warc-parquet --gzipped example.warc.gz example.snappy.parquet

⚠️ Note that the Parquet path WILL be overwritten.

There are any number of ways to consume Parquet once you have it. However a natural fit might be DuckDB:

$ duckdb v0.3.3 fe9ba8003 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. D select type, id from 'example.snappy.parquet'; ┌──────────┬─────────────────────────────────────────────────┐ │ type │ id │ ├──────────┼─────────────────────────────────────────────────┤ │ warcinfo │ <urn:uuid:A8063499-7675-4D8D-A736-A1D7DAE84C84> │ │ request │ <urn:uuid:3EB20966-D74F-4949-AACB-23DB3A0733A7> │ │ response │ <urn:uuid:8B92CADC-F770-45BE-8B72-E13A61CD6D1C> │ │ metadata │ <urn:uuid:4C0E9E17-E21B-49E0-859A-D1016FBDE636> │ │ resource │ <urn:uuid:14F502A5-3BDE-4D0B-8A43-95F4BB8398C6> │ │ resource │ <urn:uuid:6B6D6ADD-52FF-4760-AA00-FB9E739CABBE> │ └──────────┴─────────────────────────────────────────────────┘ D describe select * from 'example.snappy.parquet'; ┌─────────────────────────┬─────────────┬──────┬─────┬─────────┬───────┐ │ column_name │ column_type │ null │ key │ default │ extra │ ├─────────────────────────┼─────────────┼──────┼─────┼─────────┼───────┤ │ id │ VARCHAR │ YES │ │ │ │ │ content_length │ UINTEGER │ YES │ │ │ │ │ date │ TIMESTAMP │ YES │ │ │ │ │ type │ VARCHAR │ YES │ │ │ │ │ content_type │ VARCHAR │ YES │ │ │ │ │ concurrent_to │ VARCHAR │ YES │ │ │ │ │ block_digest │ VARCHAR │ YES │ │ │ │ │ payload_digest │ VARCHAR │ YES │ │ │ │ │ ip_address │ VARCHAR │ YES │ │ │ │ │ refers_to │ VARCHAR │ YES │ │ │ │ │ target_uri │ VARCHAR │ YES │ │ │ │ │ truncated │ VARCHAR │ YES │ │ │ │ │ warc_info_id │ VARCHAR │ YES │ │ │ │ │ filename │ VARCHAR │ YES │ │ │ │ │ profile │ VARCHAR │ YES │ │ │ │ │ identified_payload_type │ VARCHAR │ YES │ │ │ │ │ segment_number │ UINTEGER │ YES │ │ │ │ │ segment_origin_id │ VARCHAR │ YES │ │ │ │ │ segment_total_length │ UINTEGER │ YES │ │ │ │ │ body │ BLOB │ YES │ │ │ │ └─────────────────────────┴─────────────┴──────┴─────┴─────────┴───────┘