tantivywarcindexer

tantivywarcindexer builds a tantivy index from common crawl warc.wet files

Build

Install rust (e.g. via rustup). make

Usage

``` ./target/release/tantivywarcindexer --help WARC Indexer

Usage: warcparserĀ [-t ] [--from ] [--to ] warcparserĀ (-h | --help)

Options: -h --help Show this help -t number of threads to use, default 4 --from skip files until from --to skip files after to ```

Run

Where is the directory of an empty index you created e.g. tantivy-cli and the path to the directory with the common crawl warc.wet or warc.wet.gz files. Depending on your system this might take a few days or weeks. ./target/release/tantivy_warc_indexer ../common_crawl_tantivy_index ../wet To create an index: mkdir ../common_crawl_tantivy_index cp template/meta.json ../common_crawl_tantivy_index/

Best Andreas