Corpus Preprocessor

Build binary

CLI and HTTP API to preprocess corpora for word embeddings and possibly other NLP tasks. The main goal is to convert many HTML or plain text files into a single normalized plain text corpus.

Features

Usage

Command Line Interface (CLI)

```console

Install

$ cargo install corpus-preproc

Run CLI help

$ corpus-preproc clean -h Preprocess a file or directory

USAGE: corpus-preproc clean [OPTIONS]

ARGS:

OPTIONS: -c Clean HTML tags

    --content-selector <CONTENT_SELECTOR>
        CSS selector for main content

    --delete-selector <DELETE_SELECTOR>
        CSS selector for tag removal [default: "script, style, pre, svg, math, noscript, ref,
        table, tr, td, ol, ul, li, time, [aria-hidden], img, figure"]

-h, --help
        Print help information

-l
        Perform case-folding

-m
        Keep modifiers and marks on normalization

-n
        Perform NFKC and whitespace normalization

    --nl-append-selector <NL_APPEND_SELECTOR>
        CSS selector to append newline [default: "div, p, hr, br, h1, h2, h3, h4, h5, h6"]

-p
        Trim punctuation surrounding words

-t <THREADS>
        Number of threads to use [default: 4]

```

HTTP API

Startup

console $ corpus-preproc serve 127.0.0.1:8000

Python Example

The requests Python library needs to be installed. ```python import requests import json

DEFAULT_CONFIG = { "htmlClean": { "enabled": True, "contentSelector": None, "deleteSelector": "script, style, pre, svg, math, noscript, ref, table, tr, td, ol, ul, li, time, [aria-hidden], img, figure", "nlAppendSelector": "div, p, hr, br, h1, h2, h3, h4, h5, h6", }, "charNormalization": { "enabled": True, "keepModifiersAndMarks": False, "lowercase": True, }, "wordNormalization": { "enabled": True, "replacePii": True, } }

def cleantext(text): files = { 'config': (None, json.dumps(DEFAULTCONFIG), 'application/json'), # optional 'data': (None, text, 'text/plain'), } response = requests.post('http://127.0.0.1:3000/preproc', files=files) return response.text clean = clean_text("HELLo, WORLD!!!").rstrip() assert (clean == "hello world"), "OK" ```

TODO

Speed