Corpus Preprocessor

CLI and HTTP API to preprocess corpora for word embeddings and possibly other NLP tasks. The main goal is to convert many HTML or plain text files into a single normalized plain text corpus.

Features

Parallel processing of files in a directory (CLI only)
NKFC and whitespace normalization
Removal of modifiers and marks
Lower-case folding
Trimming of punctuation around words
Replace words with <unk> placeholder if they meet any of the following criteria:
- Word has an at sign @
- Word lacks alphabetic characters
- Word has two punctuation chars in a row, such as http://
HTML code is parsed and CSS selectors can be used to:
- Remove undesired elements
- Insert newlines after paragraphs and line breaks
- Extract the main content of an HTML document
Text is automatically converted to UTF-8 if the original encoding is in the Encoding Standard.

Usage

Command Line Interface (CLI)

```console

Install

$ cargo install corpus-preproc

Run CLI help

$ corpus-preproc clean -h Preprocess a file or directory

USAGE: corpus-preproc clean [OPTIONS]

ARGS:

OPTIONS: -c Clean HTML tags

    --content-selector <CONTENT_SELECTOR>
        CSS selector for main content

    --delete-selector <DELETE_SELECTOR>
        CSS selector for tag removal [default: "script, style, pre, svg, math, noscript, ref,
        table, tr, td, ol, ul, li, time, [aria-hidden], img, figure"]

-h, --help
        Print help information

-l
        Perform case-folding

-m
        Keep modifiers and marks on normalization

-n
        Perform NFKC and whitespace normalization

    --nl-append-selector <NL_APPEND_SELECTOR>
        CSS selector to append newline [default: "div, p, hr, br, h1, h2, h3, h4, h5, h6"]

-p
        Trim punctuation surrounding words

-t <THREADS>
        Number of threads to use [default: 4]

```

HTTP API

Startup

console $ corpus-preproc serve 127.0.0.1:8000

Python Example

The requests Python library needs to be installed. ```python import requests import json

DEFAULT_CONFIG = { "htmlClean": { "enabled": True, "contentSelector": None, "deleteSelector": "script, style, pre, svg, math, noscript, ref, table, tr, td, ol, ul, li, time, [aria-hidden], img, figure", "nlAppendSelector": "div, p, hr, br, h1, h2, h3, h4, h5, h6", }, "charNormalization": { "enabled": True, "keepModifiersAndMarks": False, "lowercase": True, }, "wordNormalization": { "enabled": True, "replacePii": True, } }

def cleantext(text): files = { 'config': (None, json.dumps(DEFAULTCONFIG), 'application/json'), # optional 'data': (None, text, 'text/plain'), } response = requests.post('http://127.0.0.1:3000/preproc', files=files) return response.text clean = clean_text("HELLo, WORLD!!!").rstrip() assert (clean == "hello world"), "OK" ```

TODO

[ ] Normalize or remove inner word separators

[ ] Replace indicatif with linya

[ ] Export and load CLI options as JSON files
Wishlist

Speed

[ ] Use the efficient plain text preprocessors of tokenizers

[ ] Use a better text data structure such as ropey or tendril

[ ] Determine feasibility to process text as a stream instead of loading entire file buffer into memory

See lol-html and html5ever issue #149
Functionality

[ ] Implement quality control (minimum and maximum sentence length)

[ ] Implement pdf text extractor with pdf-extract

[ ] Implement docx/pptx/odt text extractor with dotext or docx

[ ] Implement stemmer with rust-stemmers

[ ] Implement sentence filtering based on desired language with fasttext-rs and a language identification model

[ ] Automatically concatenate common MWEs with MITIE (Rust bindings missing) or phrase

Interoperability

[ ] Python bindings