CLI and HTTP API to preprocess corpora for word embeddings and possibly other NLP tasks. The main goal is to convert many HTML or plain text files into a single normalized plain text corpus.
<unk>
placeholder if they meet any of the following criteria:
@
http://
```console
$ cargo install corpus-preproc
$ corpus-preproc clean -h Preprocess a file or directory
USAGE: corpus-preproc clean [OPTIONS]
ARGS:
OPTIONS: -c Clean HTML tags
--content-selector <CONTENT_SELECTOR>
CSS selector for main content
--delete-selector <DELETE_SELECTOR>
CSS selector for tag removal [default: "script, style, pre, svg, math, noscript, ref,
table, tr, td, ol, ul, li, time, [aria-hidden], img, figure"]
-h, --help
Print help information
-l
Perform case-folding
-m
Keep modifiers and marks on normalization
-n
Perform NFKC and whitespace normalization
--nl-append-selector <NL_APPEND_SELECTOR>
CSS selector to append newline [default: "div, p, hr, br, h1, h2, h3, h4, h5, h6"]
-p
Trim punctuation surrounding words
-t <THREADS>
Number of threads to use [default: 4]
```
console
$ corpus-preproc serve 127.0.0.1:8000
The requests
Python library needs to be installed.
```python
import requests
import json
DEFAULT_CONFIG = { "htmlClean": { "enabled": True, "contentSelector": None, "deleteSelector": "script, style, pre, svg, math, noscript, ref, table, tr, td, ol, ul, li, time, [aria-hidden], img, figure", "nlAppendSelector": "div, p, hr, br, h1, h2, h3, h4, h5, h6", }, "charNormalization": { "enabled": True, "keepModifiersAndMarks": False, "lowercase": True, }, "wordNormalization": { "enabled": True, "replacePii": True, } }
def cleantext(text): files = { 'config': (None, json.dumps(DEFAULTCONFIG), 'application/json'), # optional 'data': (None, text, 'text/plain'), } response = requests.post('http://127.0.0.1:3000/preproc', files=files) return response.text clean = clean_text("HELLo, WORLD!!!").rstrip() assert (clean == "hello world"), "OK" ```
indicatif
with linya
tokenizers
ropey
or tendril
[ ] Determine feasibility to process text as a stream instead of loading entire file buffer into memory
lol-html
and html5ever
issue #149
[ ] Implement quality control (minimum and maximum sentence length)
pdf-extract
dotext
or docx
rust-stemmers
fasttext-rs
and a
language identification model[ ] Automatically concatenate common MWEs with MITIE
(Rust bindings missing) or phrase
[ ] Python bindings