NLP in Rust with Python bindings
This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.
The API is currently unstable.
CountVectorizer
and
HashingVectorizer
in scikit-learn.Add the following to Cargo.toml
,
toml
[dependencies]
text-vectorize = {"git" = "https://github.com/rth/vtext"}
A simple example can be found below,
```rust
extern crate vtext;
use vtext::CountVectorizer;
let documents = vec![ String::from("Some text input"), String::from("Another line"), ];
let mut vect = CountVectorizer::new();
let X = vect.fit_transform(&documents);
``
where
Xis a
CSRArraystruct with the following attributes
X.indptr,
X.indices,
X.values`.
The API aims to be compatible with scikit-learn's CountVectorizer and HashingVectorizer though only a subset of features will be implemented.
Following benchmarks illustrate the tokenization accuracy (F1 score) on UD treebanks ,
| lang | dataset |regexp | spacy 2.1 | vtext |
|-------|-----------|----------|-----------|----------|
| en | EWT | 0.812 | 0.972 | 0.966 |
| en | GUM | 0.881 | 0.989 | 0.996 |
| de | GSD | 0.896 | 0.944 | 0.964 |
| fr | Sequoia | 0.844 | 0.968 | 0.971 |
and the English tokenization speed in million words per second (MWPS)
| |regexp | spacy 2.1 | vtext | |-----------|----------|-----------|----------| | Speed | 3.1 MWPS | 0.14 MWPS | 2.1 MWPS |
Below are benchmarks for converting textual data to a sparse document-term matrix using the 20 newsgroups dataset,
| | scikit-learn 0.20.1 | vtext 0.1.0a1 | |---------------------|----------------------|------------------| | CountVectorizer | 14 MB/s | 35 MB/s | | HashingVectorizer | 19 MB/s | 68 MB/s |
see benchmarks/README.md for more details.
vtext is released under the BSD 3-clause license.