SloWord2Vec Build Status

This is a naive implementation of Word2Vec implemented in Rust.

The goal is to learn the basic principles and formulas behind Word2Vec. BTW, it's slow ;)

Getting it

This lib is available as a lib and as a binary.

Binary

``` A naive Word2Vec implementation

USAGE: sloword2vec [SUBCOMMAND]

FLAGS: -h, --help Prints help information -V, --version Prints version information

SUBCOMMANDS: add-subtract Given a number of words to add and to subtract, returns a list of words in that area. help Prints this message or the help of the given subcommand(s) similar Given a path to a saved Word2Vec model and a target word, finds words in the model's vocab that are similar. train Given a corpus and a path to save a trained model, trains Word2Vec encodings for the vocabulary in the corpus and saves it. ```

Training

``` Given a corpus and a path to save a trained model, trains Word2Vec encodings for the vocabulary in the corpus and saves it.

USAGE: sloword2vec train [OPTIONS] --corpus --path

FLAGS: -h, --help Prints help information -V, --version Prints version information

OPTIONS: -A, --acceptable-error Acceptable error threshold under which training will end. [default: 0.1] -R, --context-radius The context radius (how many word surrounding a centre word to take into account per training sample). [default: 5] -C, --corpus Where the corpus file is. -D, --dimensions Number of dimensions to use for encoding a word as a vector. [default: 100] -I, --iterations Max number of training iterations. [default: 500] -L, --learning-rate Learning rate. [default: 0.001] -M, --min-error-improvement Minimum improvement in average error magnitude in a single training iteration (over all words) to keep on training [default: 0.0001] -O, --min-word-occurences Minimum number of occurences in the corpus a word needs to have in order to be included in the trained vocabulary. [default: 2] -P, --path Where to store the model when training is done. ```

Similarity

``` Given a path to a saved Word2Vec model and a target word, finds words in the model's vocab that are similar.

USAGE: sloword2vec similar --limit --path --word

FLAGS: -h, --help Prints help information -V, --version Prints version information

OPTIONS: -L, --limit Max number of similar entries to show. [default: 20] -P, --path Where to store the model when training is done. -W, --word Word to find similar terms for. ```

Add subtract

The classic demo of Word2Vec..

``` Given a number of words to add and to subtract, returns a list of words in that area.

USAGE: sloword2vec add-subtract --add ... --limit --path --subtract ...

FLAGS: -h, --help Prints help information -V, --version Prints version information

OPTIONS: -A, --add ... Words to add encodings for -L, --limit Max number of similar entries to show. [default: 20] -P, --path Where to store the model when training is done. -S, --subtract ... Words to subtract encodings for ```

Details

Pretty much the most naive implementation of Word2Vec, the only special thing being the use of matrix/vector maths to speed things up.

The linear algebra library behind this lib is ndarray, with OpenBlas enabled (Fortran and transparent multithreading FTW!).

Reference material

  1. Word2Vec Parameter learning explained paper
  2. Word2Vec Skip-gram model tutorial article