A data embedding tool and related data analysis or clustering

The crate provides mainly in the form of a library (See documentation of the binary embed for a small executable embedding data in csv files):

  1. Some variations on data embedding tools from t-Sne (2008) to Umap(2018).

    Our implementation is a mix of the various embedding algorithms mentioned in References.

Building

The crate provides 3 features to choose between openblas-static, intel-mkl-static or openblas-system as defined in the ndarray-linalg crate.

compile with :

Alternatively define the default in Cargo.toml.

Results

Timings are given for a 24-core (32 threads) i9 laptop with 64Gb memory.

Embedder examples

Sources of examples are in corresponding directory.

  1. MNIST digits database Cf mnist-digits

    It consists in 70000 images of handwritten digits of 784 pixels

    It tooks 11s (system time) to run (cpu time 270s), of which 3s were spent in the ann construction.

    mnist

    mnist

    It took 11s to run (334s of cpu time) of which 3s were spent in the ann construction.

  2. MNIST fashion database Cf mnist-fashion

    It consists in 70000 images of clothes.

    mnist

    system time : 14s, cpu time 428s

    mnist

    system time : 15s, cpu time 466s

  3. Higgs boson Cf Higgs-data

    It consists in 11 millions float vectors of dimension 28. First we run on the first 21 columns, keeping out the last 7 variables constructed by the physicists to help the discrimination in machine learning tasks and then on the 28 variables.

    In both cases we use hierarchical initialization. We run 200 batches in the first pass by using layers from layer 1 (included) to the upper layer. The first batches runs thus on about 460000 nodes. Then 40 batches are done on the 11 millions points.

    Run times are in both cases around 2 hours (45' for the Hnsw construction and 75' for the entropy iterations)

    higgs-21

                28 variables case
    

    higgs-28

    higgs-28-subs0.15

            density of points obtained by transforming distance to first neighbour (See visu.jl)
    

    higgs-28-density

    higgs_dmap

Usage

rust // allocation of a Hnsw structure to store data let ef_c = 50; let max_nb_connection = 70; let nbimages = images_as_v.len(); let nb_layer = 16.min((nbimages as f32).ln().trunc() as usize); let hnsw = Hnsw::<f32, DistL2>::new(max_nb_connection, nbimages, nb_layer, ef_c, DistL2{}); let data_with_id : Vec<(&Vec<f32>, usize)>= images_as_v.iter().zip(0..images_as_v.len()).collect(); // data insertion in the hnsw structure hnsw.parallel_insert(&data_with_id); // choice of embedding parameters let mut embed_params = EmbedderParams::new(); embed_params.nb_grad_batch = 15; embed_params.scale_rho = 1.; embed_params.beta = 1.; embed_params.grad_step = 1.; embed_params.nb_sampling_by_edge = 10; embed_params.dmap_init = true; // conversion of the hnsw to a graph structure let knbn = 8; let kgraph = kgraph_from_hnsw_all(&hnsw, knbn).unwrap(); // allocation of the embedder and embedding embedder = Embedder::new(&kgraph, embed_params); let embed_res = embedder.embed();

Randomized SVD

The randomized SVD is based on the paper of Halko-Tropp. The implementation covers dense matrices or matrices in compressed row storage as provided in the sprs crate.

Two algorithms for range approximation used in approximated SVD are:

References

License

Licensed under either of

  1. Apache License, Version 2.0, LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0

  2. MIT license LICENSE-MIT or http://opensource.org/licenses/MIT

at your option.