A data embedding tool and related data analysis or clustering

The crate provides, in the form of a library:

  1. Some variations on data embedding tools from t-Sne (2008) to Umap(2018). Our implementation is a mix of the various embedding algorithms mentioned in References.

    1. Some by-products :

Future work

The crate will provide a single-linkage hierarchical clustering function and an implementation of the Mapper algorithm using the C++ **Ripser** module from U. Bauer.

Building

The crate provides 2 features to choose between openblas-static, intel-mkl-static as defined in the **ndarray-linalg** crate. So **--features "openblas-static"** or **--features "intel-mkl-static"** must be passed to cargo to compile. Alternatively define the default in Cargo.toml.

Results

These are preliminary results. Timings are given for a 8-core i7 @2.3 Ghz laptop.

Embedder examples

Sources of examples are in corresponding directory.
  1. MNIST digits database Cf mnist-digits
It consists in 70000 images of handwritten digits of 784 pixels

mnist

mnist

It took 22s of which 9s were spent in the ann construction.

  1. MNIST fashion database Cf mnist-fashion

It conssits in 70000 images of clothes.

rust // allocation of a Hnsw structure to store data let ef_c = 50; let max_nb_connection = 70; let nbimages = images_as_v.len(); let nb_layer = 16.min((nbimages as f32).ln().trunc() as usize); let hnsw = Hnsw::<f32, DistL2>::new(max_nb_connection, nbimages, nb_layer, ef_c, DistL2{}); let data_with_id : Vec<(&Vec<f32>, usize)>= images_as_v.iter().zip(0..images_as_v.len()).collect(); // data insertion in the hnsw structure hnsw.parallel_insert(&data_with_id); // choice of embedding parameters let mut embed_params = EmbedderParams::new(); embed_params.nb_grad_batch = 15; embed_params.scale_rho = 1.; embed_params.beta = 1.; embed_params.grad_step = 3.; embed_params.nb_sampling_by_edge = 10; embed_params.dmap_init = true; // conversion of the hnsw to a graph structure let knbn = 8; let kgraph = kgraph_from_hnsw_all(&hnsw, knbn).unwrap(); // allocation of the embedder and embedding embedder = Embedder::new(&kgraph, embed_params); let embed_res = embedder.embed();

Randomized SVD

The randomized SVD is based on the paper of Halko-Tropp. The implementation covers dense matrices or matrices in compressed row storage as provided in the sprs crate.

Two algorithms for range approximation used in approximated SVD are: - subspace_iteration_csr , corresponds to algo 4.4 in Tropp paper. It uses QR stabilization.
- adaptative_range_finder_matrep correponds to algo 4.2 in Tropp paper. The algorithm is less precise than subspace_iteration_csr but can work on larger matrices for example on sparse matrices with a million rows.

Installation

compile with :

References

License

Licensed under either of

  1. Apache License, Version 2.0, LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0

  2. MIT license LICENSE-MIT or http://opensource.org/licenses/MIT

at your option.

This software was written on my own while working at CEA