The crate provides, in the form of a library:
Some variations on data embedding tools from t-Sne (2008) to Umap(2018). Our implementation is a mix of the various embedding algorithms mentioned in References.
The graph is initialized by the Hnsw nearest neighbour algorithm as implemented in: hnsw_rs.
This provides for free, sub-sampling in the data to embed by considering only less densely occupied layers (the upper layers). This corresponds generally to a subsampling of 2%-4%, but can give a guarantee as the distance beetween points leaved out the sampling and its nearest sampled neighbour are known. The hnsw structure thus enables also an iterative/hierarchical initialization of the embedding by taking into account an increasing number of layers.
The preliminary graph built for the embedding uses an exponential function of distances to neighbour nodes (as in Umap),but keeps a probability normalization constraint with respect to neighbours (as in T-sne). It is possible to modulate the initial edge weight by :
We also use a cross entropy optimization of this initial layout but take into account the initial local density estimate of each point when computing the cauchy weight of an embedded edge.
an implementation of range approximation and approximated SVD for dense and/or row compressed matrices as described in explicited in the svdapprox module and the paper of Halko-Tropp (Cf. Tsvd).
An estimation of the data intrinsic dimension as described in:
Levina E. and Bickel P.J NIPS 2004. See paper.
a Diffusion Maps implementation.
It took 22s of which 9s were spent in the ann construction.
It conssits in 70000 images of clothes.
initialized by an approximated svd.
time : 29s
hierarchical initialization (This is only useful for large data embedding where we initialize the embedding for one layer of the Hnsw structure with the embedding of of the layer just above to speed up the process).
time : 34s
The estimated intrinsic dimension of the data is 21.9 with standard deviation depending on points : 12.2 taking into account sorted neighbours around each point between the 9-th and 20-th first ranks.
rust
// allocation of a Hnsw structure to store data
let ef_c = 50;
let max_nb_connection = 70;
let nbimages = images_as_v.len();
let nb_layer = 16.min((nbimages as f32).ln().trunc() as usize);
let hnsw = Hnsw::<f32, DistL2>::new(max_nb_connection, nbimages, nb_layer, ef_c, DistL2{});
let data_with_id : Vec<(&Vec<f32>, usize)>= images_as_v.iter().zip(0..images_as_v.len()).collect();
// data insertion in the hnsw structure
hnsw.parallel_insert(&data_with_id);
// choice of embedding parameters
let mut embed_params = EmbedderParams::new();
embed_params.nb_grad_batch = 15;
embed_params.scale_rho = 1.;
embed_params.beta = 1.;
embed_params.grad_step = 3.;
embed_params.nb_sampling_by_edge = 10;
embed_params.dmap_init = true;
// conversion of the hnsw to a graph structure
let knbn = 8;
let kgraph = kgraph_from_hnsw_all(&hnsw, knbn).unwrap();
// allocation of the embedder and embedding
embedder = Embedder::new(&kgraph, embed_params);
let embed_res = embedder.embed();
The randomized SVD is based on the paper of Halko-Tropp. The implementation covers dense matrices or matrices in compressed row storage as provided in the sprs crate.
Two algorithms for range approximation used in approximated SVD are:
- subspace_iteration_csr , corresponds to algo 4.4 in Tropp paper. It uses QR stabilization.
- adaptative_range_finder_matrep correponds to algo 4.2 in Tropp paper. The algorithm is less precise than subspace_iteration_csr but can work on larger matrices for example on sparse matrices with a million rows.
compile with :
cargo build --release --features "openblas-static" to link statically with rust downloaded openblas
cargo build --release --features "intel-mkl-static" to link with mkl intel's library (intel mkl will be automatically downloaded, see README.md of crate ndarray-linalg)
Visualizing data using t_sne. Van der Maaten and Hinton 2008.
Visualizing Large Scale High Dimensional Data Tang Liu WWW2016 2016 LargeVis
Phate Visualizing Structure and Transitions for Biological Data Exploration K.R Moon 2017.
Umap: Uniform Manifold Approximation and Projection for Dimension Reduction. L.MacInnes, J.Healy and J.Melville 2018
Licensed under either of
Apache License, Version 2.0, LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0
MIT license LICENSE-MIT or http://opensource.org/licenses/MIT
at your option.
This software was written on my own while working at CEA