Graphembed

The purpose of this crate is to provide embedding of directed or undirected graphs with positively weighted edges and possibly discrete labels attached to nodes.

Methods

The embedding algorithms used in this crate are based on the following papers

NodeSketch : Highly-Efficient Graph Embeddings via Recursive Sketching KDD 2019. [https://dl.acm.org/doi/10.1145/3292500.3330951]
D. Yang,P. Rosso,Bin-Li, P. Cudre-Mauroux.

It is based on multi hop neighbourhood identification via sensitive hashing based on the recent algorithm probminhash. See arxiv or ieee-2022.

The algorithm associates a probability distribution on neighbours of each point depending on edge weights and distance to the point. Then this distribution is hashed to build a (discrete) embedding vector consisting in nodes identifiers.
The distance between embedded vectors is the Jaccard distance so we get a real distance on the embedding space for the symetric embedding.

An extension of the paper is also implemented to get asymetric embedding for directed graph. The similarity is also based on the hash of sets (nodes going to or from) a given node but then the dissimilarity is no more a distance (no symetry and some discrepancy with the triangular inequality).

The orkut graph with 3 millions nodes and 100 millions of edge is embedded in less than 10' with a 8 core laptop with this algorithm.

Asymetric Transitivity Preserving Graph Embedding 2016
M. Ou, P Cui, J. Pei, Z. Zhang and W. Zhu.

The objective is to provide an asymetric graph embedding and get estimate of the precision of the embedding in function of its dimension.

We use the Adamic-Adar matricial representation of the graph. (It must be noted that the ponderation of a node by the number of couples joined by it is called Resource Allocation in the Graph Kernel litterature). The asymetric embedding is obtained from the left and right singular eigenvectors of the Adamic-Adar representation of the graph. Source node are related to left singular vectors and target nodes to the right ones.
The similarity measure is the dot product, so it is not a norm.
The svd is approximated by randomization as described in Halko-Tropp 2011 as implemented in the annembed crate.

The core decomposition algorithms

Large Scale decomposition via convex programming 2017
M.Danisch T.H Hubert Chan and M.Sozio

The decomposition of the graph in maximally dense groups of nodes is implemented and used to assess the quality of the embeddings in a structural way. See module validation and the comments on the embedding of the Orkut graph where we can use the community data provided with the graph to analyze the behaviour of embedded edge lengths.
In particular it is shown that embedding of edges internal to a community are consistently smaller than embedded edges crossing a block frontier, see results in orkut.md and examples directory together with a small Rust notebook in Notebooks

Some data sets

Without labels

Small datasets are given in the Data subdirectory (with 7z compression) to run tests.
Larger datasets can be downloaded from the SNAP data collections https://snap.stanford.edu/data

Some small test graphs are provided in a Data subdirectory

Some larger data tests for user to download

These graphs were used in results see below.

Beware of the possible need to convert from Windows to Linux End Of Line, see the dos2unix utility.
Possibly some data can need to be converted from Tsv format to Csv, before being read by the program.

Some results

results for the atp and nodesketch modules

Embedding and link prediction evaluation for the above data sets are given in file resultats.md A more global analysis of the embedding with the nodesketch module is done for the orkut graph in file orkut.md

Some qualitative comments

Generalized Svd

An implementation of Generalized Svd comes as a by-product in module gsvd.

Installation and Usage

Installation

The crate provides three features, required by the annembed dependency, to specify which version of lapack you want to use.
For example compilation is done by : cargo build --release --features="openblas-system" to use a dynamic link with openblas. The choice of one feature is mandatory to provide required linear algebra library.

Usage