Suffix Collections

Build Status LICENSE Crates Documentation

Fast realization of suffix array and suffix tree for substring search, longest common prefix array (lcp array).

Unicode

The current implementation builds suffix structures using bytes and does not decode the string before or during construction in Unicode. But if Unicode string is normalized before construction and search, then structures support Unicode (because all byte sequences are decoded unambiguously). Also search and lcp returns indexes as in byte array but not in Unicode decoded string.

Example

SuffixArray and lcp for the word "mississippi"

LCP index suffixe's 0 11 0 10 i 1 7 ippi 1 4 issippi 4 1 ississippi 0 0 mississippi 0 9 pi 1 8 ppi 0 6 sippi 2 3 sissippi 1 5 ssippi 3 2 ssissippi

All construction and search work for O(n). For the suffix tree implementation the Ukkonen algorithm is taken and for the suffix array implementation the SA-IS algorithm is taken.