feruca is a basic implementation of the Unicode Collation Algorithm in Rust. It's current with Unicode version 14.0. The name of the library is a portmanteau of Ferris 🦀 and UCA.
One unsafe function is called directly in this library:
char::from_u32_unchecked
. But this is done only after input is
UTF-8-validated. It may soon be possible to remove these calls (despite their
innocuousness in context).
Digression: More importantly, feruca is designed to be tolerant of problematic input. The main function accepts either
&str
or&[u8]
, and it relies on the excellent bstr library to generate a validated list of Unicode scalar values, which can then be processed for collation. This approach seems more useful than maintaining the illusion of safety by assuming that all input will be clean.
In describing feruca as a "basic implementation," I have a few things in mind.
First, I don't expect that it's highly performant. My rough attempts at
benchmarking suggest that this is on the order of 10–20x slower than ucol
from
icu4c. But my initial priority was to pass
the official
conformance tests.
feruca also passes the conformance tests for the
CLDR root collation order (more on this
below).
Second, there is not yet support for tailoring, beyond being able to choose between the Default Unicode Collation Element Table (DUCET) and the default variation from CLDR. (You can additionally choose between the "non-ignorable" and "shifted" strategies for handling variable-weight characters.) Adding further support for tailoring is a near-term priority.
Third, the library has effectively[0] just one public function: collate
,
which accepts two string references or byte slices (plus a CollationOptions
struct), and returns an Ordering
. That is, you can pass collate
to the
standard library function sort_by
(see "Example usage").
For many people and use cases, UCA sorting will not work properly without being
able to specify a certain locale. That being said, the CLDR root collation order
is already quite useful. When calling the collate
function, you can pass
default options (see below), which specify the use of the CLDR table with the
"shifted" strategy. I think this is a good starting point.
[0]: There is also a variant form, collate_no_tiebreak
, which will return
Ordering::Equal
for any two strings that produce the same UCA sort key. (The
normal version will fall back on byte-value comparison in such cases.)
```rust use feruca::{collate, CollationOptions};
fn main() { let mut uca = [ "چنگیز", "Éloi", "Ötzi", "Melissa", "صدام", "Mélissa", "Overton", "Elrond", ];
let mut naive = uca;
uca.sort_by(|a, b| collate(a, b, CollationOptions::default()));
naive.sort();
for item in uca {
println!("{}", item);
}
// Éloi
// Elrond
// Melissa
// Mélissa
// Ötzi
// Overton
// چنگیز
// صدام
// Add a line of space (in case you run this verbatim)
println!();
for item in naive {
println!("{}", item);
}
// Elrond
// Melissa
// Mélissa
// Overton
// Éloi
// Ötzi
// صدام
// چنگیز
} ```