Two strategies are available: - Standard Knuth–Liang hyphenation, with dictionaries built from the TeX UTF-8 patterns. - Extended (“non-standard”) hyphenation based on László Németh's Automatic non-standard hyphenation in OpenOffice.org, with dictionaries built from Libre/OpenOffice patterns.
The dictionaries can be built with:
shell
cargo build -vv --features build_dictionaries
The resulting dictionaries are saved in the dictionaries
directory.
You can then load and use a dictionary with: ```rust use kl_hyphenate::{Standard, Hyphenator, Language, Load};
let pathtodict = "dictionaries/en-us.standard.bincode"; let enus = Standard::frompath(Language::EnglishUS, pathtodict) ?;
// Identify valid breaks in the given word. let hyphenated = en_us.hyphenate("hyphenation");
// Word breaks are represented as byte indices into the string. let breakindices = &hyphenated.breaks; asserteq!(break_indices, &[2, 6, 7]);
// The segments of a hyphenated word can be iterated over. let segments = hyphenated.intoiter().segments(); let collected : Vec<_> = segments.collect(); asserteq!(collected, vec!["hy", "phen", "a", "tion"]);
// hyphenate()
is case-insensitive.
let uppercase : Vec<_> = enus.hyphenate("CAPITAL").intoiter().collect();
assert_eq!(uppercase, vec!["CAP-", "I-", "TAL"]);
```
Dictionaries can be used in conjunction with text segmentation to hyphenate words within a text run. This short example uses the unicode-segmentation
crate for untailored Unicode segmentation.
```rust use unicode_segmentation::UnicodeSegmentation;
let hyphenatetext = |text : &str| -> String { // Split the text on word boundaries— text.splitwordbounds() // —and hyphenate each word individually. .flatmap(|word| enus.hyphenate(word).intoiter()) .collect() };
let excerpt = "I know noble accents / And lucid, inescapable rhythms; […]"; asserteq!("I know no-ble ac-cents / And lu-cid, in-escapable rhythms; […]" , hyphenatetext(excerpt)); ```
Hyphenation patterns for languages affected by normalization occasionally cover multiple forms, at the discretion of their authors, but most often they don’t. If you require kl-hyphenate
to operate strictly on strings in a known normalization form, as described by the Unicode Standard Annex #15 and provided by the unicode-normalization
crate, you may specify it in your Cargo manifest, like so:
toml
[dependencies.kl-hyphenate]
version = "…"
features = ["nfc"]
The features
field may contain exactly one of the following normalization options:
"nfc"
, for canonical composition;"nfd"
, for canonical decomposition;"nfkc"
, for compatibility composition;"nfkd"
, for compatibility decomposition.It is recommended to build kl-hyphenate
in release mode if normalization is enabled, since the bundled hyphenation patterns will need to be reprocessed into dictionaries.
Dual-licensed under the terms of either: - the Apache License, Version 2.0 - the MIT license
texhyphen
and other hyphenation patterns © their respective owners; see patterns/*.lic.txt
files for licensing information.