kl-hyphenate version 0.7.3

Introduction

Two strategies are available: - Standard Knuth–Liang hyphenation, with dictionaries built from the TeX UTF-8 patterns. - Extended (“non-standard”) hyphenation based on László Németh's Automatic non-standard hyphenation in OpenOffice.org, with dictionaries built from Libre/OpenOffice patterns.

Usage

Quickstart

The dictionaries can be built with: shell cargo build -vv --features build_dictionaries The resulting dictionaries are saved in the dictionaries directory.

You can then load and use a dictionary with: ```rust use kl_hyphenate::{Standard, Hyphenator, Language, Load};

let pathtodict = "dictionaries/en-us.standard.bincode"; let enus = Standard::frompath(Language::EnglishUS, pathtodict) ?;

// Identify valid breaks in the given word. let hyphenated = en_us.hyphenate("hyphenation");

// Word breaks are represented as byte indices into the string. let breakindices = &hyphenated.breaks; asserteq!(break_indices, &[2, 6, 7]);

// The segments of a hyphenated word can be iterated over. let segments = hyphenated.intoiter().segments(); let collected : Vec<_> = segments.collect(); asserteq!(collected, vec!["hy", "phen", "a", "tion"]);

// hyphenate() is case-insensitive. let uppercase : Vec<_> = enus.hyphenate("CAPITAL").intoiter().collect(); assert_eq!(uppercase, vec!["CAP-", "I-", "TAL"]); ```

Segmentation

Dictionaries can be used in conjunction with text segmentation to hyphenate words within a text run. This short example uses the unicode-segmentation crate for untailored Unicode segmentation.

```rust use unicode_segmentation::UnicodeSegmentation;

let hyphenatetext = |text : &str| -> String { // Split the text on word boundaries— text.splitwordbounds() // —and hyphenate each word individually. .flatmap(|word| enus.hyphenate(word).intoiter()) .collect() };

let excerpt = "I know noble accents / And lucid, inescapable rhythms; […]"; asserteq!("I know no-ble ac-cents / And lu-cid, in-escapable rhythms; […]" , hyphenatetext(excerpt)); ```

Normalization

Hyphenation patterns for languages affected by normalization occasionally cover multiple forms, at the discretion of their authors, but most often they don’t. If you require kl-hyphenate to operate strictly on strings in a known normalization form, as described by the Unicode Standard Annex #15 and provided by the unicode-normalization crate, you may specify it in your Cargo manifest, like so:

toml [dependencies.kl-hyphenate] version = "…" features = ["nfc"]

The features field may contain exactly one of the following normalization options:

"nfc", for canonical composition;
"nfd", for canonical decomposition;
"nfkc", for compatibility composition;
"nfkd", for compatibility decomposition.

It is recommended to build kl-hyphenate in release mode if normalization is enabled, since the bundled hyphenation patterns will need to be reprocessed into dictionaries.

License

Dual-licensed under the terms of either: - the Apache License, Version 2.0 - the MIT license

patterns/hyph-hu.ext.txt (extended Hungarian hyphenation patterns) is licensed under: - MPL 1.1 (refer to patterns/hyph-hu.ext.lic.txt)

patterns/hyph-ca.ext.txt (extended Catalan hyphenation patterns) is licensed under: - LGPL v.3.0 or higher (refer to patterns/hyph-ca.ext.lic.txt)