Decompound

Decompose a compound word into its constituent parts. Works in any language, as you provide the rules around what constitutes a (single) word. The algorithm is Unicode-aware.

Useful for culling down existing dictionaries at build time.

The docs are best viewed via docs.rs.

crates

Usage

Usage is very straightforward. There is only one (free) function of interest, [decompound]. Its party piece is a closure argument, deciding whether a single word is valid. As this can be highly dynamic and language-specific, this decision is left to the user.

```rust use decompound::{decompound, DecompositionOptions};

let isvalidsingle_word = |w: &str| ["bed", "room"].contains(&w);

asserteq!( decompound( "bedroom", &isvalidsingleword, DecompositionOptions::empty(), ).unwrap(), vec!["bed", "room"] ); ```

Candidates for validity checks are simple dictionary lookups (for example, using [std::collections::HashSet], phf, Finite State Transducers, binary search, ...), or any elaborate algorithm of your choice.

Configuration

Configuration is exposed as a bit field via [DecompositionOptions]. It affords more complex use cases, freely combinable. Usefulness largely depends on the natural language at hand. Some, for example German, might require:

```rust use decompound::{decompound, DecompositionError, DecompositionOptions};

let isvalidsingle_word = |w: &str| ["Rüben", "Knollen", "Küche"].contains(&w);

asserteq!( decompound( "Rübenknollen-Küche", &isvalidsingleword, // Wouldn't find anything without titlecasing boot to Boot, // and splitting on hyphens. DecompositionOptions::SPLITHYPHENATED | DecompositionOptions::TRYTITLECASE_SUFFIX ).unwrap(), vec!["Rüben", "Knollen", "Küche"] ); ```

This covers all currently available options already:

```rust use decompound::DecompositionOptions;

assert!( ( // This is doc-tested so new options are not forgotten. DecompositionOptions::SPLITHYPHENATED | DecompositionOptions::TRYTITLECASESUFFIX ).isall() ); ```

Failure modes

If the word cannot be decomposed, a [DecompositionError] is returned.

```rust use decompound::{decompound, DecompositionError, DecompositionOptions};

let isvalidsingle_word = |w: &str| ["water", "melon"].contains(&w);

asserteq!( decompound( "snowball", &isvalidsingleword, DecompositionOptions::empty(), ).unwrap_err(), DecompositionError::NothingValid ); ```

Overeager validity checks

Nothing prevents you from providing a closure which itself accepts compound words. Compound words (like railroad) being included in a lookup dictionary (instead of only its root words rail and road) is an example "pathological" case. Accommodating compound words yourself is precisely what this crate is supposed to alleviate. If you already have and do not want to or cannot drop that capability, this crate might be obsolete for your case (hence "overeager checks").

Although [decompound] prefers splits if possible, such as

```rust use decompound::{decompound, DecompositionError, DecompositionOptions};

// Contains a compound word and its root words. let isvalidsingle_word = |w: &str| ["blueberry", "blue", "berry"].contains(&w);

asserteq!( decompound( "blueberry", &isvalidsingleword, DecompositionOptions::empty(), ).unwrap(), vec!["blue", "berry"] ); ```

if root words are missing but the compound itself is present, decomposition technically fails. This case is considered an error, and marked as such. That is more ergonomic than being returned a [Vec] of constituents of length 1, requiring more awkward error handling at the call site.

```rust use decompound::{decompound, DecompositionError, DecompositionOptions};

// Only contains a compound word, not its root words. let isvalidsingle_word = |w: &str| ["firefly"].contains(&w);

asserteq!( decompound( "firefly", &isvalidsingleword, DecompositionOptions::empty(), ).unwraperr(), DecompositionError::SingleWord("firefly".tostring()) ); ```

Match on this variant if this case is not an error in your domain (this crate itself does so internally, too).

Motivation

The crate implementation is simple and nothing you wouldn't be able to write yourself.

There is a catch though. As mentioned, this crate can help you move checks for compound words from static (a fixed dictionary) to runtime ([decompound]). For some languages, this is strictly required, as the set of compound words might be immense, or (effectively, not mathematically) unbounded, meaning root words may be combined to arbitrary lengths. German is such a case. No dictionary exists to cover all possible German words. However, existing ones are almost guaranteed to themselves contain some compound words (which is generally helpful). When using such dictionaries and this crate to cover all remaining, arbitrary compound words, duplication arises, and the dictionary is no longer minimal. Most, perhaps all, compound words in the dictionary could be detected at runtime instead (providing a single source of truth along the way).

Culling the dictionary might lead to significant, perhaps necessary savings in size (memory and executable), so a build script is needed. But now, both the actual code and the build script depend on that same detection algorithm. If what you cull the dictionary with gets out of sync with what's done at runtime, bugs arise. The build script cannot depend on what it's building. Currently (2023-08-19), there is no place for the compound check to live except another crate, external to both the build script and actual code. That's this crate. It affords a non-cyclic build graph, a single source of truth for the compound check and affords the usage of any dictionary, no out-of-band preprocessing necessary (the original dictionary can be kept).