Crate

sesdiff: Shortest Edit Script Diff

Description

This is a small and fast command line tool that reads a two-column tab separated input from standard input and computes the shortest edit script (Myers' diff algorithm) to go from the string in column A to the string in column B. It also computed the edit distance (aka levenshtein distance).

It was written to build lemmatisers.

Installation

Install it using Rust's package manager:

cargo install sesdiff

No cargo/rust on your system yet? Do sudo apt install cargo on Debian/ubuntu based systems, brew install rust on mac, or use rustup.

This tool builds upon Dissimilar that provides the actual diff algorithm (will be downloaded and compiled in automatically).

Usage

$ sesdiff < input.tsv

Example input and output (reformatted for legibility, the first two columns correspond to the input). Output is in a four-column tab separated format:

hablaron hablar =[hablar]-[on] 2 contaron contar =[contar]-[on] 2 pidieron pedir =[p]-[i]+[e]=[di]-[eron]+[r] 6 говорим говорить =[говори]-[м]+[ть] 3

By default the full edit script will be provided in a simple language:

For lemmatisation purposes, it makes sense for many languages to look at suffixes (from right to left) and strip common prefixes. Pass the --suffix option for that behaviour and output is now:

$ sesdiff --suffix < input.tsv hablaron hablar -[on] 2 contaron contar -[on] 2 pidieron pedir -[eron]+[r]=[di]-[i]+[e] 6 говорим говорить -[м]+[ть] 3

There is also a --prefix option that strips common suffixes.

License

GNU General Public Licence v3