A repetition detector written in Rust.
I don't think it's really the case in english, but in french (and possibly other languages), it is considered poor style to repeat a word too often in a text (particularly a literary text). The purpose of this tool is to assist a writer in detecting those repetitions.
A text is composed of words, themselves composed of characters, which in french are called caractères. In french, good is bon so caribon is essentially good characters.
Alright, this doesn't make much sense, I'll admit I just found the name funny.
Internally, Caribon use a stemming library (https://github.com/lady-segfault/stemmer-rs, the Rust bindings for Snowball C implementation: http://snowball.tartarus.org/) to reduce words to their stems, which allows e.g. to see a singular and a plural as the "same" word. Then it's just counting the repetitions, and outputting HTML.
You'll need Rust and Cargo, see install instructions. Then
$ cargo build
should do the job (it works with Rust 1.1). You can then run caribon either with:
$ cargo run
or by directly executing the binary (in target/debug
or
target/release
).
If you plan to use cargo run
, note that command-line arguments must
be prefixed by --
so cargo gives them to the binary:
$ cargo run -- --input=some_text.txt --output=output.html
You can also install the caribon
binary somewhere in your path
(e.g. /usr/local/bin
) but currently there is no install/uninstall
option, so you'll have to do it manually.
Once you have generated an HTML file, just open it with your favorite browser and see your repetitions. Note that at this time the default binary is configured for french, if you want to use another language, you'll have to pass an option (see below). Note that though a variety of input languages are supported thanks to the Snowball stemming library, at this time only french has a (incomplete) list of common words to ignore.
Here is an example of Caribon used on a (previous) version of this README, using the following command:
cargo run -- --language=english --input=README.html --output=example.html
``` Caribon, version 0.4.0 by Élisabeth Henry liz.henry@ouvaton.org
Detects the repetitions in a text and renders a HTML document highlighting them
Options: --help: displays this message --version: displays program version --listlanguages: lists the implemented languages --language=[language]: sets the language of the text (default: french) --input=[filename]: sets input file (default: stdin) --output=[filename]: sets output file (default: stdout) --ignore=[string]: a string containing custom ignored words, separated by spaces or comma (default: use a builtin list that depends on the language) --algo=[global|local|leak]: sets the detection algoritm (default: local) --leak=[value]: sets leak value (only used by leak algorithm) (default: 0.95) --maxdistance=[value]: sets max distance (only used by local algorithm) (default: 50) --globalcount=[relative|absolute]: sets repetitions count as absolute or relative ratio of words (only used by global algorithm) (default: absolute) --threshold=[value]: sets threshold value for underlining repetitions (default: 1.9) --html=[true|false]: enables/disable HTML input (default: true) --ignoreproper=[true|false]: if true, try to detect proper nouns and don't count them (default: false) ```
It is possible to use Caribon as a library. The documentation is
available here; in order to
get the latest version, you can also generate it with
cargo doc
.
Basically, it's pretty easy:
You create a new parser with Parser::new("language")
(the only
trick is that it returns an Option
, as all languages are not
implemented, see Parser::list_languages()
to get a vector of those
that are implemented by the stemming library.
You can then set some parameters for the parser, e.g:
rust
let parser = Parser::new("french")
.unwrap()
.with_html(true) // enable html in input (default value, so it's useless)
.with_ignore_proper(true); // don't count repetitions for proper nouns
The first step is to "tokenize" the string you want to parse:
rust
let words = parser.tokenize("Some string which may or may not contain repetitions");
The second step is to detect the repetitions, using one of the three algorithms:
rust
let detected_words = parser.detect_local(words);
let detected_words = parser.detect_global(words, false);
let detected_words = parser.detect_leak(words);
The final step is to display this vector of words. The parser provides a method that generates an HTML file, which also takes as argument a threshold above which words are underlined, and a boolean to tell whether it must be a standalone file or not:
rust
println!("{}", self::words_to_html(&detected_words, 1.5, true));
(A note on this threshold: its choices depends on the detection
algorithm you use (and possibly your taste and the language you write
in, of course). Generally, it should be a bit above 1.0, except for
detect_global
(in which case, it depends whether you set
is_relative
to true or false).
<html>
, <body>
and so on) but it works fine if you
use e.g pandoc -o file.html file.md
.Caribon is licensed under the GNU General Public License, version 2.0 or (at your convenience) any ulterior version.
Élisabeth Henry
This software uses (rust bindings to) the C Stemming library written by Dr Martin Porter, licensed under the BSD License.
caribon-server, a work-in-progress project that runs Caribon as a web server.