Ferret: Copy-Detection in Text and Code

Ferret is a copy-detection tool, locating duplicate text or code in multiple text documents or source files. Ferret is designed to detect copying ( collusion ) within a given set of files.

As a library, Ferret can be used to analyse program code or natural language texts into trigrams, and compare pairs of documents for similarity.

Features:

Command line use

console $ ferret --help Usage: ferret [-ghluvx] filename [filenames...] -g, --group Use subdirectory names to group files -h, --help Show help information -l, --list-trigrams Output list of trigrams found -u, --unique-counts Output counts of unique trigrams -v, --version Version number -x, --xml-report filename1 filename2 outfile : Create XML report

Library use

Take some files and find the two most similar:

``` rust use ferret::documents::Documents;

fn main() { let files = ["txt1.txt".tostring(), "txt2.txt".tostring(), "txt3.txt".tostring()]; let docs = Documents::new(&files[..]); let results = docs.sortedresults(false); println!("Most similar pair: {}", results[0]); } ```

Take a file, and read it trigram-by-trigram:

``` rust use ferret::trigram_reader::TrigramReader; use std::path::PathBuf;

fn main() { let path = PathBuf::from(r"test.rb"); let mut reader = TrigramReader::new(&path);

while reader.read_trigram () {
    println!("Trigram {}", reader.last_trigram ());
}

} ```