Online plagiarism detection tools usually come with a few constraints. It could be a paid-only service, the number of characters to check could be artificially limited, etc. This tool aims to fill a gap where: - Plagiarism cases are usually simple copy-paste jobs of a few phrases with minor edits, - Paying for an online tool is unpalatable, - The source texts that might be copied from can be put together manually by the user into a few files (i.e. the Internet is not automatically searched by the tool) and/or the only concern is people copying from each other, and - Running a command-line tool is simple enough for the user
trusted
folder.equal
metric is quite fast at detecting copy-paste plagiarism of a few words.lev
metric is too slow for large datasets, but promises more fine-grained control over how different two phrases can be.Download a binary from the Releases page. Currently only x86_64-unknown-linux-gnu
targets are supported.
rust
language toolchain (https://www.rust-lang.org/tools/install).git clone
this repository to a folder of your choice.cargo build --release
in that folder.target/release
folder will contain the plagiarism-basic
executable to be used.Some setup is required:
1. Two folders need to be created anywhere, a "trusted" folder and an "untrusted" folder.
1. The "trusted" folder may contain any number of files in its top level directory. Each file will be treated as a separate trusted source of text. This is where you might put the text of the top 10 Google search results, for e.g.
1. The "untrusted" folder may contain any number of files in its top level directory. Each file will be treated as a separate untrusted source of text. This is where you would put each separate "submission" from a student, for e.g.
1. The files in both folders must only contain UTF-8 interpretable text. The name of the file will be used in the output of the program, so naming the files appropriately is a good idea.
1. After these steps are done, the plagiarism-basic
executable can be run and the path to these folders can be specified in the arguments to the executable.
```
$ ./plagiarism-basic -h
Basic Plagiarism Checker v0.1
Sriram Sami (@frizensami on GitHub)
Checks for plagiarism using very basic metrics between different text files
USAGE:
plagiarism-basic -m
FLAGS: -h, --help Prints help information -V, --version Prints version information
OPTIONS:
-m
Informally, two strings that are long enough and with the same number of words that are "similar enough" by a chosen metric are considered to be plagiarised.
Formally:
- Two separate strings (s1
and s2
) consisting of words (a sequence of characters without a space) are considered plagiarised if:
- Both have l
words
- Where l
< some user-chosen sensitivity value n
- Where a metric M
and similarity value s
produces M(s1, s2) <= s
- Subject to pre-processing of
- Removing CR + LF
- Removing extra spaces (only one space between words)
- Converting all letters to lowercase
- Removing all non alphanumeric characters
n
is a user-chosen value to indicate how many words a string needs to be before being considered for plagiarisms
is a user-chosen value to indicate how similar the strings have to be before being considered for plagiarismM
is the metric used to evaluate the strings for similarity. They can be one of the following
equal
: checks if the strings are equal, ignores s
value. Uses hashed set intersections, very fast.lev
: uses the Levenshtein distance between the words, uses the s
value. Compares between all combinations of string fragments, very slow at the moment.