NLPRule is a library for rule-based grammatical error correction written in pure Rust with bindings for Python. Rules are sourced from LanguageTool.
```python from nlprule import Tokenizer, Rules, SplitOn
tokenizer = Tokenizer.load("en") rules = Rules.load("en", tokenizer, SplitOn([".", "?", "!"]))
rules.correct("He wants that you send him an email.")
rules.correct("Thanks for your’s and Lucy’s help.")
rules.correct("I can due his homework.")
suggestions = rules.suggest("She was not been here since Monday.") for s in suggestions: print(s.start, s.end, s.text, s.source, s.message)
```
My goal with this library was creating a fast, lightweight engine to run natural language rules without having to rely on the JVM (and its speed / memory implications) and without all the extra stuff LanguageTool does such as spellchecking, n-gram based error detection, etc.
NLPRule currently supports English and German.
| | \|Disambiguation rules\| | \|Grammar rules\| | LT version | |---------|--------------------------|-------------------|--------------| | English | 843 (100%) | 3725 (~ 85%) | 5.2 | | German | 486 (100%) | 2970 (~ 90%) | 5.2 |
NLPRule is focused on speed.
```python In [1]: from nlprule import Tokenizer, Rules, SplitOn ...: ...: tokenizer = Tokenizer.load("en") ...: rules = Rules.load("en", tokenizer, SplitOn([".", "?", "!"]))
In [2]: %timeit rules.correct("He wants that you send him an email.") 783 µs ± 6.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ```
Using Intel(R) Core(TM) i5-8600K CPU @ 3.60GHz
pip install nlprule
2. Create a
tokenizer
and rules
object
```python from nlprule import Tokenizer, Rules tokenizer = Tokenizer.load("en") # or 'de' rules = Rules.load("en", tokenizer) # or 'de' ``` The objects will be downloaded the first time, then cached.
3a. Correct your text
```python rules.correct_sentence("He wants that you send him an email.") # returns: 'He wants you to send him an email.' ``` `correct_sentence` expects a single sentence as input. If you want to correct an arbitrary text, pass a `sentence_splitter` at initialization. A sentence splitter can be any function that takes a list of texts as input and returns a list of lists of sentences. A splitter that splits on fixed characters is included in NLPRule for convenience: ```python from nlprule import SplitOn rules = Rules.load("en", tokenizer, SplitOn([".", "?", "!"])) ``` Pro tip: You can use [NNSplit](https://github.com/bminixhofer/nnsplit) for more robust sentence segmentation: ```python from nnsplit import NNSplit splitter = NNSplit.load("en") rules = Rules.load( "en", tokenizer, lambda texts: [[str(s) for s in text] for text in splitter.split(texts)], ) ``` If a sentence splitter is set, you can call `.correct`: ```python rules.correct("He wants that you send him an email. She was not been here since Monday.") # returns: 'He wants you to send him an email. She was not here since Monday.' ```
3b. Get suggestions
```python suggestions = rules.suggest_sentence("She was not been here since Monday.") for s in suggestions: print(s.start, s.end, s.text, s.source, s.message) # prints: # 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been? ``` `.suggest_sentence` also has a multi-sentence counterpart in `.suggest`.
Bonus: Analyze text with the
tokenizer
NLPRule does rule + dictionary-based part-of-speech tagging and lemmatization as well as chunking with a model ported from [OpenNLP](https://opennlp.apache.org/). It's not as fancy as spaCy but could be faster and had to be done anyway to apply the rules so I thought I might as well add a public API: ```python tokens = tokenizer.tokenize_sentence("She was not been here since Monday.") for token in tokens: print(token.text, token.span, token.tags, token.lemmas, token.chunks) # prints: # (0, 0) ['SENT_START'] [] [] # She (0, 3) ['PRP'] ['She', 'she'] ['B-NP-singular', 'E-NP-singular'] # was (4, 7) ['VBD'] ['be', 'was'] ['B-VP'] # not (8, 11) ['RB'] ['not'] ['I-VP'] # been (12, 16) ['VBN'] ['be', 'been'] ['I-VP'] # here (17, 21) ['RB'] ['here'] ['B-ADVP'] # since (22, 27) ['CC', 'IN', 'RB'] ['since'] ['B-PP'] # Monday (28, 34) ['NNP'] ['Monday'] ['B-NP-singular', 'E-NP-singular'] # . (34, 35) ['.', 'PCT', 'SENT_END'] ['.'] ['O'] ```
NLPRule is approximately 1.7x - 2.8x faster than LanguageTool. See the benchmark issue for details.
| | NLPRule time | LanguageTool time | |---------|--------------|-------------------| | English | 1 | 1.7 - 2.0 | | German | 1 | 2.4 - 2.8 |
NLPRule is currently pretty bare bones in terms of API and documentation. I will definitely fix bugs, but adding new functionality (especially new languages) and improving API / docs will depend on interest by the community.
Fixing discrepancies between NLPRule and LanguageTool behaviour will have high priority if any are found.
All credit for the rule content goes to LanguageTool who have made a Herculean effort to create high-quality grammar correction rules. This library is just a parser and reimplementation of the rule logic.
NLPRule is licensed under the MIT license.