Rust-tokenizer is a drop-in replacement for the tokenization methods from the Transformers library
Rust-tokenizer requires a rust nightly build in order to use the Python API. Building from source involes the following steps:
python setup.py install
in the repository. This will compile the Rust library and install the python API/tests
folder, including benchmark and integration testsThe library is fully unit tested at the Rust level
```python from rusttransformers import PyBertTokenizer from transformers.modelingbert import BertForSequenceClassification
rusttokenizer = PyBertTokenizer('bert-base-uncased-vocab.txt') model = BertForSequenceClassification.frompretrained('bert-base-uncased', output_attentions=False).cuda() model = model.eval()
sentence = '''For instance, on the planet Earth, man had always assumed that he was more intelligent than dolphins because he had achieved so much—the wheel, New York, wars and so on—whilst all the dolphins had ever done was muck about in the water having a good time. But conversely, the dolphins had always believed that they were far more intelligent than man—for precisely the same reasons.'''
features = rusttokenizer.encode(sentence, maxlen=128, truncationstrategy='onlyfirst', stride=0) inputids = torch.tensor([f.tokenids for f in features], dtype=torch.long).cuda()
with torch.nograd(): output = model(allinput_ids)[0].cpu().numpy() ```