opus_tools: Miscellaneous tools for working with OPUS parallel corpus

Latest version License Build Status Build status

These are small utilties for working with the [OPUS][] parallel corpus, which is normally used for machine translation research. To install:

sh curl https://sh.rustup.rs -sSf | sh cargo install opus_tools

opusraw2txt: Extract raw text from raw, monolingual file

Download the file ca.raw.tar.gz from the right-hand column of the subtitle page and run:

sh opusraw2txt ca.raw.tar.gz

This will print a huge number of sentences on standard output in UTF-8 format for further processing.

If you want to process an entire directory of files, you could install GNU parallel and szip, and run:

sh ls *.raw.tar.gz | sed 's/\.raw\.tar\.gz$//' | parallel --joblog out.log 'opusraw2txt {}.raw.tar.gz | szip > {}.sz'

Contributing

Your feedback and contributions are welcome! For more information, see the subtitles-rs project.