opus_tools
: Miscellaneous tools for working with OPUS parallel corpusThese are small utilties for working with the [OPUS][] parallel corpus, which is normally used for machine translation research. To install:
sh
curl https://sh.rustup.rs -sSf | sh
cargo install opus_tools
opusraw2txt
: Extract raw text from raw, monolingual fileDownload the file ca.raw.tar.gz
from the right-hand column of
the subtitle page and run:
sh
opusraw2txt ca.raw.tar.gz
This will print a huge number of sentences on standard output in UTF-8 format for further processing.
Your feedback and contributions are welcome! For more information, see the subtitles-rs project.