opus_tools
: Miscellaneous tools for working with OPUS parallel corpusThese are small utilties for working with the [OPUS][] parallel corpus, which is normally used for machine translation research. To install:
sh
curl https://sh.rustup.rs -sSf | sh
cargo install opus_tools
opusraw2txt
: Extract raw text from raw, monolingual fileDownload the file ca.raw.tar.gz
from the right-hand column of
the subtitle page and run:
sh
opusraw2txt ca.raw.tar.gz
This will print a huge number of sentences on standard output in UTF-8 format for further processing.
If you want to process an entire directory of files, you could install GNU
parallel
and szip
, and run:
sh
ls *.raw.tar.gz |
sed 's/\.raw\.tar\.gz$//' |
parallel --joblog out.log 'opusraw2txt {}.raw.tar.gz | szip > {}.sz'
Your feedback and contributions are welcome! For more information, see the subtitles-rs project.