rga is a line-oriented search tool that allows you to look for a regex in a multitude of file types. It is a wrapper around the awesome [ripgrep] that enables it to search in pdf, docx, pptx, movie subtitles (mkv, mp4), sqlite, etc.
Say you have a large folder of papers or lecture slides, and you can't remember which one of them mentioned LSTM
s. With rga, you can just run this:
rga "LSTM|GRU" collection/
[results]
and it will recursively find a regex in pdfs and pptx slides, including if some of them are zipped up.
You can do mostly the same thing with pdfgrep -r
, but it will be much slower and you will miss content in other file types.
barchart
title: Searching in 20 pdfs with 100 slides each
subtitle: lower is better
data:
- pdfgrep: 123s
- rga (first run): 10.3s
- rga (subsequent runs): 0.1s
On the first run rga is mostly faster because of multithreading, but on subsequent runs (on the same files but with any query) rga will cache the text extraction because pdf parsing is slow.
rga should compile with stable Rust. To install it, simply run (your OSes equivalent of)
```bash apt install build-essential pandoc poppler-utils cargo install ripgrep_all
rga --help ```
rga
simply runs ripgrep (rg
) with some options set, especially --pre=rga-preproc
and --pre-glob
.
Some rga adapters run external binaries
To enable debug logging:
bash
export RUST_LOG=debug
export RUST_BACKTRACE=1
Also rember to disable caching with --rga-no-cache
or clear the cache in ~/.cache/rga
to debug the adapters.