lrtc: Low-resource text classification

This crate is a Rust implementation of the low-resource text classification method introduced in Jiang et al. (2023). Unlike Jiang et al., this implementation currently relies on zstd, rather than gzip, as a compression utility.

rust let training = vec!["some normal sentence".to_string(), "godzilla ate mars in June".into(),]; let training_labels = vec!["normal".to_string(), "godzilla".into(),]; let queries = vec!["another normal sentence".to_string(), "godzilla eats marshes in August".into(),]; // Using a compression level of 3, and 1 nearest neighbor: println!("{:?}", classify(training, training_labels, queries, 3i32, 1usize));

This method seems to perform decently well for relatively sparse training sets, and does not require the same amount of tuning as neural net methods.

```rust use csv::Reader; use lrtc::classify; use std::fs::File; use std::vec::Vec; let imdb = File::open("./data/imdb.csv").unwrap(); let mut reader = Reader::from_reader(imdb);

let mut content = Vec::withcapacity(50000); let mut label = Vec::withcapacity(50000); for record in reader.records() { content.push(record.asref().unwrap()[0].tostring()); label.push(record.unwrap()[1].to_string()); }

let predictions = classify( content[0..10000].tovec(), label[0..10000].tovec(), content[40000..50000].tovec(), 3i32, 5usize, ); let correct = predictions .iter() .zip(label[40000..50000].tovec().iter()) .filter(|(a, b)| a == b) .count();

println!("{}", correct as f64 / 10000f64) // 0.701 ```

References

Zhiying Jiang, Matthew Yang, Mikhail Tsirlin, Raphael Tang, Yiqin Dai, and Jimmy Lin. 2023. “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6810–6828, Toronto, Canada. Association for Computational Linguistics. https://aclanthology.org/2023.findings-acl.426