It splits the JSON data set available from PushShift into smaller JSON files.
At this time, the data can be split by the following keys:
When the data is split, a JSON file is created for each unique key, so if the split is on subreddit, a JSON file is created per subreddit.
Build the code
shell script
~/dev/rust/axe (master)
abhijat $ cargo build --release
Run the code
shell script
~/dev/rust/axe (master)
abhijat $ cargo run -- --input-path ~/Downloads/R --output-prefix ~/tmp/data-by-sub --split-on subreddit
Finished dev [unoptimized + debuginfo] target(s) in 0.02s
Running `target/debug/axe --input-path /home/abhijat/Downloads/R --output-prefix /home/abhijat/tmp/data-by-sub --split-on subreddit`
...
The files will be present in ~/tmp/data-by-sub
after the above run is complete.
``shell script
~/dev/rust/axe (master)
abhijat $ cargo run -- --help
Finished dev [unoptimized + debuginfo] target(s) in 0.01s
Running
target/debug/axe --help`
axe 0.1.0
A utility to split a reddit dataset into individual JSON files
USAGE:
axe --input-path
FLAGS: -h, --help Prints help information -V, --version Prints version information
OPTIONS:
-i, --input-path
```