Create dictionaries by scraping webpages.
Similar tools (some features inspired by them): - CeWL - CeWLeR
```bash
nix build .# ./result/bin/wdict --help
nix run .# -- --help
nix run github:pyqlsa/wdict -- --help
cargo install wdict
nix develop .# cargo build ./target/debug/wdict --help
cargo build --release ./target/release/wdict --help ```
```bash Create dictionaries by scraping webpages.
Usage: wdict [OPTIONS] <--url
Options:
-u, --url
--theme <THEME>
Pre-canned theme URLs to start crawling from (for fun, demoing features, and sparking new ideas)
Possible values:
- star-wars: Star Wars themed URL <https://www.starwars.com/databank>
- tolkien: Tolkien themed URL <https://www.quicksilver899.com/Tolkien/Tolkien_Dictionary.html>
- witcher: Witcher themed URL <https://witcher.fandom.com/wiki/Elder_Speech>
-d, --depth
[default: 1]
-m, --min-word-length
[default: 3]
-r, --req-per-sec
[default: 20]
-f, --file
[default: wdict.txt]
--filters <FILTERS>...
Filter strategy for words; multiple can be specified
[default: none]
Possible values:
- deunicode: Transform unicode according to <https://github.com/kornelski/deunicode>
- decancer: Transform unicode according to <https://github.com/null8626/decancer>
- all-numbers: Ignore words that consist of all numbers
- any-numbers: Ignore words that contain any number
- none: Leave the word as-is
--site <SITE>
Site policy for discovered links
[default: same]
Possible values:
- same: Allow crawling links, only if the domain exactly matches
- subdomain: Allow crawling links if they are the same domain or subdomains
- all: Allow crawling all links, regardless of domain
-h, --help Print help (see a summary with '-h')
-V, --version Print version
```
This crate exposes a library, but for the time being, the interfaces should be considered unstable.
A list of ideas for future work: - archive mode to crawl and save pages locally - build dictionaries from local (archived) pages - support different mime types - smarter/togglable parsing of html tags (e.g. to ignore js and css) - more word filtering options - better async
Licensed under either of
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.