Create dictionaries by scraping webpages.
Similar tools (some features inspired by them): - CeWL - CeWLeR
```bash
nix build .# ./result/bin/wdict --help
nix run .# -- --help
nix run github:pyqlsa/wdict -- --help
cargo install wdict
nix develop .# cargo build ./target/debug/wdict --help
cargo build --release ./target/release/wdict --help ```
```bash Create dictionaries by scraping webpages.
Usage: wdict [OPTIONS]
Options:
-u, --url
[default: https://www.quicksilver899.com/Tolkien/Tolkien_Dictionary.html]
-d, --depth
[default: 1]
-m, --min-word-length
[default: 3]
-f, --file
[default: wdict.txt]
--filter <FILTER>
Filter strategy for words
[default: none]
Possible values:
- deunicode: Transform unicode according to https://github.com/kornelski/deunicode
- decancer: Transform unicode according to https://github.com/null8626/decancer
- none: Leave the string as-is
--site <SITE>
Site policy for discovered links
[default: same]
Possible values:
- same: Allow crawling links, only if the domain exactly matches
- subdomain: Allow crawling links if they are the same domain or subdomains
- all: Allow crawling all links, regardless of domain
-h, --help Print help (see a summary with '-h')
-V, --version Print version
```
A list of ideas for future work: - archive mode to crawl and save pages locally - build dictionaries from local (archived) pages - support different mime types - add a collection of pre-canned 'themed' urls - smarter/togglable parsing of html tags (e.g. to ignore js and css) - more word filtering options
Licensed under either of
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.