wdict

Create dictionaries by scraping webpages.

Similar tools (some features inspired by them): - CeWL - CeWLeR

Take it for a spin

```bash

build with nix and run the result

nix build .# ./result/bin/wdict --help

just run it directly

nix run .# -- --help

run it without cloning

nix run github:pyqlsa/wdict -- --help

install from crates.io

(nixOS users may need to do this within a dev shell)

cargo install wdict

using a dev shell

nix develop .# cargo build ./target/debug/wdict --help

...or a release version

cargo build --release ./target/release/wdict --help ```

Usage

```bash Create dictionaries by scraping webpages.

Usage: wdict [OPTIONS] <--url |--theme >

Options: -u, --url URL to start crawling from

  --theme <THEME>
      Pre-canned theme URLs to start crawling from (for fun, demoing features, and sparking new ideas)

      Possible values:
      - star-wars: Star Wars themed URL <https://www.starwars.com/databank>
      - tolkien:   Tolkien themed URL <https://www.quicksilver899.com/Tolkien/Tolkien_Dictionary.html>
      - witcher:   Witcher themed URL <https://witcher.fandom.com/wiki/Elder_Speech>

-d, --depth Limit the depth of crawling links

      [default: 1]

-m, --min-word-length Only save words greater than or equal to this value

      [default: 3]

-r, --req-per-sec Number of requests to make per second

      [default: 20]

-f, --file File to write dictionary to (will be overwritten if it already exists)

      [default: wdict.txt]

  --filters <FILTERS>...
      Filter strategy for words; multiple can be specified

      [default: none]

      Possible values:
      - deunicode:   Transform unicode according to <https://github.com/kornelski/deunicode>
      - decancer:    Transform unicode according to <https://github.com/null8626/decancer>
      - all-numbers: Ignore words that consist of all numbers
      - any-numbers: Ignore words that contain any number
      - none:        Leave the word as-is

  --site <SITE>
      Site policy for discovered links

      [default: same]

      Possible values:
      - same:      Allow crawling links, only if the domain exactly matches
      - subdomain: Allow crawling links if they are the same domain or subdomains
      - all:       Allow crawling all links, regardless of domain

-h, --help Print help (see a summary with '-h')

-V, --version Print version

```

Lib

This crate exposes a library, but for the time being, the interfaces should be considered unstable.

TODO

A list of ideas for future work: - archive mode to crawl and save pages locally - build dictionaries from local (archived) pages - support different mime types - smarter/togglable parsing of html tags (e.g. to ignore js and css) - more word filtering options - better async

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.