wdict

Create dictionaries by scraping webpages.

Similar tools (some features inspired by them): - CeWL - CeWLeR

Take it for a spin

```bash

build with nix and run the result

nix build .# ./result/bin/wdict --help

just run it directly

nix run .# -- --help

run it without cloning

nix run github:pyqlsa/wdict -- --help

install from crates.io

(nixOS users may need to do this within a dev shell)

cargo install wdict

using a dev shell

nix develop .# cargo build ./target/debug/wdict --help

...or a release version

cargo build --release ./target/release/wdict --help ```

Usage

```bash Create dictionaries by scraping webpages.

Usage: wdict [OPTIONS] <--url |--theme >

Options: -u, --url URL to start crawling from

  --theme <THEME>
      Pre-canned theme URLs to start crawling from (for fun, demoing features, and sparking new ideas)

      Possible values:
      - star-wars: Star Wars themed URL <https://www.starwars.com/databank>
      - tolkien:   Tolkien themed URL <https://www.quicksilver899.com/Tolkien/Tolkien_Dictionary.html>
      - witcher:   Witcher themed URL <https://witcher.fandom.com/wiki/Elder_Speech>

-d, --depth Limit the depth of crawling links

      [default: 1]

-m, --min-word-length Only save words greater than or equal to this value

      [default: 3]

-r, --req-per-sec Number of requests to make per second

      [default: 20]

-f, --file File to write dictionary to (will be overwritten if it already exists)

      [default: wdict.txt]

  --filters <FILTERS>...
      Filter strategy for words; multiple can be specified

      [default: none]

      Possible values:
      - deunicode:   Transform unicode according to <https://github.com/kornelski/deunicode>
      - decancer:    Transform unicode according to <https://github.com/null8626/decancer>
      - all-numbers: Ignore words that consist of all numbers
      - any-numbers: Ignore words that contain any number
      - none:        Leave the word as-is

  --site <SITE>
      Site policy for discovered links

      [default: same]

      Possible values:
      - same:      Allow crawling links, only if the domain exactly matches
      - subdomain: Allow crawling links if they are the same domain or subdomains
      - all:       Allow crawling all links, regardless of domain

-h, --help Print help (see a summary with '-h')

-V, --version Print version

```

Lib

This crate exposes a library, but for the time being, the interfaces should be considered unstable.

TODO

A list of ideas for future work: - archive mode to crawl and save pages locally - build dictionaries from local (archived) pages - support different mime types - smarter/togglable parsing of html tags (e.g. to ignore js and css) - more word filtering options - better async

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.