wdict

Create dictionaries by scraping webpages.

Similar tools (some features inspired by them): - CeWL - CeWLeR

Take it for a spin

```bash

build with nix and run the result

nix build .# ./result/bin/wdict --help

just run it directly

nix run .# -- --help

run it without cloning

nix run github:pyqlsa/wdict -- --help

install from crates.io

(nixOS users may need to do this within a dev shell)

cargo install wdict

using a dev shell

nix develop .# cargo build ./target/debug/wdict --help

...or a release version

cargo build --release ./target/release/wdict --help ```

Usage

```bash Create dictionaries by scraping webpages.

Usage: wdict [OPTIONS]

Options: -u, --url URL to start crawling from

      [default: https://www.quicksilver899.com/Tolkien/Tolkien_Dictionary.html]

-d, --depth Limit the depth of crawling links

      [default: 1]

-m, --min-word-length Only save words greater than or equal to this value

      [default: 3]

-r, --req-per-sec Number of requests to make per second

      [default: 20]

-f, --file File to write dictionary to (will be overwritten if it already exists)

      [default: wdict.txt]

  --filter <FILTER>
      Filter strategy for words

      [default: none]

      Possible values:
      - deunicode: Transform unicode according to <https://github.com/kornelski/deunicode>
      - decancer:  Transform unicode according to <https://github.com/null8626/decancer>
      - none:      Leave the string as-is

  --site <SITE>
      Site policy for discovered links

      [default: same]

      Possible values:
      - same:      Allow crawling links, only if the domain exactly matches
      - subdomain: Allow crawling links if they are the same domain or subdomains
      - all:       Allow crawling all links, regardless of domain

-h, --help Print help (see a summary with '-h')

-V, --version Print version

```

Lib

This crate exposes a library, but for the time being, the interfaces should be considered unstable.

TODO

A list of ideas for future work: - archive mode to crawl and save pages locally - build dictionaries from local (archived) pages - support different mime types - add a collection of pre-canned 'themed' urls - smarter/togglable parsing of html tags (e.g. to ignore js and css) - more word filtering options

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.