robotxt version 0.4.0

robotxt

Also check out other xwde projects here.

The implementation of the robots.txt (or URL exclusion) protocol in the Rust programming language with the support of crawl-delay, sitemap and universal * match extensions (according to the RFC specification).

Features

builder to enable robotxt::{RobotsBuilder, GroupBuilder}. Enabled by default.
parser to enable robotxt::{Robots}. Enabled by default.

Examples

parse the user-agent in the provided robots.txt file:

```rust use robotxt::Robots;

fn main() { let txt = r#" User-Agent: foobot Disallow: * Allow: /example/ Disallow: /example/nope.txt "#.as_bytes();

let r = Robots::from_bytes(txt, "foobot");
assert!(r.is_allowed("/example/yeah.txt"));
assert!(!r.is_allowed("/example/nope.txt"));
assert!(!r.is_allowed("/invalid/path.txt"));

} ```

build the new robots.txt file from provided directives:

```rust use robotxt::RobotsBuilder;

fn main() { let txt = RobotsBuilder::default() .header("Robots.txt: Start") .group(["foobot"], |u| { u.crawldelay(5) .header("Rules for Foobot: Start") .allow("/example/yeah.txt") .disallow("/example/nope.txt") .footer("Rules for Foobot: End") }) .group(["barbot", "nombot"], |u| { u.crawldelay(2) .disallow("/example/yeah.txt") .disallow("/example/nope.txt") }) .sitemap("https://example.com/sitemap1.xml".tryinto().unwrap()) .sitemap("https://example.com/sitemap1.xml".tryinto().unwrap()) .footer("Robots.txt: End");

println!("{}", txt.to_string());

} ```

Notes

The parser is based on Smerity/texting_robots
The Host directive is not supported

robotxt

Features

Examples

Links

Notes