crates.io Build Status

robots_txt

robots_txt is a lightweight robots.txt parser and generator written in Rust.

Nothing extra.

Unstable

The implementation is WIP.

Installation

Robots_txt is available on crates.io and can be included in your Cargo enabled project like this:

Cargo.toml: toml [dependencies] robots_txt = "0.5"

Parsing & matching paths against rules

main.rs: ```rust extern crate robots_txt;

use robots_txt::Robots;

static ROBOTS: &'static str = r#"

robots.txt for http://www.site.com

User-Agent: * Disallow: /cyberworld/map/ # this is an infinite virtual URL space

Cybermapper knows where to go

User-Agent: cybermapper Disallow:

"#;

fn main() { let robots = Robots::from_str(ROBOTS);

let matcher = SimpleMatcher::new(&robots.choose_section("NoName Bot").rules);
assert!(matcher.check_path("/some/page"));
assert!(matcher.check_path("/cyberworld/welcome.html"));
assert!(!matcher.check_path("/cyberworld/map/object.html"));

let matcher = SimpleMatcher::new(&robots.choose_section("Mozilla/5.0; CyberMapper v. 3.14").rules);
assert!(matcher.check_path("/some/page"));
assert!(matcher.check_path("/cyberworld/welcome.html"));
assert!(matcher.check_path("/cyberworld/map/object.html"));

} ```

Building & rendering

main.rs: ```rust extern crate robots_txt;

use robots_txt::Robots;

fn main() { let robots1 = Robots::startbuild() .startsectionfor("cybermapper") .disallow("") .endsection() .startsectionfor("*") .disallow("/cyberworld/map/") .end_section() .finalize();

let robots2 = Robots::start_build()
    .host("example.com")
    .start_section_for("*")
        .disallow("/private")
        .disallow("")
        .crawl_delay(4.5)
        .request_rate(9, 20)
        .sitemap("http://example.com/sitemap.xml".parse().unwrap())
        .end_section()
    .finalize();

println!("# robots.txt for http://cyber.example.com/\n\n{}", robots1);
println!("# robots.txt for http://example.com/\n\n{}", robots2);

} As a result we get

robots.txt for http://cyber.example.com/

User-agent: cybermapper Disallow:

User-agent: * Disallow: /cyberworld/map/

robots.txt for http://example.com/

User-agent: * Disallow: /private Disallow: Crawl-delay: 4.5 Request-rate: 9/20 Sitemap: http://example.com/sitemap.xml

Host: example.com

```

Alternatives

License

Licensed under either of * Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0) * MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT) at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.