A configurable parallel web crawler, designed to crawl a website for content.
```rust extern crate urlcrawler; use urlcrawler::*;
/// Function for filtering content in the crawler before a HEAD request.
///
/// Only allow directory entries, and files that have the deb
extension.
fn aptfilter(url: &Url) -> bool {
let url = url.asstr();
url.endswith("/") || url.endswith(".deb")
}
pub fn main() { // Create a crawler designed to crawl the given website. let crawler = Crawler::new("http://apt.pop-os.org/".into()) // Use four threads for fetching .threads(4) // Check if a URL matches this filter before performing a HEAD request on it. .prefetch(aptfilter) // Initialize the crawler and begin crawling. This returns immediately. .crawl();
// Process url entries as they become available
for file in crawler {
println!("{:#?}", file);
}
} ```
The folowing includes two snippets from the combined output.
...
Html {
url: "http://apt.pop-os.org/proprietary/pool/bionic/main/source/s/system76-cudnn-9.2/"
}
Html {
url: "http://apt.pop-os.org/proprietary/pool/bionic/main/source/t/tensorflow-1.9-cuda-9.2/"
}
Html {
url: "http://apt.pop-os.org/proprietary/pool/bionic/main/source/t/tensorflow-1.9-cpu/"
}
...
File {
url: "http://apt.pop-os.org/proprietary/pool/bionic/main/binary-amd64/a/atom/atom_1.30.0_amd64.deb",
length: 87689398,
modified: Some(
2018-09-25T17:54:39+00:00
)
}
File {
url: "http://apt.pop-os.org/proprietary/pool/bionic/main/binary-amd64/a/atom/atom_1.31.1_amd64.deb",
length: 90108020,
modified: Some(
2018-10-03T22:29:15+00:00
)
}
...