crawler in the works for rust.
This is a work in progress.
The wax worker is the crawler. The crawler is being built out to generate or "press" different docs like "HtmlDocument". The wax worker presses docs The specific docs will implement methods to parse themselves.
This is a slow process for crawling, and calling blind. The last thing anyone wants with a crawler is to not be able to crawl.
01/02/22 - I am looking at swapping out the select.rs dependency for scraper (https://docs.rs/scraper/latest/scraper/) - looking at creating my own parser, the parser would be included in another crate
how to use:
toml
[dependencies]
waxy = "0.1.0"
tokio = { version = "1", features = ["full"] }
```rust use waxy::workers::WaxWorker;
async fn main() -> Result<(), Box
//Wax worker
/*
create a single document from url
*/
match WaxWorker::press_document("https://example.com").await{
Ok(res)=>{
println!("{:?}", res);
},
Err(..)=>{
println!("went bad")
}
}
println!();
println!("----------------------");
println!();
/*
crawl a vector or urls for a vector of documents
*/
match WaxWorker::press_documents(vec!["https://example.com"]).await{
Ok(res)=>{
println!("{:?}", res.len());
},
Err(..)=>{
println!("went bad")
}
}
println!();
println!("----------------------");
println!();
/*
crawl a domain, the "1" is the limit of pages you are willing to crawl
*/
match WaxWorker::press_documents_blind("https://example.com",1).await{
Ok(res)=>{
println!("{:?}", res.len());
},
Err(..)=>{
println!("went bad")
}
}
/*
blind crawl a domain for links,
inputs:
url to site
link limit, limit of the number of links willing to be grabbed
page limit, limit of the number of pages to crawl for links
*/
match WaxWorker::press_urls("https://example.com",1,1).await{
Ok(res)=>{
println!("{:?}", res.len());
},
Err(..)=>{
println!("went bad")
}
}
println!();
println!("----------------------");
println!();
/*
blind crawl a domain for links that match a pattern,
inputs:
url to site
pattern the url should match
link limit, limit of the number of links willing to be grabbed
page limit, limit of the number of pages to crawl for links
*/
match WaxWorker::press_curated_urls("https://example.com", ".", 1,1).await{
Ok(res)=>{
println!("{:?}", res);
},
Err(..)=>{
println!("went bad")
}
}
println!();
println!("----------------------");
println!();
/*
blind crawl a domain for document whose urls that match a pattern,
inputs:
url to site
pattern the url should match
page limit, limit of the number of pages to crawl for links
*/
match WaxWorker::press_curated_documents("https://example.com", ".", 1).await{
Ok(res)=>{
println!("{:?}", res);
},
Err(..)=>{
println!("went bad")
}
}
println!();
println!("----------------------");
println!();
//get doc
let document = WaxWorker::press_document("https://example.com").await.unwrap();
//get anchors
println!("{:?}",document.anchors());
println!();
println!("{:?}",document.anchors_curate("."));
println!();
println!("{:?}",document.domain_anchors());
println!();
//call headers
println!("{:?}",document.headers);
println!();
//call meta data
println!("{:?}",document.meta_data());
println!();
//tag text and html
println!("{:?}",document.tag_html("title"));
println!();
println!("{:?}",document.tag_text("title"));
println!();
Ok(())
}
```