A gRPC web indexer turbo charged for performance.
This project is capable of handling millions of pages per second efficiently.
Make sure to have Rust installed or Docker.
This project requires that you start up another gRPC server on port 50051
following the proto spec.
The user agent is spoofed on each crawl to a random agent and the indexer extends spider as the base.
cargo run
or docker compose up
You can install easily with the following:
The crate is available to setup a gRPC server within rust projects.
sh
cargo install website_crawler
You can use also use the docker image at a11ywatch/crawler.
Set the CRAWLER_IMAGE
env var to darwin-arm64
to get the native m1 mac image.
yml
crawler:
container_name: crawler
image: "a11ywatch/crawler:${CRAWLER_IMAGE:-latest}"
ports:
- 50055
We also release the package to npm @a11ywatch/crawler.
sh
npm i @a11ywatch/crawler
After import at the top of your project to start the gRPC server or run node directly against the module.
ts
import "@a11ywatch/crawler";
This is a basic example crawling a web page, add spider to your Cargo.toml
:
toml
[dependencies]
website_crawler = "0.7.38"
And then the code:
```rust,no_run extern crate spider;
use websitecrawler::website::Website; use websitecrawler::tokio;
async fn main() { let url = "https://choosealicense.com"; let mut website: Website = Website::new(&url); website.crawl().await;
for page in website.get_pages() {
println!("- {}", page.get_url());
}
} ```
You can use Configuration
object to configure your crawler:
rust
// ..
let mut website: Website = Website::new("https://choosealicense.com");
website.configuration.blacklist_url.push("https://choosealicense.com/licenses/".to_string());
website.configuration.respect_robots_txt = true;
website.configuration.subdomains = true;
website.configuration.tld = false;
website.configuration.sitemap = false;
website.configuration.proxy = "http://username:password@localhost:1234";
website.configuration.delay = 0; // Defaults to 250 ms
website.configuration.user_agent = "myapp/version".to_string(); // Defaults to spider/x.y.z, where x.y.z is the library version
website.crawl().await;
A basic example can also be done with:
One terminal run the server
sh
cargo run --example server --release
Another terminal run the client/server
sh
cargo run --example client --release
https://user-images.githubusercontent.com/8095978/221221122-cfed83aa-6ca1-4122-a1db-0d9948e9f9d9.mov
npm is required to install the protcol buffers.
In order to build crawler
locally >= 0.5.0, you need the protoc
Protocol Buffers compiler, along with Protocol Buffers resource files.
proto compiler needs to be at v3 in order to compile. Ubuntu 18+ auto installs.
bash
sudo apt update && sudo apt upgrade -y
sudo apt install -y protobuf-compiler libprotobuf-dev
sh
sudo apk add protoc protobuf-dev
Assuming Homebrew is already installed. (If not, see instructions for installing Homebrew on the Homebrew website.)
zsh
brew install protobuf
jemalloc
- use jemalloc memory allocator (default disabled).
This crawler is optimized for reduced latency and uses isolated based concurrency as it can handle over 10,000 pages within seconds.
In order to receive the links found for the crawler you need to add the website.proto
to your server.
This is required since every request spawns a thread. Isolating the context drastically improves performance (preventing shared resources / communication ).
If you need help implementing the gRPC server to receive the pages or links when found check out the gRPC node example for a starting point .
Building the project is a bit expensive, we take all contributions including help with cost to deploy the best docker images for everyone to utilize. If you want to help send an email to a11ywatch support. At A11yWatch we use a custom variation of the project and we do not need to target other platforms since we deploy on similiar architecture. Helping give back to the community requires a bit more time and effort for some of the solutions that we released to the community.
Check the license file in the root of the project.