crawler

A gRPC web indexer turbo charged for performance.

Getting Started

Make sure to have Rust installed or Docker.

This project requires that you start up another gRPC server on port 50051 following the proto spec.

The user agent is spoofed on each crawl to a random agent and the indexer extends spider as the base.

Make sure to have npm installed in order to build the proto defs from @a11ywatch/protos.

cargo run or docker compose up

Installation

You can install easily with the following:

Cargo

The crate is available to setup a gRPC server within rust projects.

sh cargo install website_crawler

Docker

You can use also use the docker image at a11ywatch/crawler.

Set the CRAWLER_IMAGE env var to darwin-arm64 to get the native m1 mac image.

yml crawler: container_name: crawler image: "a11ywatch/crawler:${CRAWLER_IMAGE:-latest}" ports: - 50055

Node / Bun

We also release the package to npm @a11ywatch/crawler.

sh npm i @a11ywatch/crawler

After import at the top of your project to start the gRPC server or run node directly against the module.

ts import "@a11ywatch/crawler";

Dependencies

npm is required to install the protcol buffers.

In order to build crawler locally >= 0.5.0, you need the protoc Protocol Buffers compiler, along with Protocol Buffers resource files.

Ubuntu

bash sudo apt update && sudo apt upgrade -y sudo apt install -y protobuf-compiler libprotobuf-dev

Alpine Linux

sh sudo apk add protoc protobuf-dev

macOS

Assuming Homebrew is already installed. (If not, see instructions for installing Homebrew on the Homebrew website.)

zsh brew install protobuf

About

This crawler is optimized for reduced latency and performance as it can handle over 10,000 pages within seconds. In order to receive the links found for the crawler you need to add the website.proto to your server. This is required since every request spawns a thread. Isolating the context drastically improves performance (preventing shared resources / communication ).

Help

If you need help implementing the gRPC server to receive the pages or links when found check out the gRPC node example for a starting point .

TODO

Allow gRPC server port setting or change to direct url:port combo.

LICENSE

Check the license file in the root of the project.