Crusty - polite && scalable broad web crawler

Introduction

Broad web crawling is an activity of going through practically boundless web by starting from a set of locations(urls) and following outgoing links

It presents a unique set of challenges one must overcome to get a stable and scalable system, Crusty is an attempt to tackle on some of those challenges to see what's out here

This particular implementation could be used to quickly fetch a subset of all observable internet in a scalable and stable manner

Built on top of crusty-core - which handles all low-level aspects of web-crawling

Key features

Configurability && extensibility

see a typical config file with some explanations
Blazing fast single node performance

Crusty is written in rust on top of green threads running on tokio, so it can achieve quite impressive single-node performance even on a moderate PC

Additional optimizations are possible to further improve this(channels, better html parsing depending on task requirements)

Additionally, Crusty has small and predictable memory footprint and is usually cpu/network bound
Scalability

Each Crusty node is essentially an independent crawling unit, and we can run hundreds of them in parallel, the tricky part is job delegation and domain discovery which is solved by a high performance sharded queue-like structure built on top of clickhouse.
Basic politeness

While we can crawl thousands of domains in parallel - we should absolutely limit concurrency on per-domain level to avoid any stress to crawled sites, see job_reader.default_crawler_settings.concurrency. It's also a good practice to introduce delays between visiting pages, see job_reader.default_crawler_settings.delay.

Additionally, there are massive sites with millions of sub-domains(usually blogs) such as tumblr.com. Special care should be taken when crawling them, as such we implement a sub-domain concurrency limitation as well, see job_reader.domain_tail_top_n setting which defaults to 3 - no more than 3 sub-domains can be crawled concurrently

Currently, there's no robots.txt support but this can be added easily(and will be)
Observability

Crusty uses tracing and stores multiple metrics in clickhouse that we can observe with grafana - giving a real-time insight in crawling performance

example

Getting started

building

there is a Dockerfile for easier building and distribution: docker build -f ./infra/Dockerfile -t crusty . (supports incremental builds)

external service dependencies - clickhouse and grafana

for now see those notes, docker compose is coming a bit later

to create / clean db use this script

grafana dashboard is exported as json model

Contributing

I'm open to discussions/contributions, use github issues, pull requests are welcomed ;)