Crusty - polite && scalable broad web crawler

Introduction

Broad web crawling is an activity of going through practically boundless web by starting from a set of locations(urls) and following outgoing links

It presents a unique set of challenges one must overcome to get a stable and scalable system, Crusty is an attempt to tackle on some of those challenges to see what's out here

This particular implementation could be used to quickly fetch a subset of all observable internet in a scalable and stable manner

Built on top of crusty-core - which handles all low-level aspects of web-crawling

Key features

example

Getting started

there is a Dockerfile for easier building and distribution: docker build -f ./infra/Dockerfile -t crusty . (supports incremental builds)

for now see those notes, docker compose is coming a bit later

to create / clean db use this script

grafana dashboard is exported as json model

Contributing

I'm open to discussions/contributions, use github issues, pull requests are welcomed ;)