crates.io Dependency status

Crusty - polite && scalable broad web crawler

Introduction

Broad web crawling is an activity of going through practically boundless web by starting from a set of locations(urls) and following outgoing links. Usually it doesn't matter where you start from as long as it has outgoing links to external domains.

It presents unique set of challenges one must overcome to get a stable and scalable system, Crusty is an attempt to tackle on some of those challenges to see what's out here while having fun with Rust ;)

This particular implementation could be used to quickly fetch a subset of all observable internet and for example, discover most popular domains/links

Built on top of crusty-core which handles all low-level aspects of web crawling

Key features

example

Getting started

install docker && docker-compose, follow instructions at

https://docs.docker.com/get-docker/

https://docs.docker.com/compose/install/

git clone https://github.com/let4be/crusty cd crusty

additionally


if you decide to build manually via cargo build, remember - release build is a lot faster(and default is debug)

In the real world usage scenario on high bandwidth channel docker might become a bit too expensive, so it might be a good idea either to run directly or at least in network_mode = host

External service dependencies - clickhouse and grafana

just use docker-compose, it's the recommended way to play with Crusty

however...

to create / clean db use this sql(must be fed to clickhouse client -in context- of clickhouse docker container)

grafana dashboard is exported as json model

Development

Contributing

I'm open to discussions/contributions, - use github issues,

pull requests are welcomed ;)