sqlite-robotstxt

A SQLite extension for parsing robots.txt files. Based on sqlite-loadable-rs and the robotstxt crate.

Usage

See if a specified User-Agent can access a specific path, based on the rules of a robots.txt.

sql select robotstxt_matches( readfile('robots.txt'), 'My-Agent', '/path' ); -- 0 or 1

Find all User-Agents listed in a robots.txt file.

sql select * from robotstxt_user_agents( readfile('robots.txt') ); /* ┌─────────────────────┬────────┬───────┐ │ name │ source │ rules │ ├─────────────────────┼────────┼───────┤ │ * │ 1 │ │ │ AdsBot-Google │ 280 │ │ │ Twitterbot │ 288 │ │ │ facebookexternalhit │ 295 │ │ └─────────────────────┴────────┴───────┘ */

Use with sqlite-http to requests robots.txt files on the fly.

sql select * from robotstxt_user_agents( http_get_body('https://en.wikipedia.org/robots.txt') ) limit 10; /* ┌────────────────────────────┬────────┬───────┐ │ name │ source │ rules │ ├────────────────────────────┼────────┼───────┤ │ MJ12bot │ 11 │ │ │ Mediapartners-Google* │ 15 │ │ │ IsraBot │ 19 │ │ │ Orthogaffe │ 22 │ │ │ UbiCrawler │ 27 │ │ │ DOC │ 30 │ │ │ Zao │ 33 │ │ │ sitecheck.internetseer.com │ 38 │ │ │ Zealbot │ 41 │ │ │ MSIECrawler │ 44 │ │ └────────────────────────────┴────────┴───────┘ */

TODO