Skyscraper - HTML scraping with XPath

Dependency Status License MIT Crates.io doc.rs

Rust library to scrape HTML documents with XPath expressions.

HTML Parsing

Skyscraper has its own HTML parser implementation. The parser outputs a tree structure that can be traversed manually with parent/child relationships.

Example: Simple HTML Parsing

```rust use skyscraper::html::{self, parse::ParseError}; let html_text = r##"

Hello world
"##;

let document = html::parse(html_text)?; ```

Example: Traversing Parent/Child Relationships

```rust // Parse the HTML text into a document let text = r#""#; let document = html::parse(text)?;

// Get the children of the root node let parentnode: DocumentNode = document.rootnode; let children: Vec = parentnode.children(&document).collect(); asserteq!(2, children.len());

// Get the parent of both child nodes let parentofchild0: DocumentNode = children[0].parent(&document).expect("parent of child 0 missing"); let parentofchild1: DocumentNode = children[1].parent(&document).expect("parent of child 1 missing");

asserteq!(parentnode, parentofchild0); asserteq!(parentnode, parentofchild1); ```

XPath Expressions

Skyscraper is capable of parsing XPath strings and applying them to HTML documents.

```rust use skyscraper::{html, xpath}; // Parse the html text into a document. let html_text = r##"

yes
no

"##; let document = html::parse(html_text)?;

// Parse and apply the xpath. let expr = xpath::parse("//div[@class='foo']/span")?; let results = expr.apply(&document)?; assert_eq!(1, results.len());

// Get text from the node let text = results[0].gettext(&document).expect("text missing"); asserteq!("yes", text); ```