tl

tl is a fast HTML parser written in pure Rust.

Usage

Add tl to your dependencies. ```toml [dependencies] tl = "0.7.4"

or, with explicit SIMD support

(requires a nightly compiler!)

tl = { version = "0.7.4", features = ["simd"] } ```

The main function is tl::parse(). It accepts an HTML source code string and parses it. It is important to note that tl currently silently ignores tags that are invalid, sort of like browsers do. Sometimes, this means that large chunks of the HTML document do not appear in the resulting AST, although in the future this will likely be customizable, in case you need explicit error checking.

```rust let dom = tl::parse(r#"

Hello

"#, tl::ParserOptions::default()).unwrap(); let parser = dom.parser(); let element = dom.getelementby_id("text") .expect("Failed to find element") .get(parser) .unwrap();

asserteq!(element.innertext(parser), "Hello"); ```

Examples

Finding a tag using the query selector API

```rust let dom = tl::parse(r#"

"#, tl::ParserOptions::default()).unwrap(); let img = dom.query_selector("img[src]").unwrap().next();

assert!(img.is_some()); ```

Iterating over the subnodes of an HTML document

```rust let dom = tl::parse(r#"

"#, tl::ParserOptions::default()).unwrap(); let img = dom.nodes() .iter() .find(|node| { node.astag().mapor(false, |tag| tag.name() == "img") });

assert!(img.is_some()); ```

Mutating the href attribute of an anchor tag:

In a real world scenario, you would want to handle errors properly instead of unwrapping. ```rust let input = r#"

About
"#; let mut dom = tl::parse(input, tl::ParserOptions::default()) .expect("HTML string too long");

let anchor = dom.query_selector("a[href]") .expect("Failed to parse query selector") .next() .expect("Failed to find anchor tag");

let parsermut = dom.parsermut();

let anchor = anchor.getmut(parsermut) .expect("Failed to resolve node") .astagmut() .expect("Failed to cast Node to HTMLTag");

let attributes = anchor.attributes_mut();

attributes.get_mut("href") .flatten() .expect("Attribute not found or malformed") .set("http://localhost/about");

assert_eq!(attributes.get("href").flatten(), Some(&"http://localhost/about".into())); ```

SIMD-accelerated parsing

This crate has utility functions used by the parser which make use of SIMD (e.g. finding a specific byte by looking at the next 16 bytes at once, instead of going through the string one by one). These are disabled by default and must be enabled explicitly by passing the simd feature flag due to the unstable feature portable_simd. This requires a nightly compiler!

If the simd feature is not enabled, it will fall back to stable alternatives that don't explicitly use SIMD intrinsics, but are still decently well optimized, using techniques such as manual loop unrolling to remove boundary checks and other branches by a factor of 16, which also helps LLVM further optimize the code and potentially generate SIMD instructions by itself.

Benchmarks

Results for parsing a ~320KB HTML document. Benchmarked using criterion on codespaces hardware. notrust time thrpt tl + simd 628.23 us 497.87 MiB/s htmlstream 2.2786 ms 137.48 MiB/s rusthtml 3.3881 ms 92.317 MiB/s html5ever 5.7900 ms 54.021 MiB/s rphtml 6.0154 ms 51.997 MiB/s htmlparser 17.764 ms 17.608 MiB/s

Source