html5gum

docs.rs crates.io

html5gum is a WHATWG-compliant HTML tokenizer.

```rust use std::fmt::Write; use html5gum::{Tokenizer, Token};

let html = ""; let mut new_html = String::new();</p> <p>for token in Tokenizer::new(html).infallible() { match token { Token::StartTag(tag) => { write!(new<em>html, "<{}>", String::from</em>utf8<em>lossy(&tag.name)).unwrap(); } Token::String(hello</em>world) => { write!(new<em>html, "{}", String::from</em>utf8<em>lossy(&hello</em>world)).unwrap(); } Token::EndTag(tag) => { write!(new<em>html, "</{}>", String::from</em>utf8_lossy(&tag.name)).unwrap(); } _ => panic!("unexpected input"), } }</p> <p>assert<em>eq!(new</em>html, "<title>hello world"); ```

What a tokenizer does and what it does not do

html5gum fully implements 13.2.5 of the WHATWG HTML spec, i.e. is able to tokenize HTML documents and passes html5lib's tokenizer test suite. Since it is just a tokenizer, this means:

With those caveats in mind, html5gum can pretty much ~parse~ tokenize anything that browsers can.

The Emitter trait

A distinguishing feature of html5gum is that you can bring your own token datastructure and hook into token creation by implementing the Emitter trait. This allows you to:

See the custom_emitter example for how this looks like in practice.

Other features

Alternative HTML parsers

html5gum was created out of a need to parse HTML tag soup efficiently. Previous options were to:

Etymology

Why is this library called html5gum?

License

Licensed under the MIT license, see ./LICENSE.