html5gum

docs.rs crates.io

html5gum is a WHATWG-compliant HTML tokenizer.

```rust use std::fmt::Write; use html5gum::{Tokenizer, Token};

let html = ""; let mut new_html = String::new();</p> <p>for token in Tokenizer::new(html) { match token { Token::StartTag(tag) => { write!(new<em>html, "<{}>", tag.name).unwrap(); } Token::String(hello</em>world) => { write!(new<em>html, "{}", hello</em>world).unwrap(); } Token::EndTag(tag) => { write!(new_html, "</{}>", tag.name).unwrap(); } _ => panic!("unexpected input"), } }</p> <p>assert<em>eq!(new</em>html, "<title>hello world"); ```

It fully implements 13.2 of the WHATWG HTML spec and passes html5lib's tokenizer test suite, except that:

A distinguishing feature of html5gum is that you can bring your own token datastructure and hook into token creation by implementing the Emitter trait. This allows you to:

html5gum was created out of a need to parse HTML tag soup efficiently. Previous options were to:

Etymology

Why is this library called html5gum?

License

Licensed under the MIT license, see ./LICENSE.