html5tokenizer

docs.rs crates.io

Spec-compliant HTML parsing requires both tokenization and tree-construction. While this crate implements a spec-compliant HTML tokenizer it does not implement any tree-construction. Instead it just provides a NaiveParser that may be used as follows:

``` use std::fmt::Write; use html5tokenizer::{NaiveParser, Token};

let html = ""; let mut new_html = String::new();</p> <p>for token in NaiveParser::new(html).flatten() { match token { Token::StartTag(tag) => { write!(new<em>html, "<{}>", tag.name).unwrap(); } Token::String(hello</em>world) => { write!(new<em>html, "{}", hello</em>world).unwrap(); } Token::EndTag(tag) => { write!(new_html, "</{}>", tag.name).unwrap(); } _ => panic!("unexpected input"), } }</p> <p>assert<em>eq!(new</em>html, "<title>hello world"); ```

Compared to html5gum

html5tokenizer was forked from [html5gum] 0.2.1.

html5gum has since switched its parsing to operate on bytes, which html5tokenizer doesn't yet support. html5tokenizer does not implement [charset detection]. This implementation requires all input to be Rust strings and therefore valid UTF-8.

Both crates pass the [html5lib tokenizer test suite].

Both crates have an Emitter trait that lets you bring your own token data structure and hook into token creation by implementing the Emitter trait. This allows you to:

License

Licensed under the MIT license, see [the LICENSE file].