html5tokenizer

docs.rs crates.io

Spec-compliant HTML parsing requires both tokenization and tree-construction. While this crate implements a spec-compliant HTML tokenizer it does not implement any tree-construction. Instead it just provides a NaiveParser that may be used as follows:

```rust use std::fmt::Write; use html5tokenizer::{NaiveParser, Token};

let html = ""; let mut new_html = String::new();</p> <p>for token in NaiveParser::new(html).flatten() { match token { Token::StartTag(tag) => { write!(new<em>html, "<{}>", tag.name).unwrap(); } Token::String(hello</em>world) => { write!(new<em>html, "{}", hello</em>world).unwrap(); } Token::EndTag(tag) => { write!(new_html, "</{}>", tag.name).unwrap(); } _ => panic!("unexpected input"), } }</p> <p>assert<em>eq!(new</em>html, "<title>hello world"); ```

This library can provide source spans. For an example, see [examples/spans.rs], which produces the following output:

output id=spans note: ┌─ file.html:1:2 │ 1 │ <img src=example.jpg alt="some description"> │ ^^^ ^^^ ^^^^^^^^^^^ ^^^ ^^^^^^^^^^^^^^^^ attr value │ │ │ │ │ │ │ │ │ attr name │ │ │ attr value │ │ attr name │ tag name

Limitations

Compliance & testing

The tokenizer passes the [html5lib tokenizer test suite]. The library is not yet fuzz tested.

Credits

html5tokenizer was forked from [html5gum] 0.2.1, which was created by Markus Unterwaditzer who deserves major props for implementing all 80 (!) tokenizer states.

For details please refer to the [changelog].

License

Licensed under the MIT license, see [the LICENSE file].