Spec-compliant HTML parsing requires both tokenization and tree-construction.
While this crate implements a spec-compliant HTML tokenizer it does not implement any
tree-construction. Instead it just provides a NaiveParser
that may be used as follows:
```rust use std::fmt::Write; use html5tokenizer::{NaiveParser, Token};
let html = "
for token in NaiveParser::new(html).flatten() { match token { Token::StartTag(tag) => { write!(newhtml, "<{}>", tag.name).unwrap(); } Token::Char(c) => { write!(newhtml, "{c}").unwrap(); } Token::EndTag(tag) => { write!(new_html, "{}>", tag.name).unwrap(); } Token::EndOfFile => {}, _ => panic!("unexpected input"), } }
asserteq!(newhtml, "
This library can provide source spans. For an example, see
[examples/spans.rs
], which produces the following output:
output id=spans
note:
┌─ file.html:1:2
│
1 │ <img src=example.jpg alt="some description">
│ ^^^ ^^^ ^^^^^^^^^^^ ^^^ ^^^^^^^^^^^^^^^^ attr value
│ │ │ │ │
│ │ │ │ attr name
│ │ │ attr value
│ │ attr name
│ tag name
This crate does not yet implement tree construction
(which is necessary for spec-compliant HTML parsing).
This crate does not yet implement [character encoding detection].
The tokenizer passes the [html5lib tokenizer test suite]. The library is not yet fuzz tested.
html5tokenizer was forked from [html5gum] 0.2.1, which was created by Markus Unterwaditzer who deserves major props for implementing all 80 (!) tokenizer states.
For details please refer to the [changelog].
Licensed under the MIT license, see [the LICENSE file].