Spec-compliant HTML parsing requires both tokenization and tree-construction.
While this crate implements a spec-compliant HTML tokenizer it does not implement any
tree-construction. Instead it just provides a NaiveParser
that may be used as follows:
``` use std::fmt::Write; use html5tokenizer::{NaiveParser, Token};
let html = "
for token in NaiveParser::new(html).flatten() { match token { Token::StartTag(tag) => { write!(newhtml, "<{}>", tag.name).unwrap(); } Token::String(helloworld) => { write!(newhtml, "{}", helloworld).unwrap(); } Token::EndTag(tag) => { write!(new_html, "{}>", tag.name).unwrap(); } _ => panic!("unexpected input"), } }
asserteq!(newhtml, "
html5tokenizer
was forked from [html5gum] 0.2.1.
html5gum has since switched its parsing to operate on bytes,
which html5tokenizer doesn't yet support.
html5tokenizer
does not implement [charset detection].
This implementation requires all input to be Rust strings and therefore valid UTF-8.
Both crates pass the [html5lib tokenizer test suite].
Both crates have an Emitter
trait that lets you bring your own token data
structure and hook into token creation by implementing the Emitter
trait.
This allows you to:
Rewrite all per-HTML-tag allocations to use a custom allocator or data structure.
Efficiently filter out uninteresting categories data without ever allocating for it. For example if any plaintext between tokens is not of interest to you, you can implement the respective trait methods as noop and therefore avoid any overhead creating plaintext tokens.
Licensed under the MIT license, see [the LICENSE file].