html5tokenizer
is a WHATWG-compliant HTML tokenizer (forked from
html5gum with added code span support).
```rust use std::fmt::Write; use html5tokenizer::{Tokenizer, Token};
let html = "
for token in Tokenizer::new(html).infallible() { match token { Token::StartTag(tag) => { write!(newhtml, "<{}>", tag.name).unwrap(); } Token::String(helloworld) => { write!(newhtml, "{}", helloworld).unwrap(); } Token::EndTag(tag) => { write!(new_html, "{}>", tag.name).unwrap(); } _ => panic!("unexpected input"), } }
asserteq!(newhtml, "
html5tokenizer
fully implements 13.2.5 of the WHATWG HTML
spec, i.e. is able to tokenize HTML documents and passes html5lib's tokenizer
test suite. Since it is just a tokenizer, this means:
html5tokenizer
does not implement charset
detection.
This implementation requires all input to be Rust strings and therefore valid
UTF-8.html5tokenizer
does not correct mis-nested
tags.html5tokenizer
does not recognize implicitly self-closing elements like
<img>
, as a tokenizer it will simply emit a start token. It does however
emit a self-closing tag for <img .. />
.html5tokenizer
does not generally qualify as a browser-grade HTML parser as
per the WHATWG spec. This can change in the future.With those caveats in mind, html5tokenizer
can pretty much ~parse~ tokenize
anything that browsers can.
Emitter
traitA distinguishing feature of html5tokenizer
is that you can bring your own token
datastructure and hook into token creation by implementing the Emitter
trait.
This allows you to:
Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.
Efficiently filter out uninteresting categories data without ever allocating for it. For example if any plaintext between tokens is not of interest to you, you can implement the respective trait methods as noop and therefore avoid any overhead creating plaintext tokens.
Licensed under the MIT license, see ./LICENSE
.