sgmlish
This is a library for handling SGML. It's not intended to be a full-featured implementation of the SGML spec; rather, it's meant to successfully parse common SGML uses, and then apply a number of normalization passes to make it suitable for deserialization, like inserting implied end tags.
In particular, DTDs are not supported. That means any desired validation or normalization must be performed either manually or through the built-in transforms.
Parsing HTML. Even though the HTML 4 spec was defined as an SGML DTD, browsers of that era were never close to conformant to all the intricacies of SGML, and websites were built with nearly zero regard for that anyway.
Attempting to use an SGML parser to understand real-world HTML is a losing battle; the [HTML5 spec] was thus built with that in mind, describing how to handle all the ways web pages can be malformed in the best possible manner, based on how old browsers understood it.
If you need to parse HTML, even old HTML, please use something like [html5ever].
Parsing XML. This space is well-served by existing libraries like [xml-rs]. [serde-xml-rs] offers a very similar deserialization experience to this library.
The following SGML features are hard to properly implement without full doctype awareness during parsing, and are therefore beyond the scope of this library:
<FOO/example/
<![INCLUDE[ outer <![IGNORE[ inner ]]> ]]>
SEPCHAR
or LCNMSTRT
This is a quick guide on deriving deserialization of data structures with [Serde].
First, add sgmlish
and serde
to your dependencies:
```toml
[dependencies] serde = { version = "1.0", features = ["derive"] } sgmlish = "0.1" ```
Defining your data structures is similar to using any other Serde library:
```rust use serde::Deserialize;
struct Example {
name: String,
version: Option
Usage deviates a bit from other deserializers. The process is usually split in three phases:
rust
let input = r##"
<CRATE>
<NAME>sgmlish</NAME>
<VERSION>0.1</VERSION>
</CRATE>
"##;
let sgml =
// Phase 1: tokenization
sgmlish::parse(input)?
// Phase 2: normalization
.trim_spaces()
.lowercase_identifiers();
// Phase 3: deserialization
let example = sgmlish::from_fragment::<Crate>(sgml)?;
Tokenization: sgmlish::parse()
is invoked on an input string, producing a
fragment, which is a series of events.
Normalization: because SGML is so flexible, you'll almost certainly want to apply a few normalization passes to the data before deserializing.
Some passes of interest:
trim_spaces
]: removes whitespace surrounding tags.lowercase_identifiers
]: most SGML is case-insensitive; this will
normalize all tag and attribute names to lowercase.normalize_end_tags
]: inserts omitted end tags, assuming they are
omitted only when the element cannot contain child elements.
This algorithm is good enough for many SGML applications, like [OFX].expand_entities
]: allows you to support &entities;
in text content.
No entities are supported by default, only character references ( 
).expand_marked_sections
]: processes marked sections, like <![IGNORE[x]]>
.
Only simple CDATA
and RCDATA
sections are processed by default.A very important rule: before proceding with deserialization, all start tags must have a matching end tag with identical case, in a consistent hierarchy.
Deserialization: once the event stream is normalized, pass on to Serde and let it do its magic.
Primitives and strings: values can be either an attribute directly on the container element, or a child element with text content.
The following are equivalent to the deserializer:
xml
<example foo="bar"></example>
<example><foo>bar</foo></example>
Booleans: the strings true
, false
, 1
and 0
are accepted,
both as attribute values and as text content.
In the case of attributes, HTML-style flags are also accepted:
an empty value (explicit or implicit) and a value equal to the attribute name
(case insensitive) are treated as true
.
The following all set checked
to true
:
xml
<example checked></example>
<example checked=""></example>
<example checked="1"></example>
<example checked="checked"></example>
<example checked="true"></example>
<example><checked>true</checked></example>
Structs: the tag name comes from the parent struct's field, not from the value type!
```rust
struct Root {
// Expects a
If you want to capture the text content of an element, you can make use of
the special name $value
:
```rust
struct Example { foo: String,
content: String, } ```
Sequences: sequences are read from a contiguous series of elements with the same name. Similarly to structs, the tag name comes from the parent struct's field.
```rust
struct Example {
// Expects a series of
hosts: Vec
deserialize
--- includes support for [Serde] deserialization.
Since this is the main use case for this library, this feature is enabled by default.