A recursive Rust parser for the Dx format.
Dx is a simple, concise, human-readable and writable textual format for configuration, serialization and hand-coding of structures and markup. It is a meta-format, like XML, which means that only the syntax is defined, and the semantic validity of a Dx document is determined by an external source, such as a schema or a program. Dx syntax natively supports many structures common to programming languages and other formats.
```
title: Aluminium; type: chemical-element; tags: [metal; reactive; common]; key: aluminium; element-symbol: Al; atomic-number: 13; references: { wikipedia: "https://en.wikipedia.org/wiki/Aluminium"; snl: "https://snl.no/aluminium"; };
content: { (h1) (title!) # (title!) is a macro that substitutes in the article title. (span class:article-name) {Aluminium} is a (@ chemical-element) {chemical element} # The (@) macro denotes an article link. with (@ element-symbol) {symbol} (span class:chemical-symbol-text) {Al} and (@ atomic-number) {atomic number} (span class:atomic-number) {13}. (\) # (\) indicates a line break. In pure form, it is a highly reactive metal. (\) It constitutes 8.2% of the earth's crust. }; ```
More example documents can be found in the examples
directory.
A Dx document consists of expressions and arguments. The root node may either be an open expression, an open sequence or an open dictionary. Open means that it is not enclosed in brackets.
An expression is a sequence of arguments delimited by whitespace. Ex: arg1 arg2 arg3 ...
.
An argument is an element of an expression. There are 6 argument variants.
A symbol is a sequence of characters. Symbols are delimited by whitespace. Ex: Symbol
, These are four symbols
.
Chevrons/angular brackets ⟨
,⟩
(not the less than <
and greater than >
signs) can be used to insert an
escaped symbol. Any character within the chevrons, even reserved characters, are part of the symbol.
Ex: ⟨This: is a symbol⟩
.
A quote is a string enclosed in quotes "
. Ex: "This is a quote argument"
. Any character within the quotes, even
reserved characters, are part of the quote.
A grouping is an expression enclosed in curly brackets {
, }
. Ex: { expr }
. {}
is not a grouping, but an empty
dictionary. An empty grouping is inserted as the single standalone character _
.
A sequence is a sequence of expressions delimited by semicolons ;
enclosed in square brackets [
, ]
.
A trailing semicolon is allowed. Ex: [expr1; expr2; expr3; ...]
. []
is an empty sequence.
A dictionary is a sequence of key-value entries delimited by semicolons ;
enclosed in curly brackets {
, }
.
A key and its value is separated by a colon :
. A key is a string given by either a symbol or a quote argument. A value
is an expression. A trailing semicolon is allowed.
Ex: {k1: v1; "k2": v2; k3: v3; ...}
.
{}
is an empty dictionary.
A function is a function expression enclosed in parentheses (
, )
. Ex: ( fexpr )
.
A function expression is a sequence of positional arguments, options and flags delimited by whitespace. Positional
arguments are regular arguments. Options are key-value pairs, where the key is a string given as a symbol or a quote,
the value is a single argument and the key and value is separated by a colon :
. A flag is a string given as a symbol
or a quote terminated by a semicolon ;
.
Ex: arg1 f1; k1:v1 "f2"; arg2 "k2":v2
is a function expression with 2 positional arguments, 2 options and 2 flags.
Backslash \
is the escape character, and will insert the next character into the current argument
no matter if it is reserved or not.
Brackets (
, )
, [
, ]
, {
, }
, ⟨
, ⟩
, quotes "
, colons :
and semicolons ;
are reserved characters.
They cannot be used in symbols unless they are escaped.
All whitespace is equivalent to a single space character, unless it is escaped. All non-symbol arguments and the
beginnings and ends of expressions are considered to have implicit whitespace surrounding them.
Ex: symbol{grouping}
is equivalent to symbol { grouping }
.
A number sign #
opens a comment which extends to the next newline. #
must be followed by either whitespace # ...
or another number sign ##...
, and it must follow whitespace ... #
, otherwise it will be treated as part of a
symbol. Ex: # This is a comment
and #### Part II
start comments,
while #2
, #0FA60F
and elements#
do not.
{}
indicates an empty dictionary, not an empty grouping. A single standalone underscore _
is treated as an empty
grouping instead. To insert a standalone underscore as a symbol, it must be escaped: \_
.
Dx defines the syntax of expressions and arguments, but it does not define their semantics or dictate how structures are encoded. The semantics must be defined by a user of the format. This is similar to how XML defines the syntax and requires a schema to define valid tags and values.
Data structures can be encoded in Dx in many arbitrary ways. Thus, a user must define an encoding for each of them. A user must also define whether the document root is an open expression, an open sequence or an open dictionary. This can be done by writing documentation, using a schema, or preferably by implementing serialization and deserialization procedures in a program. Once this is done, one has a format with well-defined syntax and semantics.
Although there are no definite rules about how a structure should be encoded, there are some best practices when it comes to what expressions and arguments represent. Following these practices when defining a structure encoding will make Dx documents more uniform, which makes them more easily understood. Below, the best practices for usage of expressions and arguments are described.
An expression represents an encoded data structure. A program evaluates an expression to produce the structure.
Ex: The Text
expression This is text
is evaluated to the string "This is text". The Point3D
expression
40 -10 9
is evaluated to the struct Point3D {x: 40, y: -10, z: 9}
.
Data structures vary in complexity. Many simple structures correspond to a single argument, such as strings, numbers, sequences and dictionaries. Some more complex data structures may require several arguments to be properly encoded. In such cases, one must split the data structure into multiple parts, and encode each part using the argument variant that best fits.
An argument on its own can also represent a simple encoded data structure. Such an argument can be part of a greater expression. More complex structures cannot be encoded as a single argument, so if one wishes to make it part of a greater expression, one should encode such a structure as an expression, and then wrap it as a grouping argument.
A symbol is the most general type of argument. Its meaning is highly dependent on the type of expression. It can represent
Matrix
and ,
in Matrix [1, 0, 0; 0, 1, 0; 0, 0, 1]
or Binomial
in Binomial 20 10%
.A quote usually represents text.
In markup, a grouping should be used to group text. For example, a grouping can delimit the content of an XML tag or a TeX macro argument.
For structured data, a grouping could be used to insert one structure into another. In this way, nested data structures are encoded.
In general, a grouping can be used to delimit an input argument of an expression.
A sequence trivially represents a collection of multiple values, such as arrays, ordered lists, unordered lists and tuples.
A dictionary trivially represents a collection of named values or mappings.
A function represents something that may
For example:
Options and flags should be used to add metadata to a function. For example, XML tag attributes should be represented by function options.
It should be obvious how to encode most data structures. Here are some examples and suggestions.
Text is primarily given as a sequence of symbol arguments This is text
or as a quote argument "This is text."
.
A number is primarily given as a symbol. Ex: 400
, 2.45
, True
or 50%
.
Trivially, a sequence or a dictionary can be encoded as a sequence argument or a dictionary argument.
A sequence could alternatively be encoded as varargs expression. Ex: 1 3 5 7
.
Structs could optionally encode type name in the first argument.
| Variant | Examples |
|-------------------|--------------------------------------------------------------------------|
| Named fields | Point3D { x: 10; y: 30; z: 5 }
or
{ x: 10; y: 30; z: 5 }
|
| Positional fields | Point3D 10 30 5
, Point3D [10, 30, 5]
,
10 30 5
or [10, 30, 5]
|
Enums encode their variant in the first argument.
| Variant | Example |
|-------------------|-------------------------------------|
| Named fields | Binomial { n: 50; p: 10% }
|
| Positional fields | Uniform 0 10
or Uniform [0; 10]
|
Here it is assumed that the structure can consist of text and tags with attributes and content.
Symbols and quotes encode text and functions encode tags with attributes. Groupings are used to encode the content of a tag.
Ex: (p class:front-paragraph) {Hello world!} Text.
encodes the HTML <p class="front-paragraph">Hello world!</p>Text.
.
This encoding is expanded upon in the HTML preprocessor subcrate.
Here it is assumed that the structure can consist of text, macro commands and groupings.
Symbols and quotes encode text, groupings trivially encode groupings, and functions encode macros.
Ex: This is text. (italic) {This is a grouping}
encodes the TeX This is text.\italic{This is a grouping}
.
This encoding is expanded upon in the TeX preprocessor subcrate.
The goal is to design a textual format that satisfy the requirements below. It is also considered how other formats that already exist satisfy these requirements. The most important requirements are 1, 2, 7 and 10, while 8 and 9 are of lesser importance. The reason for designing this new format is indeed the lack of a format satisfying requirements 7 and 10. Keep in mind that some requirements may be subjective.
Goal | JSON | XML&HTML | YAML | TOML |
---|---|---|---|---|
1 The format is human-readable. Assuming that best formatting practices are followed, the format should be easy to read and understand. | ✔️ | ✔️ | ✔️ | ✔️ |
2 The format is human-writable. Here, ease of writing or convenience is not taken into account. | ✔️️ | ✔️ | ✔️ | ✔️ |
3 The format is simple. There are few special cases. An advantage of a simpler format is that it is easier to parse. | ✔️ | ✔️ There is sometimes minor confusion about whether to encode data as tags or as attributes. | ❌ YAML is complex. There are many special cases and values may yield surprising results. | ✔️ |
4 The format is concise and contains minimal syntax noise. | ➖ JSON is concise, but does not minimize syntax noise. It requires quotes around keys even when there is no ambiguity. | ❌ XML does not minimize syntax noise. It is extremely verbose. | ✔️ | ✔️ |
5 The format has comments.️ | ❌ | ✔️ | ✔️ | ✔️ |
6 The format fully defines syntax, but not semantics. Semantics (such as data types of expressions) are defined externally. | ❌ JSON fully defines syntax and data types. | ✔️ | ❌ YAML fully defines syntax and data types. | ❌ TOML fully defines syntax and data types. |
7 The format can natively express both structured and unstructured data, such as:
|
❌ JSON does not support markup, and it is not entirely clear how to represent sum types. | ➖️ XML can represent these structures thanks to its flexibility, but it has no native support for sequences and dictionaries. Yet, it is obvious how to model them. | ❌ YAML does not support markup, and it is not entirely clear how to represent sum types. | ❌ TOML does not support markup, and it is not entirely clear how to represent sum types. |
8 The format is suitable for configuration. | ➖️ JSON can be used for configuration, but it lacks comments, which is a big downside. | ➖️ XML can be used for configuration, but its verbosity makes it inconvenient as a universal configuration format. | ✔️ | ✔️ |
9 The format is suitable for serialization. | ✔️ | ✔️ | ➖️ YAML can be used for serialization, but is not optimal. | ❌ TOML is not intended for serialization. |
10 The format is suitable for hand-coding. It lends itself well as a source format. It can conveniently encode structured data and markup. | ➖️️ JSON can be hand-coded easily, but its lack of comments makes it impractical as a source format. | ❌️ XML is not suitable as a source format because of its verbosity. | ✔️ YAML is easy to hand-code in most cases, but when YAML documents get large or complex, they may get hard to manage, especially given the whitespace indentation. | ✔️ |
Here are some of the decisions made during the design process. The reasons behind these decisions may be subjective.
Dx comes from "data expressions". This makes sense since expressions represent encoded data structures.
Arguments are primarily delimited by whitespace, because arguments are frequent, and whitespace creates the least possible amount of syntax noise.
Whitespace equivalence gives users the flexibility to format a document however they like. For simple expressions, this flexibility is not needed, but for complex expressions that span multiple lines, it is appreciated.
Whitespace indentation is simple and works great when expressions span one line. In most whitespace-indented formats and languages, this is the case most of the time. However, when an expression has to span multiple lines, whitespace indentation requires complex rules that feel like special cases to the user. Keeping track of whitespace and indentation level also adds complexity to the parser. Thus, it was decided to stick with bracket delimited scopes.
Looking at modern programming languages and ubiquitous formats such as JSON, XML&HTML and TeX, the following structures are commonly used: numbers, text, structs/product types, enums/sum types, dictionaries, sequences, markup with text and tags with attributes and content (such as XML) and markup with text, groupings and macro commands (such as TeX).
The described argument variants are able to natively support these structures with concise and convenient syntax.
The XML approach is taken, where the semantics (such as the types of the contained expressions) of a document must be defined externally. A user must define semantics by using a schema, writing documentation or implementing serialization/deserialization procedures in a program.
This approach is taken because normally a document is not read blindly. A user or a program already has expectations about the types of the encoded expressions. This also makes the format more flexible and extensible, like XML.