Neodyn Exchange Format Specification

Neodyn Exchange on crates.io Neodyn Exchange on docs.rs Minimum Supported Rust Version (MSRV) Downloads License Lines of Code

Neodyn Exchange is a serialization format originally derived from the Serde data model. It is designed to be an easy-to-use, portable, and efficient means of data exchange between systems. Although the reference implementation is in Rust, implementations in other languages are expected and encouraged.

Usage Quick Start

Like many Serde-compatible serialization formats, Neodyn Exchange is used primarily by calling top-level functions for different formats. The crate root re-exports the following functions:

``rust // For serialization: fn to_value() // serializes to an in-memory, loosely-typedValue` fn tostring() // serializes to a string, in the text representation fn tobytes() // serializes to a Vec, in the binary format fn towriter() // serializes to an io::Write, in the binary format fn towriterbuffered() // same as towriter(), but uses buffering

// For deserialization fn fromvalue() // deserializes from an in-memory, loosely-typed Value fn fromvalueref() // same as above, but takes a borrowed &Value instead fn fromstr() // deserializes from the text representation fn frombytes() // deserializes from a byte slice in the binary format fn fromreader() // deserializes from an io::Read in the binary format fn fromreaderbuffered() // same as above, but uses buffering for efficiency ```

There are more functions in the conventionally-named ser and de modules, which, however, are less common (e.g. writing the text format as raw bytes).

The Value enum is also exported at the top level; it represents all possible values the Neodyn Exchange format can work with. Additionally, Value and &Value implement the Deserializer trait, as well as Serialize and Deserialize, so you can use them as a last-resort escape hatch if some types or other formats aren't flexible enough in their implementation of (de)serialization.

Furthermore, the types implementing Serializer and Deserializer are also public (however, they are not re-exported). However, some caveats apply to them (e.g. some of them must be .finalize()d manually), so please refer to their respective documentation for more information.

Representations

The Neodyn Exchange data model has three equivalent representations:

  1. An abstract, structured, in-memory value tree, realized by the Value type in the Rust reference implementation;
  2. A human-readable, pretty-printable, text-based format;
  3. And a machine-readable, compact binary format.

Each of these representations are capable of storing the same set of values. How exactly one interprets a serialized value may differ between applications. For example, due to the varying nature of the means of parsing each representation, the Serde reference implementation may produce different kinds of errors at different points in time when deserializing the structured, the human-readable, and the binary format, respectively.

The Type System

Neodyn Exchange is a self-describing format, and as such, every serialized value carries its type. This also means that as long as one is satisfied with a loosely-typed value tree as the output of the parsers, no schema is needed for decoding a serialized value.

The possible types are the following:

  1. null, denoting the absence of an optional value.
  2. opt, denoting the presence of a (wrapped) optional value.
  3. bool, just a regular Boolean, either true or false.
  4. int, 64-bit signed integer.
  5. uint, 64-bit unsigned integer.
  6. float, 64-bit IEEE-754 floating-point number, which may not be NaN.
  7. string, UTF-8 encoded, human readable text.
  8. blob, an arbitrary, raw sequence of bytes.
  9. array, an ordered collection of arbitrary nested values.
  10. map, an unordered collection of key-value pairs, where both the keys and the values are arbitrary nested Neodyn Exchange values.

The abstract value tree representation pretty much reflects this system one-to-one.

Remark: The signed integer, unsigned integer, and floating-point types are collectively called numbers. (Note in particular that bool is not considered a number.)

The Text Representation

The text representation is defined to be a UTF-8 encoded string. The following is an informal (read: not rigorous), BNF-style grammar for this format.

Whitespace (as defined by Unicode) before the first token, after the last token, and between consecutive tokens is allowed and ignored. The exception is strings, which are treated as a single token, and whitespace between characters in a string must not be ignored. (This is just the usual semantics, because of course we want to preserve whitespace within a string.)

Details of the notation below:

``` VALUE := NULL | OPT | BOOL | INT | UINT | FLOAT | STRING | BLOB | ARRAY | MAP

NULL := 'null' WB OPT := '?' VALUE

BOOL := ('true' | 'false') WB

INT := ('+' | '-') [0-9]+ WB UINT := [0-9]+ WB FLOAT := ('+' | '-')? (([0-9]* '.' [0-9]+) | ([0-9]+ '.' [0-9]*) | 'inf') WB

STRING := '"' ( UNESCAPED | ESCAPED )* '"' UNESCAPED := ESCAPED := '\n' | '\r' | '\t' | '\' | '\'' | '\"' | '\u{[0-9a-fA-F]+}'

BLOB := '#' ([0-9a-fA-F]{2})* '#'

ARRAY := '[' ( VALUE ( ',' VALUE )* ','? )? ']' MAP := '{' ( PAIR ( ',' PAIR )* ','? )? '}' PAIR := VALUE ':' VALUE

WB := ```

Notes:

Example for the text representation:

[ { +39: -.354, -1.: true, +3.142: -6.283, 0: null, 1: ?"an optional string", 2: ??"two levels of optionals; even an optional null is allowed, e.g.:", null: ?null, "as you can see": "null is allowed to be a key as well", "escaped\nnewline": "unescaped newline", ["arrays","and","maps"]:{"can":"be","keys":"too"}, "this is a map": "with a trailing comma", }, { "optional array": ?[ "first", "second", ], "empty map": {}, "array without a trailing comma": [1, 2, 3], "this is a map": "also without a trailing comma" }, ]

The Binary Representation

The binary format tries to pack values into as little space as possible. It is, however, a strictly byte-oriented format, i.e. it operates on units of at least 8 bits and assumes 8-bit bytes. It employs techniques such as variable-width integer encoding and string interning in order to achieve small serialized object sizes. Consequently, although the conceptual type system describes 64-bit integers and floating-point numbers, values of sufficiently low magnitude and/or sufficiently few significant digits may actually use less space when encoded.

Type tags and Variable-Width Encoding

Every value starts with a one-byte tag containing at least type information, but often other data such as length or an inline small value.

The types are organized hierarchically. There are three levels of types:

  1. Every value has a major type.
  2. For some of the major types, there are one or more minor types.
  3. For certain combinations of major and minor types, there is a value tag.

The type annotations and these additional pieces of information occupy well-defined places within a byte:

To sum up, the possible formats for type/value tags are:

Anatomy of a Serialized Value. The Symbol Table.

A serialized Neodyn Exchange value in the binary format consists of two parts:

  1. The header or "symbol table"
  2. The body

Strings and binary blobs are stored in a compressed manner: only one copy of each identical byte array is ever written. These uniqued byte arrays are arranged into a "symbol table" at the beginning of the serialized data. If there are no interned strings or blobs in a value, this header may be omitted (and it is omitted by the reference implementation).

Note that if a string and a blob have the exact same data payload, they are uniqued to the same slot in the symbol table. Also note that empty strings and blobs are never interned and they are denoted by special inline "empty string" and "empty blob" markers, respectively, in the value body.

Furthermore, Each symbol table entry declares whether the corresponding data payload can (and will) be used as a string too, or only as a blob. This helps the deserializer avoid trying to parse arbitrary binary data as UTF-8.

If a symbol table entry is referenced more than once, the number of references, the so-called "use count" is also stored along with the payload. This makes it possible for the deserializer to hand over the owned buffer to the consumer upon the last reference to the symbol, sparing a copy and an allocation. (This optimization only matters when the deserializer works with owned buffers.)

After the symbol table, all other data are stored in the value body. It contains atomic values (e.g. booleans, numbers, etc.) as well as arrays and maps, inline. Strings and blobs are referenced by their index into the symbol table.

The only case when more than one value is present in the body is when there is a complex type (an array or a map), recursively containing other values. The values in an array follow each other sequentially. The keys and values of a map are interleaved: the first entry is the first key followed by the first value, then the second key and the second value, and so on.

Type Tags in the Header and the Body

Numerically equal type tags can have different interpretations depending on whether they are found in the header / symbol table or in the actual value body.

The following lists all of the possible major and minor types and value tags. As previously, NN denotes log-length bits, and VVV VV stands for inline small integer bits.

First, type tags shared between the header and the body:

| Bits | Meaning | |-------------:|:------------------------------------------| | 000 000 NN | Symbol Table Start |

Second, type tags for the header / symbol table:

| Bits | Meaning | |-------------:|:------------------------------------------| | 010 VVV VV | Short Blob, Single-use | | 011 VVV VV | Short Blob, Multiple uses | | 100 VVV VV | Short String, Single-use | | 101 VVV VV | Short String, Multiple uses | | 111 010 NN | Long Blob, Single-use | | 111 011 NN | Long Blob, Multiple uses | | 111 100 NN | Long String, Single-use | | 111 101 NN | Long String, Multiple uses |

Third, type tags for the value body:

| Bits | Meaning | |-------------:|:------------------------------------------| | 000 001 00 | null | | 000 001 01 | Optional Present | | 000 001 10 | false | | 000 001 11 | true | | 000 010 00 | Empty string (inline) | | 000 010 01 | Empty blob (inline) | | 001 VVV VV | Small signed integer | | 010 VVV VV | Small unsigned integer | | 011 VVV VV | Low-index string | | 100 VVV VV | Low-index blob | | 101 VVV VV | Small array | | 110 VVV VV | Small map | | 111 001 NN | Signed integer | | 111 010 NN | Unsigned integer | | 111 011 NN | String | | 111 100 NN | Blob | | 111 101 NN | Array | | 111 110 NN | Map | | 111 111 NN | Floating-point number |

Bit-level Details of the Binary Format

First some general notes about the variable-width number encoding:

Next, the format of the symbol table is elaborated.

Symbol Table Header Format

If the symbol table is present, the value must start with a marker byte with the major type "simple" and the minor type "symtab", i.e. 000 000 NN. If the value does not start with such a byte, then no symbol table is read.

The last two bits, NN, define the base-2 logarithm of the number of bytes needed for encoding the number of entries in the symbol table. I.e.,

| Symtab start byte | Width of integer encoding symbol count | |------------------:|:--------------------------------------:| | 000 000 00 | 1 | | 000 000 01 | 2 | | 000 000 10 | 4 | | 000 000 11 | 8 |

Note that this means that the symbol table count doesn't have a compact, inline (1-byte) format for storing the count; the total size is always 2, 3, 5, or 9 bytes, as the "start byte" is present unconditionally.

So, the symbol count follows, as a little-endian unsigned integer. For example, a symbol table with 512 entries might begin like this:

+---------------+-------------+-------------+ | 000 000 01 | 0000 0000 | 0000 0010 | +---------------+-------------+-------------+ | Symtab start, | count, | count, | | 2^1 = 2 bytes | low byte | high byte | | of length | | | | follow | | | +---------------+-------------+-------------+

Then, each entry follows in sequence, without any additional padding in between.

Symbol Table Entry Format

Symbol table entries consist of, in this order:

  1. Type tag and length (1...9 bytes), in the usual variable-width encoding;
  2. An optional use count (1...9 bytes), as an unsigned integer;
  3. The raw payload data.

The first 1...9 bytes thus contain information about the following:

The bit pattern for each of the type tags can be found in the table of all type tags in the previous section.

So, for example, a 64-byte-long string that is only referred to once in the value body (and thus has an implicit use count) might be encoded as follows:

+--------------+-----------+------------------------+ | 111 100 00 | 0100 0000 | 64 bytes of UTF-8 data | +--------------+-----------+------------------------+ | long string, | length, | actual string payload | | single-use, | 64 bytes | | | 2^0 = 1 byte | | | | of length | | | | follow | | | +--------------+-----------+------------------------+

Meanwhile a 17-byte-long blob that is used 130 times in the body can be encoded in the following manner:

+--------------+------------+-----------+---------------------+ | 011 100 01 | 111 010 00 | 1000 0010 | 17 arbitrary bytes | +--------------+------------+-----------+---------------------+ | short blob, | use count: | 130 | actual blob payload | | multi-use, | unsigned | | | | 17 bytes | integer, | | | | long | 2^0 = 1 | | | | | byte wide | | | +--------------+------------+-----------+---------------------+

The Value Body

Non-interned data, i.e. values of every type other than strings and blobs, are stored inline in the body, which directly follows the header.

The exact format of each type of value is reproduced below.

| Type Tag | Value | Following additional data | |-------------:|:--------------:|:------------------------------------------| | 000 001 00 | null | None (tag-only) | | 000 001 01 | Optional | Wrapped value | | 000 001 10 | false | None (tag-only) | | 000 001 11 | true | None (tag-only) | | 000 010 00 | Empty String | None (tag-only) | | 000 010 01 | Empty Blob | None (tag-only) | | 001 VVV VV | Small Int | None (5-bit value inline) | | 010 VVV VV | Small Uint | None (5-bit value inline) | | 011 NNN NN | Small String | None (5-bit symtab index inline) | | 100 NNN NN | Small Blob | None (5-bit symtab index inline) | | 101 NNN NN | Small Array | Array items (5-bit count inline) | | 110 NNN NN | Small Map | Key-value pairs (5-bit count inline) | | 111 001 NN | Big Int | 2NN bytes of 2's complement | | 111 010 NN | Big Uint | 2NN bytes of unsigned int | | 111 011 NN | Big String | 2NN bytes of symtab index | | 111 100 NN | Big Blob | 2NN bytes of symtab index | | 111 101 NN | Big Array | 2NN bytes of count, then items | | 111 110 NN | Big Map | 2NN bytes of count, then pairs | | 111 111 NN | Floating-point | 2NN (4 or 8) bytes of IEEE-754 |

Example

Consider the MsgPack front page example in the human-readable format of Neodyn Exchange:

{"compact": true, "schema": 0}

In the binary format this is represented as:

| Address | Byte | Explanation | |--------:|:------------:|:------------------------------------------------| | 00 | 000 000 00 | Symtab start, 200 = 1 byte of length | | 01 | 000 000 10 | 2 symbols in symtab | | 02 | 100 001 11 | Short string, single-use, 7 bytes | | 03 | 011 000 11 | 'c' | | 04 | 011 011 11 | 'o' | | 05 | 011 011 01 | 'm' | | 06 | 011 100 00 | 'p' | | 07 | 011 000 01 | 'a' | | 08 | 011 000 11 | 'c' | | 09 | 011 101 00 | 't' | | 10 | 100 001 10 | Short string, single-use, 6 bytes | | 11 | 011 100 11 | 's' | | 12 | 011 000 11 | 'c' | | 13 | 011 010 00 | 'h' | | 14 | 011 001 01 | 'e' | | 15 | 011 011 01 | 'm' | | 16 | 011 000 01 | 'a' | | 17 | 110 000 10 | 2-element map | | 18 | 011 000 00 | String #0 | | 19 | 000 001 11 | true | | 20 | 011 000 01 | String #1 | | 21 | 010 000 00 | Unsigned integer 0 |