BDV
Unambiguous Delimited Values. Similar to CSV, but consistent, unambiguous, and
predictable.
Description
Uses leading delimiters and simple character escapes to allow simple and
unambiguous introduction of units and records, unambiguous header declaration,
unambiguous concatenation of documents, the ability to discern the
differences between 0 fields and 1 blank field, and the ability to use arbitrary
binary data.
The EBNF is like this, where the all-caps values are each a configurable single
byte delmiter:
ebnf
stream = {garbage}, { message, {garbage} }, [ ENDSTREAM ];
garbage = (* - (STARTMESSAGE | STARTHEADER | ENDSTREAM) )
message = [header], STARTMESSAGE, { record }, ENDMESSAGE;
header = STARTHEADER, units;
record = STARTRECORD, units;
units = { STARTUNIT, unit };
unit = { (* - control) | (ESCAPE, control) };
control = ENDSTREAM | STARTHEADER | STARTMESSAGE | ENDMESSAGE | STARTRECORD | STARTUNIT | ESCAPE;
the defaut delimters:
ebnf
STARTHEADER = "#";
STARTMESSAGE = ">";
ENDMESSAGE = "<";
STARTRECORD = ? ASCII newline ?;
STARTUNIT = ",";
ESCAPE = "\";
ENDSTREAM = "!";
Examples, using default delimiters
Single message with a header and two records
```
,id,name,value>
,1,taylor,developer
,2,namewith\,comma,valuewith\
newline<
```
Single message with no header and two records
>
,1,taylor,developer
,2,namewith\,comma,valuewith\
newline<
Single message with a header and no records
```
,id,name,value><
```
Single message with a header and one empty record
```
,id,name,value>
<
```
Single message with a header with an empty unit, and a record of all empty units
```
,id,name,,value>
,,,,<
```
The shortest valid message
```
<
```
Single message with no header and one record with one empty unit
>
,<
Single message with no header and one record with zero empty units, one record with one empty unit, and one record with two empty units
```
>
,
,,<
```
All the previous examples concatenated as a stream of messages, with an ENDSTREAM character to delimit the end
This takes advantage of the fact that any amount of garbage data may appear
before any STARTMESSAGE, STARTHEADER, or ENDSTREAM character, to allow trailing newlines to
not cause issues.
```
,id,name,value>
,1,taylor,developer
,2,namewith\,comma,valuewith\
newline<
>
,1,taylor,developer
,2,namewith\,comma,valuewith\
newline<
,id,name,value><
,id,name,value>
<
,id,name,,value>
,,,,<
<
,<
,
,,<
!
```
The shortest valid stream
!
Advantages over CSV
- Headers are explicitly delimited, so there is never any guessing about whether
the first row constitutes a header.
- Parsing is extremely simple and unambiguous. There is not problem like CSV's
having to determine whether a newline is part of a value, or if quotes around
a value are part of the value, or how to escape commas and quotes inside of a
value.
- It is possible to differentiate between a message with no records, a message
with a record that has no units, and a message with a record that has one
empty unit.
- Multiple documents with or without headers can be concatenated in the same
stream and parsed without any loss. Because of the rules for garbage data
around messages, it's possible to usually naively concatenate message files,
as long as there's no ENDSTREAM at the end of any of them.
- The optional ENDSTREAM to end the stream allows a BDV stream to be embedded in
the middle of more data and parsed without issues using just a pointer to its
start byte.
- being byte-oriented means that you can even embed binary data without issues.
You can use the C0 control codes for a self-describing binary stream of
messages as well, using the following rules:
ebnf
STARTHEADER = SOH;
STARTMESSAGE = STX;
ENDMESSAGE = ETX;
STARTRECORD = RS;
STARTUNIT = US;
ESCAPE = ESC;
ENDSTREAM = EOT;
If you have a stream of mostly string messages, these rules can help serialize
into a compact stream with as little escaping as possible.
Disadvantages
- The concessions needed to make the format unambiguous can make it somewhat
unwieldy to read and write by hand. Manual modification is possible, but this
format is intended primarily to be a machine-generated and machine-parsed
format.
- Because strings are not length-prefixed, parsing strings has to be done
character-by-character, and escape sequences can interfere with no-copy
reading. This is a disadvantage compared to other binary formats, not CSV or
other text-oriented formates.
- Because this is byte-oriented, it can be unwieldy at best to use this with
any multibyte encodings (all values are treated as arbitrary binary data and
all delimiters are required to be single bytes, so it's quite likely that
using other encodings will yield messages that don't conform to that encoding;
round trips should still work regardless). Even UTF-8 can have its codepoints
decimated by escape values. A later version might relax the byte-oriented
nature of this and allow arbitrary encodings.