Lozizol 0.5

Extensible and efficient serialization protocol.

Aiming at both wire and storage, multiplexed streams, concatenation, event sourcing...

Event sourcing? History is immutable and only it's interpretation may change between application versions. On the down side, the application needs to know how to interpret the records of all previous versions. On the up side, there is no need for migrations or data updates on application upgrade.

Definitions

These entry types are implied and default by Lozizol:

Other entry types are defined by applications with a URI and assigned in a Lozizol sequence using the type assignment entry. Type assignment can be assigned, reassigned or removed even in the middle of a Lozizol sequence. A type must be assigned or implied before it can be used in a Lozizol sequence.

The higher level implementations must work with the type URIs, rather than type IDs. As a result, one can assign the same type under multiple type IDs and they will all be treated equally.

Binary definitions

The Vuint binary representation choice

Lozizol header (111)

Since the Lozizol header would typically be the first record stored in a file, it allows for magic bytes based file type detection. It is deliberately human readable so that unsuspecting explorers can learn about the nature of the stored and/or transferred data.

Without the Lozizol header the implementation must either: * assume the file is not a Lozizol sequence - consider it corrupt, or * assure that it knows precisely which sequence (and state) it belongs to: * sequence (potentially nested) * starting position in the sequence * type definitions and context

Try bash try_lozizol () { data="zizol 0.5 039343bb-9019-4d6e-825e-631aa578ff7f lozizol-cli 0.0.1-preview and what not ....................." lozizol serialize entry 111 "$data" | xxd # lozizol magic: lozizol serialize entry 111 "$data" | dd bs=1 count=109 status=none | cut -d ' ' -f1 # lozizol version: lozizol serialize entry 111 "$data" | dd bs=1 count=109 status=none | cut -d ' ' -f2 # sequence ID: lozizol serialize entry 111 "$data" | dd bs=1 count=109 status=none | cut -d ' ' -f3 # diagnostic info: lozizol serialize entry 111 "$data" | dd bs=1 count=109 status=none | cut -d ' ' -f4- } try_lozizol

A Lozizol sequence is initialized with the header. The header sets the sequence id and protocol version. It resets all type assignments in the given sequence do default (unassign all and assign only the ones implied by lozizol).

Note: The Lozizol header type id 111 can be reassigned once the sequence is initialized.

Type assignment (1)

Try lozizol serialize vuint 63 | xxd # find that 63 serializes as 0x3F ('?') lozizol serialize entry 1 "?urn:my-awesome-type" | xxd or a specific shortcut lozizol serialize type 1 63 urn:my-awesome-type | xxd

Note: The type assignment type id 1 can be reassigned.

Deleted entry (0)

Warning: The deletion type id 0 MUST NOT be reassigned as it would break the padding semantics.

Operations

The protocol anticipates aggressive forward only readers following a forward only writer to be the most common scenario. This fits for both file operations and network streaming. Fast seeking forward is supported by prefixing the entries with length.

Multiplexing or random write access can be accommodated too with some caution applied by the application. Lazy cleanup and indexing writes can take place without affecting integrity and reads. Indexing can be built on top of Lozizol to enable random read access.

Blocking

Forward only reads and writes

Use case: all the data required to serialize the entry is available. In other words it will not block excessively by waiting for serialized data to manifest (IO or intense serialization).

Write

Examples: * Recording a TCP stream, the writer only persists the immediately available buffer as is without much overhead for serialization. * Recording a user action, all the data is available in memory and serialization is trivial.

The writer: 1. writes the entry type record if it hasn't been written to this Lozizol sequence yet, 2. writes the entry record - size, type, data,

Record is considered committed when all data according to size is written.

Read

The reader simply eagerly reads all available data. When it reaches the end of the record it produces the entry.

The reader: 1. reads vuint as long as the first bit is set, learning the size of the record, 2. reads entry type as long as the first bit is set, learning and using this information, 3. reads remaining data (size of the record - type vuint length) into a buffer, 4. produces the entry

Parallel approach is that the reader skips through the records by their length. actual record reading is delegated to asynchronous record readers by giving them the byte range to munch on.

Deletions

Use case: the application needs to mark an already written record as deleted.

The aplication should have its own semantics of labeling data structures redundant, such as adding an entry that an entity has been deleted, in the spirit of event sourcing. Still, for archiving/cleanup purposes, the application should be able mark a record as deleted. See also Wiping and Secure erasure.

The deleted entry type (0) is intended to be applied to already writen records to mark them as deleted. As long as the single byte write operation can be considered atomic, the deletion can be combined with reads and writes and is safe for the readers as the reader is either before the type ID and will see it as deleted, or is behind the type ID and will see the original record.

The data does not change, but with the entry type being reset, it is no longer possible to interpret it's meaning reliably. Readers should skip deleted records unless they are in the business of reclaiming unused space for compression or executing data erasure.

Deletion is a single byte write operation. To mark a record as deleted, put null (0x00) byte at the position immediately following the record length. If this operation cannot be executed in atomic fashion, deletions must block readers and random access writers on the given record to ensure consistency.

Wiping

Lozizol sequence can be compressed as a whole, like any other stream. A fitting optimization is to wipe all deleted records with null (0x00) bytes beforehand.

This can be done as a background job on a stored Lozizol sequence or passing through a wiping process.

The background wiping process: 1. asserts that random access reads and writes have been stopped where applicable (append only writes can continue), 2. locates the next deleted record, 3. writes null (0x00) bytes in place of type, 4. writes null (0x00) bytes in place of data, 5. writes null (0x00) bytes in place of size.

This will effectively create padding. It should be safe to fail at any moment.

The background wiping process can optionally optimize subsequent reads by creating deleted entry records to span the continuous blocks of padding.

The pass through (streaming) wiping process: 1. simply replaces all deleted records with null (0x00) bytes.

Secure erasure

Sensitive applications may require secure erasure of stored and deleted records. This can be accomplished similarly to the background wiping process except that before null (0x00) filling the data, the secure erasure process runs adequate measures to eliminate residual data traces from the storage media.

Random access writes

Before even considering random access asynchronous writes, think about your model. Can these writes be expressed as smaller entries written forward only? For example, rather than writing down the whole e-mail as it comes from the wire, which would beg an asynchronous write optimization, write each header block and each body block as a record tagged with the common mail ID and close with a terminating record (a.k.a. event sourcing).

Alternatively, simple parallelism can be achieved by partitioning the Lozizol sequences into multiple streams.

Still, there may be situations where asynchronous random access writes are desirable. For instance in case of UDP transfers, or intense serialization.

Use case: I really want to write asynchronously

There should be one authority over each Lozizol sequence that allocates records to be written. It will only write down the record length and type and hand over the byte range to an asynchronous writer. Immediately after that it can skip to the next available position and allocate the next record write request.

Since the integrity of the record cannot be asserted by reaching the end of data in this case, there must be a commit mechanism built into the application - either a hash as part of the record that will invalidate dirty reads or a separate commit record. Writing the commit record must block the authority. One simple commit mechanism would be to use one type for dirty records and another for fully written records. After writing, flushing, syncing, the writer would then update the record type. This is tricky because the type vuint length must not change. IDs of the same vuint length must be chosen.

For now it is left to the application do define suitable entry types and the rules for handling asynchronous writes with integrity guarantees.