ArunaReadWriter

Short guidance for usage of the ArunaReadWriter custom component. For the formal file specification click here.

This is the first working generic version of customisable data transformer component for the Aruna Object Storage (AOS). The idea is simple, you implement these two base traits with your custom data transformation logic:

```rust

pub trait AddTransformer<'a> { fn add_transformer(&mut self, t: Box); }

[asynctrait::asynctrait]

pub trait Transformer { async fn process_bytes(&mut self, buf: &mut bytes::Bytes, finished: bool) -> Result; async fn notify(&mut self, notes: &mut Vec) -> Result<()>; } ```

And afterwards the structs implementing Transformer + AddTransformer can be registered in the ArunaReadWriter to be plugged between the Read and Write parts of the ReadWriter.

Example:

```rust let file = b"This is a very very important test".to_vec(); let mut file2 = Vec::new();

    ArunaReadWriter::new_with_writer(file.as_ref(), &mut file2)
        .add_transformer(ZstdDec::new()) // Double compression because we can
        .add_transformer(ZstdDec::new()) // Double compression because we can
        .add_transformer(
            ChaCha20Dec::new(b"wvwj3485nxgyq5ub9zd3e7jsrq7a92ea".to_vec()).unwrap(),
        )
        .add_transformer(
            ChaCha20Dec::new(b"99wj3485nxgyq5ub9zd3e7jsrq7a92ea".to_vec()).unwrap(),
        )
        .add_transformer(
            ChaCha20Enc::new(false, b"99wj3485nxgyq5ub9zd3e7jsrq7a92ea".to_vec()).unwrap(),
        )
        .add_transformer(
            ChaCha20Enc::new(false, b"wvwj3485nxgyq5ub9zd3e7jsrq7a92ea".to_vec()).unwrap(),
        ) // Tripple compression because we can
        .add_transformer(ZstdEnc::new(2, false)) // Double compression because we can
        .add_transformer(ZstdEnc::new(1, false))
        .process()
        .await
        .unwrap();
    assert_eq!(file, file2)

```

This example creates a Vec<u8> from a bytes array (implements AsyncRead) and sinks it in another Vec<u8> (impl AsynWrite). In between, custom data transformations can take place. Please note: the order of execution is reversed from the add_transformer calls, so you have to start with the "last" step and end with the "first".

The example compresses the vector first double compresses the vector with a custom padded Zstandard compression component and afterwards encrypts the result also two times with ChaCha20-Poly1305. Afterwards all steps are reversed resulting in the original data.

Notes for own implementations

The AddTransformer trait is used to register the transformer and chain it via a dynamic dispatch of multiple Transformer. For this your struct should contain a Option<Box<dyn Transformer + Send + 'a>> field that is set via add_transformer.

The rest of the main logic is build around, the process_bytes function.

rust async fn process_bytes(&mut self, buf: &mut bytes::Bytes, finished: bool) -> Result<bool>;

The idea is that your Transformer receives a mutable buffer with bytes that you can transform. If you have transformed (either all or via an internal buffer) the data is transferred to the next transformers process_bytes method. To work properly the following rules should be followed:

The ARUNA file format

This document contains the formal description for the aruna (.aruna equivalent to .zst.c4gh) file format. A file format that enables compression and encryption while still maintaining a resonable performant indexing solution for large multi-gigabyte files. Optimized for usage with object storage solutions, like S3.

Specification

The core of the aruna file format is the combination of GA4GH's crypt4gh encryption format with the zstandard compression algorithm (RFC8878). This is extended by an optional custom footer block containing positional information for decrypting and decompressing blocks within larger files.

Structure

Aruna files consist of three distinct parts. A Header section followed by blocks of compressed and encrypted data and an optional footer section containing indirect index information and block sizes.

Data structure

For Compression the data SHOULD first be split into raw data chunks with exactly 5 Mib size (except the last one). These chunks MUST be compressed using the zstandard algorithm with a compression level of choice and MAY optionally end with a MAC. Each compressed frame MUST be followed by a skippable frame as defined in RFC8878 if the resulting compressed size is not a multiple of 65536 Bytes (64 Kib) and the raw file size was more than 5 Mib. The skippable frame SHOULD use 0x184D2A50 as Magic_Number and SHOULD avoid 0x184D2A51 and 0x184D2A52 to avoid confusion with the custom footer section. A skippable frame MUST be used to align the total compressed size to a multiple of the encryption block size of 65536 Bytes (except for the last block) if more than one chunk exists. Because skippable frames have a minimum size of 8 Bytes they extend the data at worst by 65536 + 7 = 65543 Bytes. Raw files that are smaller than 5 Mib SHOULD NOT contain any skippable frames and omit any indexing for performance reasons.

The resulting blocks consisting of compressed data MUST be encrypted in ChaCha20-Poly1305_ietf encrypted blocks as specified in RFC7539 with 65536 Bytes size, using a securely generated random encryption secret. All blocks SHOULD be preceeded by a per block random generated 12 byte Nonce and end with a 16 byte message authentication code (MAC). This results in a total blocksize of 65562 Bytes. The last encrypted block of the file CAN have a smaller size than this if the file has an uncompressed size of less than 5 Mib.

If the file is larger than 5 Mib the number of blocks that build 5 Mib of raw data SHOULD be summed up resulting in a 1 Byte unsigned integer between 1 and 81 (with last chunk +2 = 83). 81 is the maximum because 5 Mebibytes are exactly 80 x 65536 Bytes chunks and in the worst case with no compression the skippable frame could extend this by a maximum of one block. This index number is stored in the last one or two encrypted blocks of skippable frames in the file as index for fast access of data in order.

Header

The primary header is identical to the header specified by the crypt4gh standard and contains the block and encryption information for a specific recipient. This header is generated ad-hoc and NOT stored with the data itself to avoid re-encrypting the first section multiple times.

Footer

The footer consists of one or two encrypted 65536 Byte sized blocks of skippable frames that contain 1 byte unsigned integers with index information about each block of 5 Mib raw uncompressed data in order. These blocks have the following structure in little-endian format.

If the footer contains two blocks (indicated by the Magic_Number 0x184D2A52) both blocks should repeat the Header / Frame_Size / Block_number sections with the same information.

Practical guidance

This section contains practical recommendations for building encryption logic that comply with this format.

Compression and Encryption

Decryption and Decompression

This procedure has two options a simple single threaded one and a more parallelizable multi-threaded one. Multi-threading only gives a significant advantage for files that are larger than 10-20 Mib.

#### Option A (single-threaded):

Option C (specific Range):

If you want to get only a specific range from the file the procedure is as follows:

Discussion

The Aruna file format considers multiple aspects like compresion ratio, access speed etc. and tries to create a balanced middle ground that is best suitable for a wide range of filetypes. By utilizing existing standard algorithms and procedures the resulting file is readable by existing tools and does not need specific software to be handled. However the full potential of this file format can only be established with customized software that uses the additional information stored in the skippable frames.