Avrow is a pure Rust implementation of the Avro specification: a row based data serialization system. The Avro data serialization format finds its use quite a lot in big data streaming systems such as Kafka and Spark. Within avro's context, an avro encoded file or byte stream is called a "data file". To write data in avro encoded format, one needs a schema which is provided in json format. Here's an example of an avro schema represented in json:
json
{
"type": "record",
"name": "LongList",
"aliases": ["LinkedLongs"],
"fields" : [
{"name": "value", "type": "long"},
{"name": "next", "type": ["null", "LongList"]}
]
}
The above schema is of type record with fields and represents a linked list of 64-bit integers. In most implementations, this schema is then fed to a Writer
instance along with a buffer to write encoded data to. One can then call one
of the write
methods on the writer to write data. One distinguishing aspect of avro is that the schema for the encoded data is written on the header of the data file. This means that for reading data you don't need to provide a schema to a Reader
instance. The spec also allows providing a reader schema to filter data when reading.
The Avro specification provides two kinds of encoding: * Binary encoding - Efficent and takes less space on disk. * JSON encoding - When you want a readable version of avro encoded data. Also used for debugging purposes.
This crate implements only the binary encoding as that's the format practically used for performance and storage reasons.
deflate
, bzip2
, snappy
, xz
, zstd
) supported as per spec.Read
and Write
types, avrow tries to mimic the same APIs as Rust's standard library APIs for minimal learning overhead. Writing avro values is simply calling write
or serialize
(with serde) and reading avro values is simply using iterators.--features zstd
).Reader
with a reader schema and only read data relevant to their use case.rabin64
, sha256
, md5
) support.Note: This is not a complete spec implemention and remaining features being implemented are listed under Todo section.
Add avrow as a dependency to Cargo.toml
:
toml
[dependencies]
avrow = "0.1"
```rust
use anyhow::Error; use avrow::{Schema, Writer}; use std::str::FromStr;
fn main() -> Result<(), Error> { // Create schema from json let schema = Schema::fromstr(r##"{"type":"string"}"##)?; // or from a path let schema2 = Schema::frompath("./stringschema.avsc")?; // Create an output stream let stream = Vec::new(); // Create a writer let writer = Writer::new(&schema, stream.asslice())?; // Write your data! let res = writer.write("Hey")?; // or using serialize method for serde derived types. let res = writer.serialize("there!")?;
Ok(())
}
``
For simple and native Rust types, avrow provides a
Fromimpl for Avro value types. For compound or user defined types (structs, enums), one can use the
serializemethod which relies on serde. Alternatively, one can construct
avrow::Value` instances which is a more verbose way to write avro values and should be a last resort.
```rust fn main() -> Result<(), Error> { let schema = Schema::fromstr(r##""null""##); let data = vec![ 79, 98, 106, 1, 4, 22, 97, 118, 114, 111, 46, 115, 99, 104, 101, 109, 97, 32, 123, 34, 116, 121, 112, 101, 34, 58, 34, 98, 121, 116, 101, 115, 34, 125, 20, 97, 118, 114, 111, 46, 99, 111, 100, 101, 99, 14, 100, 101, 102, 108, 97, 116, 101, 0, 145, 85, 112, 15, 87, 201, 208, 26, 183, 148, 48, 236, 212, 250, 38, 208, 2, 18, 227, 97, 96, 100, 98, 102, 97, 5, 0, 145, 85, 112, 15, 87, 201, 208, 26, 183, 148, 48, 236, 212, 250, 38, 208, ]; // Create a Reader let reader = Reader::withschema(v.as_slice(), schema)?; for i in reader { dbg!(&i); }
Ok(())
}
```
A more involved self-referential recursive schema example:
```rust use anyhow::Error; use avrow::{from_value, Codec, Reader, Schema, Writer}; use serde::{Deserialize, Serialize};
struct LongList {
value: i64,
next: Option
fn main() -> Result<(), Error> { let schema = r##" { "type": "record", "name": "LongList", "aliases": ["LinkedLongs"], "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["null", "LongList"]} ] } "##;
let schema = Schema::from_str(schema)?;
let mut writer = Writer::with_codec(&schema, vec![], Codec::Null)?;
let value = LongList {
value: 1i64,
next: Some(Box::new(LongList {
value: 2i64,
next: Some(Box::new(LongList {
value: 3i64,
next: Some(Box::new(LongList {
value: 4i64,
next: Some(Box::new(LongList {
value: 5i64,
next: None,
})),
})),
})),
})),
};
writer.serialize(value)?;
// Calling into_inner performs flush internally. Alternatively, one can call flush explicitly.
let buf = writer.into_inner()?;
// read
let reader = Reader::with_schema(buf.as_slice(), schema)?;
for i in reader {
let a: LongList = from_value(&i)?;
dbg!(a);
}
Ok(())
}
```
An example of writing a json object with a confirming schema. The json object maps to an avrow::Record
type.
```rust use anyhow::Error; use avrow::{from_value, Reader, Record, Schema, Writer}; use serde::{Deserialize, Serialize}; use std::str::FromStr;
struct Mentees { id: i32, username: String, }
struct RustMentors { name: String, github_handle: String, active: bool, mentees: Mentees, }
fn main() -> Result<(), Error> { let schema = Schema::fromstr( r##" { "name": "rustmentors", "type": "record", "fields": [ { "name": "name", "type": "string" }, { "name": "github_handle", "type": "string" }, { "name": "active", "type": "boolean" }, { "name":"mentees", "type": { "name":"mentees", "type": "record", "fields": [ {"name":"id", "type": "int"}, {"name":"username", "type": "string"} ] } } ] } "##, )?;
let json_data = serde_json::from_str(
r##"
{ "name": "bob",
"github_handle":"ghbob",
"active": true,
"mentees":{"id":1, "username":"alice"} }"##,
)?;
let rec = Record::from_json(json_data, &schema)?;
let mut writer = crate::Writer::new(&schema, vec![])?;
writer.write(rec)?;
let avro_data = writer.into_inner()?;
let reader = crate::Reader::from(avro_data.as_slice())?;
for value in reader {
let mentors: RustMentors = from_value(&value)?;
dbg!(mentors);
}
Ok(())
}
```
If you want to have more control over the parameters of Writer
, consider using WriterBuilder
as shown below:
```rust
use anyhow::Error; use avrow::{Codec, Reader, Schema, WriterBuilder};
fn main() -> Result<(), Error> { let schema = Schema::fromstr(r##""null""##)?; let v = vec![]; let mut writer = WriterBuilder::new() .setcodec(Codec::Null) .setschema(&schema) .setdatafile(v) // set any custom metadata in the header .setmetadata("hello", "world") // set after how many bytes, the writer should flush .setflushinterval(128000) .build() .unwrap(); writer.serialize(())?; let v = writer.into_inner()?;
let reader = Reader::with_schema(v.as_slice(), schema)?;
for i in reader {
dbg!(i?);
}
Ok(())
} ```
Refer to examples for more code examples.
In order to facilitate efficient encoding, avro spec also defines compression codecs to use when serializing data.
Avrow supports all compression codecs as per spec:
These are feature-gated behind their respective flags. Check Cargo.toml
features
section for more details.
Quite often you will need a quick way to examine avro file for debugging purposes.
For that, this repository also comes with the avrow-cli
tool (av)
by which one can examine avro datafiles from the command line.
See avrow-cli repository for more details.
Installing avrow-cli:
cd avrow-cli
cargo install avrow-cli
Using avrow-cli (binary name is av
):
bash
av read -d data.avro
The read
subcommand will print all rows in data.avro
to standard out in debug format.
| Rust native types (primitive types) | Avro (Value
) |
| ----------------------------------- | -------------- |
| (), Option::None
| null
|
| bool
| boolean
|
| i8, u8, i16, u16, i32, u32
| int
|
| i64, u64
| long
|
| f32
| float
|
| f64
| double
|
| &[u8], Vec<u8>
| bytes
|
&str, String
| string
|Complex
| Rust native types (complex types) | Avro |
| ---------------------------------------------------- | -------- |
| struct Foo {..}
| record
|
| enum Foo {A,B}
(variants cannot have data in them) | enum
|
| Vec<T> where T: Into<Value>
| array
|
| HashMap<String, T> where T: Into<Value>
| map
|
| T where T: Into<Value>
| union
|
| Vec<u8>
: Length equal to size defined in schema | fixed
|
Please see the CHANGELOG for a release history.
All kinds of contributions are welcome.
Head over to CONTRIBUTING.md for contribution guidelines.
Avrow works on stable Rust, starting 1.37+. It does not use any nightly features.
Dual licensed under either of Apache License, Version 2.0 or MIT license at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.