This crate provides convenience methods on top of arrow2
to facilitate conversion between nested rust types and the Arrow memory format.
The following features are supported:
u8
], [u16
], [u32
], [u64
], [i8
], [i16
], [i32
], [i64
], [f32
], [f64
]bool
], [String
], [Binary
]chrono::NaiveDate
], [chrono::NaiveDateTime
]ArrowField
macro or by VecThe following are not yet supported.
Note: This is not an exclusive list. Please see the repo issues for current work in progress and add proposals for features that would be useful for your project.
The API is inspired by serde with the necessary modifications to generate a compile-time Arrow schema and to integrate with arrow2 data structures.
Types (currently only structures) can be annotated with the ArrowField
procedural macro to derive the following:
arrow2::array::MutableArray
for serializationArrowField
, ArrowSerialize
, and ArrowDeserialize
traits.Serialization can be performed by using the arrow_serialize
method from the TryIntoArrow
trait or by manually pushing elements to the generated arrow2::array::MutableArray
.
Deserialization can be performed by using the try_into_collection
method from the TryIntoCollection
trait or by iterating through the iterator provided by arrow_array_deserialize_iterator
.
Both serialization and deserialization perform memory copies for the elements converted. For example, iterating through the deserialize iterator will copy memory from the arrow2 array, into the structure that the iterator returns. Deserialization can be more efficient by supporting structs with references.
Below is a bare-bones example that does a round trip conversion of a struct with a single field.
Please see the complex_example.rs for usage of the full functionality.
```rust /// Simple example
use arrow2::array::Array; use arrow2_convert::{deserialize::TryIntoCollection, serialize::TryIntoArrow, ArrowField};
pub struct Foo { name: String, }
fn testsimpleroundtrip() { // an item let originalarray = [ Foo { name: "hello".tostring() }, Foo { name: "one more".tostring() }, Foo { name: "good bye".tostring() }, ];
// serialize to an arrow array. try_into_arrow() is enabled by the TryIntoArrow trait
let arrow_array: Box<dyn Array> = original_array.try_into_arrow().unwrap();
// which can be cast to an Arrow StructArray and be used for all kinds of IPC, FFI, etc.
// supported by `arrow2`
let struct_array= arrow_array.as_any().downcast_ref::<arrow2::array::StructArray>().unwrap();
assert_eq!(struct_array.len(), 3);
// deserialize back to our original vector via TryIntoCollection trait.
let round_trip_array: Vec<Foo> = arrow_array.try_into_collection().unwrap();
assert_eq!(round_trip_array, original_array);
} ```
The goal is to allow the Arrow memory model to be used by an existing rust type tree and to facilitate type conversions, when needed. Ideally, for performance, if the arrow memory model or specifically the API provided by the arrow2 crate exactly matches the custom type tree, then no conversions should be performed.
To achieve this, the following approach is used:
Introduce three traits, ArrowField
, ArrowSerialize
, and ArrowDeserialize
that can be implemented by types that can be represented in Arrow. Implementations are provided for the built-in arrow2` types, and custom implementations can be provided for other types.
Blanket implementations are provided for types that recursively contain types that implement the above traits eg. [Option<T>
], [Vec<T>
], [Vec<Option<T>>
], [Vec<Vec<Option<T>>>
], etc. The blanket implementation needs be enabled by the arrow_enable_vec_for_type
macro on the primitive type. This explicit enabling is needed since Vec
Supporting traits to expose functionality not exposed via arrow2's traits.
Ideally for code reusability, we wouldn’t have to reimplement ArrowSerialize
and ArrowDeserialize
for large and fixed offset types since the primitive types are the same. However, this requires the trait functions to take a generic bounded mutable array as an argument instead of a single array type. This requires the ArrowSerialize
and ArrowDeserialize
implementations to be able to specify the bounds as part of the associated type , which not possible without generic associated types.
As a result, we’re forced to sacrifice code reusability and introduce a little bit of complexity by providing separate ArrowSerialize
and ArrowDeserialize
implementations for large types via placeholder structures. This also requires introducing the Type
attribute to ArrowField
so that the large types can be used via a field attribute without affecting the structure field types.
While serde is the de-facto serialization framework in Rust, it introduces a layer of indirection. Specifically, arrow2 uses Apache Arrow's in-memory columnar format, while serde is row based. Using serde requires a Serializer/Deserializer implementation around each arrow2 MutableArray and Array, leading to a heavyweight wrapper around simple array manipulations.
Arrow's in-memory format can be serialized/deserialized to a wide variety of formats including Apache Parquet, JSON, Apache Avro, Arrow IPC, and Arrow FFI specification.
One of the objectives of this crate is to derive a compile-time Arrow schema for Rust structs, which we achieve via the ArrowField
trait.
Other crates that integrate serde for example serde_arrow,
either need an explicit schema or need to infer the schema at runtime.
Lastly, the serde ecosystem comes with its own default representations for temporal times, that differ from the default representations of arrow2. It seemed best to avoid any conflicts by introducing a new set of traits.
The biggest disadvantage of not using Serde is for types in codebases that already implement serde traits. They will need to additionally reimplement the traits needed by this crate.
Licensed under either of
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.