This crate contains the official Native Rust implementation of Apache Arrow in memory format, governed by the Apache Software Foundation. Additional details can be found on crates.io, docs.rs and examples.
This crate is tested with the latest stable version of Rust. We do not currently test against other, older versions.
The arrow crate follows the SemVer standard defined by Cargo and works well within the Rust crate ecosystem.
However, for historical reasons, this crate uses versions with major numbers greater than 0.x
(e.g. 18.0.0
), unlike many other crates in the Rust ecosystem which spend extended time releasing versions 0.x
to signal planned ongoing API changes. Minor arrow releases contain only compatible changes, while major releases may contain breaking API changes.
The arrow crate provides the following features which may be enabled:
csv
(default) - support for reading and writing Arrow arrays to/from csv filesipc
(default) - support for the arrow-flight IPC and wire formatprettyprint
- support for formatting record batches as textual columnsjs
- support for building arrow for WebAssembly / JavaScriptsimd
- (Requires Nightly Rust) alternate optimized
implementations of some compute
kernels using explicit SIMD instructions available through packedsimd2.chrono-tz
- support of parsing timezone using chrono-tzArrow seeks to uphold the Rust Soundness Pledge as articulated eloquently here. Specifically:
The intent of this crate is to be free of soundness bugs. The developers will do their best to avoid them, and welcome help in analyzing and fixing them
Where soundness in turn is defined as:
Code is unable to trigger undefined behaviour using safe APIs
One way to ensure this would be to not use unsafe
, however, as described in the opening chapter of the Rustonomicon this is not a requirement, and flexibility in this regard is actually one of Rust's great strengths.
In particular there are a number of scenarios where unsafe
is largely unavoidable:
Additionally, this crate exposes a number of unsafe
APIs, allowing downstream crates to explicitly opt-out of potentially expensive invariant checking where appropriate.
We have a number of strategies to help reduce this risk:
Array
and ArrayBuilder
APIs to safely and efficiently interact with arraysArrayData
from untrusted sourcesforce_validate
feature that enables additional validation checks for use in test/debug buildsArrow can compile to WebAssembly using the wasm32-unknown-unknown
and wasm32-wasi
targets.
In order to compile Arrow for wasm32-unknown-unknown
you will need to disable default features, then include the desired features, but exclude test dependencies (the test_utils
feature). For example, use this snippet in your Cargo.toml
:
toml
[dependencies]
arrow = { version = "5.0", default-features = false, features = ["csv", "ipc", "simd"] }
The examples folder shows how to construct some different types of Arrow arrays, including dynamic arrays:
Examples can be run using the cargo run --example
command. For example:
bash
cargo run --example builders
cargo run --example dynamic_types
cargo run --example read_csv
Most of the compute kernels benefit a lot from being optimized for a specific CPU target.
This is especially so on x86-64 since without specifying a target the compiler can only assume support for SSE2 vector instructions.
One of the following values as -Ctarget-cpu=value
in RUSTFLAGS
can therefore improve performance significantly:
native
: Target the exact features of the cpu that the build is running on.
This should give the best performance when building and running locally, but should be used carefully for example when building in a CI pipeline or when shipping pre-compiled software. x86-64-v3
: Includes AVX2 support and is close to the intel haswell
architecture released in 2013 and should be supported by any recent Intel or Amd cpu.x86-64-v4
: Includes AVX512 support available on intel skylake
server and icelake
/tigerlake
/rocketlake
laptop and desktop processors.These flags should be used in addition to the simd
feature, since they will also affect the code generated by the simd library.