** SIMD for Humans Easy, powerful, absurdly fast numerical calculations. Chaining, Type punning, static dispatch (w/ inlining) based on your platform and vector types, zero-allocation iteration, vectorized loading/storing, and support for uneven collections.
It looks something like this:
let lotsof3s = (&[-123.456f32; 128][..]).simditer()
.map(|v| { f32s::splat(9.0) * v.abs().sqrt().rsqrt().ceil().sqrt() -
f32s::splat(4.0) - f32s::splat(2.0) })
.scalarcollect::
Which is analogous to this scalar code:
let lotsof3s = (&[-123.456f32; 128][..]).iter()
.map(|v| { 9.0 * v.abs().sqrt().sqrt().recip().ceil().sqrt() -
4.0 - 2.0 })
.collect::
The vector size is entirely determined by the machine you're compiling for - it attempts to use the largest vector size supported by your machine, and works on any platform or architecture (see below for details).
Compare this to traditional explicit SIMD:
use std::mem::transmute; use stdsimd::{f32x4, f32x8};
let lotsof3s = &mut [-123.456f32; 128][..];
if cfg!(all(not(targetfeature = "avx"), targetfeature = "sse")) {
for ch in init.chunksmut(4) {
let v = f32x4::load(ch, 0);
let scalarabsmask = unsafe { transmute::
Even with all of that boilerplate, this still only supports x86-64 machines with SSE or AVX - and you have to look up each intrinsic to ensure it's usable for your compilation target. ** Upcoming Features More intrinsic traits are coming; feel free to open an issue or pull request if you have one you'd like to see.
Swizzling, automated testing, and documentation are also in the pipeline. ** Compatibility Faster currently supports 32- and 64-bit x86 machines with SSE and above, although AVX-512 support isn't working in rustc yet. Support for non-x86 architectures is currently blocked by stdsimd and rustc.
Of course, once those issues are resolved, adding support ARM, MIPS, or any other intrinsics and vector lengths will be trivial. ** Performance Here are some extremely unscientific benchmarks which, at least, prove that this isn't any worse than scalar iterators. Even on ancient CPUs, a lot of performance can be extracted out of SIMD.
$ RUSTFLAGS="-C target-cpu=ivybridge" cargo bench # host is ivybridge test tests::benchnopscalar ... bench: 29 ns/iter (+/- 2) test tests::benchnopsimd ... bench: 28 ns/iter (+/- 1) test tests::benchworkscalar ... bench: 1,042 ns/iter (+/- 93) test tests::benchworksimd ... bench: 133 ns/iter (+/- 1)
$ RUSTFLAGS="-C target-cpu=pentium3" cargo bench # host is ivybridge test tests::benchnopscalar ... bench: 18 ns/iter (+/- 0) test tests::benchnopsimd ... bench: 21 ns/iter (+/- 1) test tests::benchworkscalar ... bench: 1,013 ns/iter (+/- 72) test tests::benchworksimd ... bench: 281 ns/iter (+/- 18)