** SIMD for Humans Easy, powerful, portable, absurdly fast numerical calculations. Includes static dispatch with inlining based on your platform and vector types, zero-allocation iteration, vectorized loading/storing, and support for uneven collections.
It looks something like this:
let lotsof3s = (&[-123.456f32; 128][..]).simditer() .simdmap(|v| { f32s::splat(9.0) * v.abs().sqrt().rsqrt().ceil().sqrt() - f32s::splat(4.0) - f32s::splat(2.0) }) .scalar_collect();
Which is analogous to this scalar code:
let lotsof3s = (&[-123.456f32; 128][..]).iter()
.simd_map(|v| { 9.0 * v.abs().sqrt().sqrt().recip().ceil().sqrt() -
4.0 - 2.0 })
.collect::
The vector size is entirely determined by the machine you're compiling for - it attempts to use the largest vector size supported by your machine, and works on any platform or architecture (see below for details).
Compare this to traditional explicit SIMD:
use std::mem::transmute; use stdsimd::{f32x4, f32x8};
let lotsof3s = &mut [-123.456f32; 128][..];
if cfg!(all(not(targetfeature = "avx"), targetfeature = "sse")) {
for ch in init.chunksmut(4) {
let v = f32x4::load(ch, 0);
let scalarabsmask = unsafe { transmute::
Even with all of that boilerplate, this still only supports x86-64 machines with SSE or AVX - and you have to look up each intrinsic to ensure it's usable for your compilation target. * Upcoming Features Gathers and scatters are next in the pipeline. This should let you one-line matrix determinants, cross products, and many vector calculus primitives. * Compatibility Faster currently supports x86 back to the first Pentium, although AVX-512 support isn't working in rustc yet. It builds on many architectures, although I'm not sure whether the tests pass. ** Performance Here are some extremely unscientific benchmarks which, at least, prove that this isn't any worse than scalar iterators. Even on ancient CPUs, a lot of performance can be extracted out of SIMD. Surprisingly, using SIMD iterators performs better than scalar iterators even on the SSE-less Pentium.
$ RUSTFLAGS="-C target-cpu=ivybridge" cargo bench # host is ivybridge; target has AVX test tests::benchmapscalar ... bench: 6,969 ns/iter (+/- 170) test tests::benchmapsimd ... bench: 900 ns/iter (+/- 17) test tests::benchmapunevensimd ... bench: 905 ns/iter (+/- 23) test tests::benchnopscalar ... bench: 37 ns/iter (+/- 0) test tests::benchnopsimd ... bench: 35 ns/iter (+/- 1) test tests::benchreducescalar ... bench: 6,908 ns/iter (+/- 62) test tests::benchreducesimd ... bench: 875 ns/iter (+/- 17) test tests::benchreduceunevensimd ... bench: 905 ns/iter (+/- 14)
RUSTFLAGS="-C target-cpu=x86-64" cargo bench # host is ivybridge; target has SSE2 test tests::benchmapscalar ... bench: 7,229 ns/iter (+/- 100) test tests::benchmapsimd ... bench: 1,880 ns/iter (+/- 38) test tests::benchmapunevensimd ... bench: 1,887 ns/iter (+/- 42) test tests::benchnopscalar ... bench: 43 ns/iter (+/- 1) test tests::benchnopsimd ... bench: 35 ns/iter (+/- 0) test tests::benchreducescalar ... bench: 7,021 ns/iter (+/- 127) test tests::benchreducesimd ... bench: 1,874 ns/iter (+/- 37) test tests::benchreduceunevensimd ... bench: 1,946 ns/iter (+/- 28)
$ RUSTFLAGS="-C target-cpu=pentium" cargo bench # host is ivybridge; this only runs the polyfills! test tests::benchmapscalar ... bench: 7,193 ns/iter (+/- 48) test tests::benchmapsimd ... bench: 6,277 ns/iter (+/- 40) test tests::benchmapunevensimd ... bench: 6,287 ns/iter (+/- 20) test tests::benchnopscalar ... bench: 46 ns/iter (+/- 0) test tests::benchnopsimd ... bench: 70 ns/iter (+/- 0) test tests::benchreducescalar ... bench: 7,005 ns/iter (+/- 30) test tests::benchreducesimd ... bench: 6,076 ns/iter (+/- 19) test tests::benchreduceunevensimd ... bench: 6,110 ns/iter (+/- 16)