faster ** SIMD for Humans Easy, powerful, absurdly fast numerical calculations. Chaining, Type punning, static dispatch (w/ inlining) based on your platform and vector types, zero-allocation iteration, vectorized loading/storing, and support for uneven collections.

It looks something like this:

+BEGIN_SRC rust

let lotsof3s = (&[-123.456f32; 128][..]).simditer() .map(|v| { f32s::splat(9.0) * v.abs().sqrt().rsqrt().ceil().sqrt() - f32s::splat(4.0) - f32s::splat(2.0) }) .scalarcollect::>();

+END_SRC

Which is analogous to this scalar code:

+BEGIN_SRC rust

let lotsof3s = (&[-123.456f32; 128][..]).iter() .map(|v| { 9.0 * v.abs().sqrt().sqrt().recip().ceil().sqrt() - 4.0 - 2.0 }) .collect::>();

+END_SRC

The vector size is entirely determined by the machine you're compiling for - it attempts to use the largest vector size supported by your machine, and works on any platform or architecture (see below for details).

Compare this to traditional explicit SIMD:

+BEGIN_SRC rust

use std::mem::transmute; use stdsimd::{f32x4, f32x8};

let lotsof3s = &mut [-123.456f32; 128][..];

if cfg!(all(not(targetfeature = "avx"), targetfeature = "sse")) { for ch in init.chunksmut(4) { let v = f32x4::load(ch, 0); let scalarabsmask = unsafe { transmute::(0x7fffffff) }; let absmask = f32x4::splat(scalarabsmask); // There isn't actually an absolute value intrinsic for floats - you // have to look at the IEEE 754 spec and do some bit flipping v = unsafe { mmandps(v, absmask) }; v = unsafe { mmsqrtps(v) }; v = unsafe { _mmrsqrtps(v) }; v = unsafe { _mmceilps(v) }; v = unsafe { _mmsqrtps(v) }; v = unsafe { _mmmulps(v, 9.0) }; v = unsafe { _mmsubps(v, 4.0) }; v = unsafe { _mmsubps(v, 2.0) }; f32x4::store(ch, 0); } } else if cfg!(all(not(targetfeature = "avx512"), targetfeature = "avx")) { for ch in init.chunksmut(8) { let v = f32x8::load(ch, 0); let scalarabsmask = unsafe { transmute::(0x7fffffff) }; let absmask = f32x8::splat(scalarabsmask); // There isn't actually an absolute value intrinsic for floats - you // have to look at the IEEE 754 spec and do some bit flipping v = unsafe { _mm256andps(v, absmask) }; v = unsafe { mm256sqrtps(v) }; v = unsafe { _mm256rsqrtps(v) }; v = unsafe { _mm256ceilps(v) }; v = unsafe { _mm256sqrtps(v) }; v = unsafe { _mm256mulps(v, 9.0) }; v = unsafe { _mm256subps(v, 4.0) }; v = unsafe { _mm256sub_ps(v, 2.0) }; f32x8::store(ch, 0); } }

+END_SRC

Even with all of that boilerplate, this still only supports x86-64 machines with SSE or AVX. ** Upcoming Features Zero-overhead support for uneven collections which don't fit entirely into a vector is upcoming. Also, zero-allocation collects are coming, to be called ~fill~.

By 0.2.0, this code will compile:

+BEGIN_SRC rust

let someu8s = [0u8; 100]; let filledu8s = (&[0u8; 100][..]).simditer() .unevenmap(|vector| vector * splat(2), |scalar| scalar * 2) .scalarfill(&mut someu8s);

+END_SRC

More intrinsic traits are also coming; feel free to open an issue or pull request if you have one you'd like to see. ** Compatibility Faster currently supports x86 machines with SSE and above, although AVX-512 support isn't working in rustc yet. Support for non-x86 architectures is currently blocked by stdsimd and rustc.

Of course, once those issues are resolved, adding support ARM, MIPS, or any other intrinsics and vector lengths will be trivial. ** Performance Here are some extremely unscientific benchmarks which, at least, prove that this isn't any worse than scalar iterators.

+BEGIN_SRC shell

running 4 tests test tests::benchnopscalar ... bench: 51 ns/iter (+/- 1) test tests::benchnopsimd ... bench: 51 ns/iter (+/- 1) test tests::benchworkscalar ... bench: 1,276 ns/iter (+/- 39) test tests::benchworksimd ... bench: 251 ns/iter (+/- 0)