General matrix multiplication for f32, f64, and complex matrices. Operates on matrices with general layout (they can use arbitrary row and column stride).
Please read the API documentation here
__
__ https://docs.rs/matrixmultiply/
We presently provide a few good microkernels, portable and for x86-64 and AArch64 NEON, and only one operation: the general matrix-matrix multiplication (“gemm”).
This crate was inspired by the macro/microkernel approach to matrix multiplication that is used by the BLIS_ project.
.. _BLIS: https://github.com/flame/blis
|crates|_
.. |crates| image:: https://img.shields.io/crates/v/matrixmultiply.svg .. _crates: https://crates.io/crates/matrixmultiply
cargo bench
is useful for special cases and small matricesexamples/benchmarks.rs
which supports custom sizes,
some configuration, and csv output.
Use the script benches/benchloop.py
to run benchmarks over parameter ranges.gemm: a rabbit hole
____ https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-hole/
0.3.7
0.3.6
0.3.5
Significant improvements to complex matrix packing and kernels (#75)
Use a specialized AVX2 matrix packing function for sgemm, dgemm when this feature is detected on x86-64
0.3.4
Matrixmultiply now uses autocfg to detect rust version to enable these kernels when AArch64 intrinsics are available from Rust 1.61.
0.3.3
Attempt to fix macos bug #55 again (manifesting as a debug assertion, only in debug builds.)
Updated comments for x86 kernels by @Tastaturtaste
Updates to MIRI/CI by @jturner314
Silenced Send/Sync future compatibility warnings for a raw pointer wrapper
0.3.2
Add optional feature cgemm
for complex matmult functions cgemm
and
zgemm
Add optional feature constconf
for compile-time configuration of matrix
kernel parameters for chunking. Improved scripts for benchmarking over ranges
of different settings. With thanks to @DutchGhost for the const-time
parsing functions.
Improved benchmarking and testing.
Threading is now slightly more eager to threads (depending on matrix element count).
0.3.1
Attempt to fix bug #55 were the mask buffer in TLS did not seem to get its requested alignment on macos. The mask buffer pointer is now aligned manually (again, like it was in 0.2.x).
Fix a minor issue where we were passing a buffer pointer as &T
when it should have been &[T]
.
0.3.0
threading
(and configure number of threads with the
variable MATMUL_NUM_THREADS
).Initial support is for up to 4 threads - will be updated with more experience in coming versions.
Added a better benchmarking program for arbitrary size and layout, see
examples/benchmark.rs
for this; it supports csv output for better
recording of measurements
Minimum supported rust version is 1.41.1 and the version update policy has been updated.
Updated to Rust 2018 edition
Moved CI to github actions (so long travis and thanks for all the fish).
0.2.4
Support no-std mode by @vadixidav and @jturner314 New (default) feature flag "std"; use default-features = false to disable and use no-std. Note that runtime CPU feature detection requires std.
Fix tests so that they build correctly on non-x86 #49 platforms, and manage the release by @bluss
0.2.3
-Ctarget-cpu=native
use (not recommended -
use automatic runtime feature detection.0.2.2
Benchmark improvements: Using fma instructions reduces execution time on
dgemm benchmarks by 25-35% compared with the avx kernel, see issue #35
_
Using the avx dgemm kernel reduces execution time on dgemm benchmarks by 5-7% compared with the previous version's autovectorized kernel.
Benchmark improvement: Using fma instructions reduces execution time on
sgemm benchmarks by 10-15% compared with the avx kernel, see issue #35
_
Benchmark improvement: Reduces execution time on various benchmarks
by 1-2% in the avx kernels, see #37
_.
.. _#35: https://github.com/bluss/matrixmultiply/issues/35 .. _#37: https://github.com/bluss/matrixmultiply/issues/37
0.2.1
Benchmark improvement: execution time for 64×64 problem where inputs are either both row major or both column major changed by -5% sgemm and -1% for dgemm. (#26)
Benchmark improvement: execution time for 32×32 problem where output is column major changed by -11%. (#27)
0.2.0
This means no special compiler flags are needed to enable native instruction performance!
Implement a specialized 8×8 sgemm (f32) AVX microkernel, this speeds up matrix multiplication by another 25%.
Use std::alloc
for allocation of aligned packing buffers
We now require Rust 1.28 as the minimal version
0.1.15
0.1.14
0.1.13
rawpointer
, a µcrate with raw pointer methods taken from this
project.0.1.12
0.1.11
0.1.10
0.1.9
0.1.8
0.1.7
0.1.6
0.1.5
0.1.4
0.1.3
0.1.2
0.1.1