General matrix multiplication for f32, f64 matrices. Operates on matrices with general layout (they can use arbitrary row and column stride).
Please read the API documentation here
__
__ https://docs.rs/matrixmultiply/
This crate uses the same macro/microkernel approach to matrix multiplication as the BLIS_ project.
We presently provide a few good microkernels portable and for x86-64, and only one operation: the general matrix-matrix multiplication (“gemm”).
.. _BLIS: https://github.com/flame/blis
|buildstatus| |crates|_
.. |buildstatus| image:: https://travis-ci.org/bluss/matrixmultiply.svg?branch=master .. _buildstatus: https://travis-ci.org/bluss/matrixmultiply
.. |crates| image:: https://meritbadge.herokuapp.com/matrixmultiply .. _crates: https://crates.io/crates/matrixmultiply
gemm: a rabbit hole
____ https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-hole/
0.2.4
Support no-std mode by @vadixidav and @jturner314 New (default) feature flag "std"; use default-features = false to disable and use no-std. Note that runtime CPU feature detection requires std.
Fix tests so that they build correctly on non-x86 #49 platforms, and manage the release by @bluss
0.2.3
-Ctarget-cpu=native
use (not recommended -
use automatic runtime feature detection.0.2.2
Benchmark improvements: Using fma instructions reduces execution time on
dgemm benchmarks by 25-35% compared with the avx kernel, see issue #35
_
Using the avx dgemm kernel reduces execution time on dgemm benchmarks by 5-7% compared with the previous version's autovectorized kernel.
Benchmark improvement: Using fma instructions reduces execution time on
sgemm benchmarks by 10-15% compared with the avx kernel, see issue #35
_
Benchmark improvement: Reduces execution time on various benchmarks
by 1-2% in the avx kernels, see #37
_.
.. _#35: https://github.com/bluss/matrixmultiply/issues/35 .. _#37: https://github.com/bluss/matrixmultiply/issues/37
0.2.1
Benchmark improvement: execution time for 64×64 problem where inputs are either both row major or both column major changed by -5% sgemm and -1% for dgemm. (#26)
Benchmark improvement: execution time for 32×32 problem where output is column major changed by -11%. (#27)
0.2.0
This means no special compiler flags are needed to enable native instruction performance!
Implement a specialized 8×8 sgemm (f32) AVX microkernel, this speeds up matrix multiplication by another 25%.
Use std::alloc
for allocation of aligned packing buffers
We now require Rust 1.28 as the minimal version
0.1.15
0.1.14
0.1.13
rawpointer
, a µcrate with raw pointer methods taken from this
project.0.1.12
0.1.11
0.1.10
0.1.9
0.1.8
0.1.7
0.1.6
0.1.5
0.1.4
0.1.3
0.1.2
0.1.1