General matrix multiplication for f32, f64 matrices. Operates on matrices with general layout (they can use arbitrary row and column stride).
Please read the API documentation here
__
__ https://docs.rs/matrixmultiply/
We presently provide a few good microkernels portable and for x86-64, and only one operation: the general matrix-matrix multiplication (“gemm”).
This crate was inspired by the tmacro/microkernel approach to matrix multiplication that is used by the BLIS_ project.
.. _BLIS: https://github.com/flame/blis
|crates|_
.. |crates| image:: https://meritbadge.herokuapp.com/matrixmultiply .. _crates: https://crates.io/crates/matrixmultiply
gemm: a rabbit hole
____ https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-hole/
0.3.1
Attempt to fix bug #55 were the mask buffer in TLS did not seem to get its requested alignment on macos. The mask buffer pointer is now aligned manually (again, like it was in 0.2.x).
Fix a minor issue where we were passing a buffer pointer as &T
when it should have been &[T]
.
0.3.0
threading
(and configure number of threads with the
variable MATMUL_NUM_THREADS
).Initial support is for up to 4 threads - will be updated with more experience in coming versions.
Added a better benchmarking program for arbitrary size and layout, see
examples/benchmark.rs
for this; it supports csv output for better
recording of measurements
Minimum supported rust version is 1.41.1 and the version update policy has been updated.
Updated to Rust 2018 edition
Moved CI to github actions (so long travis and thanks for all the fish).
0.2.4
Support no-std mode by @vadixidav and @jturner314 New (default) feature flag "std"; use default-features = false to disable and use no-std. Note that runtime CPU feature detection requires std.
Fix tests so that they build correctly on non-x86 #49 platforms, and manage the release by @bluss
0.2.3
-Ctarget-cpu=native
use (not recommended -
use automatic runtime feature detection.0.2.2
Benchmark improvements: Using fma instructions reduces execution time on
dgemm benchmarks by 25-35% compared with the avx kernel, see issue #35
_
Using the avx dgemm kernel reduces execution time on dgemm benchmarks by 5-7% compared with the previous version's autovectorized kernel.
Benchmark improvement: Using fma instructions reduces execution time on
sgemm benchmarks by 10-15% compared with the avx kernel, see issue #35
_
Benchmark improvement: Reduces execution time on various benchmarks
by 1-2% in the avx kernels, see #37
_.
.. _#35: https://github.com/bluss/matrixmultiply/issues/35 .. _#37: https://github.com/bluss/matrixmultiply/issues/37
0.2.1
Benchmark improvement: execution time for 64×64 problem where inputs are either both row major or both column major changed by -5% sgemm and -1% for dgemm. (#26)
Benchmark improvement: execution time for 32×32 problem where output is column major changed by -11%. (#27)
0.2.0
This means no special compiler flags are needed to enable native instruction performance!
Implement a specialized 8×8 sgemm (f32) AVX microkernel, this speeds up matrix multiplication by another 25%.
Use std::alloc
for allocation of aligned packing buffers
We now require Rust 1.28 as the minimal version
0.1.15
0.1.14
0.1.13
rawpointer
, a µcrate with raw pointer methods taken from this
project.0.1.12
0.1.11
0.1.10
0.1.9
0.1.8
0.1.7
0.1.6
0.1.5
0.1.4
0.1.3
0.1.2
0.1.1