Iai-Callgrind is a benchmarking framework and harness that uses Callgrind to provide extremely accurate and consistent measurements of Rust code, making it perfectly suited to run in environments like a CI.
This crate started as a fork of the great Iai crate rewritten to use Valgrind's Callgrind instead of Cachegrind but also adds a lot of other improvements and features.
In order to use Iai-Callgrind, you must have Valgrind installed. This means that Iai-Callgrind cannot be used on platforms that are not supported by Valgrind.
To start with Iai-Callgrind, add the following to your Cargo.toml
file:
toml
[dev-dependencies]
iai-callgrind = "0.3.0"
To be able to run the benchmarks you'll also need the iai-callgrind-runner
binary installed
somewhere in your $PATH
, for example with
shell
cargo install --version 0.3.0 iai-callgrind-runner
When updating the iai-callgrind
library, you'll also need to update iai-callgrind-runner
and
vice-versa or else the benchmark runner will exit with an error.
Add
toml
[[bench]]
name = "my_benchmark"
harness = false
to your Cargo.toml
file and then create a file with the same name
in benches/my_benchmark.rs
with the following content:
```rust use iaicallgrind::{blackbox, main};
fn fibonacci(n: u64) -> u64 { match n { 0 => 1, 1 => 1, n => fibonacci(n-1) + fibonacci(n-2), } }
// Don't forget the #[inline(never)]
fn iaibenchmarkshort() -> u64 { fibonacci(black_box(10)) }
fn iaibenchmarklong() -> u64 { fibonacci(black_box(30)) }
main!(iaibenchmarkshort, iaibenchmarklong); ```
Note that it is important to annotate the benchmark functions with #[inline(never)]
or else the
rust compiler will most likely try to optimize this function and inline it. Callgrind
is function
(name) based and the collection of counter events starts when entering this function and ends when
leaving it. Not inlining this function serves the additional purpose to reduce influences of the
surrounding code on the benchmark function.
Now you can run this benchmark with cargo bench --bench my_benchmark
in your project root and you
should see something like this:
text
my_benchmark::bench_fibonacci_short
Instructions: 1727
L1 Data Hits: 621
L2 Hits: 0
RAM Hits: 1
Total read+write: 2349
Estimated Cycles: 2383
my_benchmark::bench_fibonacci_long
Instructions: 26214727
L1 Data Hits: 9423880
L2 Hits: 0
RAM Hits: 2
Total read+write: 35638609
Estimated Cycles: 35638677
In addition, you'll find the callgrind output in target/iai/my_benchmark
, if you want to
investigate further with a tool like callgrind_annotate
. Now, if running the same benchmark again,
the output will report the differences between the current and the previous run. Say you've made
change to the fibonacci
function, then you might see something like this:
text
my_benchmark::bench_fibonacci_short
Instructions: 2798 (+62.01506%)
L1 Data Hits: 1006 (+61.99678%)
L2 Hits: 0 (No Change)
RAM Hits: 1 (No Change)
Total read+write: 3805 (+61.98382%)
Estimated Cycles: 3839 (+61.09945%)
my_benchmark::bench_fibonacci_long
Instructions: 16201590 (-38.19661%)
L1 Data Hits: 5824277 (-38.19661%)
L2 Hits: 0 (No Change)
RAM Hits: 2 (No Change)
Total read+write: 22025869 (-38.19661%)
Estimated Cycles: 22025937 (-38.19654%)
For examples see also the benches folder.
Usually, all setup code in the benchmark function itself is attributed to the event counts. It's possible to pass additional arguments to Callgrind and something like below will eliminate the setup code from the final metrics:
```rust use iaicallgrind::{blackbox, main}; use my_library;
fn expensive_setup() -> Vec
fn test() { mylibrary::calltofunction(blackbox(expensive_setup())); }
main!( callgrindargs = "toggle-collect=somespecialid::expensivesetup"; functions = test ); ```
and then run the benchmark for example with
shell
cargo bench --bench my_bench
See also Skip setup code example for an in-depth explanation.
This crate is built on the same idea like the original Iai, but over the time applied a lot of improvements. The biggest difference is, that it uses Callgrind under the hood instead of Cachegrind.
Iai-Callgrind has even more precise and stable metrics across different systems. It achieves this by
bench_empty
below). This behavior virtually encapsulates the benchmark function and (almost) completely
separates the benchmark from the surrounding code.iai-callgrind-runner
but before this separation even small changes in
the iai library had effects on the benchmarks under test.Below a run of one of the benchmarks of this library on my local computer
shell
$ cd iai-callgrind
$ cargo bench --bench test_regular_bench
test_regular_bench::bench_empty
Instructions: 0
L1 Data Hits: 0
L2 Hits: 0
RAM Hits: 0
Total read+write: 0
Estimated Cycles: 0
test_regular_bench::bench_fibonacci
Instructions: 1727
L1 Data Hits: 621
L2 Hits: 0
RAM Hits: 1
Total read+write: 2349
Estimated Cycles: 2383
test_regular_bench::bench_fibonacci_long
Instructions: 26214727
L1 Data Hits: 9423880
L2 Hits: 0
RAM Hits: 2
Total read+write: 35638609
Estimated Cycles: 35638677
For comparison here the output of the same benchmark but in the github CI:
text
test_regular_bench::bench_empty
Instructions: 0
L1 Data Hits: 0
L2 Hits: 0
RAM Hits: 0
Total read+write: 0
Estimated Cycles: 0
test_regular_bench::bench_fibonacci
Instructions: 1727
L1 Data Hits: 621
L2 Hits: 0
RAM Hits: 1
Total read+write: 2349
Estimated Cycles: 2383
test_regular_bench::bench_fibonacci_long
Instructions: 26214727
L1 Data Hits: 9423880
L2 Hits: 0
RAM Hits: 2
Total read+write: 35638609
Estimated Cycles: 35638677
There's no difference (in this example) what makes benchmark runs and performance improvements of the benchmarked code even more comparable across systems. However, the above benchmarks are pretty clean and you'll most likely see some very small differences in your own benchmarks.
The now obsolete calibration run needed with Iai has just fixed the summary output of Iai itself,
but the output of cg_annotate
was still cluttered by the setup functions and metrics. The
callgrind_annotate
output produced by Iai-Callgrind is far cleaner and centered on the actual
function under test.
The statistics of the benchmarks are mostly not compatible with the original Iai anymore although still related. They now also include some additional information:
text
test_regular_bench::bench_fibonacci_long
Instructions: 26214732
L1 Data Hits: 9423880
L2 Hits: 0
RAM Hits: 2
Total read+write: 35638609
Estimated Cycles: 35638677
There is an additional line Total read+write
which summarizes all event counters above it and the
L1 Accesses
line changed to L1 Data Hits
. So, the (L1) Instructions
(reads) and L1 Data Hits
are now separately listed.
In detail:
Total read+write = Instructions + L1 Data Hits + L2 Hits + RAM Hits
.
The formula for the Estimated Cycles
hasn't changed and uses Itamar Turner-Trauring's formula from
https://pythonspeed.com/articles/consistent-benchmarking-in-ci/:
Estimated Cycles = (Instructions + L1 Data Hits) + 5 × (L2 Hits) + 35 × (RAM Hits)
For further details about how the caches are simulated and more, see the documentation of Callgrind
The metrics output is colored per default but follows the value for the CARGO_TERM_COLOR
environment variable. Disabling colors can be achieved with setting this environment variable to
CARGO_TERM_COLOR=never
.
This library uses env_logger and the default logging level
WARN
. Currently, env_logger
is only used to print some warnings and debug output, but to set the
logging level to something different set the environment variable RUST_LOG
for example to
RUST_LOG=DEBUG
. The logging output is colored per default but follows the setting of
CARGO_TERM_COLOR
. See also the documentation of env_logger
.
It's now possible to pass additional arguments to callgrind separated by --
(cargo bench -- CALLGRIND_ARGS
) or overwrite the defaults, which are:
--I1=32768,8,64
--D1=32768,8,64
--LL=8388608,16,64
--cache-sim=yes
(can't be changed)--toggle-collect=*BENCHMARK_FILE::BENCHMARK_FUNCTION
--collect-atstart=no
--compress-pos=no
--compress-strings=no
Note that toggle-collect
won't be overwritten by any additional toggle-collect
argument but
instead will be passed to Callgrind in addition to the default value. See the Skipping setup
code section for an example of how to make use of this.
It's also possible to pass arguments to callgrind on a benchmark file level with the alternative form of the main macro
rust
main!(
callgrind_args = "--arg-with-flags=yes", "arg-without-flags=is_ok_too"
functions = func1, func2
)
See also Callgrind Command-line Options.
target/iai
to avoid
overwriting them in case of multiple benchmark files.Iai-Callgrind does not completely remove the influences of setup changes (like an additional benchmark function in the same file). However, these effects shouldn't be so large anymore.
Iai-Callgrind is forked from https://github.com/bheisler/iai and was originally written by Brook Heisler (@bheisler).
Iai-Callgrind is like Iai dual licensed under the Apache 2.0 license and the MIT license.