Measuring CPU core-to-core latency

We measure the latency it takes for a CPU to send a message to another CPU via its cache coherence protocol.

By pinning two threads on two different CPU cores, we can get them to do a bunch of compare-exchange operation, and measure the latency.

How to run:

$ cargo install core-to-core-latency $ core-to-core-latency

Results

CPU | Release Date | Median Latency ---------------------------------------------------------------------|---------------| -------------------------- Intel Core i9-12900K @ 8+8 Cores (Alder Lake, 12th gen) | 2021-Q4 | 35ns, 44ns, 50ns Intel Xeon Platinum 8375C @ 2.90GHz 32 Cores (Ice Lake, 3rd gen) | 2021-Q2 | 51ns Intel Xeon Platinum 8275CL @ 3.00GHz 24 Cores (Cascade Lake, 2nd gen)| 2019-Q2 | 47ns Intel Core i7-1165G7 @ 2.80GHz 4 Cores (Tiger Lake, 11th gen) | 2020-Q3 | 27ns Intel Core i9-9900K @ 3.60 GHz 8 Cores (Coffee Lake, 9th gen) | 2018-Q4 | 21ns Intel Xeon E5-2695 v4 @ 2.10GHz 18 Cores (Broadwell, 5th gen) | 2016-Q1 | 44ns Intel Core i5-4590 CPU @ 3.30GHz 4 Cores (Haswell, 4th gen) | 2014-Q2 | 21ns AMD EPYC 7R13 @ 48 Cores (Milan, 3rd gen) | 2021-Q1 | 23ns and 107ns AMD Ryzen 9 5950X @ 3.40 GHz 16 Cores (Zen3, 4th gen) | 2020-Q4 | 17ns and 85ns AMD Ryzen 7 2700X @ 3.70 GHz 8 Cores (Zen+, 2nd gen) | 2018-Q3 | 24ns and 92ns AWS Graviton3 @ 64 Cores (Arm Neoverse, 3rd gen) | 2021-Q4 | 46ns AWS Graviton2 @ 64 Cores (Arm Neoverse, 2rd gen) | 2020-Q1 | 47ns Apple M1 | 2020-Q4 | 39ns Apple M1 Max | 2021-Q4 | 39ns

See the notebook for additional CPU graphs: results/results.ipynb, it includes hyperthreading and dual socket configurations

Intel Core i9-12900K @ 8+8 Cores (Alder Lake, 12th gen) 2021-Q4

Data provided by bizude.

This CPU has 8 performance cores, and 2 groups of 4 efficient cores. We see CPU=8 with fast access to all other cores.

Intel Xeon Platinum 8375C @ 2.90GHz 32 Cores (Ice Lake, 3rd gen) 2021-Q2

From an AWS c6i.metal machine.

Intel Xeon Platinum 8275CL @ 3.00GHz 24 Cores (Cascade Lake, 2nd gen) 2019-Q2

From an AWS c5.metal machine.

Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 4 Cores (Tiger Lake, 11th gen) 2020-Q3

Data provided by Jonas Wunderlich

Intel Core i9-9900K @ 3.60 GHz 8 Cores (Coffee Lake, 8th gen) 2018-Q4

My gaming machine, it's twice as fast as the other server-oriented CPUs.

Intel Xeon E5-2695 v4 @ 2.10GHz 18 Cores (Broadwell, 5th gen) 2016-Q1

From a machine provided by GTHost

Intel Core i5-4590 CPU @ 3.30GHz 4 Cores (Haswell, 4th gen) 2014-Q2

Data provided by Felipe Lube de Bragança

AMD EPYC 7R13 @ 48 Cores (Milan, 3rd gen) 2021-Q1

From an AWS c6a.metal machine.

We can see cores arranged in 6 groups of 8 in which latency is excellent within (23ns). When data crosses groups, the latency jumps to around 110ns. Note, that the last 3 groups have a better cross-group latency than the first 3 (~90ns).

AMD Ryzen 9 5950X @ 3.40 GHz 16 Cores (Zen3, 4th gen) 2020-Q1

Data provided by John Schoenick.

We can see 2 groups of 8 cores with latencies of 17ns intra-group, and 85ns inter-group.

AMD Ryzen 7 2700X @ 3.70 GHz 8 Cores (Zen+, 2nd gen) 2018-Q3

Data provided by David Hoppenbrouwers.

We can see 2 groups of 4 cores with latencies of 24ns intra-group, and 92ns inter-group.

AWS Graviton3 @ 64 Cores (Arm Neoverse, 3rd gen) 2021-Q4

From an AWS c7g.16xlarge machine.

AWS Graviton2 @ 64 Cores (Arm Neoverse, 2nd gen) 2020-Q1

From an AWS c6gd.metal machine.

See the notebook for additional CPU graphs: results/results.ipynb, it includes hyperthreading and dual socket configurations

How to use

First install Rust and gcc on linux, then:

``` $ cargo install core-to-core-latency $ core-to-core-latency Num cores: 10 Using RDTSC to measure time: false Num round trips per samples: 1000 Num samples: 300 Showing latency=round-trip-time/2 in nanoseconds:

   0       1       2       3       4       5       6       7       8       9

0 1 52±6 2 38±6 39±4 3 39±5 39±6 38±6 4 34±6 38±4 37±6 36±5 5 38±5 38±6 38±6 38±6 37±6 6 38±5 37±6 39±6 36±4 49±6 38±6 7 36±6 39±5 39±6 37±6 35±6 36±6 38±6 8 37±5 38±6 35±5 39±5 38±6 38±5 37±6 37±6 9 48±6 39±6 36±6 39±6 38±6 36±6 41±6 38±6 39±6

Min latency: 34.5ns ±6.1 cores: (4,0) Max latency: 52.1ns ±9.4 cores: (1,0) Mean latency: 38.4ns ```

Contribute

Use core-to-core-latency 5000 --csv > output.csv to instruct the program to use 5000 iterations per sample to reduce the noise, and save the results.

It can be used in the jupter notebook results/results.ipynb for rendering graphs.

Create a GitHub issue with the generated output.csv file and I'll add your results.

License

This software is licensed under the MIT license