Fast new algorithm(s) for finding 1D medians, implemented in Rust.
Finding the medians is a common task in statistics and data analysis. At least it should be, if only it did not take so long.
We argue in rstats that using the Geometric Median is the most stable way to characterise multidimensional data.
That leaves the one dimensional case, where the medians are not used nearly enough either, due to being much slower to calculate than the arithmetic mean.
Floyd-Rivest with the 'Median of Medians' approximation is currently considered to be the best algorithm. Here we explore some alternatives:
naive_median
is relatively slow but is a useful baseline for comparisons. It gives reliable and exact results. The median can be found simply by sorting the list of data and then picking the midpoint. Here using the standard Rust sort_unstable_by
sort. The only problem with this approach is that, even when using a good quality sort with guaranteed performance, the complexity is O(n log n).
The quest for faster algorithms, with complexity O(n) is motivated by the simple observation that not all items need to be fully sorted.
w_median
is a specialisation of n dimensional gmedian
from rstats to one dimensional case. It is iterative and on average about twice as fast as the naive_median
.
hash_median
uses hashsort
from indxvec
crate. It is about 30% faster than the naive median on longer vecs, simply due to hashsort
. However, for short vecs, up to about 100 items, it is slower.
There is at least one more algorithm in the pipeline.