Medians

by Libor Spacek

Fast algorithm for finding 1d medians, implemented in Rust.

Usage

rust use medians::{MStats,Medianf64,Median};

Introduction

Finding medians is a common task in statistics and data analysis. At least it should be, because Median is more stable measure of central tendency than mean. Similarly, mad (median of absolute differences from median) is a more stable measure of data spread than is standard deviation. Median and mad are not used nearly enough simply for historical reasons; being more difficult to compute. The fast algorithms presented here provide a practical remedy for this situation.

We argued in rstats, that using the Geometric Median is the most stable way to characterise multidimensional data. The one dimensional case is addressed here.

See tests.rs for examples of usage. Their automatically generated output can also be found by clicking the 'test' icon at the top of this document and then examining the latest log.

Algorithms Analysis

Median can be found naively by sorting the list of data and then picking the midpoint. When using the best known sort algorithm(s), the complexity is O(nlog(n)). Faster median algorithms, with complexity O(n) are possible. They are based on the observation that not all data items need to be fully sorted, only partitioned and counted off. Therefore the naive median can not compete. It has been deleted as of version 2.0.0.

Currently considered to be the 'state of the art' algorithm is Floyd-Rivest (1975) Median of Medians. This divides the data into groups of five items, finds a median of each group and then recursively finds medians of five of these medians, and so on, until only one is left. This is then used as a pivot for the partitioning of the original data. This pivot is guaranteed to produce 'pretty good' partitioning, though not necessarily perfect.

However, to find the best pivot is not the main objective. Rather, it is to count off eccentric data items as fast as possible. Therefore, the expense of choosing the pivot must be considered. It is possible to allow less optimal pivots, as we do here and yet, on average, to find the median faster.

Let our average ratio of items remaining after one partitioning be RS and the Floyd-Rivest be RF. Where 1/2 <= RF <= RS < 1. RF is more optimal, being nearer to the perfect ratio 1/2. However, suppose that we can perform two partitions in the time it takes Floyd-Rivest to do one (because of their expensive pivot selection process). Then it is enough for better performance that RS^2 < RF, which is entirely possible and seems to be confirmed in practice. For example, RF=0.65 (nearly optimal), RS=0.8 (deeply suboptimal), yet RS^2 < RF.

The main features of our median algorithm are:

Linear complexity.
Fast (in place) iterative partitioning, minimising data movements and memory management.
Simple pivot selection strategy (median of a sample of three). This is enough to guarantee convergence, even in the worst case.

Trait `Medianf64`

rust /// Fast 1D f64 medians and associated tasks pub trait Medianf64 { /// Finds the median, fast. fn median(self) -> Result<f64, Me>; /// finds and subtracts the median from data. fn zeromedian(self) -> Result<Vec<f64>, Me>; /// Median correlation = cosine of an angle between two zero median vecs fn mediancorr(self,v: &[f64]) -> Result<f64, Me>; /// Median of absolute differences (MAD). fn mad(self, med: f64) -> Result<f64, Me>; /// Median and MAD. fn medstats(self) -> Result<MStats, Me>; }

Trait `Median`

Is the generic version of Medianf64. Most of its methods take an extra argument, a quantify closure, which evaluates T to f64. This extends their applicability to all quantifiable generic types T. The closures facilitate not just standard as and .into() conversions but also any number of custom ways of quantifying more complex data types. The conversion incurs some extra cost but it is only done once.

Five of the methods fulfil the same roles as those in trait Medianf64 above, so instead of renaming them, we use auto referencing.

generic_odd, generic_even

are provided for types for which the quantification is not possible, only direct comparison. They return references to the central item(s) but otherwise use the same algorithm. The only additional cost is an extra layer of referencing. These methods are particularly suitable for large data types, as the data is not moved around.

When the data items are unquantifiable, we can not simply average the two midpoints of even length data, as we did before. So we return them both as a pair tuple, the smaller one first. Therefore these two methods return results of different types: &T and (&T,&T) respectively. The user has to deal with them appropriately.

rust /// Fast 1D generic medians and associated information and tasks. /// Using auto referencing to disambiguate conflicts /// with five more specific Medianf64 methods with the same names. /// To invoke specifically these generic versions, add a reference: /// `(&v[..]).method` or `v.as_slice().method` pub trait Median<T> { /// Finds the median of `&[T]`, fast. fn median(&self, quantify: impl Fn(&T) -> f64) -> Result<f64, Me>; /// Odd median for any PartialOrd type T fn generic_odd(&self) -> Result<&T, Me>; /// Even median for any PartialOrd type T fn generic_even(&self) -> Result<(&T,&T), Me>; /// Zero median data produced by finding and subtracting the median. fn zeromedian(&self, quantify: impl Copy + Fn(&T) -> f64) -> Result<Vec<f64>, Me>; /// Median correlation = cosine of an angle between two zero median vecs fn mediancorr(&self,v: &[T],quantify: impl Copy + Fn(&T) -> f64) -> Result<f64, Me>; /// Median of absolute differences (MAD). fn mad(&self, med: f64, quantify: impl Fn(&T) -> f64) -> Result<f64, Me>; /// Median and MAD. fn medstats(&self, quantify: impl Copy + Fn(&T) -> f64) -> Result<MStats, Me>; }

`Struct MStats`

Holds the sample parameters: centre (here the median), and the spread measure, (here MAD = median of absolute differences from the median). MAD is the most stable measure of data spread. Alternatively, MStats can hold the mean and the standard deviation, as computed in crate RStats.

Release Notes

Version 2.2.6 - Improved README.md. No changes to the code.

Version 2.2.5 - Upped dependency on indxvec to version 1.8.

Version 2.2.3 - Slight further improvement to efficiency of part.

Version 2.2.2 - Corrected some comment and readme typos. No change in functionality.

Version 2.2.1 - Some code pruning and streamlining. No change in functionality.

Version 2.2.0 - Major new version with much improved speed and generality and some breaking changes (renaming).