Order Statistics Computation
Computer Science | . Edited . 16 min read (3691 words).
Explains what order statistics, selection algorithms, and percentiles are and various ways of computing them. Also benchmarks in Python and Rust.
Topics:
Definitions
There are several different related definitions here that are useful to know.
Order statistics
From mathematical statistics, we have this definition:
Definition: order statistic Given random variables $X_1, X_2, …, X_n$, arrange them in ascending order and write the reordered variables as
$$Y_1 \leq Y_2 \leq \cdots \leq Y_n.$$
Each $Y_i$ is then called the $i$th order statistic ($i \in \{{1, 2, …, n\}}$).
A definition similar to this is used by the mathematics books Order statistics by David and Nagaraja (2004)1 and Order Statistics & Inference by Balakrishnan and Cohen (2014)2. Wikipedia references the first book and therefore defines it the same way.
Note that any ascending order gives us the same result, when values are repeated. Due to the statistical origin of this definition, it’s defined based on random variables and this also explains the name.
Some books prefer to focus on order statistics of sets of unique numbers only. CLRS3 goes down this path, but mentions that it’s easy to extend this to allow repeated numbers. This book also avoids tying the definition to random variables, which makes sense in an algorithm book.
Wolfram MathWorld4 only covers the case of unique values, which is sufficient for continuous random variables where the probability of duplicate values is zero.
Selection algorithm
There’s a closely related non-statistical definition in computer science:
Algorithms for this are described in both CLRS3 and The algorithm design manual5, as well as on Wikipedia, and this will be the topic in the rest of the blog post, after looking at one more related definition.
Note that the definition of a selection algorithm makes no reference to probability or random variables, but the input can still originate from random variables.
Percentile
There are different ways to define percentiles and they all give similar results for large samples. Here’s a simple definition that is sufficient here:
Definition: percentile The $p$th percentile, $0 \leq p \leq 100$, of a list of $N$ elements is the $k$th smallest element, where
$$k = \left \lceil\frac{p}{100} \cdot N \right \rceil.$$
This means that at least $p\%$ of the elements will be less or equal to the $p$th percentile and this will be true for no smaller number. It’s also closely related to the selection algorithm and order statistics above.
Selection algorithms
In the following, we’ll go through a few different selection algorithms and look at special cases.
Sorting
As is often the case, there’s a very simple way to correctly find the the $k$th smallest element: directly implement the mathematical definition. In this case, we’ll turn to the order statistic definition above for guidance.
We can see that if we sort the list of numbers, a selection algorithm can simply index the sorted list after that. This translates easily into Python:
sorted(x)[k-1]
That code correctly raises an exception when the order statistic is not defined and assumes that $k$ starts at $1$ (rather than being zero-based).
In Rust, we can also implement this easily (this time using in-place sorting):
fn kth_element_sort(l: &mut [i64], k: usize) -> i64 {
l.sort_unstable();
l[k-1]
}
This also uses one-based indexing and will cause a panic for the undefined zero-th smallest element. Better error handling and making full use of generic typing would be useful additions in real code. Zero-based indexing is also more natural in most programming languages.
After sorting the list, we can find any order statistic in constant time, so this algorithm finds all the order statistics at once for us and the sorted list can be re-used (after refactoring) to speed up other computations if we want. By finding all the order statistics, the algorithm also computes more than is needed if we only need to find the $k$th smallest element for one $k$. If the list is already sorted, on the other hand, it’s easy to benefit from that by not re-sorting it.
The time complexity for this algorithm is dominated by the sorting and is therefore at best $O(n \log n)$. If the list is pre-sorted, this changes to $O(1)$ (constant time). The latter is clearly the fastest solution when applicable.
This will be relatively slow for large lists. On the other hand, these are complete solutions that are easy to review, adjust, and re-use.
Simple linear scan
For some special cases, such as the minimum, maximum, or second-largest element, it’s easy and fast to scan through the unsorted list once to achieve selection in linear time, $O(n)$.
We’ll look at one such special case: the second-largest element. It can be found in linear time in Python using the following function:
def second_largest(l):
x0, x1 = max(l[0], l[1]), min(l[0], l[1])
for i in range(2, len(l)):
if l[i] > x1:
x1 = l[i]
if x1 > x0:
x0, x1 = x1, x0
return x1
In Rust we can implement this similarly and much faster:
fn second_largest_loop(l: &[i64]) -> i64 {
let mut x0 = std::i64::MIN;
let mut x1 = std::i64::MIN;
for x in l {
let x = *x;
if x > x1 {
if x > x0 {
x1 = x0;
x0 = x;
} else {
x1 = x;
}
}
}
x1
}
These functions are not pretty, but also not terribly hard to understand even without any comments like this. Error handling for too short lists can be improved.
Both functions are much faster than the respective sort-based implementations for large lists of data. In Python, this is not the case for shorter lists, but in Rust this implementation is always faster. This is expected due to Python being slower.
We pay a price for the added performance. The implementation is more complicated, but more importantly only supports a specific order statistic (the second-last here). This is much less flexible. There’s also no way any more to accelerate other similar computations by re-using anything, which may or may not be important for overall performance. Finally, this solution can’t take advantage of an already sorted list. Apart from the speed, another advantage is that this algorithm doesn’t need to change the list in any way, which is neat.
In the right situation, this will be the fastest solution!
Partial sorting (expected linear time)
If we want to compute for example the median (i.e. the 50th percentile or middle order statistic) in linear time, we can’t use the simple type of algorithm from the previous section. Sorting still works, but not in linear time.
There’s a family of algorithms that are built on modified sorting algorithms where the idea is that a complete sorting is unnecessary and we can save computation by only partially sorting. This typically means that the $k$th element will have the same value as in the fully sorted list and all other elements will be somewhere on the correct side of this one. In other words, we partition the list around this element with larger elements on one side and smaller on the other (and equal elements on possibly both sides).
One of the simplest such algorithms is quickselect, which is an adaptation of quicksort. Similar to quicksort, the quickselect algorithm has good average performance. The average (or expected) time complexity for quickselect is linear: $O(n)$. The worst-case is $O(n^2)$, which is worse than many sorting algorithms. As always, the worst-case performance may or may not be a practical problem. The algorithm is a relatively simple way to achieve good performance most of the time.
The problem with both quickselect and quicksort is the selection of pivot elements. Bad selections of pivot elements will trigger the quadratic-time worst-case performance. By randomly selecting a pivot element, this is unlikely. It’s still a possibility and this needs to be fixed to achieve better worst-case performance.
There are faster variants of the basic quickselect algorithm such as the Floyd–Rivest algorithm. It still has the same basic properties, but is often faster in practice by selecting a pivot based on a small sample of elements. This doesn’t guarantee a good pivot element selection, so it doesn’t improve the worst-case performance.
The Floyd-Rivest algorithm seems like a good choice in many practical circumstances where performance matters but theoretical worst-case performance isn’t important. An alternative is introselect below or similar, which adaptively switches to another strategy if needed to avoid the worst case without incurring much overhead otherwise.
As a practical example, C++ provides a function std::nth_element() that runs in linear time on average. Rust similarly provides select_nth_unstable() based on quicksort. This category of algorithms seems to be the most useful one for generic practical implementations. Other alternatives can be better in specific situations.
In Rust, it is as easy to use this as the built-in sort function:
pub fn kth_element_stdlib(l: &mut [i64], k: usize) -> i64 {
l.select_nth_unstable(k-1);
l[k-1]
}
Worst-case linear time
There’s a relatively simple algorithm that achieves worst-case linear time complexity, but at the cost of more expensive computations. This will typically be slower on average than the Floyd-Rivest algorithm, but is asymptotically faster in the worst case.
The CLRS3 algorithm book describes this algorithm. The Algorithm Design Manual5 only mentions its existence and points out that expected linear time algorithms are likely better in practice.
The linear-time algorithm is basically a variant of quickselect, where the pivot element is chosen as the approximate median using the median of medians algorithm (an approximate selection algorithm for the median).
When median of medians is combined with quickselect in the right way, the worst-case time complexity can be proved to be linear. This is because the median of medians based pivot guarantees that close to 30% of the elements can be discarded in each recursion of quickselect. To achieve this, additional recursive calls are needed to compute the pivot element.
See the Wikipedia article for median of medians or CLRS3 for a complete description.
Adaptive algorithms and introselect
It’s clear above that there’s not one basic algorithm that is always the fastest choice. One way around this is to combine several and adaptively switch between them. It’s possible to use another algorithm for small lists for example.
More complicated adaptive algorithms are also possible. Introselect is one such. It uses quickselect at first to achieve high performance, but switches to another algorithm when quickselect approaches its worst-case performance. This way, it’s possible to achieve low overhead most of the time and still improve the worst-case performance.
NumPy has an implementation of introselect with worst-case linear performance as part of numpy.partition().
This is almost as easy to use in Python as sorting (l is the list or NumPy array):
numpy.partition(l, k-1)[k-1]
Order statistic tree
So far, all algorithms have been based on just an array (and possibly some temporary storage). The only case of some kind of acceleration data structure so far is the special case of receiving a pre-sorted array above.
It’s also possible to use data structures to create a fast selection algorithm. If we need to compute many (or all) order statistics repeatedly, sorting the whole list is the easiest way to accelerate this. However, this will be slow if the list needs to be updated. We can do better in that case!
An order statistic tree allows us to select the $k$th smallest element in worst-case $O(\log n)$ time. This is worse than $O(1)$ for a sorted list. The order statistic tree also allows us to insert and delete elements in $O(1)$ amortized and $O(\log n)$ worst-case time, which is much better than the $O(n)$ time it takes for a sorted array.
Such an order statistic tree can be built on top of a red-black tree with added information about number of child elements along each branch in the nodes.
This clearly beats all above algorithms for a set of numbers that changes over time and where we need to compute a few order statistics between changes. If we use this to find all the order statistics, we end up with the same time complexity as sorting, $O(n \log n)$, and sorting another way would be faster in practice.
Approximation
All the above algorithms give exact results. That’s not always necessary and using approximation algorithms is often the best way to improve performance, when full accuracy is not needed. The maximum possible error can still be bounded.
Calculating order statistics or (equivalently) percentiles approximately is easy using histograms. This is sometimes called bucketing instead, since that’s how it works. By creating a set of non-overlapping buckets that each represent a range of possible values and together cover the range of observed values, we can keep track of the frequency (i.e. count) of values that fall within each bucket.
This reduces the data to a fixed array of frequencies for each bucket, regardless of how many elements we have in the original list. For each original element, we lose information about exactly where it was located inside its bucket. We still know exactly how big this error can be at most. Adding the original elements to the histogram only takes linear time. If this is still too slow, we can add only a sample from the original data to approximate even more.
To compute an order statistic, we simply scan from the beginning until we find the bucket in which the $k$th element is located (i.e. the sum of the frequencies so far adds up to at least $k$). We then know that the order statistic is somewhere inside the bucket and the bucket’s size determines the uncertainty of the exact value.
If we want to accelerate repeated order statistics calculations based on a list of bucket frequencies, we can calculate the prefix sum and use binary search. Either way, the performance depends only on the number of buckets and not the number of original elements.
How do we decide on the buckets? One way is to decide on the necessary precision and make each bucket this large. Then, it’s just a matter of adding more buckets as we insert values to achieve full coverage. This doesn’t put an upper bound on the number of buckets, however. Another option is to decide on a fixed number of buckets.
With a fixed number of buckets, we can scan all elements to find the maximum and minimum value and distribute the buckets uniformly between these. We can also perform this in a streaming fashion without scanning the whole list ahead of time. This way, we can simply distribute all buckets evenly between the first two unique values we find.6 After that, we can enlarge the buckets an integer factor whenever needed and simply merge two or more buckets into a larger one. This will free up buckets to increase the overall range.
Bucketing is often used, in my experience, when calculating percentiles in diagrams such as metric graphs for system monitoring dashboards.
Lower bound
It’s clear that an exact selection algorithm, in general, can be no faster than linear time, i.e. $O(n)$. The reason for this is that all elements in the list potentially affect the result. By changing the value of a single element that is smaller than the $i$th element to a value larger, the result will typically change (unless the elements around the $k$th element are equal).
Order statistic trees don’t contradict this conclusion, since they still require at least linear time if we start from a list of numbers and first add them all to the tree. If we have more than just an unordered list, the lower bound can be different, as seen above. This is because some of the work has already been done.
This means that we have an asymptotically optimal algorithm above.
Other implementation concerns
There are more implementation concerns than the time-complexity from above.
First of all, more complicated and longer code is more bug prone than simpler and more readable code and this is a well-established fact. What is readable, simple, or complicated is highly subjective, however, and depends on personal experience and skills. An experienced C++ programmer typically finds low-level C++ source code to be much simpler than a programmer who doesn’t know C++ and so on. It varies even more among programmers how familiar they are with mathematical notation. Let’s instead look at more objective implementation concerns.
The most obvious is storage. Many of the above algorithms can be done in-place with only constant extra storage, so storage complexity is not a concern in that way. What does matter is the size of the data being processed. The algorithms above generally assume that the data fits in RAM and is processed on a typical CPU. If the data doesn’t fit in RAM, we need an external memory algorithm that can work with data either on a directly attached disk or stored across a network. If the data is processed on a GPU, much will be different (which is out of scope here).
For external memory, disk-based sorting can be used. The possibly easiest and most common one is merge sort. This will require a complete sorting, however. Quicksort can also be adapted for external storage, but it gets more complicated. It should be possible to construct an order statistic tree based on a B-tree. I haven’t found any examples of partial external sorting similar to quickselect and haven’t analysed it. See sortbenchmark.org for external sorting benchmarks.
Another variation of external memory is a typical cloud setting where the data is stored in a distributed data storage system. This changes all considerations considerably and there will likely be many parameters of the system and expected usage pattern to take into consideration. It’s fairly easy to create a distributed batch job to compute a histogram of a very large amount of data stored across thousands of machines and I believe this is a common approach. Sorting is the other obvious solution, which can likely use existing infrastructure.
This brings use to another aspect: parallelisation. This is possible both in the cloud and on a single multi-core machine. It’s possible to parallelise sorting much more easily. In theory, every scan through the data to partition it could be split across $m$ CPU cores or even machines. This adds synchronisation overhead and it’s not clear that it would be any faster. Across a network cluster, the synchronisation overhead is more extreme. For a sufficiently large amount of data, it’s going to be necessary regardless.
Benchmarking
It’s time-consuming to benchmark, so this section only offers benchmarks for some special cases. The focus is on computing the second-largest element. This is less computationally expensive than computing the median, but also offers an additional simple solution that was benchmarked.
As always, measure performance of your specific implementation in a realistic way to be sure! The benchmarks here are a bit noisy and I haven’t setup a perfect and well-documented environment to make it repeatable. Take it for what it is. The overall results matches what I would expect and seem reliable.
I used Criterion in Rust and timeit in Python for the benchmarking. The graphs are created using Matplotlib for both languages.
Benchmarking in Python 3
This section benchmarks the computation of the second-largest element.
Time complexity analysis doesn’t tell us the whole story. There’s still a constant factor to the bound that we don’t know and it tells us nothing about small problem sizes or constant overhead.
Since Python is a fairly slow language, I would expect built-in functions such as sorted() to outperform explicit looping here on smaller sizes. For larger problem sizes, the better time complexity will eventually win. Let’s see if that’s the case.
The benchmark results show something like that:
Up to around 400 elements, it’s faster with sorting than a linear scan (in Python code). The difference is overall tiny for arrays up to around 500 elements of size. NumPy is faster than both except for very small lists (under 100 elements) due to some constant overhead.
For bigger arrays, the difference becomes bigger. Both the linear-time implementations are faster, but they have different slope and NumPy’s introselect implementation is much faster for large data sets.
Benchmarking in Rust
As expected, the Rust implementations run much faster:
The difference here is that Rust has low runtime overhead and the for loop implementation therefore runs with a lower constant overhead per iteration.
This is very quick regardless of algorithm if only run once, of course. For larger sizes, the times grow as expected:
The difference in performance between the built-in select_nth_unstable() quickselect implementation and the custom for loop is small. There’s some noise in the benchmarking data, but it seems like quickselect is slightly slower for bigger slices of data too, which makes sense.
The select_nth_unstable() function is much more flexible, but requires a mutable slice and adds a little overhead when computing e.g. the maximum, minimum, or second-largest element compared to a simple linear scan through the data. The difference is likely to be bigger for even larger data sets. This data still fits into the CPU cache and that benefits quickselect more. I simply didn’t have patience to measure huge arrays, so caching effects will have to wait for a future blog post.
David, H. and Nagaraja, H. (2004) Order Statistics. 3rd edn. Wiley. Available at: https://www.perlego.com/book/2772409/order-statistics-pdf (Accessed: 25 September 2021). ↩︎
Balakrishnan, N. and Cohen, C. (2014) Order Statistics & Inference. Elsevier Science. Available at: https://www.perlego.com/book/1897382/order-statistics-inference-pdf (Accessed: 25 September 2021). ↩︎
Cormen, T., Leiserson, C., Rivest, R. and Stein, C. (2009) Introduction to Algorithm. 3rd ed. The MIT Press. Also known as CLRS. ↩︎ ↩︎ ↩︎ ↩︎
Wolfram MathWorld. Order Statistics. Available at: https://mathworld.wolfram.com/OrderStatistic.html (Accessed: 25 September 2021). ↩︎
Skiena, S. S. (2011) The Algorithm Design Manual. 2nd edn. Springer. ↩︎ ↩︎
The special case where all elements have the same value is trivial to handle separately. ↩︎