Improve statistics.median() complexity #135157

ryanrudes · 2025-06-05T05:36:40Z

Median can be computed in time O(n log n) without sorting using the select-k algorithm.

https://github.com/python/cpython/blob/169cdfefce83fabcea48d0ba24ca4dba210f41d0/Lib/statistics.py#L327C1-L348C43

skirpichev · 2025-06-05T06:01:58Z

But sorting in the stdlib is already O(n*log(n)) in worst case, isn't?

Could you propose implementation, that actually beats the present algorithm?

rhettinger · 2025-06-05T14:27:34Z

Is this a joke?

An abusive tone is not welcome here.

vstinner · 2025-06-05T15:12:25Z

I changed the issue title.

Median can be computed in time O(n log n) without sorting using the select-k algorithm.

Is it a feature request? You should elaborate.

skirpichev · 2025-06-05T16:01:32Z

But sorting in the stdlib is already O(n*log(n)) in worst case, isn't?

Indeed, I forgot to add link: https://en.wikipedia.org/wiki/Timsort. And it's already mentioned in the sorting howto: https://docs.python.org/3/howto/sorting.html

hongweipeng · 2025-06-06T05:56:57Z

But sorting in the stdlib is already O(n*log(n)) in worst case, isn't?

Could you propose implementation, that actually beats the present algorithm?

https://en.wikipedia.org/wiki/Median_of_medians Its worst-case time complexity is still O(n).

skirpichev · 2025-06-06T06:11:48Z

https://en.wikipedia.org/wiki/Median_of_medians Its worst-case time complexity is still O(n).

Now this does make sense as a feature request for me.

I'll try to benchmark this. Though, I would expect that pure-Python version of this algorithm might be worse than current version on small lists.

tim-one · 2025-06-06T07:20:25Z

Yes, median can be computed in worst-case linear time. But the overhead is high. If you did so, and used it to pick the pivot for a quicksort, then quicksort's worst case would be O(n log n) instead of quadratic. But nobody does that, which should be a clue about how high the overhead is 😉. Instead, when modern quicksorts get into trouble, they switch to using heapsort (a slower-on-average method, but easily coded and worst case n log n).

It is used in "quick select" implementations, though, because they have no other realistic choice to fall back on if their partitioning isn't making sufficient progress. There are no other linear-worst-case methods for them to fall back on.

If's fun 😉 to code, but it's doubtful whether core Python should make heroic efforts to supply worst-case linear-time order statistics.

This is more in numpy's natural domain. But they don't implement worst-case linear-time either. They use a "quick select" based on partitioning, which is expected case linear time, but fall back to a worst case n log n method if partitioning isn't making good progress (Python's sort is suitable for that).

But partitioning in pure Python is pretty slow compared to what can be done in C. compounded by that Python's sort gets major speed advantages from its elaborate efforts to analyze the list for type homogeneity and specialize to using a cheap type-specific comparison function if possible.

The details here don't really matter: the point is that sorting to do this, in CPython, is faster than you might guess, for several reasons, and is worst case n log n. For practical list sizes, it will probably be hard to beat by anything coded in Python.

skirpichev · 2025-06-06T09:59:49Z

But partitioning in pure Python is pretty slow compared to what can be done in C.

That's something I expected.

I did quick tests for a list of random() floats (I think it should be "good enough" to be sorted with average performance) with my mindless translation of the wiki pseudocode (pass CI):

$ ./python a.py 10000000
15.595426321029663
$ ./python a.py 10000000  # new algorithm
92.77318906784058

If I increase length by order of magnitude - I suspect that my system will swap out :-)

For low values it's much worse:

Benchmark	ref	patch
10	3.26 us	53.5 us: 16.43x slower
100	13.5 us	526 us: 39.10x slower
1000	172 us	5.83 ms: 33.90x slower
100000	52.9 ms	784 ms: 14.82x slower
1000000	940 ms	8.49 sec: 9.03x slower
Geometric mean	(ref)	19.63x slower

# a.py
from statistics import median
from random import random, seed
from sys import argv
from time import time

seed(1)
n = int(argv[1])
xs = [random() for _ in range(n)]
begin = time()
median(xs)
end = time()
print(end - begin)

# bench.py
import pyperf
from statistics import median
from random import random, seed

runner = pyperf.Runner()

seed(1)
for xs in ([random() for _ in range(10)],
           [random() for _ in range(100)],
           [random() for _ in range(1000)],
           [random() for _ in range(100000)],
           [random() for _ in range(1000000)],
           ):
    n = len(xs)
    runner.bench_func(str(n), median, xs)

diff --git a/Lib/statistics.py b/Lib/statistics.py
index 3d805cb073..2a91745476 100644
--- a/Lib/statistics.py
+++ b/Lib/statistics.py
@@ -337,15 +337,88 @@ def median(data):
     4.0
 
     """
-    data = sorted(data)
+    data = list(data)
     n = len(data)
     if n == 0:
         raise StatisticsError("no median for empty data")
-    if n % 2 == 1:
-        return data[n // 2]
+    i, r = divmod(n, 2)
+    if r == 1:
+        return nthSmallest(data, i)
     else:
-        i = n // 2
-        return (data[i - 1] + data[i]) / 2
+        return (nthSmallest(data, i - 1) + nthSmallest(data, i))/2
+
+
+def nthSmallest(xs, n):
+    idx = select(xs, 0, len(xs) - 1, n)
+    return xs[idx]
+
+def select(xs, left, right, n):
+    while True:
+        if left == right:
+            return left
+        pivotIndex = pivot(xs, left, right)
+        pivotIndex = partition(xs, left, right, pivotIndex, n)
+        if n == pivotIndex:
+            return n
+        elif n < pivotIndex:
+            right = pivotIndex - 1
+        else:
+            left = pivotIndex + 1
+
+
+def pivot(xs, left, right):
+    # for 5 or less elements just get median
+    if right - left < 5:
+        return partition5(xs, left, right)
+    # otherwise move the medians of five-element subgroups to the first n/5 positions
+    for i in range(left, right + 1, 5):
+        # get the median position of the i'th five-element subgroup
+        subRight = i + 4
+        if subRight > right:
+            subRight = right
+        median5 = partition5(xs, i, subRight)
+        i, j = median5, left + ((i - left) // 5)
+        xs[i], xs[j] = xs[j], xs[i]
+
+    # compute the median of the n/5 medians-of-five
+    mid = ((right - left) // 10) + left + 1
+    return select(xs, left, left + ((right - left) // 5), mid)
+
+
+def partition(xs, left, right, pivotIndex, n):
+    pivotValue = xs[pivotIndex]
+    xs[pivotIndex], xs[right] = xs[right], xs[pivotIndex]
+    storeIndex = left
+    # Move all elements smaller than the pivot to the left of the pivot
+    for i in range(left, right):
+        if xs[i] < pivotValue:
+            xs[storeIndex], xs[i] = xs[i], xs[storeIndex]
+            storeIndex += 1
+    # Move all elements equal to the pivot right after
+    # the smaller elements
+    storeIndexEq = storeIndex
+    for i in range(storeIndex, right):
+        if xs[i] == pivotValue:
+            xs[storeIndexEq], xs[i] = xs[i], xs[storeIndexEq]
+            storeIndexEq += 1
+    xs[right], xs[storeIndexEq] = xs[storeIndexEq], xs[right]
+    # Return location of pivot considering the desired location n
+    if n < storeIndex:
+        return storeIndex  # n is in the group of smaller elements
+    if n <= storeIndexEq:
+        return n  # n is in the group equal to pivot
+    return storeIndexEq  # n is in the group of larger elements
+
+
+def partition5(xs, left, right):
+    i = left + 1
+    while i <= right:
+        j = i
+        while j > left and xs[j-1] > xs[j]:
+            xs[j-1], xs[j] = xs[j], xs[j-1]
+            j -= 1
+        i += 1
+    return left + (right - left) // 2
 
 
 def median_low(data):

tim-one · 2025-06-08T05:44:50Z

Ya, worst case linear is a major pain. If you're inclined to do anything, (I'm not) look at numpy's implementation. That's expected linear time, but with a method that's worst case quadratic time, Which is very unlikely. Even so, numpy guards against it, falling back to a worst case n log n method (for which Python's current sort-based code could be used directly as-is).

skirpichev added stdlib Python modules in the Lib dir pending The issue will be closed if no feedback is provided labels Jun 5, 2025

vstinner changed the title ~~Is this a joke?~~ statistics.median() complexity Jun 5, 2025

skirpichev added type-feature A feature request or enhancement and removed pending The issue will be closed if no feedback is provided labels Jun 6, 2025

skirpichev self-assigned this Jun 6, 2025

tim-one added the performance Performance or resource usage label Jun 6, 2025

skirpichev removed their assignment Jun 7, 2025

picnixz changed the title ~~statistics.median() complexity~~ Improve statistics.median() complexity Jun 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve statistics.median() complexity #135157

Improve statistics.median() complexity #135157

ryanrudes commented Jun 5, 2025

skirpichev commented Jun 5, 2025

Uh oh!

rhettinger commented Jun 5, 2025

Uh oh!

vstinner commented Jun 5, 2025

Uh oh!

skirpichev commented Jun 5, 2025

Uh oh!

hongweipeng commented Jun 6, 2025

Uh oh!

skirpichev commented Jun 6, 2025

Uh oh!

tim-one commented Jun 6, 2025

Uh oh!

skirpichev commented Jun 6, 2025

Uh oh!

tim-one commented Jun 8, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Uh oh!

Improve statistics.median() complexity #135157

Improve statistics.median() complexity #135157

Comments

ryanrudes commented Jun 5, 2025

skirpichev commented Jun 5, 2025

Uh oh!

rhettinger commented Jun 5, 2025

Uh oh!

vstinner commented Jun 5, 2025

Uh oh!

skirpichev commented Jun 5, 2025

Uh oh!

hongweipeng commented Jun 6, 2025

Uh oh!

skirpichev commented Jun 6, 2025

Uh oh!

tim-one commented Jun 6, 2025

Uh oh!

skirpichev commented Jun 6, 2025

Uh oh!

tim-one commented Jun 8, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.