Skip to content

Improve statistics.median() complexity #135157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ryanrudes opened this issue Jun 5, 2025 · 9 comments
Open

Improve statistics.median() complexity #135157

ryanrudes opened this issue Jun 5, 2025 · 9 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@ryanrudes
Copy link

Median can be computed in time O(n log n) without sorting using the select-k algorithm.

https://github.com/python/cpython/blob/169cdfefce83fabcea48d0ba24ca4dba210f41d0/Lib/statistics.py#L327C1-L348C43

@skirpichev skirpichev added stdlib Python modules in the Lib dir pending The issue will be closed if no feedback is provided labels Jun 5, 2025
@skirpichev
Copy link
Contributor

But sorting in the stdlib is already O(n*log(n)) in worst case, isn't?

Could you propose implementation, that actually beats the present algorithm?

@rhettinger
Copy link
Contributor

Is this a joke?

An abusive tone is not welcome here.

@vstinner vstinner changed the title Is this a joke? statistics.median() complexity Jun 5, 2025
@vstinner
Copy link
Member

vstinner commented Jun 5, 2025

I changed the issue title.

Median can be computed in time O(n log n) without sorting using the select-k algorithm.

Is it a feature request? You should elaborate.

@skirpichev
Copy link
Contributor

But sorting in the stdlib is already O(n*log(n)) in worst case, isn't?

Indeed, I forgot to add link: https://en.wikipedia.org/wiki/Timsort. And it's already mentioned in the sorting howto: https://docs.python.org/3/howto/sorting.html

@hongweipeng
Copy link
Contributor

But sorting in the stdlib is already O(n*log(n)) in worst case, isn't?

Could you propose implementation, that actually beats the present algorithm?

https://en.wikipedia.org/wiki/Median_of_medians Its worst-case time complexity is still O(n).

@skirpichev skirpichev added type-feature A feature request or enhancement and removed pending The issue will be closed if no feedback is provided labels Jun 6, 2025
@skirpichev
Copy link
Contributor

https://en.wikipedia.org/wiki/Median_of_medians Its worst-case time complexity is still O(n).

Now this does make sense as a feature request for me.

I'll try to benchmark this. Though, I would expect that pure-Python version of this algorithm might be worse than current version on small lists.

@skirpichev skirpichev self-assigned this Jun 6, 2025
@tim-one tim-one added the performance Performance or resource usage label Jun 6, 2025
@tim-one
Copy link
Member

tim-one commented Jun 6, 2025

Yes, median can be computed in worst-case linear time. But the overhead is high. If you did so, and used it to pick the pivot for a quicksort, then quicksort's worst case would be O(n log n) instead of quadratic. But nobody does that, which should be a clue about how high the overhead is 😉. Instead, when modern quicksorts get into trouble, they switch to using heapsort (a slower-on-average method, but easily coded and worst case n log n).

It is used in "quick select" implementations, though, because they have no other realistic choice to fall back on if their partitioning isn't making sufficient progress. There are no other linear-worst-case methods for them to fall back on.

If's fun 😉 to code, but it's doubtful whether core Python should make heroic efforts to supply worst-case linear-time order statistics.

This is more in numpy's natural domain. But they don't implement worst-case linear-time either. They use a "quick select" based on partitioning, which is expected case linear time, but fall back to a worst case n log n method if partitioning isn't making good progress (Python's sort is suitable for that).

But partitioning in pure Python is pretty slow compared to what can be done in C. compounded by that Python's sort gets major speed advantages from its elaborate efforts to analyze the list for type homogeneity and specialize to using a cheap type-specific comparison function if possible.

The details here don't really matter: the point is that sorting to do this, in CPython, is faster than you might guess, for several reasons, and is worst case n log n. For practical list sizes, it will probably be hard to beat by anything coded in Python.

@skirpichev
Copy link
Contributor

But partitioning in pure Python is pretty slow compared to what can be done in C.

That's something I expected.

I did quick tests for a list of random() floats (I think it should be "good enough" to be sorted with average performance) with my mindless translation of the wiki pseudocode (pass CI):

$ ./python a.py 10000000
15.595426321029663
$ ./python a.py 10000000  # new algorithm
92.77318906784058

If I increase length by order of magnitude - I suspect that my system will swap out :-)

For low values it's much worse:

Benchmark ref patch
10 3.26 us 53.5 us: 16.43x slower
100 13.5 us 526 us: 39.10x slower
1000 172 us 5.83 ms: 33.90x slower
100000 52.9 ms 784 ms: 14.82x slower
1000000 940 ms 8.49 sec: 9.03x slower
Geometric mean (ref) 19.63x slower
# a.py
from statistics import median
from random import random, seed
from sys import argv
from time import time

seed(1)
n = int(argv[1])
xs = [random() for _ in range(n)]
begin = time()
median(xs)
end = time()
print(end - begin)
# bench.py
import pyperf
from statistics import median
from random import random, seed

runner = pyperf.Runner()

seed(1)
for xs in ([random() for _ in range(10)],
           [random() for _ in range(100)],
           [random() for _ in range(1000)],
           [random() for _ in range(100000)],
           [random() for _ in range(1000000)],
           ):
    n = len(xs)
    runner.bench_func(str(n), median, xs)
diff --git a/Lib/statistics.py b/Lib/statistics.py
index 3d805cb073..2a91745476 100644
--- a/Lib/statistics.py
+++ b/Lib/statistics.py
@@ -337,15 +337,88 @@ def median(data):
     4.0
 
     """
-    data = sorted(data)
+    data = list(data)
     n = len(data)
     if n == 0:
         raise StatisticsError("no median for empty data")
-    if n % 2 == 1:
-        return data[n // 2]
+    i, r = divmod(n, 2)
+    if r == 1:
+        return nthSmallest(data, i)
     else:
-        i = n // 2
-        return (data[i - 1] + data[i]) / 2
+        return (nthSmallest(data, i - 1) + nthSmallest(data, i))/2
+
+
+def nthSmallest(xs, n):
+    idx = select(xs, 0, len(xs) - 1, n)
+    return xs[idx]
+
+def select(xs, left, right, n):
+    while True:
+        if left == right:
+            return left
+        pivotIndex = pivot(xs, left, right)
+        pivotIndex = partition(xs, left, right, pivotIndex, n)
+        if n == pivotIndex:
+            return n
+        elif n < pivotIndex:
+            right = pivotIndex - 1
+        else:
+            left = pivotIndex + 1
+
+
+def pivot(xs, left, right):
+    # for 5 or less elements just get median
+    if right - left < 5:
+        return partition5(xs, left, right)
+    # otherwise move the medians of five-element subgroups to the first n/5 positions
+    for i in range(left, right + 1, 5):
+        # get the median position of the i'th five-element subgroup
+        subRight = i + 4
+        if subRight > right:
+            subRight = right
+        median5 = partition5(xs, i, subRight)
+        i, j = median5, left + ((i - left) // 5)
+        xs[i], xs[j] = xs[j], xs[i]
+
+    # compute the median of the n/5 medians-of-five
+    mid = ((right - left) // 10) + left + 1
+    return select(xs, left, left + ((right - left) // 5), mid)
+
+
+def partition(xs, left, right, pivotIndex, n):
+    pivotValue = xs[pivotIndex]
+    xs[pivotIndex], xs[right] = xs[right], xs[pivotIndex]
+    storeIndex = left
+    # Move all elements smaller than the pivot to the left of the pivot
+    for i in range(left, right):
+        if xs[i] < pivotValue:
+            xs[storeIndex], xs[i] = xs[i], xs[storeIndex]
+            storeIndex += 1
+    # Move all elements equal to the pivot right after
+    # the smaller elements
+    storeIndexEq = storeIndex
+    for i in range(storeIndex, right):
+        if xs[i] == pivotValue:
+            xs[storeIndexEq], xs[i] = xs[i], xs[storeIndexEq]
+            storeIndexEq += 1
+    xs[right], xs[storeIndexEq] = xs[storeIndexEq], xs[right]
+    # Return location of pivot considering the desired location n
+    if n < storeIndex:
+        return storeIndex  # n is in the group of smaller elements
+    if n <= storeIndexEq:
+        return n  # n is in the group equal to pivot
+    return storeIndexEq  # n is in the group of larger elements
+
+
+def partition5(xs, left, right):
+    i = left + 1
+    while i <= right:
+        j = i
+        while j > left and xs[j-1] > xs[j]:
+            xs[j-1], xs[j] = xs[j], xs[j-1]
+            j -= 1
+        i += 1
+    return left + (right - left) // 2
 
 
 def median_low(data):

@skirpichev skirpichev removed their assignment Jun 7, 2025
@tim-one
Copy link
Member

tim-one commented Jun 8, 2025

Ya, worst case linear is a major pain. If you're inclined to do anything, (I'm not) look at numpy's implementation. That's expected linear time, but with a method that's worst case quadratic time, Which is very unlikely. Even so, numpy guards against it, falling back to a worst case n log n method (for which Python's current sort-based code could be used directly as-is).

@picnixz picnixz changed the title statistics.median() complexity Improve statistics.median() complexity Jun 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy