-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
Improve statistics.median() complexity #135157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
But sorting in the stdlib is already Could you propose implementation, that actually beats the present algorithm? |
An abusive tone is not welcome here. |
I changed the issue title.
Is it a feature request? You should elaborate. |
Indeed, I forgot to add link: https://en.wikipedia.org/wiki/Timsort. And it's already mentioned in the sorting howto: https://docs.python.org/3/howto/sorting.html |
https://en.wikipedia.org/wiki/Median_of_medians Its worst-case time complexity is still |
Now this does make sense as a feature request for me. I'll try to benchmark this. Though, I would expect that pure-Python version of this algorithm might be worse than current version on small lists. |
Yes, median can be computed in worst-case linear time. But the overhead is high. If you did so, and used it to pick the pivot for a quicksort, then quicksort's worst case would be It is used in "quick select" implementations, though, because they have no other realistic choice to fall back on if their partitioning isn't making sufficient progress. There are no other linear-worst-case methods for them to fall back on. If's fun 😉 to code, but it's doubtful whether core Python should make heroic efforts to supply worst-case linear-time order statistics. This is more in But partitioning in pure Python is pretty slow compared to what can be done in C. compounded by that Python's sort gets major speed advantages from its elaborate efforts to analyze the list for type homogeneity and specialize to using a cheap type-specific comparison function if possible. The details here don't really matter: the point is that sorting to do this, in CPython, is faster than you might guess, for several reasons, and is worst case |
That's something I expected. I did quick tests for a list of random() floats (I think it should be "good enough" to be sorted with average performance) with my mindless translation of the wiki pseudocode (pass CI):
If I increase length by order of magnitude - I suspect that my system will swap out :-) For low values it's much worse:
# a.py
from statistics import median
from random import random, seed
from sys import argv
from time import time
seed(1)
n = int(argv[1])
xs = [random() for _ in range(n)]
begin = time()
median(xs)
end = time()
print(end - begin) # bench.py
import pyperf
from statistics import median
from random import random, seed
runner = pyperf.Runner()
seed(1)
for xs in ([random() for _ in range(10)],
[random() for _ in range(100)],
[random() for _ in range(1000)],
[random() for _ in range(100000)],
[random() for _ in range(1000000)],
):
n = len(xs)
runner.bench_func(str(n), median, xs) diff --git a/Lib/statistics.py b/Lib/statistics.py
index 3d805cb073..2a91745476 100644
--- a/Lib/statistics.py
+++ b/Lib/statistics.py
@@ -337,15 +337,88 @@ def median(data):
4.0
"""
- data = sorted(data)
+ data = list(data)
n = len(data)
if n == 0:
raise StatisticsError("no median for empty data")
- if n % 2 == 1:
- return data[n // 2]
+ i, r = divmod(n, 2)
+ if r == 1:
+ return nthSmallest(data, i)
else:
- i = n // 2
- return (data[i - 1] + data[i]) / 2
+ return (nthSmallest(data, i - 1) + nthSmallest(data, i))/2
+
+
+def nthSmallest(xs, n):
+ idx = select(xs, 0, len(xs) - 1, n)
+ return xs[idx]
+
+def select(xs, left, right, n):
+ while True:
+ if left == right:
+ return left
+ pivotIndex = pivot(xs, left, right)
+ pivotIndex = partition(xs, left, right, pivotIndex, n)
+ if n == pivotIndex:
+ return n
+ elif n < pivotIndex:
+ right = pivotIndex - 1
+ else:
+ left = pivotIndex + 1
+
+
+def pivot(xs, left, right):
+ # for 5 or less elements just get median
+ if right - left < 5:
+ return partition5(xs, left, right)
+ # otherwise move the medians of five-element subgroups to the first n/5 positions
+ for i in range(left, right + 1, 5):
+ # get the median position of the i'th five-element subgroup
+ subRight = i + 4
+ if subRight > right:
+ subRight = right
+ median5 = partition5(xs, i, subRight)
+ i, j = median5, left + ((i - left) // 5)
+ xs[i], xs[j] = xs[j], xs[i]
+
+ # compute the median of the n/5 medians-of-five
+ mid = ((right - left) // 10) + left + 1
+ return select(xs, left, left + ((right - left) // 5), mid)
+
+
+def partition(xs, left, right, pivotIndex, n):
+ pivotValue = xs[pivotIndex]
+ xs[pivotIndex], xs[right] = xs[right], xs[pivotIndex]
+ storeIndex = left
+ # Move all elements smaller than the pivot to the left of the pivot
+ for i in range(left, right):
+ if xs[i] < pivotValue:
+ xs[storeIndex], xs[i] = xs[i], xs[storeIndex]
+ storeIndex += 1
+ # Move all elements equal to the pivot right after
+ # the smaller elements
+ storeIndexEq = storeIndex
+ for i in range(storeIndex, right):
+ if xs[i] == pivotValue:
+ xs[storeIndexEq], xs[i] = xs[i], xs[storeIndexEq]
+ storeIndexEq += 1
+ xs[right], xs[storeIndexEq] = xs[storeIndexEq], xs[right]
+ # Return location of pivot considering the desired location n
+ if n < storeIndex:
+ return storeIndex # n is in the group of smaller elements
+ if n <= storeIndexEq:
+ return n # n is in the group equal to pivot
+ return storeIndexEq # n is in the group of larger elements
+
+
+def partition5(xs, left, right):
+ i = left + 1
+ while i <= right:
+ j = i
+ while j > left and xs[j-1] > xs[j]:
+ xs[j-1], xs[j] = xs[j], xs[j-1]
+ j -= 1
+ i += 1
+ return left + (right - left) // 2
def median_low(data): |
Ya, worst case linear is a major pain. If you're inclined to do anything, (I'm not) look at |
Median can be computed in time O(n log n) without sorting using the select-k algorithm.
https://github.com/python/cpython/blob/169cdfefce83fabcea48d0ba24ca4dba210f41d0/Lib/statistics.py#L327C1-L348C43
The text was updated successfully, but these errors were encountered: