UPD: anyway, it would be much interesting too for the case of the memory consumption less then or equal to O(sqrt N); or even O(K * N) where 0 < K << 1

# n-th order statistics of the immutable data

UPD: anyway, it would be much interesting too for the case of the memory consumption less then or equal to O(sqrt N); or even O(K * N) where 0 < K << 1

czarandyMarch 4 2009, 19:46:06 UTC 9 years ago

czarandyMarch 4 2009, 19:46:41 UTC 9 years ago

jtootfMarch 4 2009, 19:54:32 UTC 9 years ago

however, in any case it would be nice to get some clues for the proving the minimum time required

czarandyMarch 5 2009, 02:46:42 UTC 9 years ago

jtootfMarch 5 2009, 14:33:00 UTC 9 years ago

czarandyMarch 5 2009, 02:54:31 UTC 9 years ago

Initialize U = infinity, L = -infinity.

Initialize U_count = 0, L_count = 0.

Repeat:

1. Choose a random element x. Check whether it is less than the median or greater than the median by counting how many elements are smaller than it. This takes O(n).

2. If greater, set U = min(x, U). Set U_count to the number elements less than x.

3. If less, set L = max(x, L). Set L_count to the number of elements less than x.

4. If U_count - L_count < # of iterations done, quit the loop and search all U_count - L_count elements in the range [L, U] to see whether they are the median.

I would guess this runs in expected O(n log n) time or thereabouts. Maybe you can prove it.

chouyu_31March 5 2009, 03:52:18 UTC 9 years ago

As an aside, a different problem of starting with a sorted sequence and searching can be made more efficient using what is known as 'interpolation search'. It's like binary search, except that instead of always choosing the middle element, you choose an element that you estimate to be close to what you are searching for. Interpolation search for relatively evenly distributed data runs in O(log log n) per search.

A person might be able to prove that choosing an interpolation of O(log n) random elements from the in-range set is actually O(n log log n). If you only pick 1 element randomly from the in-range set, it is still randomized O(n log n) time with more or less the same analysis as quicksort. I can explain if you would like.

czarandyMarch 5 2009, 18:17:22 UTC 9 years ago

chouyu_31March 5 2009, 19:53:30 UTC 9 years ago

czarandyMarch 5 2009, 18:27:33 UTC 9 years ago

Thus if you repeat the search step 2 sqrt(n) times you have a > 50% probability of finding a range around the median that is < 2 sqrt(n). Repeating this a few times will allow you to succeed w.h.p. in O(n^1.5) time.

chouyu_31March 5 2009, 19:51:18 UTC 9 years ago

To get it to run in O(n f(n)) time, it is sufficient to show you only need to take O(f(n)) passes over all of your input data. If you can reduce the number of values within the range L to U by roughly half each time, or merely prove that you do so with high probability, that is sufficient to get randomized O(n f(n)) time with high probability.

Quick select does this (it works like quicksort, only for selection rather than sorting), and because it creates new sequences to recurse on, reducing per-pass runtime, runs in randomized O(n) time.

I implemented the method I described in Python, for constant (3) and ceil(log2(n)) space. I implemented random, median, and interpolated "pivot" selection from the given space.

All of the variants have average numbers of passes (100 trials each) of ~2*log2(n) (sequences of 1,000 take ~ 20 passes, sequences of 10,000 take ~28 passes, and sequences of 100,000 take ~37 passes). Log space tends to do one or two fewer passes than constant space, with random/median/interpolation selection from the group not really mattering.

chouyu_31March 5 2009, 20:40:04 UTC 9 years ago

czarandyMarch 11 2009, 01:58:30 UTC 9 years ago

jtootfMarch 11 2009, 09:40:32 UTC 9 years ago