Just the One of the Few (jtootf) wrote in algorithms,
Just the One of the Few

n-th order statistics of the immutable data

what's the best asymptotic time complexity for a n-th order statistics algorithm on the immutable array of size N with memory consumption less then or equal to O(log N)? median of medians algorithm gives O(N) time complexity in the worst case, but I'm not sure about it's memory boundaries for the r/o array

UPD: anyway, it would be much interesting too for the case of the memory consumption less then or equal to O(sqrt N); or even O(K * N) where 0 < K << 1
  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

If your array is immutable and you can only have O(log n) space I doubt you can do this faster than O(n^2).
(I am referring to the obvious O(n^2) time, O(1) space algorithm for selection)
well, I have more than O(1) space after all: I have the whole O(log N) :) just wandering if I can get any better time than O(n^2) with this additional space

however, in any case it would be nice to get some clues for the proving the minimum time required
Actually, the obvious algorithm (for each element, count how many are smaller than it), takes O(log n) space.

I see
Here is a reasonable randomized algorithm. Let's say you want the median.

Initialize U = infinity, L = -infinity.
Initialize U_count = 0, L_count = 0.
1. Choose a random element x. Check whether it is less than the median or greater than the median by counting how many elements are smaller than it. This takes O(n).
2. If greater, set U = min(x, U). Set U_count to the number elements less than x.
3. If less, set L = max(x, L). Set L_count to the number of elements less than x.
4. If U_count - L_count < # of iterations done, quit the loop and search all U_count - L_count elements in the range [L, U] to see whether they are the median.

I would guess this runs in expected O(n log n) time or thereabouts. Maybe you can prove it.
The problem is that your random choice doesn't necessarily give you a value within the range [U,L]. Since you are only using O(1) memory in the above, you can keep a random set of O(log n) entries within the range [U,L] during your count scan, with which you can choose the median or even some interpolated entry in the set, depending on U|L_count (even if you sort your O(log n) entries using in-place heapsort, you still only use O(log n * log log n) time for the sort, which is easily eclipsed by the O(n) count step).

As an aside, a different problem of starting with a sorted sequence and searching can be made more efficient using what is known as 'interpolation search'. It's like binary search, except that instead of always choosing the middle element, you choose an element that you estimate to be close to what you are searching for. Interpolation search for relatively evenly distributed data runs in O(log log n) per search.

A person might be able to prove that choosing an interpolation of O(log n) random elements from the in-range set is actually O(n log log n). If you only pick 1 element randomly from the in-range set, it is still randomized O(n log n) time with more or less the same analysis as quicksort. I can explain if you would like.
You still have to find the entries in [L, U]. In general you can just throw out any random choices that are > U or < L.
Of course you still have to find the entries (that you pick from) in [L,U], but you can do that while you are making your pass to count L|U_count.
To get this to run in O(n f(n)) time, you need to find two entries within f(n) of the median. The probability that you don't find one is around (1 - f(n)/n)^f(n). For f(n) = log n this goes to 1, so that will never work. For f(n) = sqrt(n) this goes to 1/e, so you are likely to find 1. If instead we do 2 sqrt(n) searches then the probability becomes 1/e^2. So the probability that both succeed is > 0.5.

Thus if you repeat the search step 2 sqrt(n) times you have a > 50% probability of finding a range around the median that is < 2 sqrt(n). Repeating this a few times will allow you to succeed w.h.p. in O(n^1.5) time.

To get it to run in O(n f(n)) time, it is sufficient to show you only need to take O(f(n)) passes over all of your input data. If you can reduce the number of values within the range L to U by roughly half each time, or merely prove that you do so with high probability, that is sufficient to get randomized O(n f(n)) time with high probability.

Quick select does this (it works like quicksort, only for selection rather than sorting), and because it creates new sequences to recurse on, reducing per-pass runtime, runs in randomized O(n) time.

I implemented the method I described in Python, for constant (3) and ceil(log2(n)) space. I implemented random, median, and interpolated "pivot" selection from the given space.

All of the variants have average numbers of passes (100 trials each) of ~2*log2(n) (sequences of 1,000 take ~ 20 passes, sequences of 10,000 take ~28 passes, and sequences of 100,000 take ~37 passes). Log space tends to do one or two fewer passes than constant space, with random/median/interpolation selection from the group not really mattering.
I noticed a bug in my code. Turns out I was always randomly picking rather than using median or interpolation. I can't seem to get interpolation to work very well, so I've punted on it. But random and median work, median working a bit better than random. The code is as follows.

import random
import math

def random_select_limited_space(sequence, k, space_fcn, choice_fcn):
    ls = len(sequence)
    space_used = space_fcn(ls)
    L = -2**64
    H = 2**64
    current = random.choice(sequence)
    to_use = []
    passes = 0
    while 1:
        passes += 1
        L_count = H_count = 0
        for val in sequence:
            if val < current:
                L_count += 1
            elif val > current:
                H_count += 1
            if L < val < H:
                if len(to_use) < space_used:
                    posn = random.randrange(ls)
                    if posn < len(to_use):
                        to_use[posn] = val
        E_count = ls - L_count - H_count
        if L_count > k:
            H = current
        elif L_count + E_count > k:
            return passes, current
            L = current
        current = choice_fcn(to_use, k, ls-L_count)
        del to_use[:]

def log(n):
    return int(math.ceil(math.log(n) / math.log(2)))

def con(n):
    return 3

def random_(seq, k, of):
    return random.choice(seq)

def median_(seq, k, of):
    return seq[len(seq)//2]

def interp_(seq, k, of):
        return seq[int(float(k)/of)]
        print k, of, len(seq)

if __name__ == '__main__':
    for total_size in (1000, 10000, 100000):
        seq = [random.randrange(2**30) for i in xrange(total_size)]
        check = sorted(seq)
        for space_fcn in (con, log):
            indexes = [random.randrange(total_size) for i in xrange(100)]
            for interp in (random_, median_): #, interp_):
                pp = []
                for index in indexes:
                    passes, value = random_select_limited_space(seq, index, space_fcn, interp)
                    if value != check[index]:
                        print "FAILED:", value, check[index], interp.__name__, space_fcn.__name__
                        raise Exception
                print interp.__name__, space_fcn.__name__, total_size, float(sum(pp)) / len(pp)
This is sort of old now, but it's possible to do this in O(n log n) time deterministically using O(log^2 n) storage.
can you describe the algorithm? or point me to the description either?