Computational Complexity

 


More Lance on Twitter

Creative Commons License
This work is licensed under a Creative Commons License.

Powered by Blogger™

Friday, March 04, 2005

 
Finding Duplicates

Posted by Lance

Here is an interesting problem given by Muthukrishnan during his talk in the New Horizons workshop.

Start with an array A of n+1 entries each consisting of an integer between 1 and n. By the pigeonhole principle there must be some i≠j and a w such that A(i)=A(j)=w. The goal is to find w. Depending on A there may be several such w, we want to find any one of them. The catch is that you only get to use O(log n) bits of memory.

First a warm-up puzzle: Find w using only O(n) queries to entries of A (remember you only get O(log n) space). Hint: Use pointer chasing.

Now suppose A is streamed, that is we get in order A(1),A(2),…,A(n+1) and then get another pass etc. How many passes do you need to find a w?

You can find w in n+1 passes just by trying each possible value for w. With a little work you can use O(log n) passes doing a binary search.

Muthukrishnan asks whether the number of passes needed is Ω(log n) or O(1) or something in between.

3:06 AM # 12 comments

  1. Anonymous Anonymous says:  
    "First a warm-up puzzle: Find w using only O(n) queries to entries of A (remember you only get O(log n) space). Hint: Use pointer chasing."
    Ans: create a linked list kind of strucutre where there are n+1 nodes,
    a) value at node i, v[i]=A[i]
    b) pointer from node i to node
    v[i].
    check for loops in the linked list by using pointer chasing, starting from node n+1.

  2. Anonymous Anonymous says:  
    "Create a linked list kind of structuure where there are n+1 nodes" sounds like O(n) space, not O(log n).

  3. Anonymous Anonymous says:  
    http://domino.research.ibm.com/Comm/wwwr_ponder.nsf/challenges/January2004.html

  4. Blogger Jonathan says:  
    Just for fun, here's a C++ function that seems to work, using a binary search. Uses constant amount of space. Sorry for the formatting...

    unsigned findDup(const std::vector<unsigned> &vec, bool diagnostics)
    {
    unsigned lower = 1, upper = vec.size()-1;

    while(true)
    {
    unsigned count = 0, middle = (upper+lower)/2;

    for(unsigned i=0; i<vec.size(); i++)
    if(lower <= vec[i] && vec[i] <= middle) count++;

    if(count > middle-lower+1) upper = middle;
    else lower = middle+1;

    if(lower >= upper)
    return upper;
    }
    }

  5. Anonymous Anonymous says:  
    binary search takes more than O(n) queries.

  6. Anonymous Anonymous says:  
    As an aside, note that if exactly one element is duplicated it can be done in one pass using O(log n) space: simply add all the numbers, substract n(n+1)/2 and that is the duplicated number.

    Alex

  7. Blogger AL says:  
    you can sum all numbers
    since
    sum 1..n = n(n-1)/2 you need around
    2 log n = O(log n) bits
    there the correct answer is
    sum - n(n-1)/2
    Definitly O (log n) bits is enough

    in case of stream one pass is enough
    just sum numbers again :)

    just a joke

    it would be interestig to solve a problem in general case
    assume k numbers out of n are duplicated
    So by suming you get sum
    a_1 + \cdot a_k = sum - n(n-1)/2

    it may happen that you need small amount of memory/computations to find out which exactly numbers are duplivated if you know sum


    just for funny

  8. Blogger Jonathan says:  
    > "binary search takes more than O(n) queries."

    Yes, but Lance also asked: "Now suppose A is streamed, that is we get in order A(1),A(2),�,A(n+1) and then get another pass etc. How many passes do you need to find a w?"

    The binary search method solved this variant of the problem, unless I'm mistaken.

  9. Blogger Jonathan says:  
    Regarding the binary search algorithm, I think the statement "Uses constant amount of space" is wrong...

    The variable "upper" needs to be able to represent N, which requires log N bits. So even the binary search method requires log N space.

  10. Anonymous Anonymous says:  
    I think the binary search would work as follows:

    binsearch(a , b)
    begin
    -> choose mid = (a + b)/2
    -> count no. of elements <, = and > mid, say n1, n2 and n3
    -> if n2 >= 2 end
    -> else if n1 >= n3 binsearch(a , mid -1)
    -> else binsearch(mid + 1, b)
    end

    Now, just call binsearch(1 , n).

    We need to maintain the values of n1, n2 and n3 in each pass and the values of a and b all the time. All of these require O(log n) space.

  11. Anonymous Jun Tarui says:  
    The following is a partial answer to the problem above.
    [ I was attending the New Horizons workshop in Kyoto, Japan last week; after Muthu's talk I was
    discussing the following with Muthu. Jun Tarui (tarui+at-mark+ice.uec.ac.jp)]

    Finding duplicates in the stream setting above requires Omega(log n/ loglogn) passes.

    Proof Sketch:

    Assume n is odd.
    Consider a restricted class of inputs such that A(1), A(2), ..., A((n+1)/2) are distinct and
    A((n+1)/2 + 1), ..., A(n+1) are distinct.
    A solution for this class of inputs with s bits of memory and r stream-passes
    implies a two-party protocol for the following communication game
    with r rounds and s communication bits in each round.

    Alice and Bob get size-(n+1)/2 subsets A and B of {1,...,n} respectively
    and the task is to find and agree
    on some w that is in the intersection of A and B.

    This task is equivalent to the monotone Karchmer-Wigderson game for the Majority
    funtion of n variables (Alice and Bob get a max term and a min term).
    A protocol with r rounds and s bits per round for this game correnponds to
    a monotone (i.e. AND/OR) circuit computing the Majority function
    with depth at most r and fan-in at most 2^s.
    Hastad's size lower bound for such circuits imply the claimed bound on the number of passes
    above for s=O(log n).

    When n is even, consider the complexity of any function that outputs 1 if the number of 1's
    is more than n/2 and outputs 0 if it is less than n/2. (end-of-proof-sketch)

  12. Blogger Jonathan says:  
    Anonymous,

    I think the binary search method you posted will fail on the input:

    1, 2, 3, 4, 4

    After the first pass,
    mid = 3
    n1 = 2
    n2 = 1
    n3 = 2

    Since n1 >= n3 is true in this case, the algorithm will close in on the first part of the list, which does not contain the duplicate.

    See an earlier post for a binary search which I think does work.

Leave a Comment

Comment Feeds: This Post All

Links to this post:

Create a Link

Weblog Home

Archives

<