Friday, March 04, 2005

Finding Duplicates

Here is an interesting problem given by Muthukrishnan during his talk in the New Horizons workshop.

Start with an array A of n+1 entries each consisting of an integer between 1 and n. By the pigeonhole principle there must be some i≠j and a w such that A(i)=A(j)=w. The goal is to find w. Depending on A there may be several such w, we want to find any one of them. The catch is that you only get to use O(log n) bits of memory.

First a warm-up puzzle: Find w using only O(n) queries to entries of A (remember you only get O(log n) space). Hint: Use pointer chasing.

Now suppose A is streamed, that is we get in order A(1),A(2),…,A(n+1) and then get another pass etc. How many passes do you need to find a w?

You can find w in n+1 passes just by trying each possible value for w. With a little work you can use O(log n) passes doing a binary search.

Muthukrishnan asks whether the number of passes needed is Ω(log n) or O(1) or something in between.

11 comments:

  1. "First a warm-up puzzle: Find w using only O(n) queries to entries of A (remember you only get O(log n) space). Hint: Use pointer chasing."
    Ans: create a linked list kind of strucutre where there are n+1 nodes,
    a) value at node i, v[i]=A[i]
    b) pointer from node i to node
    v[i].
    check for loops in the linked list by using pointer chasing, starting from node n+1.

    ReplyDelete
  2. "Create a linked list kind of structuure where there are n+1 nodes" sounds like O(n) space, not O(log n).

    ReplyDelete
  3. http://domino.research.ibm.com/Comm/wwwr_ponder.nsf/challenges/January2004.html

    ReplyDelete
  4. Just for fun, here's a C++ function that seems to work, using a binary search. Uses constant amount of space. Sorry for the formatting...

    unsigned findDup(const std::vector<unsigned> &vec, bool diagnostics)
    {
    unsigned lower = 1, upper = vec.size()-1;

    while(true)
    {
    unsigned count = 0, middle = (upper+lower)/2;

    for(unsigned i=0; i<vec.size(); i++)
    if(lower <= vec[i] && vec[i] <= middle) count++;

    if(count > middle-lower+1) upper = middle;
    else lower = middle+1;

    if(lower >= upper)
    return upper;
    }
    }

    ReplyDelete
  5. binary search takes more than O(n) queries.

    ReplyDelete
  6. As an aside, note that if exactly one element is duplicated it can be done in one pass using O(log n) space: simply add all the numbers, substract n(n+1)/2 and that is the duplicated number.

    Alex

    ReplyDelete
  7. > "binary search takes more than O(n) queries."

    Yes, but Lance also asked: "Now suppose A is streamed, that is we get in order A(1),A(2),�,A(n+1) and then get another pass etc. How many passes do you need to find a w?"

    The binary search method solved this variant of the problem, unless I'm mistaken.

    ReplyDelete
  8. Regarding the binary search algorithm, I think the statement "Uses constant amount of space" is wrong...

    The variable "upper" needs to be able to represent N, which requires log N bits. So even the binary search method requires log N space.

    ReplyDelete
  9. I think the binary search would work as follows:

    binsearch(a , b)
    begin
    -> choose mid = (a + b)/2
    -> count no. of elements <, = and > mid, say n1, n2 and n3
    -> if n2 >= 2 end
    -> else if n1 >= n3 binsearch(a , mid -1)
    -> else binsearch(mid + 1, b)
    end

    Now, just call binsearch(1 , n).

    We need to maintain the values of n1, n2 and n3 in each pass and the values of a and b all the time. All of these require O(log n) space.

    ReplyDelete
  10. The following is a partial answer to the problem above.
    [ I was attending the New Horizons workshop in Kyoto, Japan last week; after Muthu's talk I was
    discussing the following with Muthu. Jun Tarui (tarui+at-mark+ice.uec.ac.jp)]

    Finding duplicates in the stream setting above requires Omega(log n/ loglogn) passes.

    Proof Sketch:

    Assume n is odd.
    Consider a restricted class of inputs such that A(1), A(2), ..., A((n+1)/2) are distinct and
    A((n+1)/2 + 1), ..., A(n+1) are distinct.
    A solution for this class of inputs with s bits of memory and r stream-passes
    implies a two-party protocol for the following communication game
    with r rounds and s communication bits in each round.

    Alice and Bob get size-(n+1)/2 subsets A and B of {1,...,n} respectively
    and the task is to find and agree
    on some w that is in the intersection of A and B.

    This task is equivalent to the monotone Karchmer-Wigderson game for the Majority
    funtion of n variables (Alice and Bob get a max term and a min term).
    A protocol with r rounds and s bits per round for this game correnponds to
    a monotone (i.e. AND/OR) circuit computing the Majority function
    with depth at most r and fan-in at most 2^s.
    Hastad's size lower bound for such circuits imply the claimed bound on the number of passes
    above for s=O(log n).

    When n is even, consider the complexity of any function that outputs 1 if the number of 1's
    is more than n/2 and outputs 0 if it is less than n/2. (end-of-proof-sketch)

    ReplyDelete
  11. Anonymous,

    I think the binary search method you posted will fail on the input:

    1, 2, 3, 4, 4

    After the first pass,
    mid = 3
    n1 = 2
    n2 = 1
    n3 = 2

    Since n1 >= n3 is true in this case, the algorithm will close in on the first part of the list, which does not contain the duplicate.

    See an earlier post for a binary search which I think does work.

    ReplyDelete