2008/07/13

Finding duplicates in array of integer numbers

Some time ago I spotted job offer (polish) posted by a friend of mine. Marek initiated a start up company and now is looking for a new member of his team. The offer is very interesting because it contains a kind of "applicant sieve". :)

There is a task to do for a person who wants to apply - algorithm implemented in any language which solves the following problem:

Having integer numbers array of size N which contains values from range 1 to N, find if there are any duplicated values.

The problem is quite interesting because there are many possible solutions of different properties. Though I am not looking for a new job, I tried to solve the problem. Simple solutions are quite obvious, others which could be called "optimal" become much more complicated. When taking some restrictions into account like O(n) computational complexity, O(1) memory consumption and so on, the solution becomes strightforward. I wonder if it is possible to write an algorithm which will take all restrictions into account and will treat the input array as read-only structure. I tried to implement it, but I failed. :(

In order to achieve the goal I started with testing code. Having some different algorithms I decided to compare their "real life" effectiveness. As you probably know microbenchmarking is loosely connected with real life. :) Anyway writhing them is a good fun. I decided to put everything as open source project called finding-duplicates. I hope Marek won't be angry - the project is spoiling his offer. :) The task was posted some time ago, I hope he has employed someone already. I am quite surprised by the reactions this offer has caused when mentioned on Jacek Laskowski's blog. This is the reason I hope that some people could be interested in my project. I hope that all this buzz will sooner or later bring Marek more candidates. :)

If you have more "optimal" solution then one collected here, just send it to me, or better drop me a line and I will make you a member of this project. I wonder if I have collected all the possible categories of solutions. Special thantks to Adam Woźniak for his implementation. :)

With one additional assumption - array elements are less then N - it would be possible to use algorithm described here, and here. Probably it could be somehow adapted to Marek's problem. For example by adding "virtual" last element to the array. It should has the value less then N (it could be even random number) if we find duplicate and its value is the same as the last element value, then one more pass through the array would be required. Any ideas?

3 comments:

  1. Przyznaję, że dopiero teraz zorientowałem się nad wyrafinowaniem zadania i będę zmuszony do zaproponowania swojego rozwiązania zadania, aby spróbować (!) obronić moją tezę o łatwości implementacji zadania w 15 minut. Początkowo pomyliłem się w stwierdzeniu, że "w miejscu" oznacza "ze stałą pamięcią". I dopiero po przeczytaniu Twojego wpisu zorientowałem się, dlaczego tak istotne było założenie o tablicy wyłącznie do odczytu. Dodatkowo przyznaję, że założyłem po cichu, że w zadaniu owe N wejściowe było dowolne acz ustalone i że rozmiar tablicy faktycznie był różny od zakresu liczb w niej występującej. Założyłem po prostu, że utworzę pomocniczą tablicę stałego rozmiaru (równego liczbie możliwych wartości) i pierwszy obieg po tablicy wejściowej wypełni mi wartości w tablicy pomocniczej, a drugi obieg stwierdzi, gdzie są powtórzenia (wartość > 1). Mam zagwarantowaną stałą pamięć i liniowość algorytmu. Widzę dopiero teraz jak bardzo nagiąłem wymagania. W co ja się wrobiłem?! Nie pozostaje nic innego jak zabrać się za zadanie systematycznie, śledząc Twoje systematyczne podejście ;-)

    Cieszę się, że tego typu zadanie sprawiło tyle zabawy czytelnikom. Jest to chyba najbardziej komentowany wpis na moim blogu od przeszło 2 lat (!)

    Jacek
    Notatnik Projektanta Java EE

    ReplyDelete
  2. @Jacek

    moze zrob taki staly 'kacik' z zadaniami ;)
    Przyczynisz sie do rozwoju intelektualnego rzeszy mlodych programistow i rozruszasz seniorow ;)

    ReplyDelete
  3. Well, the most "hacker" way is to count sum of the numbers (and we know what sum should be in case the numbers are uniqe). Then, if there is only one repetition, we can tell if it really exists and what number is repeated. But it works only for numbers and under special circumstanses... My favourite way is to build hash where keys are numbers (or strings - it's universal in Perl) and number of apperance is value. It's pretty universal (works for numbers, strings, initial table doesn't have to be sorted, there can be more than 1 repetition and we can tell how many times given value is repeated), pretty fast (just one iteration on table and one on hash). But it requires pretty much memory.
    And the fastest way (but pretty slow!) to write this check is print the table on STDOUT then pipe it through sort | uniq -c ;)

    ReplyDelete