2008/07/13

Finding duplicates in array of integer numbers

Some time ago I spotted job offer (polish) posted by a friend of mine. Marek initiated a start up company and now is looking for a new member of his team. The offer is very interesting because it contains a kind of "applicant sieve". :)

There is a task to do for a person who wants to apply - algorithm implemented in any language which solves the following problem:

Having integer numbers array of size N which contains values from range 1 to N, find if there are any duplicated values.

The problem is quite interesting because there are many possible solutions of different properties. Though I am not looking for a new job, I tried to solve the problem. Simple solutions are quite obvious, others which could be called "optimal" become much more complicated. When taking some restrictions into account like O(n) computational complexity, O(1) memory consumption and so on, the solution becomes strightforward. I wonder if it is possible to write an algorithm which will take all restrictions into account and will treat the input array as read-only structure. I tried to implement it, but I failed. :(

In order to achieve the goal I started with testing code. Having some different algorithms I decided to compare their "real life" effectiveness. As you probably know microbenchmarking is loosely connected with real life. :) Anyway writhing them is a good fun. I decided to put everything as open source project called finding-duplicates. I hope Marek won't be angry - the project is spoiling his offer. :) The task was posted some time ago, I hope he has employed someone already. I am quite surprised by the reactions this offer has caused when mentioned on Jacek Laskowski's blog. This is the reason I hope that some people could be interested in my project. I hope that all this buzz will sooner or later bring Marek more candidates. :)

If you have more "optimal" solution then one collected here, just send it to me, or better drop me a line and I will make you a member of this project. I wonder if I have collected all the possible categories of solutions. Special thantks to Adam Woźniak for his implementation. :)

With one additional assumption - array elements are less then N - it would be possible to use algorithm described here, and here. Probably it could be somehow adapted to Marek's problem. For example by adding "virtual" last element to the array. It should has the value less then N (it could be even random number) if we find duplicate and its value is the same as the last element value, then one more pass through the array would be required. Any ideas?