Additional space complexity isn't always a useful metric

6LawrenceC

2JBlack

2Brendan Long

New Comment

I don't think that anyone is claiming that it's a particularly *useful* metric, just that it's an interesting constraint for programming puzzles.

There are sometimes corresponding constraints in reality, such as systems with comparatively low amounts of RAM where data can be requested (e.g. embedded systems with I/O, servers with network APIs, etc) but data can not or should not be written via those channels. It doesn't make them common, but it is useful to *be able* to devise algorithms under these or other constraints.

I agree that the math puzzle is interesting.

I'm still skeptical that this algorithm is useful any real-world situation, although I was hoping I might get comments with counter-examples. Even in the examples you gave, you already have another machine that clearly has far more memory than you need to implement the set algorithm but for some reason you have to write this algorithm to run on a toaster and talk to your dramatically more powerful server over the network? I'm not saying it's impossible, but I hope you can see why I'm skeptical.

I've been subscribed to Interview Cake for years, and today they had a really interesting question: Given a list of

`n + 1`

integers in the range`1...n`

, find one of the duplicates (there is guaranteed to be at least one) in O(n) time and O(1)additionalspace. The answer is really interesting[1], and I recommend trying it, but I don't think it makes sense to care about additional space rather than total space, and I still think using a set to keep track of numbers (the most obvious solution) is the best solution in practice.The reason I thought this was worth writing about is that the "O(1)" linked list algorithm can't be streamed, so it doesn't make any sense to treat the memory the input uses and the additional memory as fundamentally different (they both need to be allocated on the same machine). Since the algorithm requires that you hold the input in memory, the only sane way to treat the space complexity to consider the total, i.e. the linked list solution is O(n) for the input + O(1) for the rest of the algorithm, which is still O(n).

In this specific case, the constant factors are also huge. The numbers are guaranteed to be densely packed and in the range

`1...n`

, so our set can be a bit array of length`n`

(i.e. to mark "2" as being in our set, we set bit 2 of our array). This means that our additional space is actually going to be smaller than our input by a factor of the size of our integer type (i.e. with 32-bit integers, the input will be 32x larger than our bit array). With any reasonable integer size, the size of the input is going to be massively larger than the additional memory for our set.In a previous post on a similar question, I pointed out that a hash table with a million 32-bit entries will only use 8 MB of memory, but this case is even better. If you want to find the duplicate in a list of a million 32-bit integers, your list will take 4 MB of memory but the set of numbers you've seen will only take 125 KB of memory.

Finally, remember how I said that the linked list solution can't be streamed? Well, the set solution

can, so taking that and the bit array optimization into account, the set solution can use substantially less memory in-practice because it only has to store the extremely-efficient bit set and doesn't have to hold the entire input in memory!For reference, here's a Python solution that uses substantially less memory than the linked-list solution when streaming the input: