Previously in seriesLawful Uncertainty

You may have noticed a certain trend in recent posts:  I've been arguing that randomness hath no power, that there is no beauty in entropy, nor yet strength from noise.

If one were to formalize the argument, it would probably run something like this: that if you define optimization as previously suggested, then sheer randomness will generate something that seems to have 12 bits of optimization, only by trying 4096 times; or 100 bits of optimization, only by trying 1030 times.

This may not sound like a profound insight, since it is true by definition.  But consider - how many comic books talk about "mutation" as if it were a source of power?  Mutation is random.  It's the selection part, not the mutation part, that explains the trends of evolution.

Or you may have heard people talking about "emergence" as if it could explain complex, functional orders.  People will say that the function of an ant colony emerges - as if, starting from ants that had been selected only to function as solitary individuals, the ants got together in a group for the first time and the ant colony popped right out.  But ant colonies have been selected on as colonies by evolution.  Optimization didn't just magically happen when the ants came together.

And you may have heard that certain algorithms in Artificial Intelligence work better when we inject randomness into them.

Is that even possible?  How can you extract useful work from entropy?

But it is possible in theory, since you can have things that are anti-optimized.  Say, the average state has utility -10, but the current state has an unusually low utility of -100.  So in this case, a random jump has an expected benefit.  If you happen to be standing in the middle of a lava pit, running around at random is better than staying in the same place.  (Not best, but better.)

A given AI algorithm can do better when randomness is injected, provided that some step of the unrandomized algorithm is doing worse than random.

Imagine that we're trying to solve a pushbutton combination lock with 20 numbers and four steps - 160,000 possible combinations.  And we try the following algorithm for opening it:

  1. Enter 0-0-0-0 into the lock.
  2. If the lock opens, return with SUCCESS.
  3. If the lock remains closed, go to step 1.

Obviously we can improve this algorithm by substituting "Enter a random combination" on the first step.

If we were to try and explain in words why this works, a description might go something like this:  "When we first try 0-0-0-0 it has the same chance of working (so far as we know) as any other combination.  But if it doesn't work, it would be stupid to try it again, because now we know that 0-0-0-0 doesn't work."

The first key idea is that, after trying 0-0-0-0, we learn something - we acquire new knowledge, which should then affect how we plan to continue from there.  This is knowledge, quite a different thing from randomness...

What exactly have we learned?  We've learned that 0-0-0-0 doesn't work; or to put it another way, given that 0-0-0-0 failed on the first try, the conditional probability of it working on the second try, is negligible.

Consider your probability distribution over all the possible combinations:  Your probability distribution starts out in a state of maximum entropy, with all 160,000 combinations having a 1/160,000 probability of working.  After you try 0-0-0-0, you have a new probability distribution, which has slightly less entropy; 0-0-0-0 has an infinitesimal probability of working, and the remaining 159,999 possibilities each have a 1/159,999 probability of working.  To try 0-0-0-0 again would now be stupid - the expected utility of trying 0-0-0-0 is less than average; the vast majority of potential actions now have higher expected utility than does 0-0-0-0.  An algorithm that tries 0-0-0-0 again would do worse than random, and we can improve the algorithm by randomizing it.

One may also consider an algorithm as a sequence of tries:  The "unrandomized algorithm" describes the sequence of tries 0-0-0-0, 0-0-0-0, 0-0-0-0... and this sequence of tries is a special sequence that has far-below-average expected utility in the space of all possible sequences.  Thus we can improve on this sequence by selecting a random sequence instead.

Or imagine that the combination changes every second.  In this case, 0-0-0-0, 0-0-0-0 is just as good as the randomized algorithm - no better and no worse.  What this shows you is that the supposedly "random" algorithm is "better" relative to a known regularity of the lock - that the combination is constant on each try.  Or to be precise, the reason the random algorithm does predictably better than the stupid one is that the stupid algorithm is "stupid" relative to a known regularity of the lock.

In other words, in order to say that the random algorithm is an "improvement", we must have used specific knowledge about the lock to realize that the unrandomized algorithm is worse-than-average.  Having realized this, can we reflect further on our information, and take full advantage of our knowledge to do better-than-average?

The random lockpicker is still not optimal - it does not take full advantage of the knowledge we have acquired.  A random algorithm might randomly try 0-0-0-0 again; it's not impossible, but it could happen.  The longer the random algorithm runs, the more likely it is to try the same combination twice; and if the random algorithm is sufficiently unlucky, it might still fail to solve the lock after millions of tries.  We can take full advantage of all our knowledge by using an algorithm that systematically tries 0-0-0-0, 0-0-0-1, 0-0-0-2...  This algorithm is guaranteed not to repeat itself, and will find the solution in bounded time.  Considering the algorithm as a sequence of tries, no other sequence in sequence-space is expected to do better, given our initial knowledge.  (Any other nonrepeating sequence is equally good; but nonrepeating sequences are rare in the space of all possible sequences.)

A combination dial often has a tolerance of 2 in either direction.  20-45-35 will open a lock set to 22-44-33.  In this case, the algorithm that tries 0-1-0, 0-2-0, et cetera, ends up being stupid again; a randomized algorithm will (usually) work better.  But an algorithm that tries 0-5-0, 0-10-0, 0-10-5, will work better still.

Sometimes it is too expensive to take advantage of all the knowledge that we could, in theory, acquire from previous tests.  Moreover, a complete enumeration or interval-skipping algorithm would still end up being stupid.  In this case, computer scientists often use a cheap pseudo-random algorithm, because the computational cost of using our knowledge exceeds the benefit to be gained from using it.  This does not show the power of randomness, but, rather, the predictable stupidity of certain specific deterministic algorithms on that particular problem.  Remember, the pseudo-random algorithm is also deterministic!  But the deterministic pseudo-random algorithm doesn't belong to the class of algorithms that are predictably stupid (do much worse than average).

There are also subtler ways for adding noise to improve algorithms.  For example, there are neural network training algorithms that work better if you simulate noise in the neurons.  On this occasion it is especially tempting to say something like:

"Lo!  When we make our artificial neurons noisy, just like biological neurons, they work better!  Behold the healing life-force of entropy!"

What might actually be happening - for example - is that the network training algorithm, operating on noiseless neurons, would vastly overfit the data.

If you expose the noiseless network to the series of coinflips "HTTTHHTTH"... the training algorithm will say the equivalent of, "I bet this coin was specially designed to produce HTTTHHTTH every time it's flipped!" instead of "This coin probably alternates randomly between heads and tails."  A hypothesis overfitted to the data does not generalize.  On the other hand, when we add noise to the neurons and then try training them again, they can no longer fit the data precisely, so instead they settle into a simpler hypothesis like "This coin alternates randomly between heads and tails."  Note that this requires us - the programmers - to know in advance that probabilistic hypotheses are more likely to be true.

Or here's another way of looking at it:  A neural network algorithm typically looks at a set of training instances, tweaks the units and their connections based on the training instances, and in this way tries to "stamp" the experience onto itself.  But the algorithms which do the stamping are often poorly understood, and it is possible to stamp too hard.  If this mistake has already been made, then blurring the sensory information, or blurring the training algorithm, or blurring the units, can partially cancel out the "overlearning".

Here's a simplified example of a similar (but not identical) case:  Imagine that the environment deals us a random mix of cards, 70% blue and 30% red.  But instead of just predicting "blue" or "red", we have to assign a quantitative probability to seeing blue - and the scoring rule for our performance is one that elicits an honest estimate; if the actual frequency is 70% blue cards, we do best by replying "70%", not 60% or 80%.  ("Proper scoring rule.")

If you don't know in advance the frequency of blue and red cards, one way to handle the problem would be to have a blue unit and a red unit, both wired to an output unit.  The blue unit sends signals with a positive effect that make the target unit more likely to fire; the red unit inhibits its targets - just like the excitatory and inhibitory synapses in the human brain!  (Or an earthworm's brain, for that matter...)

Each time we see a blue card in the training data, the training algorithm increases the strength of the (excitatory) synapse from the blue unit to the output unit; and each time we see a red card, the training algorithm strengthens the (inhibitory) synapse from the red unit to the output unit.

But wait!  We haven't said why the blue or red units might fire in the first place.  So we'll add two more excitatory units that spike randomly, one connected to the blue unit and one connected to red unit.  This simulates the natural background noise present in the human brain (or an earthworm's brain).

Finally, the spiking frequency of the output unit becomes the predicted probability that the next card is blue.

As you can see - assuming you haven't lost track of all the complexity - this neural network learns to predict whether blue cards or red cards are more common in the mix.  Just like a human brain!

At least that's the theory.  However, when we boot up the neural network and give it a hundred training examples with 70 blue and 30 red cards, it ends up predicting a 90% probability that each card will be blue.  Now, there are several things that could go wrong with a system this complex; but my own first impulse would be to guess that the training algorithm is too strongly adjusting the synaptic weight from the blue or red unit to the output unit on each training instance.  The training algorithm needs to shift a little less far - alter the synaptic weights a little less.

But the programmer of the neural network comes up with a different hypothesis: the problem is that there's no noise in the input.  This is biologically unrealistic; real organisms do not have perfect vision or perfect information about the environment.  So the programmer shuffles a few randomly generated blue and red cards (50% probability of each) into the training sequences.  Then the programmer adjusts the noise level until the network finally ends up predicting blue with 70% probability.  And it turns out that using almost the same noise level (with just a bit of further tweaking), the improved training algorithm can learn to assign the right probability to sequences of 80% blue or 60% blue cards.

Success!  We have found the Right Amount of Noise.

Of course this success comes with certain disadvantages.  For example, maybe the blue and red cards are predictable, in the sense of coming in a learnable sequence.  Maybe the sequence is 7 blue cards followed by 3 red cards.  If we mix noise into the sensory data, we may never notice this important regularity, or learn it imperfectly... but that's the price you pay for biological realism.

What's really happening is that the training algorithm moves too far given its data, and adulterating noise with the data diminishes the impact of the data.  The two errors partially cancel out, but at the cost of a nontrivial loss in what we could, in principle, have learned from the data.  It would be better to adjust the training algorithm and keep the data clean.

This is an extremely oversimplified example, but it is not a total strawman.  The scenario seems silly only because it is simplified to the point where you can clearly see what is going wrong.  Make the neural network a lot more complicated, and the solution of adding noise to the inputs might sound a lot more plausible.  While some neural network algorithms are well-understood mathematically, others are not.  In particular, systems crafted with the aim of biological realism are often not well-understood. 

But it is an inherently odd proposition that you can get a better picture of the environment by adding noise to your sensory information - by deliberately throwing away your sensory acuity.  This can only degrade the mutual information between yourself and the environment.  It can only diminish what in principle can be extracted from the data.  And this is as true for every step of the computation, as it is for the eyes themselves.  The only way that adding random noise will help is if some step of the sensory processing is doing worse than random.

Now it is certainly possible to design an imperfect reasoner that only works in the presence of an accustomed noise level.  Biological systems are unable to avoid noise, and therefore adapt to overcome noise.  Subtract the noise, and mechanisms adapted to the presence of noise may do strange things.

Biological systems are often fragile under conditions that have no evolutionary precedent in the ancestral environment.  If somehow the Earth's gravity decreased, then birds might become unstable, lurching up in the air as their wings overcompensated for the now-decreased gravity.  But this doesn't mean that stronger gravity helps birds fly better.  Gravity is still the difficult challenge that a bird's wings work to overcome - even though birds are now adapted to gravity as an invariant.

What about hill-climbing, simulated annealing, or genetic algorithms?  These AI algorithms are local search techniques that randomly investigate some of their nearest neighbors.  If an investigated neighbor is superior to the current position, the algorithm jumps there.  (Or sometimes probabilistically jumps to a neighbor with probability determined by the difference between neighbor goodness and current goodness.)  Are these techniques drawing on the power of noise?

Local search algorithms take advantage of the regularity of the search space - that if you find a good point in the search space, its neighborhood of closely similar points is a likely place to search for a slightly better neighbor.  And then this neighbor, in turn, is a likely place to search for a still better neighbor; and so on.  To the extent this regularity of the search space breaks down, hill-climbing algorithms will perform poorly.  If the neighbors of a good point are no more likely to be good than randomly selected points, then a hill-climbing algorithm simply won't work.

But still, doesn't a local search algorithm need to make random changes to the current point in order to generate neighbors for evaluation?  Not necessarily; some local search algorithms systematically generate all possible neighbors, and select the best one.  These greedy algorithms work fine for some problems, but on other problems it has been found that greedy local algorithms get stuck in local minima.  The next step up from greedy local algorithms, in terms of added randomness, is random-restart hill-climbing - as soon as we find a local maximum, we restart someplace random, and repeat this process a number of times.  For our final solution, we return the best local maximum found when time runs out.  Random-restart hill-climbing is surprisingly useful; it can easily solve some problem classes where any individual starting point is unlikely to lead to a global maximum or acceptable solution, but it is likely that at least one of a thousand individual starting points will lead to the global maximum or acceptable solution.

The non-randomly-restarting, greedy, local-maximum-grabbing algorithm, is "stupid" at the stage where it gets stuck in a local maximum.  Once you find a local maximum, you know you're not going to do better by greedy local search - so you may as well try something else with your time.  Picking a random point and starting again is drastic, but it's not as stupid as searching the neighbors of a particular local maximum over and over again.  (Biological species often do get stuck in local optima.  Evolution, being unintelligent, has no mind with which to "notice" when it is testing the same gene complexes over and over.)

Even more stupid is picking a particular starting point, and then evaluating its fitness over and over again, without even searching its neighbors.  This is the lockpicker who goes on trying 0-0-0-0 forever.

Hill-climbing search is not so much a little bit randomized compared to the completely stupid lockpicker, as almost entirely nonrandomized compared to a completely ignorant searcher.  We search only the local neighborhood, rather than selecting a random point from the entire state space.  That probability distribution has been narrowed enormously, relative to the overall state space.  This exploits the belief - the knowledge, if the belief is correct - that a good point probably has good neighbors.

You can imagine splitting a hill-climbing algorithm into components that are "deterministic" (or rather, knowledge-exploiting) and "randomized" (the leftover ignorance).

A programmer writing a probabilistic hill-climber will use some formula to assign probabilities to each neighbor, as a function of the neighbor's fitness.  For example, a neighbor with a fitness of 60 might have probability 80% of being selected, while other neighbors with fitnesses of 55, 52, and 40 might have selection probabilities of 10%, 9%, and 1%.  The programmer writes a deterministic algorithm, a fixed formula, that produces these numbers - 80, 10, 9, and 1.

What about the actual job of making a random selection at these probabilities?  Usually the programmer will hand that job off to someone else's pseudo-random algorithm - most language's standard libraries contain a standard pseudo-random algorithm; there's no need to write your own.

If the hill-climber doesn't seem to work well, the programmer tweaks the deterministic part of the algorithm, the part that assigns these fixed numbers 80, 10, 9, and 1.  The programmer does not say - "I bet these probabilities are right, but I need a source that's even more random, like a thermal noise generator, instead of this merely pseudo-random algorithm that is ultimately deterministic!"  The programmer does not go in search of better noise.

It is theoretically possible for a poorly designed "pseudo-random algorithm" to be stupid relative to the search space; for example, it might always jump in the same direction.  But the "pseudo-random algorithm" has to be really shoddy for that to happen.  You're only likely to get stuck with that problem if you reinvent the wheel instead of using a standard, off-the-shelf solution.  A decent pseudo-random algorithm works just as well as a thermal noise source on optimization problems.  It is possible (though difficult) for an exceptionally poor noise source to be exceptionally stupid on the problem, but you cannot do exceptionally well by finding a noise source that is exceptionally random.  The power comes from the knowledge - the deterministic formula that assigns a fixed probability distribution.  It does not reside in the remaining ignorance.

And that's why I always say that the power of natural selection comes from the selection part, not the mutation part.

As a general principle, on any problem for which you know that a particular unrandomized algorithm is unusually stupid - so that a randomized algorithm seems wiser - you should be able to use the same knowledge to produce a superior derandomized algorithm. If nothing else seems obvious, just avoid outputs that look "unrandomized"!  If you know something is stupid, deliberately avoid it!  (There are exceptions to this rule, but they have to do with defeating cryptographic adversaries - that is, preventing someone else's intelligence from working on you.  Certainly entropy can act as an antidote to intelligence!)  And of course there are very common practical exceptions whenever the computational cost of using all our knowledge exceeds the marginal benefit...

Still you should find, as a general principle, that randomness hath no power: there is no beauty in entropy, nor strength from noise.

Moderation Guidelines: Reign of Terror - I delete anything I judge to be annoying or counterproductiveexpand_more
But it is an inherently odd proposition that you can get a better picture of the environment by adding noise to your sensory information - by deliberately throwing away your sensory acuity. This can only degrade the mutual information between yourself and the environment. It can only diminish what in principle can be extracted from the data.

It is certainly counterintuitive to think that, by adding noise, you can get more out of data. But it is nevertheless true.

Every detection system has a perceptual threshold, a level of stimulation needed for it to register a signal. If the system is mostly noise-free, this threshold is a ’sharp’ transition. If the system has a lot of noise, the theshold is ‘fuzzy’. The noise present at one moment might destructively interact with the signal, reducing its strength, or constructively interact, making it stronger. The result is that the threshold becomes an average; it is no longer possible to know whether the system will respond merely by considering the strength of the signal.

When dealing with a signal that is just below the threshold, a noiseless system won’t be able to perceive it at all. But a noisy system will pick out some of it - some of the time, the noise and the weak signal will add together in such a way that the result is strong enough for the system to react to it positively.

You can see this effect demonstrated at science museums. If an image is printed very, very faintly on white paper, just at the human threshold for visual detection, you can stare right at the paper and not see what’s there. But if the same image is printed onto paper on which a random pattern of grey dots has also been printed, we can suddenly perceive some of it - and extrapolate the whole from the random parts we can see. We are very good at extracting data from noisy systems, but only if we can perceive the data in the first place. The noise makes it possible to detect the data carried by weak signals.

When trying to make out faint signals, static can be beneficial. Which is why biological organisms introduce noise into their detection physiologies - a fact which surprised biologists when they first learned of it.

This post is my first experience learning about noise in algorithms, so forgive me if I seem underinformed. Two points occurred to me while reading this comment, some clarification would be great:

First, while it was intriguing to read that input just below the perceptual threshold would half the time be perceived by bumping it above the threshold, it seems to me that input just above the threshold would half the time be knocked below it. So wouldn't noise lead to no gain? Just a loss in acuity?

Second, I'm confused how input below the perceptual threshold is actually input. If a chair moves in front of a camera so slightly that the camera doesn't register a change in position, the input seems to me like zero, and noise loud enough to move zero past the perceptual threshold would not distinguish between movement and stillness, but go off half the time and half the time be silent. If that doesn't make sense, assume that the threshold is .1 meters, and the camera doesn't notice any movement less than that. Let's say your noise is a random number between .01 meters and -.01 meters. The chair moves .09 meters, and your noise lands on .01 meters. I wouldn't think that would cross the threshold, because the camera can't actually detect that .09 meters if it's threshold is .1. So, wouldn't the input just be 0 motion detected + .01 meters of noise = .01 meters of motion? Maybe I'm misunderstanding.

Suppose you have a motion-detector that looks once per second and notices a change when the chair moves by 0.1m within a second and is completely blind to smaller changes. Then a chair moving at 0.09m/s won't trigger it at all. Now suppose you add noise of amplitude +-0.01m. Then in most seconds you still won't see anything, but sometimes (I think 1/8 of the time, if that noise is uniformly distributed) the apparent movement will be above the threshold. So now if you do some kind of aggregation of the detector output over time you'll be able to tell that the chair is moving.

Yes, the cost of this is that above the threshold your performance is worse. You'll need to take averages or something of the kind to make up for it. (But: when a detector has a threshold, it usually doesn't give perfectly accurate measurements just above the threshold. You may find that even above the threshold you actually get more useful results in the presence of noise.)

Another example. Suppose you are trying to detect oscillating signals (musical notes, radio waves, ...) via an analogue-to-digital converter. Let's say its resolution is 1 unit. Then a signal oscillating between -0.5 and +0.5 will not show up at all: every time you sample it you'll get zero. And any small change to the signal will make exactly no difference to the output. But if you add enough noise to that signal, it becomes detectable. You'll need to average your data (or do something broadly similar); you'll have some risk of false positives; but if you have enough data you can measure the signal pretty well even though it's well below the threshold of your ADC.

[EDITED to add:] It may be worth observing that there's nothing super-special about adding random stuff for this purpose. E.g., suppose you're trying to measure some non-varying value using an analogue-to-digital converter, and the value you're trying to measure is smaller than the resolution in your ADC. You could (as discussed above) add noise and average. But if you happen to have the ability to add non-random offsets to your data before measuring, you can do that and get better results than with random offsets.

In other words, this is not an exception to the principle Eliezer proposes, that anything you can improve by adding randomness you can improve at least as much by adding something not-so-random instead.

The pattern painted onto white paper can't be seen because the image is also white. If the white image is printed onto paper that has parts of it that aren't white of course it's going to be more visible. Adding noise would be the equivalent of taking the image already printed onto white paper, and just adding random static on top of it. It would be even harder to see still.

What you're saying just makes no sense to me. Adding noise is just as likely to increase the existing signal as it is to decrease it. Or to make a signal appear that isn't there at all. I can't see how it's doing anything to help detect the signal.

What you're missing is that, if the signal is below the detection threshold, there is no loss if the noise pushes it farther below the detection threshold, whereas there is a gain when the noise pushes the signal above the detection threshold. Thus the noise increases sensitivity, at the cost of accuracy. (And since a lot of sensory information is redundant, the loss of accuracy is easy to work around.)

In which case, you could view the image even better if you just changed the whole backdrop to gray, instead of just random parts of it. This would correspond to the "using the same knowledge to produce a superior algorithm" part of the article.

As I understood it, the article specifically did not state that you can't ever improve a deterministic algorithm by adding randomness - only that this is a sign that you algorithm is crap, not that the problem fundamentally requires randomness. There should always exist a different deterministic algorithm which is more accurate than your random algorithm (at least in theory - in practice, that algorithm might have an unacceptable runtime or it would require even more knowledge than you have)

Silas: @Caledonian: That's an interesting point. But are you sure the effect you describe (at science museums) isn't merely due to the brain now seeing a new color gradient in the image, rather than randomness as such? Don't you get the same effect from adding an orderly grid of dots? What about from aligning the dots along the lines of the image?

Yep. Adding a set of coherent modulations will do better than noise to improve your sensor, because you're guaranteed to get at least some modulations of a sufficiently high level, and you can subtract out the modulations afterward to arrive at a superior picture of the environment.

Remember, Eliezer_Yudkowsky's point was not that randomness can never be an improvement, but that it's always possible improve beyond what randomness would yield.

Lotta commenters seem to have entirely missed this.

it's always possible improve beyond what randomness would yield

How can you improve guessing which uranium atom will blow up next?

Eliezer stated his point more precisely in the original post:

As a general principle, on any problem for which you know that a particular unrandomized algorithm is unusually stupid - so that a randomized algorithm seems wiser - you should be able to use the same knowledge to produce a superior derandomized algorithm.

I'd recommend engaging with that formulation of his point, rather than with Silas's summary (which is what you've quoted).

You might want to footnote, before anyone starts making noise about the ant example, that colony selection is not a case of group selection but a case of individual selection on the queen (and drones), since the rest of the ants don't reproduce.

As a general principle, on any problem for which you know that a particular unrandomized algorithm is unusually stupid - so that a randomized algorithm seems wiser - you should be able to use the same knowledge to produce a superior derandomized algorithm.

This claim is at least as strong as BPP = P, which while suspected is definitely not proven, and appears to be very difficult to prove. In particular, even if BPP = P, current knowledge has no way of de-randomizing all (or even most) polynomial-time randomized algorithms.

Am I misunderstanding something here?

I believe Eliezer's point wasn't that any algoritm should be de-randomized. His point was that it is impossible to extract useful work from noise; useful work is lawful. It is more philosophical statement than practical advice. To admire beauty of entropy is to worship your own ignorance and so on.

Putting aside the computational complexity, space-time tradeoff, you can easily see it: if something, in principle, can be done with help of random number generator, than the same result can be obtained without using random number generator.

Random is a void; optimization processes are closing that gap by ordered patterns. If you don't know how to fill in that hole, it means that you don't have enough information about some problem; it doesn't mean that entropy is magic key to this problem.

And yes, in practice the cost of obtaining that information may render the derandomized algorithm useless in practice; for example, select of the "true median" (median can be found in O(n) time) as pivot in quicksort will slow it down.

Upon reading the article, I had the same impression as jsteinhardt. I thought that Eliezer was, in fact, giving practical advice: "if your algorithm performs better with noise, figure out what's wrong with it and remove the noise". This advice is good in some cases, but impractical in others, just as you said in your closing paragraph.

I think the reason for why I read Eliezer's statement the way I did is that I don't really see the point in making "philosophical statements" about AI algorithms. These algorithms exist to efficiently solve real problems, not to score points in abstract debates.

"Philosophical statements" about AI algorithms are useful: not for algorithms themselves, but for AI researchers.

AI researcher shouldn't be mistaken about Mysterious Power Of The Entropy. The "We don't understand human intelligence, hence if I wish to create an artificial one, I must not understand it either" attitude is wrong.

So if I was to summarize this post, the summary was something like "Noise can't be inherently good; if entropy magically solves your problem, it means that there is a some more lawful non-magical way to solve this problem."

I think it is a part of more general principle "Never leave behind parts that work, but you have no slightest idea why they work".

AI researcher shouldn't be mistaken about Mysterious Power Of The Entropy

I am having trouble picturing a real AI researcher who believes in the "Mysterious Power Of The Entropy". Maybe I simply lack sufficient imagination. Still, that sounds like something a philosopher might believe in, not an AI researcher who actually implements (or designs) real AI algorithms.

if entropy magically solves your problem, it means that there is a some more lawful non-magical way to solve this problem.

I guess it depends on what you mean by "magical". There are tons of stochastic algorithms out there that rely on noise explicitly. Sure, there technically does exist a "lawful non-magical way" to solve the problem these algorithms are solving stochastically, but usually it's completely infeasible due to the amount of time it would take to run. Thus, one has no choice but to use the noisy algorithms.

Never leave behind parts that work, but you have no slightest idea why they work

Again, this depends on how strictly you want to interpret the rule. For example, multi-layer neural networks work very well, but they are often criticized for not being transparent. You can train the network to recognize handwriting (just for example), but you can't readily explain what all the weights mean once you'd trained it.

Brian, the reason we do that is to avoid the quicksort algorithm being stupid and choosing the worst-case pivot every time. The naive deterministic choices of pivot (like "pick the first element") do poorly on many of the permutations of the input which are far more probable than 1/n! because of the types of inputs people give to sorting algorithm, namely, already or nearly-already sorted input. Picking the middle element does better because inputs sorted inside to outside are rarer, but they're still far more likely than 1/n! apiece. Picking a random element is a very easy way to say "hey, any simple algorithm I think up will do things that correlate with algorithms other people think up, so will hit worst-case running times more often than I'd like, so I'll avoid correlation with other people".

There are variants of quicksort that completely avoid worst-case complexity by choosing the true median of the list each time. They incur an extra cost that makes average case worse, though, and they're usually not the best choice because we're almost always not trying to avoid the worst-case for a single run, we're actually trying to make the average case faster.

Brian, a very simple analysis would say something like, "If you think the middle element is likely to be an unusually bad pivot, exclude it as a possible pivot during random selection. If you have no such belief, why not go ahead and use them? If you do have such a belief, why let random selection occasionally select bad pivots?"

@Joshua_Simmons: I got to thinking about that idea as I read today's post, but I think Eliezer_Yudkowsky answered it therein: Yes, it's important to expirment, but why must your selection of what to try out, be random? You should be able to do better by exploiting all of your knowledge about the structure of the space, so as to pick better ways to experiment. To the extent that your non-random choices of what to test do worse than random, it is because your understanding of the problem is so poor as to be worse than random.

(And of course, the only time when searching the small space around known-useful points is a good idea, is when you already have knowledge of the structure of the space...)

@Caledonian: That's an interesting point. But are you sure the effect you describe (at science museums) isn't merely due to the brain now seeing a new color gradient in the image, rather than randomness as such? Don't you get the same effect from adding an orderly grid of dots? What about from aligning the dots along the lines of the image?

Remember, Eliezer_Yudkowsky's point was not that randomness can never be an improvement, but that it's always possible improve beyond what randomness would yield.

A few posters might want to read up on Stochastic Resonance, which was surprisingly surprising a few decades ago. I'm getting a similar impression now from recent research in the field of Compressive Sensing, which ostensibly violates the Nyquist sampling limit, highlighting the immaturity of the general understanding of information-theory.

In my opinion, there's nothing especially remarkable here other than the propensity to conflate the addition of noise to data, with the addition of "noise" (a stochastic element) to (search for) data.

This confusion appears to map very well onto the cybernetic distinction between intelligently knowing the answer and intelligently controlling for the answer.

A combination dial often has a tolerance of 2 in either direction. 20-45-35 will open a lock set to 22-33-44.

I certainly hope not! I think you intended 20-35-45 for the first or 22-44-33 for the second.

Cheap locks (like those used for middle school/high school lockers) have about as much variance as Eliezer claims. As horrifying as that may be.

I disagree, a tolerance of 2 in either direction directly implies that starting from the solution you can round all the numbers up by two or round them down by two and the lock will still open.

Edit: To follow up, the original claim in the article was: "20-45-35 will open a lock set to 22-44-33", or in other words, a combination off by "-2, +1, +2" will open the lock.

the original claim in the article was: "20-45-35 will open a lock set to 22-44-33"

Not as Aaron3 quoted it. (I guess EY has edited it since.)

I somehow missed the 33-44 transpose in the quote. That would indeed be a wide variance.

I recently ran into a nice example of how randomness can be improved upon but you may not want to de-randomize.

One of my personal tools is a setup which archives & preserves URLs for later reference, since pretty much every web page will disappear or die eventually. The core is a loop which reads a URL out of a text file and then run the Internet requests to archive it.

The problem is that I want to archive hundreds & thousands of URLs a week, but I can only archive 1 every 20 seconds. The file is sorted to eliminate duplicates. The obvious approach is to archive the first URL in the file, which is also alphabetically first.

The problem with this is that as ever more URLs are dumped into the file, we wind up with a sort of resource starvation where URLs which begin with an 'x' or 'z' may never get archived because the archiving keeps on processing the 'a' and 'b' URLs first. We don't want to disproportionately favor any particular domain.

So what do we do? Select a random URL, of course. This avoids the bias problem - now the 'x' URLs have the same chance as a similar number of 'a' URLs.

But is random selection optimal? Nope. As with the lock example, we can improve the distribution by adding some sort of 'memory' about what domains have already gotten a URL archived; so one could randomly select a URL - as long as its domain hasn't gotten a URL archived in the last n archives.

Why haven't I done this? Because the archiver is meant to be able to survive frequent crashes and shutdowns (it runs on my laptop), and to get a memory over more than a few days, I would have to implement code to regularly serialize out the memory and so on. This wouldn't be too bad in Haskell (perhaps another 5 or 6 lines to the code), but that represents a pretty big increase in source code and the complexity of the design, and the benefit is fairly small. In this case, I'm happy to use the simpler dumber randomized solution.

"it's not impossible, but it could happen" - typo? Should be "It's not very likely, but it could happen" or something like that?

To those discussing randomized Quicksort: relevant to your discussion about more intelligent behaivour might be the Introsort algorithm.

See https://secure.wikimedia.org/wikipedia/en/wiki/Introsort

"Introsort or introspective sort is a sorting algorithm designed by David Musser in 1997. It begins with quicksort and switches to heapsort when the recursion depth exceeds a level based on (the logarithm of) the number of elements being sorted. It is the best of both worlds, with a worst-case O(n log n) runtime and practical performance comparable to quicksort on typical data sets. Since both algorithms it uses are comparison sorts, it too is a comparison sort."

"We are currently living through a crisis that is in large part due to this lack of appreciation for emergent behavior. Not only people in general but trained economists, even Nobel laureates like Paul Krugman, lack the imagination to understand the emergent behavior of free monetary systems."

"Emergence", in this instance, is an empty buzzword, see http://lesswrong.com/lw/iv/the_futility_of_emergence/. "Imagination" also seems likely to be an empty buzzword, in the sense of http://lesswrong.com/lw/jb/applause_lights/.

"precisely because the emergent behavior of the market is more powerful, more intelligent, in solving the problem of resource allocation than any committee."

Markets do not allocate resources anywhere near optimally, and sometimes they do even worse than committees of bureaucrats; the bureaucrats, for instance, may increase utility by allocating more resources to poor people on grounds of higher marginal utility per dollar per person.

"Once you understand it then it's not so amazing but it is very difficult to understand. Ben Bernanke doesn't understand and Alan Greenspan didn't understand before him."

If you think you know more than Bernanke, then why haven't you become rich by making better-than-expected bets?

"It can be improved on by randomisation: randomly betting on heads with p=0.5 and tails with p=0.5 is a stochastic strategy which offers improved returns - and there is no deterministic strategy which produces superior results to it."

Eliezer has already noted that it is possible for a random strategy to be superior to a stupid deterministic strategy:

"But it is possible in theory, since you can have things that are anti-optimized. Say, the average state has utility -10, but the current state has an unusually low utility of -100. So in this case, a random jump has an expected benefit. If you happen to be standing in the middle of a lava pit, running around at random is better than staying in the same place. (Not best, but better.) A given AI algorithm can do better when randomness is injected, provided that some step of the unrandomized algorithm is doing worse than random."

The point of the post is that a random strategy is never better than the best possible deterministic strategy. And assuming that you're betting on real, physical coinflips, a random strategy is actually worse than the deterministic strategy of betting that the coin will come up heads if it started as heads and vice versa (see http://www.npr.org/templates/story/story.php?storyId=1697475).

@Caledonian and Tiiba: If we knew where the image was, we wouldn't need the dots.

Okay, let's take a step back: the scenario, as Caledonian originally stated, was that the museum people could make a patron better see the image if the museum people put random dots on the image. (Pronouns avoided for clarity.) So, the problem is framed as whether you can make someone else see an image that you already know is there, by somehow exploiting randomness. My response is that, if you already know the image is there, you can improve beyond randomness, but just putting the dots there in a way that highlights the hidden image's lines. In any case, from that position, Eliezer_Yudkowsky is correct in that you can only improve the patron's detection ability for that image, by exploiting your non-random knowledge about the image.

Now, if you want to reframe that scenario, you have to adjust the baselines appropriately. (Apples to apples and all.) Let's look at a different version:

I don't know if there are subtle, barely-visible images that will come up in my daily life, but if there are, I want to see them. Can I make myself better off by adding random gray dots to my vision? By scattering physical dots wherever I go?

I can's see how it would help, but feel free to prove me wrong.

"It is not clear this can be shown to be true. 'Improvement' depends on what is valued, and what the context permits. In the real world, the value of an algorithm depends on not only its abstract mathematical properties but the costs of implementing it in an environment for which we have only imperfect knowledge."

Eliezer specifically noted this in the post:

"Sometimes it is too expensive to take advantage of all the knowledge that we could, in theory, acquire from previous tests. Moreover, a complete enumeration or interval-skipping algorithm would still end up being stupid. In this case, computer scientists often use a cheap pseudo-random algorithm, because the computational cost of using our knowledge exceeds the benefit to be gained from using it. This does not show the power of randomness, but, rather, the predictable stupidity of certain specific deterministic algorithms on that particular problem."

Anyone care to work out exactly how much better off 0-0-0-0 is than a random set in this case? The probability of success at each cycle goes up to 1/9999 from 1/10000 (after the first cycle).

It's obvious to those who comprehend the emergent behavior that interest rates have been set way below market rates, for too long, and that is the cause of the current crisis. By "comprehend the emergent behavior" do you mean that you have a vague intuitive feel for this, or that you have the equations relating interest rates to other factors, along with enough mathematical theory to make specific quatitative predictions? If you (or people like you who "comprehend the emergent behavior") did not make a lot of money out of the current crisis, then your statement is wrong. Explanations after the fact are simply stories.

How would you categorize the practice of randomly selecting the pivot element in a quicksort?

Caledonian: Yes, I did. So: can't you always do better in principle by increasing sensitivity?

drone: In fact, since random search is unbiased search, it now seems that overcoming bias should be frowned on! Perhaps "Finding Effective Biases" is a better title for this blog.

"Inductive Bias"

Caledonian: couldn't you always do better in such a case, in principle (ignoring resource limits), by increasing resolution?

Daniel C. Dennett argues that random data is useful in optimization as it's absolutely key to experimentation -- you must "wiggle" different variables, as he says, and observe changes. This is a foundational part of his argumentation in Darwin's Dangerous Idea and Freedom Evolves.

"'Emergence';, in this instance, is an empty buzzword.

Buzzword in this instance is a buzzword. This sentence is merely an assertion. I read that article before I wrote my argument. The phrase, "emergent behavior" and the word "emergence" have a specific meaning and it isn't about giving a "mysterious answer to a mysterious queston".

For example, Mises can and does give a complete and non-mysterious explaination of how the business cycle is a result of fractional reserve banking. Likewise, he can explain how market prices arise, and why markets clear at the market price. All in a very reductionist fashion.

"Imagination" also seems likely to be an empty buzzword, ..."

No, it's has the same exact meaning as in "Creationist lack the imagination to understand how evolution works." or "Behe, lacking the imagination to understand how eyes arose proposes the concept of irreducible complexity".

"Markets do not allocate resources anywhere near optimally, and sometimes they do even worse than committees of bureaucrats; the bureaucrats, for instance, may increase utility by allocating more resources to poor people on grounds of higher marginal utility per dollar per person."

I didn't use the word "optimally" anywhere in my comment. I said it "solved the problem of resource allocation."

The rest of you statement is just a bald assertion. In fact, "allocating resources optimally", is an ill defined concept. Allocated optimally in reference to what value system? The very concept of thinking you can make a utility function in the way you construct it is absurd, and ignores factors of that wealth redistribution that would harm the very poor it was suppose to help. Actual real world experimentation with redistributive systems like communism have shown it to be a bust.

Your statement is true in the same sense that it is possible by brownian motion for an elephant to fly. Markets are analogous to distributed supercomputers where each individual participates as a processor and prices are the signaling mechanisms between the processors. If you mess with those signals you get predictable results, depending on what you do and what kind of price you mess with.

"If you think you know more than Bernanke, then why haven't you become rich by making better-than-expected bets?"

Wow, you assume a lot. Firstly, my mother was a sharecropper, and my father a poorly paid college professor. I paid my own way through college. So it's not like I had a big nest egg to invest.

I did however predict the surge in sliver prices. I did converted my IRA to bullion. I did quadruple my investment in four years.

Besides, it's not the position of my theory that you get rich by understanding economics. That's your ridiculous claim. Did you apply that theory to Greenspan or Bernanke? Why in the world aren't all these economists retired rich? I mean that's your theory, right?

Don't you get the same effect from adding an orderly grid of dots?
In that particular example, yes. Because the image is static, as is the static.

If the static could change over time, you could get a better sense of where the image lies. It's cheaper and easier - and thus 'better' - to let natural randomness produce this static, especially since significant resources would have to be expended to eliminate the random noise.

What about from aligning the dots along the lines of the image?
If we knew where the image was, we wouldn't need the dots.

To be precise, in every case where the environment only cares about your actions and not what algorithm you use to produce them, any algorithm that can be improved by randomization can always be improved further by derandomization.
It's clear this is what you're saying.

It is not clear this can be shown to be true. 'Improvement' depends on what is valued, and what the context permits. In the real world, the value of an algorithm depends on not only its abstract mathematical properties but the costs of implementing it in an environment for which we have only imperfect knowledge.

@Mike Plotz: It's true that you can't do better than random in predicting (theoretical nonphysical) coin tosses, but you also can't do worse than random. As Eliezer pointed out, the claim isn't "it is always possible to to better than random", but "any algorithm which can be improved by adding randomness, can be improved even more without adding randomness."

Caledonian: Yes, I did. So: can't you always do better in principle by increasing sensitivity?
That's a little bit like saying that you could in principle go faster than light if you ignore relativistic effects, or that you could in principle produce a demonstration within a logical system that it is consistent if you ignore Godel's Fork.

There are lots of things we can do in principle if we ignore the fact that reality limits the principles that are valid.

As the saying goes: the difference between 'in principle' and 'in practice' is that in principle there is no difference between them, and in practice, there is.

If you remove the limitations on the amount and kind of knowledge you can acquire, randomness is inferior to the unrandom. But you can't remove those limitations.

I'm surprised at the pushback on this rather simple straightforward point. In fact, it seems kind of beaten to death so I hope it has some cool unexpected consequence soon-to-be revealed.

Besides being easy, randomness has the amusing property of helping overcome bias.

In fact, since random search is unbiased search, it now seems that overcoming bias should be frowned on! Perhaps "Finding Effective Biases" is a better title for this blog.

OK, let me give you another example of the lock device. Each time a code is tried, the correct code changes to (previous code) + 2571 mod 10000. You don't know this. You won't find out before opening the door, because of limited feedback. Sequential check of every code will fail, but let you know that the correct code changes (if there is a correct code). Constantly guessing the same code because you think it'll randomly change to that one will fail. Random guessing will eventually succeed. Using randomness prevents you from getting stuck due to your own stupidity or an opponent. There is no method for always beating randomness, making it a reliable failsafe against always losing.

You won't always have knowledge about the problem, and on occasion what you think is knowledge is wrong. Random may be stupid, but it has a lower bound for stupidity.

It is theoretically possible for a poorly designed "pseudo-random algorithm" to be stupid relative to the search space; for example, it might always jump in the same direction. But the "pseudo-random algorithm" has to be really shoddy for that to happen. You're only likely to get stuck with that problem if you reinvent the wheel instead of using a standard, off-the-shelf solution.

You don't need that much shoddiness to get very weird things in certain situations (IIRC there was a PRNG getting some value wrong by 20 standard deviations in a Monte Carlo simulation of the Ising model because of a non-zero three-way correlation between the n-th, (n + 63)-th, and (n + 126)-th outputs or something like that), and some off-the-shelf solutions are shoddier than you may think.

Omega creates one more copy of you, then he asks each of you to say either “zero” or “one”; he will give each of you $10 unless you both pick the same number. You have no idea which copy you are, and cannot communicate with the other copy before you both make your choice.

The strategies “choose 0” and “choose 1” both lose, but the strategy “choose 0 with 50% probability and 1 with 50% probability” has a 50% probability of winning.

Here's an algorithm that I've heard is either really hard to derandomize, or has been proven impossible to derandomize. (I couldn't find a reference for the latter claim.) Find an arbitrary prime between two large numbers, like 10^500 - 10^501. The problem with searching sequentially is that there are arbitrarily long stretches of composites among the naturals, and if you start somewhere in one of those you'll end up spending a lot more time before you get to the end of the stretch.

See the Polymath project on that subject. The conjecture is that it is possible to derandomize, but it hasn't been proven either way. Note that finding an algorithm isn't the hard part: if a deterministic algorithm exists, then the universal dovetail algorithm also works.

This is one of my favorite posts. I'm disappointed that I never grasped this during my computer science education.

This talk of "general principles" is rubbish from a physical scientist's perspective; were what you were saying a general principle there would be no such thing as stochastic resonance, nor would Monte Carlo integration be so useful a technique. And if there is some "derandomized" alternative to simulated annealing that is superior in finding minima (or maxima) in ugly solution spaces, it hasn't been found. "So what?" you might ask, but consider that tens or hundreds of thousands of engineers, scientists, and mathematicians have thought about this problem before you 'blogged it.

Yours is an interesting post, certainly, but perhaps you should substitute modesty for flowery finishes and hasty elevation of a minor finding to "general principle" status.

Dear Brian,

Sorry if I was overly brusque in my response. you have obviously given the subject more thought than I gave you credit for.

The line you wrote here is the critical one: BTW, that theory predicted the Great Depression before the fact, and this crash before the fact. It also predicted the existence of stagflation before the fact.

Where were these predictions? Did they represent a majority view of Austrian School economics? How is their false positive, false negative score? (basically, I'm asking for a statistical survey of Austrian school economics track record; that seems to be the minimum requirement to endorse the statement above). If their track record is so demonstrably superior, someone should have done such a survey.

Because for the moment a simple "classical economics + power law of random fluctuations (possibly giving way somewhat to a gaussian distribution for rare events)" seems a much more economical theory fitting the data (and yes, those were my economic opinions before the crash; I didn't act on it, because I have no money to spare :-).

Stuart Armstrong ,

"By 'comprehend the emergent behavior' do you mean that you have a vague intuitive feel for this, or that you have the equations relating interest rates to other factors, along with enough mathematical theory to make specific quantitative predictions?"

If I believe that a individual or committee cannot determine a market price other than by actually observing one then why on earth do you think I am claiming to be able to "make specific quantitative predictions?"

Those economists that make the mistake of thinking they can make a killing in the market are notoriously bad at it. Greenspan being an example.

Austrian theory holds that you cannot make the kind of quantitative predictions you expect to. So when you can predictibly and consistently make quantitative predictions on your economic theory then you can prove Austrian theory wrong.

"If you (or people like you who "comprehend the emergent behavior") did not make a lot of money out of the current crisis, then your statement is wrong."

Not at all. One can predict that the actions of say, Mugabe in Zimbabwe, would be an economic disaster without being able to capitalize on it.

"Explanations after the fact are simply stories."

But the explainations were given before the fact. Austrian theory exists as a model and it predicts certain outcomes given certain actions.

In the theory, below market interest rates result in low savings, overborrowing, trade deficits, asset inflation, and market bubbles. Everything that has been occuring makes sense in light of the theory.

BTW, that theory predicted the Great Depression before the fact, and this crash before the fact. It also predicted the existence of stagflation before the fact.

Likewise it predicts that the current actions of the Fed if continued are going to lead to inflation, all other things being equal.

The current crisis was caused by fractional reserve banking and monetary inflation and the current solutions being proposed and acted on by the likes of Krugman are to inject more money. These are precisely the wrong things to do.

Interest rates are prices like any other. It's well understood that when you put a price ceiling on a good that you get shortages as consumers try to consume more at the lower price and producers produce less. That is exactly what we are experiencing now, a shortage of capital due to a price ceiling on interest rates. This is a simple case of trying to violate economic law and being stung by it.

Austrian theory has more to say on the matter in that this kind of monetary inflation causes misallocation of resources but I'll refrain from writing a long article. It isn't at all about "vague intuitive" feelings either. It consists of clear, understandable mechanisms by which each result can be deduced from the model.

One can predict that the actions of say, Mugabe in Zimbabwe, would be an economic disaster without being able to capitalize on it.

How about shorting the Zimbabwean Dollar? Any piece of knowledge you have, and others lack, can be acted on by taking bets.

Brian: FYI, mergesort isn't faster than a good quicksort. Skiena's The Algorithm Design Manual, for example, says that "experiments show that a where a properly implemented quicksort is implemented well, it is typically 2-3 times faster than mergesort or heapsort" (2nd ed., p. 129).

I'm still waiting to hear what Eliezer says about your example too, as quicksort does seem to be an example of the best known implementation making necessary use of (pseudo-) randomness, and quicksort is of course extremely well-understood.

Brian, you want an answer to the real-world situation? Easy. First assume you have a source of inputs that is not antagonistic, as discussed. Then measure which deterministic pivot-choice algorithms would work best on large samples of the inputs, and use the best. Median-of-three is a great pivot choosing algorithm in practice, we've found. If your source of inputs is narrower than "whatever people anywhere using my ubergeneral sort utility will input" then you may be able to do better. For example, I regularly build DFAs from language data. Part of this process is a sort. I could implement this plan and possibly find that, I don't know, median of first, last, and about-twenty-percent-of-the-way-in is in general better. I anticipate the effort would not be worth the cost, so I don't, but there you are.

You don't have to put constraints on the input, you can just measure them (or guess well!). They're probably already there in real-world situations.

Brian: How does the randomness tie in to acquired knowledge, and what is the superior non-random method making better use of that knowledge?
The knowledge in this case is your belief about the distribution of input lists. Let's say, for the sake of argument, that you get sorted lists (forwards or backwards) more often than chance, and the rest of the time you get a random permutation.

On a random list, one choice of pivot is as good as another. On a sorted list, though, the ends are always a bad choice. Picking the first item is thus stupider than average, and you can do better with a random pivot. But you can do even better by picking the middle item: it's always the median when the list is sorted, and it's no worse than random when the list is random.

What if you have an intelligent adversary trying to mount a denial-of-service attack against your quicksort? Then you should expect to get inputs that your algorithm is likely to do poorly on. If you pick pivots randomly then your expected time to sort a list is completely independent of the list's initial ordering. Even though the number of permutations that cause degenerate O(n^2) behavior is the same, you've made it impossible to pick one ahead of time. Randomness can defeat an intelligent adversary for the same reason that optimization processes don't work in a completely random universe.

To be precise, in every case where the environment only cares about your actions and not what algorithm you use to produce them, any algorithm that can be improved by randomization can always be improved further by derandomization.

It seems that randomization can serve as a general substitute for memory. It's not a perfect substitute, but whenever you don't want to, or can't, remember something for some reason, randomization might help.

Besides the example of The Absent-Minded Driver, I've realized there are other examples in cryptography:

  • nonces - Many encryption schemes require a unique nonce to be generated for each message to be encrypted. You can either pick random nonces, or keep a counter. But keeping a counter might be too troublesome, and you might be running in a virtual machine that can be rolled back from time to time, so it's usually better to use a random nonce, even though you'll need a longer nonce than if you used a counter (to keep the probability of collision sufficiently small).
  • distributed key search - You can either have a central server that hands out regions of the key space to search, or each participant can search a random region. The latter is less efficient in computing time, but more efficient in communications cost.

I might do a post on this, if I could figure out a way to think about why randomization substitutes for memory.

if I could figure out a way to think about why randomization substitutes for memory.

Let A and B be actions leading to deterministic outcomes, and let C be some lottery between A and B. A rational agent will never prefer both C>A and C>B.

When you repeat the scenario without memory, the lottery is no longer exactly over choices the agent could deterministically make: the randomness is re-rolled in places where the agent doesn't get another decision. Despite what the options are labeled, you're really choosing between 2xA, 2xB, and a lottery over {2xA, 2xB, A+B}. Since the lottery contains an outcome that isn't available to the deterministic decision, it may be preferred.

I think this is equivalent to the role played by observational evidence in UDT1: Observations allow a constant strategy to take different actions in different places, whereas without any observations to distinguish agent instances you have to pick one action to optimize both situations. Of course good evidence is reliably correlated with the environment whereas randomness doesn't tell you which is which, but it's better than nothing.

I might do a post on this, if I could figure out a way to think about why randomization substitutes for memory.

I had some comments in this thread that outline the way that I think about this.

Sure. Start with random numbers. Then run the Diehard tests over them. If the Diehard tests fail, try different random numbers. There you go, algorithm derandomized.

That's an "improvement" that's unlikely to happen before universal heat death.

Improving an algorithm by adding randomness is simple in many cases - while improving it further may be totally impractical and ineffectual.

I think that this is a better de-randomization: Presumably there is some set S of numbers that have been observed to pass Diehard tests reliably. The de-randomized algorithm is simply to read off S. Computationally, this is cheaper than generating your own random set and reading it off.

ETA: And, if I'm not mistaken, if S has already been observed to pass Diehard tests reliably, then it is even more likely to pass them in the future than is a newly generated random set.

The uncertainty principle would appear to represent a problem for the thesis. Secure hardware random number generators are "really difficult" to predict. That appears to be a case where you can't - in practice - beat a random strategy - whereas a random strategy beats all kinds of stupid strategies.

The "any algorithm that can be improved by randomization" is redundant - and does no useful conceptual work. Idiotic algorithms that can be improved by adding randomness are two-a-penny.

The "can always be improved further by derandomization" is not true - in practice. Sometimes, you just can't find a way to do any better than the random algorithm.

I think the correct principle in this area is that a deterministic algorithm can always do at least as well as a random one - provided its workings can be kept secret.

The main problem case for this principle occurs when you face an opponent with a lot of resources and cryptographic skills - but with sufficient care you should be able to keep them busy until the universal heat death.

It's not always possible to improve beyond what randomness would yield. Consider, for example, the coin toss predicting game. Or the face-the-firing-squad game.

"This may not sound like a profound insight, since it is true by definition. But consider - how many comic books talk about "mutation" as if it were a source of power? Mutation is random. It's the selection part, not the mutation part, that explains the trends of evolution."

I think this is a specific case of people treating optimization power as if it just drops out of the sky at random. This is certainly true for some individual humans (eg., winning the lottery), but as you point out, it can't be true for the system as a whole.

"These greedy algorithms work fine for some problems, but on other problems it has been found that greedy local algorithms get stuck in local minima."

Er, do you mean local maxima?

"When dealing with a signal that is just below the threshold, a noiseless system won’t be able to perceive it at all. But a noisy system will pick out some of it - some of the time, the noise and the weak signal will add together in such a way that the result is strong enough for the system to react to it positively."

In such a case, you can clearly affect the content of the signal, so why not just give it a blanket boost of ten points (or whatever), if the threshold is so high that you're missing desirable data?

That is a fine observation from Caledonian. The noise is not being added to existing sense data, it's being added before the signal hits the receptors - but to stay in tune with the proposed scenario, we can easily imagine that the organism has already internalised the signals at that point.

Re: Daniel Dennet: that's not really right - random wiggling is better than no wiggling at all, but it is still an extremely bad way to generate variation.

What if the lock has multiple combinations that are close to each other. In this case the brute force does much worse than a random search. Because the correct combinations could be 98,99,100. This is true for real combination locks btw, they usually aren't too sensitive to being off by a single number, and so multiple combinations do work.

Or an even simpler case, you need to create a single algorithm to break a large number of locks. And all the locks have the same combination. And you can't save memory or modify the algorithm after it it's finished, so you have to pick the best one.

If you pick a deterministic algorithm, there is a significant chance that the combination could be something like 99999. Then you have the worst case time. Whereas the time for the random algorithm will be a Gaussian distribution around the average case.

The expected completion time for each is the same, but the deterministic algorithm has a significantly higher chance of hitting a worst case scenario.

Now a random search is bad because it can forget what it's already tried, but you can just add a memory or some other way of creating a random ordering.

And in general, there exists a pattern that will disrupt any deterministic algorithm and give it much worse performance. And they aren't rare pathological cases either.

There is nothing magical about emergence and emergence is a well document phenomena.

Unfortunately, the term "emergence" has come to have two meanings. One is uncontroversial. The other is drivel.

GreedyAlgorithm,

"If your source of inputs is narrower than 'whatever people anywhere using my ubergeneral sort utility will input' then you may be able to do better."

That's actually the scenario I had in mind, and I think it's the most common. Usually, when someone does a sort, they do it with a general-purpose library function or utility.

I think most of those are actually implemented as a merge sort, which is usually faster than quicksort, but I'm not clear on how that ties in to the use of information gained during the running of the program.

What I'm getting at is that the motivation for selecting randomly and any speedup for switching to merge sort don't seem to directly match any of the examples already given.

In his explanation and examples, Eliezer pointed to information gained while the algorithm is running. Choosing the best type of selection in a quicksort is based on foreknowledge of the data, with random selection seeming best when you have the least foreknowledge.

Likewise, the difference between quicksort and other sorts that may be faster doesn't have an obvious connection to the type information that would help you choose between different selections in a quicksort.

I'm not looking for a defense of nonrandom methods. I'm looking for an analysis of random selection in quicksort in terms of the principles that Eliezer is using to support his conclusion.

Tom, I made it clear which comments I was addressing by quoting them.

If you really want to hack my example, you should bear in mind that the "heads", "tails", and "edge" values are produced by taking bytes from a cryptographic source of randomness, and using heads: more 0s than 1s; tails: more 1s than 0s; edge: other cases.

Daniel I. Lewis, as I said, lists can have structure even when that structure is not chosen by a person.

"Let's say, for the sake of argument, that you get sorted lists (forwards or backwards) more often than chance, and the rest of the time you get a random permutation."

Let's not say that, because it creates an artificial situation. No one would select randomly if we could assume that, yet random selection is done. In reality, lists that are bad for selecting from the middle are more common than by random chance, so random beats middle.

If you put the right kind of constraints on the input, it's easy to find a nonrandom algorithm that beats random. But those same constraints can change the answer. In your case, part of the answer was the constraint that you added.

I was hoping for an answer to the real-world situation.

Isn't the most (time) efficient algorithm always a giant lookup table?

That approach doesn't help an embedded observer much, if they find that they cannot determine the entries to the table - because of the uncertainty principle.

To be precise, in every case where the environment only cares about your actions and not what algorithm you use to produce them, any algorithm that can be improved by randomization can always be improved further by derandomization.

Isn't this trivially true? Isn't the most (time) efficient algorithm always a giant lookup table?

To be precise, in every case where the environment only cares about your actions and not what algorithm you use to produce them, any algorithm that can be improved by randomization can always be improved further by derandomization.

Consider the heads-tails-edge game. Betting on edge is a deterministic strategy with a low payoff. It can be improved on by randomisation: randomly betting on heads with p=0.5 and tails with p=0.5 is a stochastic strategy which offers improved returns - and there is no deterministic strategy which produces superior results to it.

Would you discourage, where resources exist to find alternative strategies, use of any random element in choosing samples for statistical hypothesis testing?

"""What about from aligning the dots along the lines of the image?"""

Wouldn't you need to find them first?

GreedyAlgorithm, yes that's mostly why it's done. I'd add that it applies even when the source of the ordering is not a person. Measurement data can also follow the type of patterns you'd get by following a simple, fixed rule.

But I'd like to see it analyzed Eliezer's way.

How does the randomness tie in to acquired knowledge, and what is the superior non-random method making better use of that knowledge?

Using the median isn't it, because that generally takes longer to produce the same result.

Consider this scenario:

There are a large number of agents independently working on the same problem (for example, trying to find a string that hash-collides with some given string), but they cannot communicate in any way, they don't have any identification information about each other, they don't know how many other agents there are working on the problem (they aren't even sure there are any). It seems to me that each agent should decide at random where to start searching, not to fool each other but to avoid pointlessly duplicating each others' work.

Are you sure there is always something better than randomness?