The Weighted Majority Algorithm

8comingstorm

7bogdanb

4A1987dM

4bogdanb

2A1987dM

0A1987dM

1A1987dM

3V_V

0[anonymous]

0A1987dM

0Will_Pearson

0Psy-Kosh

2Eliezer Yudkowsky

0Roko

8g

2VAuroch

11R

0Michael_Howard

2Psy-Kosh

1homunq

0Daniel_Burfoot

0DonGeddis

6Stephen3

0J._Andrew_Rogers

2Silas

3Venu

2Manuel_Moertelmaier

1kyb2

2David4

0Douglas_Knight3

-5G2008

0Momo

-1NancyLebovitz

2michael_e_sullivan

0nova

1G2008

-6Caledonian2

1Joshua_Fox

2Richard_Kennaway

0g

0Richard_Kennaway

0g

0Ben_Jones

10Scott_Aaronson2

-7Read_Something

2DonGeddis

5Eliezer Yudkowsky

10Scott_Aaronson2

0bogdanb

-4Booklegger2

1Silas

4Scott_Aaronson2

4Silas

7Scott_Aaronson2

1bogdanb

2Will_Pearson

2Scott_Aaronson2

5Peter_de_Blanc

0BloodyShrimp

0PhilGoetz

2comingstorm

0DonGeddis

1Will_Pearson

1John_Faben

9Wei_Dai2

1Eliezer Yudkowsky

3Eliezer Yudkowsky

9Scott_Aaronson

6Eliezer Yudkowsky

9Scott_Aaronson

2Wei_Dai2

1Psy-Kosh

0gwern

0Aron

2DonGeddis

0davidgs

3bogdanb

2DonGeddis

3DonGeddis

0Will_Pearson

1John_Faben

0DonGeddis

2DonGeddis

0Gen_Zhang

0lifebiomedguru

-1davidgs

0VAuroch

1A1987dM

2Petter

3lackofcheese

1lackofcheese

0lackofcheese

1shminux

0lackofcheese

0ChristianKl

2Dacyn

New Comment

Some comments are truncated due to high volume. (⌘F to expand all)

What about Monte Carlo methods? There are many problems for which Monte Carlo integration is the most efficient method available.

(you are of course free to suggest and to suspect anything you like; I will, however, point out that suspicion is no substitute for building something that actually works...)

7

(Emphasis mine.) I know I’m late to the party, but I just noticed this. While what you say it’s true, “available” in this case means “that we know of”, not “that is possible”. I’m not an expert, but IIRC the point of the MCI is that the functions are hard to analyze. You could integrate analogously without randomization (e.g., sample the integration domain on a regular grid), and it very well might work. The problem is that if the function you integrate happens to behave atypically on the regular grid with respect to its global behavior (e.g., if the function has some regular features with a frequency that accidentally matches that of your grid, you might sample only local maxima, and severely overestimate the integral).
But for an individual function there certainly exists (for some values of “certainly”) an ideal sampling that yields an excellent integration, or some even better method that doesn’t need sampling. But, you need to understand the function very well, and search for that perfect evaluation strategy, and do this for every individual function you want to integrate. (Not a math expert here, but I suspect in the general case that grid would be impossible to find, even if it exists, and it’s probably very hard even for those where the solution can be found.)
So what MCI does is exactly what Eliezer mentions above: you randomize the sampling as well as you can, to avoid as much as possible tripping an unexpected “resonance” between the grid and the function’s behavior. In effect, you’re working against an environment you can’t overwhelm by being smart, so you brute force it and do your best to reduce the ways the environment can overwhelm you, i.e. you try to minimize the worst-case consequences.
Note that for functions we really understand, MCI is not by far the best method. If you want to integrate a polynomial, you just compute the antiderivative, evaluate it in two points, subtract, and you’re done. MCI is nice because in general we don’t know how to

4

Modern PRNGs tend to be less bad than that.

4

That’s not at all my domain of expertise, so I’ll take your word for it (though I guessed as much).
That said, I meant that as half funny anecdote, half “it’s not just theory, it’s real life, and you can get screwed hard by something that guesses your decision algorithm, even if it’s just a non-sentient function”, not as an argument for/against either MC or quantum noise. (Though, now that I think of it, these days there probably are quantum noise generators available somewhere.)

2

AFAIK, there are but they're way too slow for use in MC, so decent-but-fast PRNGs are still preferred.

0

Also, sometimes you need to run several tests with the same random sequence for debugging purposes, which with true random numbers would require you to store all of them, whereas with pseudo-random numbers you just need to use the same seed.

1

Too bad that the default PRNG in many systems is still an ancient one.

3

Yes.
However, programming environments oriented towards numerical computing tend to use the Mersenne Twister as their default PRNG.

0[anonymous]

Mersenne Twister is the most commonly used algorithm. It is fast, and generates good results. It is not adversarial secure, but in that situation you should be using a cryptographically secure RNG.
Age of the PRNG algorithm has no bearing on this discussion. (If rand() uses something else, it's for standards-compatability reasons; nobody writing Monti-Carlo algorithms would be using such PRNGs.)

0

Actually the C and POSIX standard impose very few requirements on rand(); it would be entirely possible in principle for it to be based on the Mersenne Twister while still complying to the standards.

It is getting late, so this may be way off, and have not the time to read the paper.

This is also assuming finite trials right? Because over infinite trials if you have a non-zero probability of siding with the wrong group of classifiers, you will make infinite mistakes. No matter how small the probabilities go.

It seems it is trading off a better expectation for a worse real worse case.

Also are you thinking of formalising an alternative to the infinite SI worst case you describe?

Um, maybe I misunderstood what you said or the proofs, but isn't there a much simpler way to reject the proof of superiorness of the WMA, namely that in the first proof is, as you say, the worst case scenario.... but the second proof is *NOT*. It uses expected value rather than worst case, right? So it's not so much "worst case analysis does really weird things" but "not only that, the two proofs aren't even validly comparable, because one is worst case analysis and one involves expected value analysis"

Or did I horribly misunderstand either the second proof or your argument against it? (if so, I blame lack of sleep)

Psy, the "worst case expected value" is a bound directly comparable to the "worst case value" because the actual trajectory of expert weights is the same between the two algorithms.

Pearson, as an infinite set atheist I see no reason to drag infinities into this case, and I don't recall any of the proofs talked about them either.

Has anyone proved a theorem on the uselessness of randomness?

Nominull's suggestion:

"By adding randomness to your algorithm, you spread its behaviors out over a particular distribution, and there must be at least one point in that distribution whose expected value is at least as high as the average expected value of the distribution."

Might be a starting point for such a proof. However it does rely on being able to find that "one point" whose expected value is >= the average expected value of the distribution. Perhaps it is easy to prove that good points lie in a certain range, but all obvious specific choices in that range are bad.

So, the randomized algorithm isn't *really* better than the unrandomized one because getting a bad result from the unrandomized one is only going to happen when your environment maliciously hands you a problem whose features match up just wrong with the non-random choices you make, so all you need to do is to make those choices in a way that's tremendously unlikely to match up just wrong with anything the environment hands you because it doesn't have the same sorts of pattern in it that the environment might inflict on you.

Except that the definition of "...

2

No, it means "the sort of thing which is uncorrelated with the environment". What you want is for your responses to be correlated with the environment in a way that benefits you.

I think this series of posts should contain a disclaimer at some point stating the usability of randomized algorithms in practice (even excluding situations like distributed computing and cryptography). In a broad sense, though randomness offers us no advantage information theoretically (this is what you seem to be saying), randomness does offer a lot of advantages computationally. This fact is not at odds with what you are saying, but deserves to be stated more explicitly.

There are many algorithmic tasks which become feasible via randomized algorithms, for which no feasible deterministic alternatives are known - until recently primality testing was one such problem (even now, the deterministic primality test is unusable in practice). Another crucial advantage of randomized algorithms is that in many cases it is much easier to come up with a randomized algorithm and later, if needed, derandomize the algorithm. Randomized algorithms are extremely useful in tackling algorithmic problems and I think any AI that has to design algorithms would have to be able to reason in this paradigm (cf. Randomized algorithms, Motwani and Raghavan or Probability and Computing, Mitzenmacher and Upfal).

To put my point another way...

Say your experts, above, predict instance frequencies in populations. Huge populations, so you can only test expert accuracy by checking subsets of those populations.

So, you write an algorithm to choose which individuals to sample from populations when testing expert's beliefs. Aren't there situations where the best algorithm, the one least likely to bias for any particular expert algorithm, would contain some random element?

I fear randomness is occasionally necessary just to fairly check the accuracy of beliefs about the world and overcome bias.

Eliezer: I recognize that the two algorithms have the exact same trajectory for the weights, that the only thing that's different between them is the final output, not the weights. However, maybe I'm being clueless here, but I still don't see that as justifying considering the two different proof results really comparable.

The real worst case analysis for the second algorithm would be more like "the random source *just happened to* (extremely unlikely, but...) always come up with the wrong answer."

Obviously this sort of worst case analysis isn't rea...

Sure, randomness is the lazy way out. But if computational resources are not infinite, laziness is sometimes the smart way out. And I said "infinite" - even transhuman quantum computers are finite.

I agree with Psy-Kosh, I don't think the proofs are comparable.

The difference between the two derivations is that #1 uses a weak upper bound for W, while #2 uses an equality. This means that the final bound is much weaker in #1 than in #2.

It's possible to redo the derivation for the nonrandomized version and get the exact same bound as for the randomized case. The trick is to write M as the number of samples for which F_i > 1/2. Under weak assumptions on the performance of the experts you can show that this is less than Sum_i F_i.

Eliezer: thanks for the AI brain-teaser, I hope to see more posts like this in the future.

I agree with Psy-Kosh too. The key is, as Eliezer originally wrote, *never*. That word appears in Theorem 1 (about the deterministic algorithm), but it does **not** appear in Theorem 2 (the bound on the randomized algorithm).

Basically, this is the same insight Eliezer suggests, that the environment is being allowed to be a superintelligent entity with complete knowledge in the proof for the deterministic bound, but the environment is **not** allowed the same powers in the proof for the randomized one.

In other words, Eliezer's conclusion is correct, but I don't thi...

The question whether randomized algorithms can have an advantage over deterministic algorithms is called the P vs BPP problem. As far as I know, most people believe that P=BPP, that is, randomization doesn't actually add power. However, nobody knows how to prove this. Furthermore, there exists an oracle relative to which P is not equal to BPP. Therefore any proof that P=BPP has to be nonrelativizing. This rules out a lot of the more obvious methods of proof.

see: complexity zoo

Nicely done. Quite a number of points raised that escape many people and which are rarely raised in practice.

Someone please tell me if I understand this post correctly. Here is my attempt to summarize it:

"The two textbook results are results specifically about the worst case. But you only encounter the worst case when the environment can extract the maximum amount of knowledge it can about your 'experts', and exploits this knowledge to worsen your results. For this case (and nearby similar ones) only, randomizing your algorithm helps, but only because it destroys the ability of this 'adversary' to learn about your experts. If you instead average over all cases, the non-random algorithm works better."

Is that the argument?

I am interested in what Scott Aaronson says to this.

I am unconvinced, and I agree with both the commenters g and R above. I would say Eliezer is underestimating the number of problems where the environment gives you correlated data and where the correlation is essentially a distraction. Hash functions are, e.g., widely used in programming tasks and not just by cryptographers. Randomized algorithms often are based on non-trivial insights into the problem at hand. For example, the insight in hashing and related approaches is that "two (different) object...

@ comingstorm: Quasi Monte Carlo often outperforms Monte Carlo integration for problems of small dimensionality involving "smooth" integrands. This is, however, not yet rigorously understood. (The proven bounds on performance for medium dimensionality seem to be extremely loose.)

Besides, MC doesn't require randomness in the "Kolmogorov complexity == length" sense, but in the "passes statistical randomness tests" sense. Eliezer has, as far as I can see, not talked about the various definitions of randomness.

Fascinating post. I especially like the observation that the purpose of randomness in all these examples is to defeat intelligence. It seems that this is indeed the purpose for nearly every use of randomness that I can think of. We randomise presidents routes to defeat the intelligence of an assasin, a casino randomises its roulette wheels to defeat the intelligence of the gambler. Even in sampling, we randomise our sample to defeat our own intelligence, which we need to treat as hostile for the purpose of the experiment.

What about in compressed sensing where we want an incoherent basis? I was under the impression that random linear projections was the best you could do here.

kyb, see the discussion of quicksort on the other thread. Randomness is used to protect against worst-case behavior, but it's not because we're afraid of intelligent adversaries. It's because worst-case behavior for quicksort happens a lot. If we had a good description of naturally occurring lists, we could design a deterministic pivot algorithm, but we don't. We only have the observation simple guess-the-median algorithms perform badly on real data. It's not terribly surprising that human-built lists resonate with human-designed pivot algorithms; but the opposite scenario, where the simplex method works well in practice is not surprising either.

I agree with Psy, the bounds are not comparable.

In fact, the bound for #2 should be the same as the one for #1. I mean, in the really worst case, the randomized algorithm makes exactly the same predictions as the deterministic one. I advise to blame and demolish your quantum source of random numbers in this case.

Tentative, but what about cases where you don't trust your own judgement and suspect that you need to be shaken out of your habits?

*What about Monte Carlo methods? There are many problems for which Monte Carlo integration is the most efficient method available.*

Monte Carlo methods can't buy you any correctness. They are useful because they allow you to sacrifice an unnecessary bit of correctness in order to give you a result in a much shorter time on otherwise intractable problem. They are also useful to simulate the effects of real world randomness (or at least behavior you have no idea how to systematically predict).

So, for example, I used a Monte Carlo script to determine expected ...

Congratulations Eliezer, this subject is very interesting.

Nontheless useful, i.e. Monter Carlo methods with pseudo-random numbers ARE actually deterministic, but we discard the details (very complex and not useful for the application) and only care about some statistical properties. This is a state of subjective partial information, the business of bayesian probability, only this partial information is deliberate! (or better, a pragmatic compromise due to limitations in computational power. Otherwise we'd just use classical numerical integration happily in...

As stated the statement "Randomness didn't buy me any accuracy, it was a way of trading accuracy for development time." is a complete non-sequitur since nobody actually believes that randomness is a means of increased accuracy in the long term it is used as a performance booster. It merely means you can get the answer faster with less knowledge which we do all the time since there are a plethora of systems that we can't understand.

Eliezer, can you mathematically characterize those environments where randomization can help. Presumably these are the "much more intelligent than you, and out to get you" environments, but I'd like to see that characterized a bit better.

A slight tangent here, but Blum loses me even before the considerations that Eliezer has so clearly pointed out. Comparing an upper bound for the worst case of one algorithm with an upper bound for the average case of the other is meaningless. Even were like things being compared, mere upper bounds of the true values are only weak evidence of the quality of the algorithms.

FWIW, I ran a few simulations (for the case of complete independence of all experts from themselves and each other), and unsurprisingly, the randomised algorithm performed on average wors...

Anyone who evaluates the performance of an algorithm by testing it with random data (e.g., simulating these expert-combining algorithms with randomly-erring "experts") is ipso facto executing a randomized algorithm...

g: Indeed I was. Easier than performing an analytical calculation and sufficiently rigorous for the present purpose. (The optimal weights, on the other hand, are log(p/(1-p)), by an argument that is both easier and more rigorous than a simulation.)

Richard, I wasn't suggesting that there's anything wrong with your running a simulation, I just thought it was amusing in this particular context.

Silas - yeah, that's about the size of it.

Eliezer, when you come to edit this for popular publication, lose the maths, or at least put it in an endnote. You're good enough at explaining concepts that if someone's going to get it, they're going to get it without the equations. However, a number of those people will switch off there and then. I skipped it and did fine, but algebra is a stop sign for a number of very intelligent people I know.

*I am interested in what Scott Aaronson says to this.*

I fear Eliezer will get annoyed with me again :), but R and Stephen basically nailed it. Randomness provably never helps in average-case complexity (i.e., where you fix the probability distribution over inputs) -- since given any ensemble of strategies, by convexity there must be at least one deterministic strategy in the ensemble that does at least as well as the average.

On the other hand, if you care about the *worst*-case running time, then there are settings (such as query complexity) where randomness provably *does* help. For example, suppose you're given n bits, you're promised that either n/3 or 2n/3 of the bits are 1's, and your task is to decide which. Any deterministic strategy to solve this problem clearly requires looking at 2n/3 + 1 of the bits. On the other hand, a randomized sampling strategy only has to look at O(1) bits to succeed with high probability.

Whether randomness ever helps in *worst-case polynomial-time computation* is the P versus BPP question, which is in the same league as P versus NP. It's conjectured that P=BPP (i.e., randomness never saves more than a polynomial). This is known to be true if really ...

@ Scott Aaronson. Re: your n-bits problem. You're moving the goalposts. Your deterministic algorithm determines with 100% accuracy which situation is true. Your randomized algorithm only determines with "high probability" which situation is true. These are not the same outputs.

You need to establish goal with a fixed level of probability for the answer, and then compare a randomized algorithm to a deterministic algorithm that only answers to that same level of confidence.

That's the same mistake that everyone always makes, when they say that "randomness provably does help." It's a cheaper way to solve a *different* goal. Hence, not comparable.

Scott, I don't dispute what you say. I just suggest that the confusing term "in the worst case" be replaced by the more accurate phrase "supposing that the environment is an adversarial superintelligence who can perfectly read all of your mind except bits designated 'random'".

Don: When you fix the goalposts, make sure someone can't kick the ball straight in! :-) Suppose you're given an n-bit string, and you're promised that exactly n/4 of the bits are 1, and they're either all in the left half of the string or all in the right half. The problem is to decide which. It's clear that any deterministic algorithm needs to examine at least n/4 + 1 of the bits to solve this problem. On the other hand, a randomized sampling algorithm can solve the problem *with certainty* after looking at only O(1) bits on average.

Eliezer: I often tell people that theoretical computer science is basically mathematicized paranoia, and that this is the reason why Israelis so dominate the field. You're absolutely right: we *do* typically assume the environment is an adversarial superintelligence. But that's not because we literally think it *is* one, it's because we don't presume to know which distribution over inputs the environment is going to throw at us. (That is, we lack the self-confidence to impose any particular prior on the inputs.) We *do* often assume that, if we generate random bits ourselves, then the environment isn't going to magically take those bits into account wh...

0

EDIT: OK, I found the explanation further down, seems I was on the right path.
Sorry Scott, I know I’m late to the party, but this got me curious. Can you explain this, or point me to an explanation? I’m not sure I got the correct numbers.
As far as I can tell, a reasonable deterministic algorithm is to examine, one by one, at most the (n/4 + 1) bits (e.g., the first); but if you find a 1 early, you stop. The nondeterministic one would be to just test bits in random order until you see a 1.
For the deterministic version, in half the cases you’re going to evaluate all those bits when they’re in the “end” half.
If the bits are in the “front” half, or if the algorithm is random, I can’t figure out exactly how to solve the average number of bits to test, but it seems like the deterministic half should do on average about as well* as the nondeterministic one on a problem half the size, and in the worst case scenario it still does (n/4 + 1) bits, while the nondeterministic algorithm has to do (n/2 + 1) tests in the worst-case scenario unless I’m mistaken.
I mean, it does seem like the probability of hitting a bad case is low on average, but the worst case seems twice as bad, even though harder to hit.
I’m not sure how you got O(1); I got to a sum of reciprocals of squares up to k for the probability of the first 1 bit being the k-th tested, which sounds about O(1) on average, but my probabilities are rusty, is that the way?
(* of course, it’s easy to construct the bad case for the deterministic algorithm if you know which bits it tests first.)

*the environment is an adversarial superintelligence*

Oh, Y'all are trying to outwit Murphy? No wonder you're screwed.

Could Scott_Aaronson or anyone who knows what he's talking about, please tell me the name of the n/4 left/right bits problem he's referring to, or otherwise give me a reference for it? His explanation doesn't seem to make sense: the deterministic algorithm needs to examine 1+n/4 bits only in the worst case, so you can't compare that to the average output of the random algorithm. (Average case for the determistic would, it seems, be n/8 + 1) Furthermore, I don't understand how the random method could average out to a size-independent constant.

Is the randomized algorithm one that uses a quantum computer or something?

Silas: "Solve" = for a worst-case string (in both the deterministic and randomized cases). In the randomized case, just keep picking random bits and querying them. After O(1) queries, with high probability you'll have queried either a 1 in the left half or a 1 in the right half, at which point you're done.

As far as I know this problem doesn't have a name. But it's how (for example) you construct an oracle separating P from ZPP.

@Scott_Aaronson: Previously, you had said the problem is solved with *certainty* after O(1) queries (which you had to, to satisfy the objection). Now, you're saying that after O(1) queries, it's merely a "high probability". Did you not change which claim you were defending?

Second, how can the required number of queries not depend on the problem size?

Finally, isn't your example a special case of exactly the situation Eliezer_Yudkowsky describes in this post? In it, he pointed out that the "worst case" corresponds to an adversary who kno...

Silas: Look, *as soon you find a 1 bit*, you've solved the problem with certainty. You have no remaining doubt about the answer. And the expected number of queries until you find a 1 bit is O(1) (or 2 to be exact). Why? Because with each query, your probability of finding a 1 bit is 1/2. Therefore, the probability that you need t or more queries is (1/2)^(t-1). So you get a geometric series that sums to 2.

(Note: I was careful to say you succeed with certainty after an *expected* number of queries independent of n -- not that there's a constant c such tha...

1

Quick check: take algorithm that tests bits 1, n/2+1, 3, n/2+3, 5, n/2+5 ... then continues alternating halves on even indices. Of course, it stops at the first 1. And it’s obviously deterministic.
But unless I’m mistaken, averaged over all possible instances of the problem, this algorithm has exactly the same average, best case and worst case as the nondeterministic one. (In fact, I think this holds for any fixed sequence of test indices, as long as it alternates between halves, such as 1, n, 2, n-1, 3, n-2... Basically, my reasoning is that an arbitrary index sequence should have (on average) the same probability of finding the answer in k tests on a random problem that a random index sequence does.)
I know you said it’s adversarial, and it’s clear that (assuming you know the sequence of tested indices) you can construct problem instances that take n/2+1 (worst-case) tests, in which case the problem isn’t random. I just want to test if I reasoned correctly.
If I’m right, then I think what you’re saying is exactly what Eliezer said: If you know the adversary can predict you, you’ve got no better option than to act randomly. (You just don’t do as bad in this case as in Eliezer’s example.) Conversely, if you can predict the adversary, then the random algorithm completely wastes this knowledge. (I.e., if you could guess more about the sequence, you could probably do better than n/4+1 worst case, and if you can predict the adversary then you don’t even need to test the bits, which is basically O(0).)

Isn't the probability of finding a bit 1/4 in each sample? Because you have to sample both sides. If you randomly sample without repetitions you can do even better than that. You can even bound the worse case for the random algorithm at n/4+1 as well by seeing if you ever pick n/4+1 0s from any side. So there is no downside to doing it randomly as such.

However if it is an important problem and you think you might be able to find some regularities, the best bet would to be to do bayesian updates on the most likely positions to be ones and preferentially ch...

Let's say the solver has two input tapes: a "problem instance" input and a "noise source" input.

When Eli says "worst-case time", he means the maximum time, over all problem instances and all noise sources, taken to solve the problem.

But when Scott says "worst-case time," he means taking the maximum (over all problem instances X) of the average time (over all noise sources) to solve X.

Scott's criterion seems relevant to real-world situations where one's input is being generated by an adversarial human who knows which ...

0[anonymous]

That last paragraph is pretty unsettling--you get penalized for not throwing a messy obfuscatory step into yourself? A Kolmogorov distribution indeed does seem to be a "present adversary"!

0

It sounds to me like Peter is suggesting to randomly permute all inputs to the sorting algorithm, in hopes of improving the average case runtime. The justification appears to be that the worst case has low complexity because it's easy to describe, because you can describe it by saying "it's the worst case for ".
For example, the worst-case input for Quicksort is an already-sorted list. You could, indeed, get better average performance out of quicksort if you could randomly permute its inputs in zero time, because code is more likely to produce already-sorted lists than lists sorted, say, by { x mod 17 }.
My initial reaction was that the implicit description "that input which is worst case for this algorithm" might specify an input which was not unusually likely to occur in practice. Peter is saying that it is more likely to occur in practice, even if it's an implicit description--even if we don't know what the worst-case input is--because it has lower complexity in an absolute sense, and so more processes in the real world will generate that input than one with a higher complexity.
Is that right?
And no fast deterministic permutation can be used instead, because the deterministic permutation could be specified concisely.
In practice, our random number generators use a seed of 16 to 32 bits, so they introduce almost no complexity at all by this criterion. No RNG can be useful for this purpose, for the same reason no deterministic permutation can be useful.
Still, I think this is a conclusive proof that randomness can be useful.

@michael e sullivan: re "Monte Carlo methods can't buy you any correctness" -- actually, they can. If you have an exact closed-form solution (or a rapidly-converging series, or whatever) for your numbers, you really want to use it. However not all problems have such a thing; generally, you either simplify (giving a precise, *incorrect* number that is readily computable and hopefully close), or you can do a numerical evaluation, which might approach the *correct* solution arbitrarily closely based on how much computation power you devote to it.

Quad...

Silas is right; Scott keeps changing the definition in the middle, which was exactly my original complaint.

For example, Scott says: *In the randomized case, just keep picking random bits and querying them. After O(1) queries, with high probability you'll have queried either a 1 in the left half or a 1 in the right half, at which point you're done.*

And yet this is **no different** from a deterministic algorithm. It can also query O(1) bits, and "with high probability" have a certain answer.

I'm really astonished that Scott can't see the slight-of-hand i...

Don how well a deterministic algorithm does depends upon what the deterministic algorithm is and what the input is, it cannot always query O(1) queries.

E.g. in a 20 bit case

If the deterministic algorithm starts its queries from the beginning, querying each bit in turn and this is pattern is always 00000111110000000000

It will always take n/4+1 queries. Only when the input is random will it on average take O(1) queries.

The random one will take O(1) on average on every type of input.

Don - the "sleight of hand", as you put it is that (as I think has been said before) we're examining worst-case. We explicitly *are* allowing the string of 1s and 0s to be picked by an adversarial superintelligence who knows our strategy. Scott's algorithm only needs to sample 4 bits on average to answer the question *even when the universe it out to get him* - the deterministic version will need to sample exactly n/4+1.

Basically, Eliezer seems to be claiming that BPP = P (or possibly something stronger), which most people think is probably true, but...

To generalize Peter's example, a typical deterministic algorithm has low Kolmogorov complexity, and therefore its worst-case input also has low Kolmogorov complexity and therefore a non-negligible probability under complexity-based priors. The only possible solutions to this problem I can see are:

1. add randomization

2. redesign the deterministic algorithm so that it has no worst-case input

3. do a cost-benefit analysis to show that the cost of doing either 1 or 2 is not justified by the expected utility of avoiding the worst-case performance of the origi...

1

(I agree that this is one among many valid cases of when we might want to just throw in randomization to save thinking time, as opposed to doing a detailed derandomized analysis.)

Quantum branching is "truly random" in the sense that branching the universe both ways is an irreducible source of new indexical ignorance. But the more important point is that unless the environment really *is* out to get you, you might be wiser to *exploit* the ways that you think the environment might depart from true randomness, rather than using a noise source to *throw away* this knowledge.

*I'm not making any nonstandard claims about computational complexity, only siding with the minority "derandomize things" crowd*

Note that I also enthusiastically belong to a "derandomize things" crowd! The difference is, I think derandomizing is *hard work* (sometimes possible and sometimes not), since I'm unwilling to treat the randomness of the problems the world throws at me on the same footing as randomness I generate myself in the course of solving those problems. (For those watching at home tonight, I hope the differences are now reasonably clear...)

I certainly don't say "it's not hard work", and the environmental probability distribution should *not* look like the probability distribution you have over your random numbers - it should contain correlations and structure. But once you know what your probability distribution *is*, then you should do your work relative to that, rather than assuming "worst case". Optimizing for the worst case in environments that aren't *actually* adversarial, makes even *less* sense than assuming the environment is as random and unstructured as thermal noise...

*once you know what your probability distribution is...*

I'd merely stress that that's an enormous "once." When you're writing a program (which, yes, I used to do), normally you have only the foggiest idea of what a typical input is going to be, yet you want the program to work anyway. This is not just a hypothetical worry, or something limited to cryptography: people have actually run into strange problems using pseudorandom generators for Monte Carlo simulations and hashing (see here for example, or Knuth vol 2).

Even so, intuition suggests it sh...

Even if P=BPP, that just means that giving up randomness causes "only" a polynomial slowdown instead of an exponential one, and in practice we'll still need to use pseudorandom generators to simulate randomness.

It seems clear to me that noise (in the sense of randomized algorithms) does have power, but perhaps we need to develop better intuitions as to why that is the case.

Eliezer: There are a few classes of situations I can think of in which it seems like the correct solution (given certain restrictions) requires a source of randomness:

One example would be Hofstadter's Luring Lottery. Assuming all the agents really did have common knowledge of each other's rationality, and assuming no form of communication is allowed between them in any way, isn't it true that no deterministic algorithm to decide how many entries to send in, if any at all, has a higher expected return than a randomized one? (that is, the solutions which are...

0

My shot at improving on randomness for things like the Luring Lottery: http://lesswrong.com/lw/vp/worse_than_random/18yi

The leaders in the NetflixPrize competition for the last couple years have utilized ensembles of large numbers of models with a fairly straightforward integration procedure. You can only get so far with a given model, but if you randomly scramble its hyperparameters or training procedure, and then average multiple runs together, you will improve your performance. The logical path forward is to derandomize this procedure, and figure out how to predict, a priori, which model probabilities become more accurate and which don't. But of course until you figure o...

@ Will: You happen to have named a particular deterministic algorithm; that doesn't say much about *every* deterministic algorithm. Moreover, none of you folks seem to notice much that pseudo-random algorithms are actually deterministic too...

Finally, you say: *"Only when the input is random will [the deterministic algorithm] on average take O(1) queries. The random one will take O(1) on average on every type of input."*

I can't tell you how frustrating it is to see you just continue to toss in the **"on average"** phrase, as though it doesn't...

0

DonGeddis is missing the point. Randomness does buy power. The simplest example is sampling in statistics: to estimate the fraction of read-headed people in the world to within 5% precision and with confidence 99%, it is enough to draw a certain number of random samples and compute the fraction of read-headed people in the sample. The number of samples required is independent of the population size. No deterministic algorithm can do this, not even if you try to simulate samples by using good PRNGs, because there are ``orderings'' of the world's population that would fool your algorithm. But if you use random bits that are independent of the data, you cannot be fooled. (By the way, whether 'true randomness' is really possible in the real world is a totally different issue.)
Note that, for decision problems, randomness is not believed to give you a more-than-polynomial speedup (this is the P vs BPP question). But neither sampling nor the promise problem Scott mentioned are decision problems (defined on all inputs).
Regarding your claim that you cannot compare 'worst-case time' for deterministic algorithms with 'worst-case expected time' for randomized ones: this is totally pointless and shows a total lack of understanding of randomization. No one claims that you can have a randomized algorithm with success probability one that outperforms the deterministic algorithm. So either we use the expected running time or the randomized algorithm, or we allow a small error probability.
We can choose either one and use the same definition of the problem for both deterministic and randomized. Take Scott example and suppose we want algorithms that succeed with probability 2/3 at least, on every input. The goal is the same in both cases, so they are comparable. In the case of deterministic algorithms, this implies always getting the correct answer. DonGeddis seems not to realize this, and contrary to what he says, any deterministic algorithm, no matter how clever, needs to look

3

Wait a minute. There are also orderings of the world population that will randomly fool the random algorithm. (That’s where the 99% comes from.) Or, more precisely, there are samplings generated by your random algorithm that get fooled by the actual ordering of the world population.
If you write a deterministic algorithm that samples the exact number of samples as the random one does, but instead of sampling randomly you sample deterministically, spreading the samples in a way you believe to be appropriately uniform, then, assuming the human population isn’t conspiring to trick you, you should get exactly the same 99% confidence and 5% precision.
(Of course, if you screw up spreading the samples, it means you did worse than random, so there’s something wrong with your model of the world. And if the entire world conspires to mess up your estimate of red hair frequency, and succeeds to do so without you noticing and altering you sampling, you’ve got bigger problems.)
For example, once n is found (the number of tests needed), you list all humans in order of some suitable tuple of properties (e.g., full name name, date and time of birth, population of home city/town/etc, or maybe), and then pick n people equally distanced in this list. You’re still going to find the answer, with the same precision and confidence, unless the world population conspired to distribute itself over the world exactly the right way to mess up your selection. There are all sorts of other selection procedures that are deterministic (they depend only on the population), but are ridiculously unlikely to have any correlation with hair color.
Other example: order all cities/towns/etc by GDP; group them in groups of ~N/n; pick from each group the youngest person; I don’t think n is high enough, but if it happens that a group contains a single city with more than N/n population pick the two, three, etc youngest persons. If you’re concerned that people all over the world recently started planning th

@ John, @ Scott: You're still doing something odd here. As has been mentioned earlier in the comments, you've imagined a mind-reading superintelligence ... except that it doesn't get to see the internal random string.

Look, this should be pretty simple. The phrase "worst case" has a pretty clear layman's meaning, and there's no reason we need to depart from it.

You're going to get your string of N bits. You need to write an algorithm to find the 1s. If your algorithm ever gives the wrong answer, we're going to shoot you in the head with a gun a...

To look at it one more time ... Scott originally said *Suppose you're given an n-bit string, and you're promised that exactly n/4 of the bits are 1, and they're either all in the left half of the string or all in the right half.*

So we have a whole set of deterministic algorithms for solving the problem over here, and a whole set of randomized algorithms for solving the same problem. Take the best deterministic algorithm, and the best randomized algorithm.

Some people want to claim that the best randomized algorithm is "provably better". Really? B...

Don, you missed my comment saying you could bound the randomized algorithm in the same way to n/4+1 by keeping track of where you pick and if you find n/4+1 0s in one half you conclude it is in the other half.

I wouldn't say that a randomized algorithm is better per se, just more generally useful than the a particular deterministic one. You don't have to worry about the types of inputs given to it.

This case matters because I am a software creator and I want to reuse my software components. In most cases I don't care about performance of every little sub com...

Don, if n is big enough then sure, I'm more than willing to play your game (assuming I get something out of it if I win).

My randomised algorithm will get the right answer within 4 tries in expectation. If I'm allowed to sample a thousand bits then the probability that it won't have found the answer by then is .75^1000 which I think is something like 10^-120. And this applies *even if you're allowed to choose the string after looking at my algorithm*.

If you play the game this way with your deterministic algorithm you will *always* need to sample n/4 + 1 bits ...

@ Will: Yes, you're right. You can make a randomized algorithm that has the same worst-case performance as the deterministic one. (It may have slightly impaired average case performance compared to the algorithm you folks had been discussing previously, but that's a tradeoff one can make.) My only point is that concluding that the randomized one is necessarily **better**, is far too strong a conclusion (given the evidence that has been presented in this thread so far).

But sure, you are correct that adding a random search is a cheap way to have good confiden...

@ John: can you really not see the difference between "this is guaranteed to succeed", vs. "this has only a tiny likelihood of failure"? Those aren't the same statements.

*"If you play the game this way"* -- but why would anyone want to play a game the way you're describing? Why is that an interesting game to play, an interesting way to compare algorithms? It's not about worst case in the real world, it's not about average case in the real world. It's about performance on a scenario that never occurs. Why judge algorithms on...

@Scott: would it be fair to say that the P=BPP question boils down to whether one can make a "secure" random source such that the adversarial environment can't predict it? Is that why the problem is equivalent to having good PRNG? (I'm a physicist, but who thinks he knows some of the terminology of complexity theory --- apologies if I've misunderstood the terms.)

I can't help but think that the reason why randomization appear to add information in the context of machine learning classifiers for the dichotomous case is that the actual evaluation of the algorithms is empirical, and, as such, use finite data. Two classes of non-demonic error always exist in the finite case. These are measurement error, and sampling error.

Multiattribute classifiers use individual measurements from many attributes. No empirical measurement is made without error; the more attributes that are measured, the more measurement error intrud...

-1

DonGeddis is missing the point. Randomness does buy power. The simplest example is sampling in statistics: to estimate the fraction of read-headed people in the world to within 5% precision and with confidence 99%, it is enough to draw a certain number of random samples and compute the fraction of read-headed people in the sample. The number of samples required is independent of the population size. No deterministic algorithm can do this, not even if you try to simulate samples by using good PRNGs, because there are ``orderings'' of the world's population that would fool your algorithm. But if you use random bits that are independent of the data, you cannot be fooled. (By the way, whether 'true randomness' is really possible in the real world is a totally different issue.)
Note that, for decision problems, randomness is not believed to give you a more-than-polynomial speedup (this is the P vs BPP question). But neither sampling nor the promise problem Scott mentioned are decision problems (defined on all inputs).
Regarding your claim that you cannot compare 'worst-case time' for deterministic algorithms with 'worst-case expected time' for randomized ones: this is totally pointless and shows a total lack of understanding of randomization. No one claims that you can have a randomized algorithm with success probability one that outperforms the deterministic algorithm. So either we use the expected running time or the randomized algorithm, or we allow a small error probability.
We can choose either one and use the same definition of the problem for both deterministic and randomized. Take Scott example and suppose we want algorithms that succeed with probability 2/3 at least, on every input. The goal is the same in both cases, so they are comparable. In the case of deterministic algorithms, this implies always getting the correct answer. DonGeddis seems not to realize this, and contrary to what he says, any deterministic algorithm, no matter how clever, needs to look

0

So don't use those orderings. That randomness isn't buying you power. It's acting as though you know nothing about the distribution of red-headed people. If you do, in fact, know something about the distribution of red-headed people (i.e. there are more red-headed people than average in Ireland and fewer in Africa, therefore make sure each sample draws proportionally from Ireland, Africa, and non-Ireland-non-Africa), then you can exploit that to make your prediction more reliable.

When the environment is adversarial, smarter than you are, and informed about your methods, then in a theoretical sense it may be wise to have a quantum noise source handy.

It can be handy even if it's *not* adversarial, and even if it is equally as smart as you are but not smarter. Suppose you're playing a game with payoffs (A,A) = (0,0), (A,B) = (1,1), (B,A) = (1,1), (B,B) = (0,0) against a clone of yourself (and you can't talk to each other); the mixed strategy where you pick A half of the time and B half of the time wins half of the time, but either p...

Sometimes the environment really is adversarial, though.

Regular expressions as implemented in many programming languages work fine for almost all inputs, but with terrible upper bounds. This is why libraries like re2 are needed if you want to let anyone on the Internet use regular expressions in your search engine.

Another example that comes to mind is that quicksort runs in n·log n expected time if randomly sorted beforehand.

3

The Internet can be pretty scary, but it is still very much short of an adversarial superintelligence. Sure, sometimes people will want to bring your website down, and sometimes it could be an organized effort, but even that is nothing like a genuine worst case.

1

OK, granted, in the case of general regex we're talking about upper bounds that are exponential in the worst case, and it's not very difficult to come up with really bad regex cases, so in that case a single query could cause you major issues.
That doesn't necessarily mean you should switch to a library that avoids those worst cases, though. You could simply reject the kinds of regexes that could potentially cause those issues, or run your regex matcher with a timeout and have the search fail whenever the timer expires.

0

I wouldn't mind knowing the reason I was downvoted here.

1

Seems like noise to me. Only one person cared enough about what you wrote to vote, and they disliked your post. Upvoted back to zero, since the comments seem pretty reasonable to me.

0

Thanks; it didn't occur to me to think of it statistically, although it seems obvious now.
That said, individual downvotes / upvotes still have reasons behind them; it's just that you can't generally expect to find out what they are except when they come in significant clusters.

0

If you have an search engine you don't want to allow people to search your whole corpus anyway. You usually want to search an index.

So let me see if I've got this straight.

Computer Scientists: For some problems, there are random algorithms with the property that they succeed with high probability *for any possible input*. No deterministic algorithm for these problems has this property. Therefore, random algorithms are superior.

Eliezer: But if we knew the probability distribution over possible inputs, we could create a deterministic algorithm with the property.

Computer Scientists: But we do not know the probability distribution over possible inputs.

Eliezer: Never say "I don't know&qu...

Followup to: Worse Than Random, Trust In BayesIn the wider field of Artificial Intelligence, it is not

universallyagreed and acknowledged that noise hath no power. Indeed, the conventional view in machine learning is that randomized algorithms sometimes perform better than unrandomized counterpartsand there is nothing peculiar about this.Thus, reading an ordinary paper in the AI literature, you may suddenly run across a remark: "There is also an improved version of this algorithm, which takes advantage of randomization..."Now for myself I will be instantly suspicious; I shall inspect the math for reasons why the unrandomized algorithm is being somewhere stupid, or why the randomized algorithm has a hidden disadvantage. I will look for something peculiar enough to explain the peculiar circumstance of a randomized algorithm somehow doing

better.I am not completely alone in this view. E. T. Jaynes, I found, was of the same mind: "It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought." Apparently there's now a small cottage industry in derandomizing algorithms. But so far as I know, it is not yet the majority, mainstream view that "we can improve this algorithm by randomizing it" is an extremely suspicious thing to say.

Let us now consider a specific example - a mainstream AI algorithm where there is, apparently, a mathematical proof that the randomized version performs better. By showing how subtle the gotcha can be, I hope to convince you that, even if you run across a case where the randomized algorithm is widely believed to perform better, and you can't find the gotcha yourself, you should nonetheless trust that there's a gotcha to be found.

For our particular example we shall examine the weighted majority algorithm, first introduced by Littlestone and Warmuth, but as a substantially more readable online reference I will use section 2 of Blum's On-Line Algorithms in Machine Learning. (Blum has an easier-to-read, more concrete version of the proofs; and the unrandomized and randomized algorithms are presented side-by-side rather than far part.)

The weighted majority algorithm is an

ensemble method- a way to combine the advice from several other algorithms or hypotheses, called "experts". Experts strive to classify a set of environmental instances into "positive" and "negative" categories; each expert produces a prediction for each instance. For example, each expert might look at a photo of a forest, and try to determine whether or not there is a camouflaged tank in the forest. Expert predictions are binary, either "positive" or "negative", with no probability attached. We do not assume that any of our experts is perfect, and we do not know which of our experts is the best. We also do not assume anything about the samples (they may not be independent and identically distributed).The weighted majority algorithm initially assigns an equal weight of 1 to all experts. On each round, we ask all the experts for their predictions, and sum up the weights for each of the two possible predictions, "positive" or "negative". We output the prediction that has the higher weight. Then, when we see the actual result, we multiply by 1/2 the weight of every expert that got the prediction wrong.

Suppose the total number of experts is

n, and the best expert makes no more thanmmistakes over some given sampling sequence. Then we can prove that the weighted majority algorithm makes a total number of mistakesMthat is bounded by2.41*(m + log2(n)). In other words, the weighted majority algorithm makes no more mistakes than the best expert, plus a factor of logn, times a constant factor of 2.41.Proof (taken from Blum 1996; a similar proof appears in Littlestone and Warmuth 1989): The combined weight of all the experts at the start of the problem is

W = n. If the weighted majority algorithm makes a mistake, at least half the total weight of experts predicted incorrectly, so the total weight is reduced by a factor of at least 1/4. Thus, after the weighted majority algorithm makesMmistakes, its total weightWhas been reduced by at least(3/4)^M. In other words:But if the best expert has made at most

mmistakes, its weight is at least (1/2)^m. And since W includes the weight of all experts,Therefore:

Blum then says that "we can achieve a better bound than that described above", by randomizing the algorithm to predict "positive" or "negative" with

probabilityproportional to the weight assigned each prediction. Thus, if 3/4 of the expert weight went to "positive", we would predict "positive" with probability 75%, and "negative" with probability 25%.An essentially similar proof, summing over the expected probability of a mistake on each round, will show that in this case:

Since the penalty applied to particular experts does not depend on the

global predictionbut only on theactual outcome,most of the proof proceeds as before. We havewhere again

mis the best expert andWis the total weight of all experts.We also have that W is the starting weight

ntimes the product of(1 - 1/2 F_i), whereF_iis the fraction of mistaken expert weight on round i:And since we predict with probability proportional to the expert weight, the expected number of mistakes is just the sum over F_i:

So:

Behold, we have done better by randomizing!

We should be

especiallysuspicious that the randomized algorithm guesses with probability proportional to the expert weight assigned. This seems strongly reminiscent of betting with 70% probability on blue, when the environment is a random mix of 70% blue and 30% red cards. We know the best bet - and yet we onlysometimesmake this best bet, at other times betting on a condition we believe to belessprobable.Yet we thereby

provea smaller upper bound on the expected error. Is there an algebraic error in the second proof? Are we extracting useful work from a noise source? Is our knowledge harming us so much that we can do better through ignorance?Maybe the unrandomized algorithm fails to take into account the Bayesian value of information, and hence fails to exhibit exploratory behavior? Maybe it doesn't test unusual cases to see if they might have surprising results?

But, examining the assumptions, we see that the feedback we receive is fixed, regardless of the prediction's output. Nor does the updating of the expert weights depend on the predictions we output. It doesn't matter whether we substitute a completely random string of 1s and 0s as our actual output. We get back exactly the same data from the environment, and the weighted majority algorithm updates the expert weights in exactly the same way. So that can't be the explanation.

Are the component experts doing worse than random, so that by randomizing our predictions, we can creep back up toward maximum entropy? But we didn't assume anything about how often the component experts were right, or use the fact in the proofs. Therefore the proofs would carry even if we specified that all experts were right at least 60% of the time. It's hard to see how randomizing our predictions could help, in that case - but the proofs still go through, unchanged.

So where's the gotcha?

Maybe I'm about to tell you that I looked, and I couldn't find the gotcha either. What would you believe, in that case? Would you think that the whole thesis on the futility of randomness was probably just wrong - a reasonable-sounding philosophical argument that simply wasn't correct in practice? Would you despair of being able to follow the math, and give up and shrug, unable to decide who might be right?

We don't

alwayssee the flaws right away, and that's something to always remember.In any case, I'm about to start explaining the gotcha. If you want to go back, read the paper, and look yourself, you should stop reading now...

There does exist a rare class of occasions where we want a source of "true" randomness, such as a quantum measurement device. For example, you are playing rock-paper-scissors against an opponent who is

smarter than you are, and who knows exactly how you will be making your choices. In this condition it is wise to choose randomly, because any method your opponent can predict will do worse-than-average. Assuming that the enemy knows your source code has strange consequences: the action sequence 1, 2, 1, 3 is good when derived from a quantum noise generator, but bad when derived fromanydeterministic algorithm, even though it is the same action sequence. Still it is not a totally unrealistic situation. In real life, it is, in fact, a bad idea to play rock-paper-scissors using an algorithm your opponent can predict. So are we, in this situation, deriving optimization from noise?The random rock-paper-scissors player does not play cleverly, racking up lots of points. It does not win more than 1/3 of the time, or lose less than 1/3 of the time (on average). The randomized player does better because its alternatives perform poorly, not from being smart itself. Moreover, by assumption, the opponent is an intelligence whom we

cannotoutsmart and whoalwaysknows everything about any method we use. There is no move we can make that does not have a possible countermove. By assumption, our own intelligence is entirely useless. The best we can do is to avoid all regularity that the enemy can exploit. In other words, we do best byminimizingthe effectiveness of intelligence within the system-as-a-whole, becauseonlythe enemy's intelligence is allowed to be effective. If we can't be clever ourselves, we might as well think of ourselves as the environment and the enemy as the sole intelligence within that environment. By becoming the maximum-entropy environment for rock-paper-scissors, we render all intelligence useless, and do (by assumption) the best we can do.When the environment is adversarial, smarter than you are, and informed about your methods, then in a theoretical sense it may be wise to have a quantum noise source handy. (In a practical sense you're just screwed.) Again, this is not because we're extracting useful work from a noise source; it's because the most powerful intelligence in the system is adversarial, and we're minimizing the power that intelligence can exert in the system-as-a-whole. We don't do better-than-average, we merely minimize the extent to which the adversarial intelligence produces an outcome that is worse-than-average (from our perspective).

Similarly, cryptographers have a legitimate interest in strong randomness generators because cryptographers are trying to minimize the effectiveness of an intelligent adversary. Certainly entropy can act as an antidote to intelligence.

Now back to the weighted majority algorithm. Blum (1996) remarks:

From the start, we did our analysis for an

upper boundon the number of mistakes made. A global upper bound is no better than the worst individual case; thus, to set a global upper bound we must bound the worst individual case. In the worst case, our environment behaves like an adversarial superintelligence.Indeed randomness can improve the worst-case scenario, if the worst-case environment is allowed to exploit "deterministic" moves but not "random" ones. It is like an environment that can decide to produce a red card whenever you bet on blue - unless you make the

samebet using a "random" number generator instead of your creative intelligence.Suppose we use a quantum measurement device to produce a string of random ones and zeroes; make two copies of this string; use one copy for the weighted majority algorithm's random number generator; and give another copy to an intelligent adversary who picks our samples. In other words, we let the weighted majority algorithm make exactly the same randomized predictions, produced by exactly the same generator, but we also let the "worst case" environment know what these randomized predictions will be. Then the improved upper bound of the randomized version is mathematically invalidated.

This shows that the improved upper bound proven for the randomized algorithm did

notcome from the randomized algorithm making systematically better predictions - doing superior cognitive work, being more intelligent - but because we arbitrarily declared that an intelligent adversary could read our mind in one case but not the other.This is not just a minor quibble. It leads to the testable prediction that on real-world problems, where the environment is usually

notan adversarial telepath, the unrandomized weighted-majority algorithm should do better than the randomized version. (Provided that some component experts outperform maximum entropy - are right more than 50% of the time on binary problems.)Analyzing the worst-case scenario is standard practice in computational learning theory, but it makes the math do strange things. Moreover, the assumption of the worst case can be subtle; it is not always explicitly labeled. Consider the upper bound for the unrandomized weighted-majority algorithm. I did not use the term "worst case" - I merely wrote down some inequalities. In fact, Blum (1996), when initially introducing this bound, does not at first use the phrase "worst case". The proof's conclusion says only:

Key word:

Never.The worst-case assumption for the unrandomized algorithm was implicit in calculating the right-hand-side of the inequality by assuming that, on each mistake, the total weight went down by a factor of 1/4, and the total weight never decreased after any successful prediction. This is the absolute worst possible performance the weighted-majority algorithm can give.

The assumption implicit in those innocent-looking equations is that the environment

carefully maximizes the anguish of the predictor:The environment so cleverly chooses its data samples, that on each case where the weighted-majority algorithm is allowed to be successful, it shall receive not a single iota of useful feedback - not a single expert shall be wrong. And the weighted-majority algorithm will be made to err on sensory cases that produce the minimum possible useful feedback, maliciously fine-tuned to the exact current weighting of experts. We assume that the environment can predict every aspect of the predictor and exploit it - unless the predictor uses a "random" number generator which we arbitrarily declare to be unknowable to the adversary.What strange assumptions are buried in that innocent little inequality,

Moreover, the entire argument in favor of the randomized algorithm was theoretically suspect from the beginning, because it rested on non-transitive inequalities. If I prove an upper bound on the errors of algorithm X, and then prove a smaller upper bound on the errors of algorithm Y, it does not

provethatin the real worldY will perform better than X. For example, I prove that X cannot do worse than 1000 errors, and that Y cannot do worse than 100 errors. Is Y a better algorithm? Not necessarily. Tomorrow I could find an improved proof which shows that X cannot do worse than 90 errors. And then in real life, both X and Y could make exactly 8 errors.But that doesn't mean that 4 > 9.

So the next time you see an author remark, "We can further improve this algorithm using randomization..." you may not know exactly where to find the oddity. If you'd run across the above-referenced example (or any number of others in the machine-learning literature), you might not have known how to deconstruct the randomized weighted-majority algorithm. If such a thing should happen to you, then I hope that I have given you grounds to suspect that an oddity exists

somewhere, even if you cannot find it - just as you would suspect a machine that proposed to extract work from a heat bath, even if you couldn't keep track of all the gears and pipes.Nominull put it very compactly when he said that, barring an environment which changes based on the form of your algorithm apart from its output, "By adding randomness to your algorithm, you spread its behaviors out over a particular distribution, and there must be at least one point in that distribution whose expected value is at least as high as the average expected value of the distribution."

As I remarked in Perpetual Motion Beliefs:

If you ever find yourself

mathematically provingthat you can do better by randomizing, I suggest that you suspect your calculations, or suspect your interpretation of your assumptions, before you celebrate your extraction of work from a heat bath.