The Weighted Majority Algorithm

[-]comingstorm17y80

What about Monte Carlo methods? There are many problems for which Monte Carlo integration is the most efficient method available.

(you are of course free to suggest and to suspect anything you like; I will, however, point out that suspicion is no substitute for building something that actually works...)

7bogdanb12y

(Emphasis mine.) I know I’m late to the party, but I just noticed this. While what you say it’s true, “available” in this case means “that we know of”, not “that is possible”. I’m not an expert, but IIRC the point of the MCI is that the functions are hard to analyze. You could integrate analogously without randomization (e.g., sample the integration domain on a regular grid), and it very well might work. The problem is that if the function you integrate happens to behave atypically on the regular grid with respect to its global behavior (e.g., if the function has some regular features with a frequency that accidentally matches that of your grid, you might sample only local maxima, and severely overestimate the integral). But for an individual function there certainly exists (for some values of “certainly”) an ideal sampling that yields an excellent integration, or some even better method that doesn’t need sampling. But, you need to understand the function very well, and search for that perfect evaluation strategy, and do this for every individual function you want to integrate. (Not a math expert here, but I suspect in the general case that grid would be impossible to find, even if it exists, and it’s probably very hard even for those where the solution can be found.) So what MCI does is exactly what Eliezer mentions above: you randomize the sampling as well as you can, to avoid as much as possible tripping an unexpected “resonance” between the grid and the function’s behavior. In effect, you’re working against an environment you can’t overwhelm by being smart, so you brute force it and do your best to reduce the ways the environment can overwhelm you, i.e. you try to minimize the worst-case consequences. Note that for functions we really understand, MCI is not by far the best method. If you want to integrate a polynomial, you just compute the antiderivative, evaluate it in two points, subtract, and you’re done. MCI is nice because in general we don’t know how to

4A1987dM12y

Modern PRNGs tend to be less bad than that.

4bogdanb12y

That’s not at all my domain of expertise, so I’ll take your word for it (though I guessed as much). That said, I meant that as half funny anecdote, half “it’s not just theory, it’s real life, and you can get screwed hard by something that guesses your decision algorithm, even if it’s just a non-sentient function”, not as an argument for/against either MC or quantum noise. (Though, now that I think of it, these days there probably are quantum noise generators available somewhere.)

2A1987dM12y

AFAIK, there are but they're way too slow for use in MC, so decent-but-fast PRNGs are still preferred.

0A1987dM12y

Also, sometimes you need to run several tests with the same random sequence for debugging purposes, which with true random numbers would require you to store all of them, whereas with pseudo-random numbers you just need to use the same seed.

1A1987dM12y

Too bad that the default PRNG in many systems is still an ancient one.

3V_V12y

Yes. However, programming environments oriented towards numerical computing tend to use the Mersenne Twister as their default PRNG.

0[anonymous]12y

Mersenne Twister is the most commonly used algorithm. It is fast, and generates good results. It is not adversarial secure, but in that situation you should be using a cryptographically secure RNG. Age of the PRNG algorithm has no bearing on this discussion. (If rand() uses something else, it's for standards-compatability reasons; nobody writing Monti-Carlo algorithms would be using such PRNGs.)

0A1987dM12y

Actually the C and POSIX standard impose very few requirements on rand(); it would be entirely possible in principle for it to be based on the Mersenne Twister while still complying to the standards.

[-][anonymous]17y00

It is getting late, so this may be way off, and have not the time to read the paper.

This is also assuming finite trials right? Because over infinite trials if you have a non-zero probability of siding with the wrong group of classifiers, you will make infinite mistakes. No matter how small the probabilities go.

It seems it is trading off a better expectation for a worse real worse case.

Also are you thinking of formalising an alternative to the infinite SI worst case you describe?

[-]Psy-Kosh17y70

Um, maybe I misunderstood what you said or the proofs, but isn't there a much simpler way to reject the proof of superiorness of the WMA, namely that in the first proof is, as you say, the worst case scenario.... but the second proof is NOT. It uses expected value rather than worst case, right? So it's not so much "worst case analysis does really weird things" but "not only that, the two proofs aren't even validly comparable, because one is worst case analysis and one involves expected value analysis"

Or did I horribly misunderstand either the second proof or your argument against it? (if so, I blame lack of sleep)

[-]Eliezer Yudkowsky17y20

Psy, the "worst case expected value" is a bound directly comparable to the "worst case value" because the actual trajectory of expert weights is the same between the two algorithms.

Pearson, as an infinite set atheist I see no reason to drag infinities into this case, and I don't recall any of the proofs talked about them either.

[-]Roko17y00

Has anyone proved a theorem on the uselessness of randomness?

Nominull's suggestion:

"By adding randomness to your algorithm, you spread its behaviors out over a particular distribution, and there must be at least one point in that distribution whose expected value is at least as high as the average expected value of the distribution."

Might be a starting point for such a proof. However it does rely on being able to find that "one point" whose expected value is >= the average expected value of the distribution. Perhaps it is easy to prove that good points lie in a certain range, but all obvious specific choices in that range are bad.

[-]g17y120

So, the randomized algorithm isn't really better than the unrandomized one because getting a bad result from the unrandomized one is only going to happen when your environment maliciously hands you a problem whose features match up just wrong with the non-random choices you make, so all you need to do is to make those choices in a way that's tremendously unlikely to match up just wrong with anything the environment hands you because it doesn't have the same sorts of pattern in it that the environment might inflict on you.

Except that the definition of "random", in practice, is something very like "generally lacking the sorts of patterns that the environment might inflict on you". When people implement "randomized" algorithms, they don't generally do it by introducing some quantum noise source into their system (unless there's a real adversary, as in cryptography), they do it with a pseudorandom number generator, which precisely is a deterministic thing designed to produce output that lacks the kinds of patterns we find in the environment.

So it doesn't seem to me that you've offered much argument here against "randomizing" algorithms as general... (read more)

3VAuroch11y

No, it means "the sort of thing which is uncorrelated with the environment". What you want is for your responses to be correlated with the environment in a way that benefits you.

[-]R17y112

I think this series of posts should contain a disclaimer at some point stating the usability of randomized algorithms in practice (even excluding situations like distributed computing and cryptography). In a broad sense, though randomness offers us no advantage information theoretically (this is what you seem to be saying), randomness does offer a lot of advantages computationally. This fact is not at odds with what you are saying, but deserves to be stated more explicitly.

There are many algorithmic tasks which become feasible via randomized algorithms, for which no feasible deterministic alternatives are known - until recently primality testing was one such problem (even now, the deterministic primality test is unusable in practice). Another crucial advantage of randomized algorithms is that in many cases it is much easier to come up with a randomized algorithm and later, if needed, derandomize the algorithm. Randomized algorithms are extremely useful in tackling algorithmic problems and I think any AI that has to design algorithms would have to be able to reason in this paradigm (cf. Randomized algorithms, Motwani and Raghavan or Probability and Computing, Mitzenmacher and Upfal).

[-]Michael_Howard17y00

To put my point another way...

Say your experts, above, predict instance frequencies in populations. Huge populations, so you can only test expert accuracy by checking subsets of those populations.

So, you write an algorithm to choose which individuals to sample from populations when testing expert's beliefs. Aren't there situations where the best algorithm, the one least likely to bias for any particular expert algorithm, would contain some random element?

I fear randomness is occasionally necessary just to fairly check the accuracy of beliefs about the world and overcome bias.

[-]Psy-Kosh17y20

Eliezer: I recognize that the two algorithms have the exact same trajectory for the weights, that the only thing that's different between them is the final output, not the weights. However, maybe I'm being clueless here, but I still don't see that as justifying considering the two different proof results really comparable.

The real worst case analysis for the second algorithm would be more like "the random source just happened to (extremely unlikely, but...) always come up with the wrong answer."

Obviously this sort of worst case analysis isn't rea... (read more)

[-]homunq17y10

Sure, randomness is the lazy way out. But if computational resources are not infinite, laziness is sometimes the smart way out. And I said "infinite" - even transhuman quantum computers are finite.

[-]Daniel_Burfoot17y00

I agree with Psy-Kosh, I don't think the proofs are comparable.

The difference between the two derivations is that #1 uses a weak upper bound for W, while #2 uses an equality. This means that the final bound is much weaker in #1 than in #2.

It's possible to redo the derivation for the nonrandomized version and get the exact same bound as for the randomized case. The trick is to write M as the number of samples for which F_i > 1/2. Under weak assumptions on the performance of the experts you can show that this is less than Sum_i F_i.

Eliezer: thanks for the AI brain-teaser, I hope to see more posts like this in the future.

[-]DonGeddis17y00

I agree with Psy-Kosh too. The key is, as Eliezer originally wrote, never. That word appears in Theorem 1 (about the deterministic algorithm), but it does not appear in Theorem 2 (the bound on the randomized algorithm).

Basically, this is the same insight Eliezer suggests, that the environment is being allowed to be a superintelligent entity with complete knowledge in the proof for the deterministic bound, but the environment is not allowed the same powers in the proof for the randomized one.

In other words, Eliezer's conclusion is correct, but I don't thi... (read more)

[-]Stephen317y70

The question whether randomized algorithms can have an advantage over deterministic algorithms is called the P vs BPP problem. As far as I know, most people believe that P=BPP, that is, randomization doesn't actually add power. However, nobody knows how to prove this. Furthermore, there exists an oracle relative to which P is not equal to BPP. Therefore any proof that P=BPP has to be nonrelativizing. This rules out a lot of the more obvious methods of proof.

see: complexity zoo

[-]J._Andrew_Rogers17y00

Nicely done. Quite a number of points raised that escape many people and which are rarely raised in practice.

[-]Silas17y20

Someone please tell me if I understand this post correctly. Here is my attempt to summarize it:

"The two textbook results are results specifically about the worst case. But you only encounter the worst case when the environment can extract the maximum amount of knowledge it can about your 'experts', and exploits this knowledge to worsen your results. For this case (and nearby similar ones) only, randomizing your algorithm helps, but only because it destroys the ability of this 'adversary' to learn about your experts. If you instead average over all cases, the non-random algorithm works better."

Is that the argument?

[-]Venu17y30

I am interested in what Scott Aaronson says to this.

I am unconvinced, and I agree with both the commenters g and R above. I would say Eliezer is underestimating the number of problems where the environment gives you correlated data and where the correlation is essentially a distraction. Hash functions are, e.g., widely used in programming tasks and not just by cryptographers. Randomized algorithms often are based on non-trivial insights into the problem at hand. For example, the insight in hashing and related approaches is that "two (different) object... (read more)

[-]Manuel_Moertelmaier17y20

@ comingstorm: Quasi Monte Carlo often outperforms Monte Carlo integration for problems of small dimensionality involving "smooth" integrands. This is, however, not yet rigorously understood. (The proven bounds on performance for medium dimensionality seem to be extremely loose.)

Besides, MC doesn't require randomness in the "Kolmogorov complexity == length" sense, but in the "passes statistical randomness tests" sense. Eliezer has, as far as I can see, not talked about the various definitions of randomness.

[-]kyb217y10

Fascinating post. I especially like the observation that the purpose of randomness in all these examples is to defeat intelligence. It seems that this is indeed the purpose for nearly every use of randomness that I can think of. We randomise presidents routes to defeat the intelligence of an assasin, a casino randomises its roulette wheels to defeat the intelligence of the gambler. Even in sampling, we randomise our sample to defeat our own intelligence, which we need to treat as hostile for the purpose of the experiment.

[-]David417y20

What about in compressed sensing where we want an incoherent basis? I was under the impression that random linear projections was the best you could do here.

[-]Douglas_Knight317y00

kyb, see the discussion of quicksort on the other thread. Randomness is used to protect against worst-case behavior, but it's not because we're afraid of intelligent adversaries. It's because worst-case behavior for quicksort happens a lot. If we had a good description of naturally occurring lists, we could design a deterministic pivot algorithm, but we don't. We only have the observation simple guess-the-median algorithms perform badly on real data. It's not terribly surprising that human-built lists resonate with human-designed pivot algorithms; but the opposite scenario, where the simplex method works well in practice is not surprising either.

[+]G200817y-50

[-]Momo17y00

I agree with Psy, the bounds are not comparable.

In fact, the bound for #2 should be the same as the one for #1. I mean, in the really worst case, the randomized algorithm makes exactly the same predictions as the deterministic one. I advise to blame and demolish your quantum source of random numbers in this case.

[-]NancyLebovitz17y-10

Tentative, but what about cases where you don't trust your own judgement and suspect that you need to be shaken out of your habits?

[-]michael_e_sullivan17y20

What about Monte Carlo methods? There are many problems for which Monte Carlo integration is the most efficient method available.

Monte Carlo methods can't buy you any correctness. They are useful because they allow you to sacrifice an unnecessary bit of correctness in order to give you a result in a much shorter time on otherwise intractable problem. They are also useful to simulate the effects of real world randomness (or at least behavior you have no idea how to systematically predict).

So, for example, I used a Monte Carlo script to determine expected ... (read more)

[-]nova17y00

Congratulations Eliezer, this subject is very interesting.

Nontheless useful, i.e. Monter Carlo methods with pseudo-random numbers ARE actually deterministic, but we discard the details (very complex and not useful for the application) and only care about some statistical properties. This is a state of subjective partial information, the business of bayesian probability, only this partial information is deliberate! (or better, a pragmatic compromise due to limitations in computational power. Otherwise we'd just use classical numerical integration happily in... (read more)

[-]G200817y10

As stated the statement "Randomness didn't buy me any accuracy, it was a way of trading accuracy for development time." is a complete non-sequitur since nobody actually believes that randomness is a means of increased accuracy in the long term it is used as a performance booster. It merely means you can get the answer faster with less knowledge which we do all the time since there are a plethora of systems that we can't understand.

[+]Caledonian217y-60

[-]Joshua_Fox17y10

Eliezer, can you mathematically characterize those environments where randomization can help. Presumably these are the "much more intelligent than you, and out to get you" environments, but I'd like to see that characterized a bit better.

[-]Richard_Kennaway17y90

A slight tangent here, but Blum loses me even before the considerations that Eliezer has so clearly pointed out. Comparing an upper bound for the worst case of one algorithm with an upper bound for the average case of the other is meaningless. Even were like things being compared, mere upper bounds of the true values are only weak evidence of the quality of the algorithms.

FWIW, I ran a few simulations (for the case of complete independence of all experts from themselves and each other), and unsurprisingly, the randomised algorithm performed on average wors... (read more)

[-]g17y00

Anyone who evaluates the performance of an algorithm by testing it with random data (e.g., simulating these expert-combining algorithms with randomly-erring "experts") is ipso facto executing a randomized algorithm...

[-]Richard_Kennaway17y10

g: Indeed I was. Easier than performing an analytical calculation and sufficiently rigorous for the present purpose. (The optimal weights, on the other hand, are log(p/(1-p)), by an argument that is both easier and more rigorous than a simulation.)

[-]g17y00

Richard, I wasn't suggesting that there's anything wrong with your running a simulation, I just thought it was amusing in this particular context.

[-]Ben_Jones17y00

Silas - yeah, that's about the size of it.

Eliezer, when you come to edit this for popular publication, lose the maths, or at least put it in an endnote. You're good enough at explaining concepts that if someone's going to get it, they're going to get it without the equations. However, a number of those people will switch off there and then. I skipped it and did fine, but algebra is a stop sign for a number of very intelligent people I know.

[-]Scott_Aaronson217y12-2

I am interested in what Scott Aaronson says to this.

I fear Eliezer will get annoyed with me again :), but R and Stephen basically nailed it. Randomness provably never helps in average-case complexity (i.e., where you fix the probability distribution over inputs) -- since given any ensemble of strategies, by convexity there must be at least one deterministic strategy in the ensemble that does at least as well as the average.

On the other hand, if you care about the worst-case running time, then there are settings (such as query complexity) where randomness provably does help. For example, suppose you're given n bits, you're promised that either n/3 or 2n/3 of the bits are 1's, and your task is to decide which. Any deterministic strategy to solve this problem clearly requires looking at 2n/3 + 1 of the bits. On the other hand, a randomized sampling strategy only has to look at O(1) bits to succeed with high probability.

Whether randomness ever helps in worst-case polynomial-time computation is the P versus BPP question, which is in the same league as P versus NP. It's conjectured that P=BPP (i.e., randomness never saves more than a polynomial). This is known to be true if really ... (read more)

[+]Read_Something17y-70

[-]DonGeddis17y20

@ Scott Aaronson. Re: your n-bits problem. You're moving the goalposts. Your deterministic algorithm determines with 100% accuracy which situation is true. Your randomized algorithm only determines with "high probability" which situation is true. These are not the same outputs.

You need to establish goal with a fixed level of probability for the answer, and then compare a randomized algorithm to a deterministic algorithm that only answers to that same level of confidence.

That's the same mistake that everyone always makes, when they say that "randomness provably does help." It's a cheaper way to solve a different goal. Hence, not comparable.

[-]Eliezer Yudkowsky17y50

Scott, I don't dispute what you say. I just suggest that the confusing term "in the worst case" be replaced by the more accurate phrase "supposing that the environment is an adversarial superintelligence who can perfectly read all of your mind except bits designated 'random'".

[-]Scott_Aaronson217y120

Don: When you fix the goalposts, make sure someone can't kick the ball straight in! :-) Suppose you're given an n-bit string, and you're promised that exactly n/4 of the bits are 1, and they're either all in the left half of the string or all in the right half. The problem is to decide which. It's clear that any deterministic algorithm needs to examine at least n/4 + 1 of the bits to solve this problem. On the other hand, a randomized sampling algorithm can solve the problem with certainty after looking at only O(1) bits on average.

Eliezer: I often tell people that theoretical computer science is basically mathematicized paranoia, and that this is the reason why Israelis so dominate the field. You're absolutely right: we do typically assume the environment is an adversarial superintelligence. But that's not because we literally think it is one, it's because we don't presume to know which distribution over inputs the environment is going to throw at us. (That is, we lack the self-confidence to impose any particular prior on the inputs.) We do often assume that, if we generate random bits ourselves, then the environment isn't going to magically take those bits into account wh... (read more)

0bogdanb12y

EDIT: OK, I found the explanation further down, seems I was on the right path. Sorry Scott, I know I’m late to the party, but this got me curious. Can you explain this, or point me to an explanation? I’m not sure I got the correct numbers. As far as I can tell, a reasonable deterministic algorithm is to examine, one by one, at most the (n/4 + 1) bits (e.g., the first); but if you find a 1 early, you stop. The nondeterministic one would be to just test bits in random order until you see a 1. For the deterministic version, in half the cases you’re going to evaluate all those bits when they’re in the “end” half. If the bits are in the “front” half, or if the algorithm is random, I can’t figure out exactly how to solve the average number of bits to test, but it seems like the deterministic half should do on average about as well* as the nondeterministic one on a problem half the size, and in the worst case scenario it still does (n/4 + 1) bits, while the nondeterministic algorithm has to do (n/2 + 1) tests in the worst-case scenario unless I’m mistaken. I mean, it does seem like the probability of hitting a bad case is low on average, but the worst case seems twice as bad, even though harder to hit. I’m not sure how you got O(1); I got to a sum of reciprocals of squares up to k for the probability of the first 1 bit being the k-th tested, which sounds about O(1) on average, but my probabilities are rusty, is that the way? (* of course, it’s easy to construct the bad case for the deterministic algorithm if you know which bits it tests first.)

[-]Booklegger217y-40

the environment is an adversarial superintelligence

Oh, Y'all are trying to outwit Murphy? No wonder you're screwed.

[-]Silas17y10

Could Scott_Aaronson or anyone who knows what he's talking about, please tell me the name of the n/4 left/right bits problem he's referring to, or otherwise give me a reference for it? His explanation doesn't seem to make sense: the deterministic algorithm needs to examine 1+n/4 bits only in the worst case, so you can't compare that to the average output of the random algorithm. (Average case for the determistic would, it seems, be n/8 + 1) Furthermore, I don't understand how the random method could average out to a size-independent constant.

Is the randomized algorithm one that uses a quantum computer or something?

[-]Scott_Aaronson217y40

Silas: "Solve" = for a worst-case string (in both the deterministic and randomized cases). In the randomized case, just keep picking random bits and querying them. After O(1) queries, with high probability you'll have queried either a 1 in the left half or a 1 in the right half, at which point you're done.

As far as I know this problem doesn't have a name. But it's how (for example) you construct an oracle separating P from ZPP.

[-]Silas17y40

@Scott_Aaronson: Previously, you had said the problem is solved with certainty after O(1) queries (which you had to, to satisfy the objection). Now, you're saying that after O(1) queries, it's merely a "high probability". Did you not change which claim you were defending?

Second, how can the required number of queries not depend on the problem size?

Finally, isn't your example a special case of exactly the situation Eliezer_Yudkowsky describes in this post? In it, he pointed out that the "worst case" corresponds to an adversary who kno... (read more)

[-]Scott_Aaronson217y90

Silas: Look, as soon you find a 1 bit, you've solved the problem with certainty. You have no remaining doubt about the answer. And the expected number of queries until you find a 1 bit is O(1) (or 2 to be exact). Why? Because with each query, your probability of finding a 1 bit is 1/2. Therefore, the probability that you need t or more queries is (1/2)^(t-1). So you get a geometric series that sums to 2.

(Note: I was careful to say you succeed with certainty after an expected number of queries independent of n -- not that there's a constant c such tha... (read more)

1bogdanb12y

Quick check: take algorithm that tests bits 1, n/2+1, 3, n/2+3, 5, n/2+5 ... then continues alternating halves on even indices. Of course, it stops at the first 1. And it’s obviously deterministic. But unless I’m mistaken, averaged over all possible instances of the problem, this algorithm has exactly the same average, best case and worst case as the nondeterministic one. (In fact, I think this holds for any fixed sequence of test indices, as long as it alternates between halves, such as 1, n, 2, n-1, 3, n-2... Basically, my reasoning is that an arbitrary index sequence should have (on average) the same probability of finding the answer in k tests on a random problem that a random index sequence does.) I know you said it’s adversarial, and it’s clear that (assuming you know the sequence of tested indices) you can construct problem instances that take n/2+1 (worst-case) tests, in which case the problem isn’t random. I just want to test if I reasoned correctly. If I’m right, then I think what you’re saying is exactly what Eliezer said: If you know the adversary can predict you, you’ve got no better option than to act randomly. (You just don’t do as bad in this case as in Eliezer’s example.) Conversely, if you can predict the adversary, then the random algorithm completely wastes this knowledge. (I.e., if you could guess more about the sequence, you could probably do better than n/4+1 worst case, and if you can predict the adversary then you don’t even need to test the bits, which is basically O(0).)

[-][anonymous]17y20

Isn't the probability of finding a bit 1/4 in each sample? Because you have to sample both sides. If you randomly sample without repetitions you can do even better than that. You can even bound the worse case for the random algorithm at n/4+1 as well by seeing if you ever pick n/4+1 0s from any side. So there is no downside to doing it randomly as such.

However if it is an important problem and you think you might be able to find some regularities, the best bet would to be to do bayesian updates on the most likely positions to be ones and preferentially ch... (read more)

[-]Scott_Aaronson217y20

Will: Yeah, it's 1/4, thanks. I somehow have a blind spot when it comes to constants. ;-)

[-]Peter_de_Blanc17y50

Let's say the solver has two input tapes: a "problem instance" input and a "noise source" input.

When Eli says "worst-case time", he means the maximum time, over all problem instances and all noise sources, taken to solve the problem.

But when Scott says "worst-case time," he means taking the maximum (over all problem instances X) of the average time (over all noise sources) to solve X.

Scott's criterion seems relevant to real-world situations where one's input is being generated by an adversarial human who knows which ... (read more)

0[anonymous]12y

That last paragraph is pretty unsettling--you get penalized for not throwing a messy obfuscatory step into yourself? A Kolmogorov distribution indeed does seem to be a "present adversary"!

0PhilGoetz10y

It sounds to me like Peter is suggesting to randomly permute all inputs to the sorting algorithm, in hopes of improving the average case runtime. The justification appears to be that the worst case has low complexity because it's easy to describe, because you can describe it by saying "it's the worst case for ". For example, the worst-case input for Quicksort is an already-sorted list. You could, indeed, get better average performance out of quicksort if you could randomly permute its inputs in zero time, because code is more likely to produce already-sorted lists than lists sorted, say, by { x mod 17 }. My initial reaction was that the implicit description "that input which is worst case for this algorithm" might specify an input which was not unusually likely to occur in practice. Peter is saying that it is more likely to occur in practice, even if it's an implicit description--even if we don't know what the worst-case input is--because it has lower complexity in an absolute sense, and so more processes in the real world will generate that input than one with a higher complexity. Is that right? And no fast deterministic permutation can be used instead, because the deterministic permutation could be specified concisely. In practice, our random number generators use a seed of 16 to 32 bits, so they introduce almost no complexity at all by this criterion. No RNG can be useful for this purpose, for the same reason no deterministic permutation can be useful. Still, I think this is a conclusive proof that randomness can be useful.

[-]comingstorm17y20

@michael e sullivan: re "Monte Carlo methods can't buy you any correctness" -- actually, they can. If you have an exact closed-form solution (or a rapidly-converging series, or whatever) for your numbers, you really want to use it. However not all problems have such a thing; generally, you either simplify (giving a precise, incorrect number that is readily computable and hopefully close), or you can do a numerical evaluation, which might approach the correct solution arbitrarily closely based on how much computation power you devote to it.

Quad... (read more)

[-]DonGeddis17y00

Silas is right; Scott keeps changing the definition in the middle, which was exactly my original complaint.

For example, Scott says: In the randomized case, just keep picking random bits and querying them. After O(1) queries, with high probability you'll have queried either a 1 in the left half or a 1 in the right half, at which point you're done.

And yet this is no different from a deterministic algorithm. It can also query O(1) bits, and "with high probability" have a certain answer.

I'm really astonished that Scott can't see the slight-of-hand i... (read more)

[-][anonymous]17y10

Don how well a deterministic algorithm does depends upon what the deterministic algorithm is and what the input is, it cannot always query O(1) queries.

E.g. in a 20 bit case

If the deterministic algorithm starts its queries from the beginning, querying each bit in turn and this is pattern is always 00000111110000000000

It will always take n/4+1 queries. Only when the input is random will it on average take O(1) queries.

The random one will take O(1) on average on every type of input.

[-]John_Faben17y10

Don - the "sleight of hand", as you put it is that (as I think has been said before) we're examining worst-case. We explicitly are allowing the string of 1s and 0s to be picked by an adversarial superintelligence who knows our strategy. Scott's algorithm only needs to sample 4 bits on average to answer the question even when the universe it out to get him - the deterministic version will need to sample exactly n/4+1.

Basically, Eliezer seems to be claiming that BPP = P (or possibly something stronger), which most people think is probably true, but... (read more)

[-]Wei_Dai217y110

To generalize Peter's example, a typical deterministic algorithm has low Kolmogorov complexity, and therefore its worst-case input also has low Kolmogorov complexity and therefore a non-negligible probability under complexity-based priors. The only possible solutions to this problem I can see are:

1. add randomization
2. redesign the deterministic algorithm so that it has no worst-case input
3. do a cost-benefit analysis to show that the cost of doing either 1 or 2 is not justified by the expected utility of avoiding the worst-case performance of the original algorithm, then continue to use the original deterministic algorithm

The main argument in favor of 1 is that its cost is typically very low, so why bother with 2 or 3? I think Eliezer's counterargument is that 1 only works if we assume that in addition to the input string, the algorithm has access to a truly random string with a uniform distribution, but in reality we only have access to one input, i.e., sensory input from the environment, and the so called random bits are just bits from the environment that seem to be random.

My counter-counterargument is to consider randomization as a form of division of labor. We use one very... (read more)

1Eliezer Yudkowsky12y

(I agree that this is one among many valid cases of when we might want to just throw in randomization to save thinking time, as opposed to doing a detailed derandomized analysis.)

[-]Eliezer Yudkowsky17y30

Quantum branching is "truly random" in the sense that branching the universe both ways is an irreducible source of new indexical ignorance. But the more important point is that unless the environment really is out to get you, you might be wiser to exploit the ways that you think the environment might depart from true randomness, rather than using a noise source to throw away this knowledge.

[-]Scott_Aaronson17y90

I'm not making any nonstandard claims about computational complexity, only siding with the minority "derandomize things" crowd

Note that I also enthusiastically belong to a "derandomize things" crowd! The difference is, I think derandomizing is hard work (sometimes possible and sometimes not), since I'm unwilling to treat the randomness of the problems the world throws at me on the same footing as randomness I generate myself in the course of solving those problems. (For those watching at home tonight, I hope the differences are now reasonably clear...)

[-]Eliezer Yudkowsky17y60

I certainly don't say "it's not hard work", and the environmental probability distribution should not look like the probability distribution you have over your random numbers - it should contain correlations and structure. But once you know what your probability distribution is, then you should do your work relative to that, rather than assuming "worst case". Optimizing for the worst case in environments that aren't actually adversarial, makes even less sense than assuming the environment is as random and unstructured as thermal noise... (read more)

[-]Scott_Aaronson17y90

once you know what your probability distribution is...

I'd merely stress that that's an enormous "once." When you're writing a program (which, yes, I used to do), normally you have only the foggiest idea of what a typical input is going to be, yet you want the program to work anyway. This is not just a hypothetical worry, or something limited to cryptography: people have actually run into strange problems using pseudorandom generators for Monte Carlo simulations and hashing (see here for example, or Knuth vol 2).

Even so, intuition suggests it sh... (read more)

[-]Wei_Dai217y20

Even if P=BPP, that just means that giving up randomness causes "only" a polynomial slowdown instead of an exponential one, and in practice we'll still need to use pseudorandom generators to simulate randomness.

It seems clear to me that noise (in the sense of randomized algorithms) does have power, but perhaps we need to develop better intuitions as to why that is the case.

[-]Psy-Kosh17y10

Eliezer: There are a few classes of situations I can think of in which it seems like the correct solution (given certain restrictions) requires a source of randomness:

One example would be Hofstadter's Luring Lottery. Assuming all the agents really did have common knowledge of each other's rationality, and assuming no form of communication is allowed between them in any way, isn't it true that no deterministic algorithm to decide how many entries to send in, if any at all, has a higher expected return than a randomized one? (that is, the solutions which are... (read more)

0gwern16y

My shot at improving on randomness for things like the Luring Lottery: http://lesswrong.com/lw/vp/worse_than_random/18yi

[-]Aron17y00

The leaders in the NetflixPrize competition for the last couple years have utilized ensembles of large numbers of models with a fairly straightforward integration procedure. You can only get so far with a given model, but if you randomly scramble its hyperparameters or training procedure, and then average multiple runs together, you will improve your performance. The logical path forward is to derandomize this procedure, and figure out how to predict, a priori, which model probabilities become more accurate and which don't. But of course until you figure o... (read more)

[-]DonGeddis17y20

@ Will: You happen to have named a particular deterministic algorithm; that doesn't say much about every deterministic algorithm. Moreover, none of you folks seem to notice much that pseudo-random algorithms are actually deterministic too...

Finally, you say: "Only when the input is random will [the deterministic algorithm] on average take O(1) queries. The random one will take O(1) on average on every type of input."

I can't tell you how frustrating it is to see you just continue to toss in the "on average" phrase, as though it doesn't... (read more)

0davidgs14y

DonGeddis is missing the point. Randomness does buy power. The simplest example is sampling in statistics: to estimate the fraction of read-headed people in the world to within 5% precision and with confidence 99%, it is enough to draw a certain number of random samples and compute the fraction of read-headed people in the sample. The number of samples required is independent of the population size. No deterministic algorithm can do this, not even if you try to simulate samples by using good PRNGs, because there are ``orderings'' of the world's population that would fool your algorithm. But if you use random bits that are independent of the data, you cannot be fooled. (By the way, whether 'true randomness' is really possible in the real world is a totally different issue.) Note that, for decision problems, randomness is not believed to give you a more-than-polynomial speedup (this is the P vs BPP question). But neither sampling nor the promise problem Scott mentioned are decision problems (defined on all inputs). Regarding your claim that you cannot compare 'worst-case time' for deterministic algorithms with 'worst-case expected time' for randomized ones: this is totally pointless and shows a total lack of understanding of randomization. No one claims that you can have a randomized algorithm with success probability one that outperforms the deterministic algorithm. So either we use the expected running time or the randomized algorithm, or we allow a small error probability. We can choose either one and use the same definition of the problem for both deterministic and randomized. Take Scott example and suppose we want algorithms that succeed with probability 2/3 at least, on every input. The goal is the same in both cases, so they are comparable. In the case of deterministic algorithms, this implies always getting the correct answer. DonGeddis seems not to realize this, and contrary to what he says, any deterministic algorithm, no matter how clever, needs to look

3bogdanb12y

Wait a minute. There are also orderings of the world population that will randomly fool the random algorithm. (That’s where the 99% comes from.) Or, more precisely, there are samplings generated by your random algorithm that get fooled by the actual ordering of the world population. If you write a deterministic algorithm that samples the exact number of samples as the random one does, but instead of sampling randomly you sample deterministically, spreading the samples in a way you believe to be appropriately uniform, then, assuming the human population isn’t conspiring to trick you, you should get exactly the same 99% confidence and 5% precision. (Of course, if you screw up spreading the samples, it means you did worse than random, so there’s something wrong with your model of the world. And if the entire world conspires to mess up your estimate of red hair frequency, and succeeds to do so without you noticing and altering you sampling, you’ve got bigger problems.) For example, once n is found (the number of tests needed), you list all humans in order of some suitable tuple of properties (e.g., full name name, date and time of birth, population of home city/town/etc, or maybe), and then pick n people equally distanced in this list. You’re still going to find the answer, with the same precision and confidence, unless the world population conspired to distribute itself over the world exactly the right way to mess up your selection. There are all sorts of other selection procedures that are deterministic (they depend only on the population), but are ridiculously unlikely to have any correlation with hair color. Other example: order all cities/towns/etc by GDP; group them in groups of ~N/n; pick from each group the youngest person; I don’t think n is high enough, but if it happens that a group contains a single city with more than N/n population pick the two, three, etc youngest persons. If you’re concerned that people all over the world recently started planning th

[-]DonGeddis17y10

@ John, @ Scott: You're still doing something odd here. As has been mentioned earlier in the comments, you've imagined a mind-reading superintelligence ... except that it doesn't get to see the internal random string.

Look, this should be pretty simple. The phrase "worst case" has a pretty clear layman's meaning, and there's no reason we need to depart from it.

You're going to get your string of N bits. You need to write an algorithm to find the 1s. If your algorithm ever gives the wrong answer, we're going to shoot you in the head with a gun a... (read more)

[-]DonGeddis17y30

To look at it one more time ... Scott originally said Suppose you're given an n-bit string, and you're promised that exactly n/4 of the bits are 1, and they're either all in the left half of the string or all in the right half.

So we have a whole set of deterministic algorithms for solving the problem over here, and a whole set of randomized algorithms for solving the same problem. Take the best deterministic algorithm, and the best randomized algorithm.

Some people want to claim that the best randomized algorithm is "provably better". Really? B... (read more)

[-][anonymous]17y40

Don, you missed my comment saying you could bound the randomized algorithm in the same way to n/4+1 by keeping track of where you pick and if you find n/4+1 0s in one half you conclude it is in the other half.

I wouldn't say that a randomized algorithm is better per se, just more generally useful than the a particular deterministic one. You don't have to worry about the types of inputs given to it.

This case matters because I am a software creator and I want to reuse my software components. In most cases I don't care about performance of every little sub com... (read more)

[-]John_Faben17y10

Don, if n is big enough then sure, I'm more than willing to play your game (assuming I get something out of it if I win).

My randomised algorithm will get the right answer within 4 tries in expectation. If I'm allowed to sample a thousand bits then the probability that it won't have found the answer by then is .75^1000 which I think is something like 10^-120. And this applies even if you're allowed to choose the string after looking at my algorithm.

If you play the game this way with your deterministic algorithm you will always need to sample n/4 + 1 bits ... (read more)

[-]DonGeddis17y00

@ Will: Yes, you're right. You can make a randomized algorithm that has the same worst-case performance as the deterministic one. (It may have slightly impaired average case performance compared to the algorithm you folks had been discussing previously, but that's a tradeoff one can make.) My only point is that concluding that the randomized one is necessarily better, is far too strong a conclusion (given the evidence that has been presented in this thread so far).

But sure, you are correct that adding a random search is a cheap way to have good confiden... (read more)

[-]DonGeddis17y10

@ John: can you really not see the difference between "this is guaranteed to succeed", vs. "this has only a tiny likelihood of failure"? Those aren't the same statements.

"If you play the game this way" -- but why would anyone want to play a game the way you're describing? Why is that an interesting game to play, an interesting way to compare algorithms? It's not about worst case in the real world, it's not about average case in the real world. It's about performance on a scenario that never occurs. Why judge algorithms on... (read more)

[-]Gen_Zhang17y00

@Scott: would it be fair to say that the P=BPP question boils down to whether one can make a "secure" random source such that the adversarial environment can't predict it? Is that why the problem is equivalent to having good PRNG? (I'm a physicist, but who thinks he knows some of the terminology of complexity theory --- apologies if I've misunderstood the terms.)

[-]lifebiomedguru17y00

I can't help but think that the reason why randomization appear to add information in the context of machine learning classifiers for the dichotomous case is that the actual evaluation of the algorithms is empirical, and, as such, use finite data. Two classes of non-demonic error always exist in the finite case. These are measurement error, and sampling error.

Multiattribute classifiers use individual measurements from many attributes. No empirical measurement is made without error; the more attributes that are measured, the more measurement error intrud... (read more)

-1davidgs14y

0VAuroch11y

So don't use those orderings. That randomness isn't buying you power. It's acting as though you know nothing about the distribution of red-headed people. If you do, in fact, know something about the distribution of red-headed people (i.e. there are more red-headed people than average in Ireland and fewer in Africa, therefore make sure each sample draws proportionally from Ireland, Africa, and non-Ireland-non-Africa), then you can exploit that to make your prediction more reliable.

[-]A1987dM12y32

When the environment is adversarial, smarter than you are, and informed about your methods, then in a theoretical sense it may be wise to have a quantum noise source handy.

It can be handy even if it's not adversarial, and even if it is equally as smart as you are but not smarter. Suppose you're playing a game with payoffs (A,A) = (0,0), (A,B) = (1,1), (B,A) = (1,1), (B,B) = (0,0) against a clone of yourself (and you can't talk to each other); the mixed strategy where you pick A half of the time and B half of the time wins half of the time, but either p... (read more)

[-]Petter11y20

Sometimes the environment really is adversarial, though.

Regular expressions as implemented in many programming languages work fine for almost all inputs, but with terrible upper bounds. This is why libraries like re2 are needed if you want to let anyone on the Internet use regular expressions in your search engine.

Another example that comes to mind is that quicksort runs in n·log n expected time if randomly sorted beforehand.

3lackofcheese11y

The Internet can be pretty scary, but it is still very much short of an adversarial superintelligence. Sure, sometimes people will want to bring your website down, and sometimes it could be an organized effort, but even that is nothing like a genuine worst case.

1lackofcheese11y

OK, granted, in the case of general regex we're talking about upper bounds that are exponential in the worst case, and it's not very difficult to come up with really bad regex cases, so in that case a single query could cause you major issues. That doesn't necessarily mean you should switch to a library that avoids those worst cases, though. You could simply reject the kinds of regexes that could potentially cause those issues, or run your regex matcher with a timeout and have the search fail whenever the timer expires.

0lackofcheese11y

I wouldn't mind knowing the reason I was downvoted here.

1Shmi11y

Seems like noise to me. Only one person cared enough about what you wrote to vote, and they disliked your post. Upvoted back to zero, since the comments seem pretty reasonable to me.

0lackofcheese11y

Thanks; it didn't occur to me to think of it statistically, although it seems obvious now. That said, individual downvotes / upvotes still have reasons behind them; it's just that you can't generally expect to find out what they are except when they come in significant clusters.

0ChristianKl11y

If you have an search engine you don't want to allow people to search your whole corpus anyway. You usually want to search an index.

[-]Dacyn9y30

So let me see if I've got this straight.

Computer Scientists: For some problems, there are random algorithms with the property that they succeed with high probability for any possible input. No deterministic algorithm for these problems has this property. Therefore, random algorithms are superior.

Eliezer: But if we knew the probability distribution over possible inputs, we could create a deterministic algorithm with the property.

Computer Scientists: But we do not know the probability distribution over possible inputs.

Eliezer: Never say "I don't know&qu... (read more)

LESSWRONG
LW

LESSWRONG
LW

23

The Weighted Majority Algorithm

23

23