Imagine that I'm playing chess against a smarter opponent. If I could predict *exactly* where my opponent would move on each turn, I would automatically be at least as good a chess player as my opponent. I could just ask myself where my opponent would move, if they were in my shoes; and then make the same move myself. (In fact, to predict my opponent's *exact* moves, I would need to be superhuman - I would need to predict my opponent's exact mental processes, including their limitations and their errors. It would become a problem of psychology, rather than chess.)

So predicting an exact move is not possible, but neither is it true that I have *no* information about my opponent's moves.

Personally, I am a very weak chess player - I play an average of maybe two games per year. But even if I'm playing against former world champion Garry Kasparov, there are certain things I can predict about his next move. When the game starts, I can *guess* that the move P-K4 is more likely than P-KN4. I can *guess* that if Kasparov has a move which would allow me to checkmate him on my next move, that Kasparov will not make that move.

Much less reliably, I can guess that Kasparov will not make a move that exposes his queen to my capture - but here, I could be greatly surprised; there could be a rationale for a queen sacrifice which I have not seen.

And finally, of course, I can guess that Kasparov will win the game...

Supposing that Kasparov is playing black, I can guess that the final position of the chess board will occupy the *class*
of positions that are wins for black. I cannot predict specific
features of the board in detail; but I can narrow things down relative
to the class of *all possible* ending positions.

If I play chess against a superior opponent, and I don't know for
certain where my opponent will move, I can still endeavor to produce a
probability distribution that is *well-calibrated* - in the sense
that, over the course of many games, legal moves that I label with a
probability of "ten percent" are made by the opponent around 1 time in
10.

You might ask: Is producing a *well-calibrated* distribution over Kasparov, beyond my abilities as an inferior chess player?

But there is a trivial way to produce a well-calibrated probability distribution - just use the maximum-entropy distribution representing a state of total ignorance. If my opponent has 37 legal moves, I can assign a probability of 1/37 to each move. This makes me perfectly calibrated: I assigned 37 different moves a probability of 1 in 37, and exactly one of those moves will happen; so I applied the label "1 in 37" to 37 different events, and exactly 1 of those events occurred.

Total ignorance is not very useful, even if you confess it honestly. So the question then becomes whether I can do better than maximum entropy.
Let's say
that you and I both answer a quiz with ten yes-or-no questions. You assign
probabilities of 90% to your answers, and get one answer wrong. I
assign probabilities of 80% to my answers, and get two answers wrong.
We are both perfectly *calibrated* but you exhibited better *discrimination* - your answers more strongly distinguished truth from falsehood.

Suppose that someone shows me an arbitrary chess position, and asks me: "What move would Kasparov make if he played black, starting from this position?" Since I'm not nearly as good a chess player as Kasparov, I can only weakly guess Kasparov's move, and I'll assign a non-extreme probability distribution to Kasparov's possible moves. In principle I can do this for any legal chess position, though my guesses might approach maximum entropy - still, I would at least assign a *lower* probability to what I *guessed* were obviously wasteful or suicidal moves.

If you put me in a box and feed me chess positions and get probability distributions back out, then we would have - theoretically speaking - a system that produces Yudkowsky's guess for Kasparov's move in any chess position. We shall suppose (though it may be unlikely) that my prediction is well-calibrated, if not overwhelmingly discriminating.

Now suppose we turn "Yudkowsky's prediction of Kasparov's move" into an *actual chess opponent*, by having a computer *randomly* make moves at the exact probabilities I assigned. We'll call this system RYK, which stands for "Randomized Yudkowsky-Kasparov", though it should really be "Random Selection from Yudkowsky's Probability Distribution over Kasparov's Move."

Will RYK be as good a player as Kasparov? Of course not. Sometimes the RYK system will *randomly* make dreadful moves which the real-life Kasparov would never make - start the game with P-KN4. I assign such moves a low probability, but sometimes the computer makes them anyway, by sheer random chance. The real Kasparov also sometimes makes moves that I assigned a low probability, but only when the move has a better rationale than I realized - the astonishing, unanticipated queen sacrifice.

Randomized Yudkowsky-Kasparov is definitely no smarter than Yudkowsky, because RYK draws on no more chess skill than I myself possess - I build all the probability distributions myself, using only my own abilities. Actually, RYK is a far worse player than Yudkowsky. I myself would make the best move I saw with my knowledge. RYK only occasionally makes the best move I saw - I won't be very confident that Kasparov would make exactly the same move I would.

Now suppose that I myself play a game of chess against the RYK system.

RYK has the odd property that, on each and every turn, my probabilistic prediction for RYK's move is exactly the same prediction I would make if I were playing against world champion Garry Kasparov.

Nonetheless, I can easily beat RYK, where the real Kasparov would crush me like a bug.

The *creative* unpredictability of intelligence is not like the *noisy* unpredictability of a random number generator. When I play against a smarter player, I can't predict *exactly* where my opponent will move against me. But I can predict the end result of my smarter opponent's moves, which is a win for the other player. When I see the randomized opponent make a move that I assigned a tiny probability, I chuckle and rub my hands, because I think the opponent has randomly made a dreadful move and now I can win. When a superior opponent surprises me by making a move to which I assigned a tiny probability, I groan because I think the other player saw something I didn't, and now *I'm* about to be swept off the board. Even though it's exactly the same probability distribution! I can be *exactly* as uncertain about the actions, and yet draw very different conclusions about the eventual outcome.

(This situation is possible because I am not logically omniscient; I do not explicitly represent a joint probability distribution over all entire games.)

When I play against a superior player, I can't predict *exactly* where my opponent will move against me. If I could predict that, I would necessarily be at least that good at chess myself. But I can predict the *consequence* of the unknown move, which is a win for the other player; *and the more the player's actual action surprises me, the more confident I become of this final outcome.*

The unpredictability of intelligence is a very special and unusual kind of surprise, which is not at all like noise or randomness. There is a weird balance between the unpredictability of actions and the predictability of outcomes.

Didn't we already have this exact post?

You had "Optimization", and you may have read a draft elsewhere with some of this material in it. I'm posting it now by way of finally finishing it after years, and by way of taking a vacation (feeling a bit burned-out lately).

On a whimsical note, it is reminiscent of the unpredictability of the Infinite Improbability Drive :-)

"If we walk without rhythm, we won't attract the worm."

Set up a pattern-recognition system directed at your own actions, and when you fall into a predictable rut, do something differently.

Could you give a more precise definition of "calibrated"? Your example of 1/37 for each of 37 different possibilities, justified by saying that indeed one of the 37 will happen, seems facile. Do you mean that the "correct" distribution, relative to your guess, has low relative entropy?

I think what he means by "calibrated" is something like it not being possible for someone else to systematically improve the probabilities you give for the possible answers to a question just from knowing what values you've assigned (and your biases), without looking at what the question is.

I suppose the improvement would indeed be measured in terms of relative entropy of the "correct" guess with respect to the guess given.

anon: The short definition of "well-calibrated" is that if things you assign X% probability to actually do happen X% of the time, you are well-calibrated. If you go around saying that you are 90% certain of your answers on an exam, and you get a 90, you are well-calibrated. If you say that you are 90% certain and you get a 50, you are not well-calibrated.

For the long version, see A Technical Explanation of Technical Explanation". (Highly recommended!)

Anon: well-calibrated means roughly that in the class of all events you think have probability p to being true, the proportion of them that turn out to be true is p.

More formally, suppose you have a probability distribution over something you are going to observe. If the log probability of the event which actually occurs is equal to the entropy of your distribution, you are well calibrated. If it is above you are over confident, if it is below you are under confident. By this measure, assigning every possibility equal probability will always be calibrated.

This is related to relative entropy.

As with jsalvati's comment, upon reading this post I was convinced that either I was hallucinating, that almost the entirety of this post had appeared elsewhere before, or that I've read so much of Eliezer's writing that I'm anticipating his arguments in detail before he makes them.

Fortunately, it turns out that the second (and least disturbing) option was correct, in that a substantial amount of this post already existed at: [link deleted]

Not that I'm complaining, but it was seriously bothering me that I knew I'd heard this stuff before, but Google said it hadn't previously been posted on Overcoming Bias, and I was starting to doubt my sanity...

This notion of calibratedness seems to have bad properties to me. Consider a situation where I'm trying to guess a distribution for the outcomes of a coin flip with a coin which, my information tells me, lands "heads" 99% of the time. Then a guess of 50% and 50% is "calibrated" because of the 50% predictions I make, exactly half come out right. But a guess 49.9% heads and 50.1% tails is horribly calibrated; the "49.9%" predictions come out 99% correct, and the "50.1%" predictions come out 1% correct. So the concept, as defined like this, seems hypersensitive, and therefore not very useful. I think a proper definition must necessarily be in terms of relative entropy, or perhaps considering Bayesian posteriors from subsets of your information, but I still don't see how it should work. Sorry if someone already gave a robust definition that I missed.

Nick: If you don't mean

expectedlog probability, then I don't know what you're talking about. And if you do, it seems to me that you're saying that well-calibratedness means that relative entropy of the "correct" distribution relative to yours is equal to your entropy. But then the uniform prior doesn't seem well-calibrated; again, consider a coin that lands "heads" 99% of the time. Then your entropy is 1, while the relative entropy of the "correct" distribution is (-log(99%)-log(1%))/2, which is >2.Nick: Sorry, I got it backwards. What you seem to be saying is that well-calibratedness means that relative entropy of your distribution relative to the "correct" one is equal to your entropy. This does hold for the uniform guess. But once again, considering a situation where your information tells you the coin will land "heads" with 99% probability, it would seem that the only well-calibrated guesses are 99%-1% and 50%-50%. I don't yet have an intuition for why both of these guesses are strictly "better" in any way than an 80%-20% guess, but I'll think about it. It definitely avoids the sensitivity that seemed to come out of the "rough" definition, where 50% is great but 49.9% is horrible.

Please don't post a link to the draft version of this. This just means that lots of Overcoming Bias readers will be bored for the next couple of weeks, because they'll already have read the old version of something.

Anon: no, I mean the log probability. In your example, the calibratedness will generally be high: - \log 0.499 - H(p) ~= 0.00289 each time you see tails, and - log 0.501 - H(p) ~= - 0.00289 each time you come up tails. It's continuous.

Let's be specific. We have H(p) = - \sum_x p(x) \log p(x), where p is some probability distribution over a finite set. If we observe x0, the say the predictor's calibration is

C(x0) = \sum_x p(x) \log p(x) - \log p(x0) = - \log p(x0) - H(p)

so the expected calibration is 0 by the definition of H(p). The calibration is continuous in p. If \log p(x0) is higher then the expected value of \log p(x) then we are underconfident and C(x0) < 0; if \log p(x0) is lower than expected we are overconfident, and C>0.

With q = p(x) d(x,x0) the non-normalised probability distribution that assigns value only x0, we have

C = D(p||q)

so this is a relative entropy of sorts.

If the possibility of "creative surprises" depends on ignorance about the logical consequences of a game move, it seems that this would be best modeled as an asymmetric information problem.

It's interesting to note that the usual "Dutch-book" arguments for subjective probability break down under asymmetric information - the only way to avoid losing money to a possibly better-informed opponent is refusing to enter some bets, i.e. adopting an upper-lower probability interval.

Of course, such upper-lower spreads show up routinely in illiquid financial markets; I wonder whether any connections have been made between rational pricing conditions in such markets and imprecise probability theories like Dempster-Shafer, etc.

I think there's a sign error in my post -- C(x0) = \log p(x0) + H(p) it should be.

Nick Hay - IIRC the minus-log probability of an outcome is usually called "surprisal" or "self-information". The Shannon entropy of a random variable is just the expected value of its surprisal.

guest: right, so with those definitions you are overconfident if you are suprised more than you expected, underconfident if you are suprised less, calibration being how close your suprisal is to your expectation of it.

If you use upper-lower then you will forego combinations of bets that are sure gains. Dutch Book is inescapable, and only your presumption that the environment will contain agents who try to offer you losing bet combinations but no agents who are nice enough to take them, would lead you think that upper-lower "solves" the problem. No real agent can afford the computing power to be an ideal Bayesian, but the laws are ineluctable: whoever is not Bayesian - regardless of the reasons, regardless of whether it is unavoidable - is losing

something.And whoever is bayesian is also losing something, a quantity of easily acquirable energy. It is a loss one way or the other.

Nick: It seems like a bad idea to me to call a prediction underconfident or overconfident depending on the particular outcome. Shouldn't it depend rather on the "correct" distribution of outcomes, i.e. the Bayesian posterior taking all your information into account? I mean, with your definition, if we do the coin flip again, with 99% heads and 1% tails, and our prediction is 99% heads and 1% tails, then if it comes up heads we're slightly underconfident, and if it comes up tails we're strongly overconfident. Hence there's no such thing as an actually well-calibrated prediction for this (?). If we take into account the existence of a correct Bayesian posterior then it's clear that "expected calibration" is not at all 0. For instance if p is the "correct" probability of heads and q is your prediction then the "expected calibration" would seem to be -p

log(q)-(1-p)log(1-q)+qlog(q)+(1-q)log(1-q). And, for instance, if you know for a fact that a certain experiment can go one of 3 ways, and over a long period of time the proportion has been 60%-30%-10%, then not only 33.3%-33.3%-33.3%, but also 45%-45%-10% and 57%-19%-24% have "expected calibration" ~0 by this definition.You consider the Randomized Yudkowsky-Kasparov system, but this is only one point on a continuum. There are stupider and smarter systems, generated by weaker and stronger chess players, respectively.

One interesting system is Randomized Kasparov-Kasparov, where we show arbitrary chess positions to Kasparov and ask him to assign probability distributions to what he'd do. Chess is a game of perfect information, so arbitrary chess positions can be shown without their history (of course, we have to count physically invisible state as known; a position consists of both the arrangement of the pieces and factoids like "the king has not moved, but the queen's rook has moved, so queenside castling is forbidden but kingside castling is allowed"). I assert without proof that RKK is an extremely smart system, and will turn both RYK and you yourself into chunky salsa.

Here's the question: is Kasparov smarter than RKK? We can imagine that first we develop RKK by properly incentivizing Kasparov (e.g. by telling him that we're going to use it to play against every human on the planet, and for every time that RKK comes out victorious, we'll donate a dollar to his favorite charity, or if RKK loses even once, we'll drown a kitten, or whatever), and then (after we've shown Kasparov the million zillion chess positions and recorded his assigned probability distributions) we turn the tables and properly incentivize him to beat RKK (charity donation, kitten drowning, whatever). Can Kasparov beat RKK? Does it matter if he knows he's playing against RKK?

Then the metaquestion: what if Kasparov was a weaker player? (For values of Kasparov that equal Yudkowsky...) What if he was a stronger player? One limiting case is easy. If Kasparov was very very smart, he could solve chess, and so RKK could solve chess, and they would be equally strong. (That is, if the solution says that White wins, then whoever plays White wins, and so forth.) There's a probability distribution that solves chess, after all (with 100% for the right move in any given situation and 0% for the others).

I might be fighting the hypothetical a bit here, but... if Kasparov

knowshe's playing RKK, then itseems to me- Mr. Wins-One-Game-Out-Of-Five - that that would make RKK predictable to Kasparov, which would give him an advantage.Even if he doesn't, it seems like this is rather similar to RYK v. Yudkowsky, and RKK's occasional slip-ups would lead to general loss against Kasparov.

Will RYK be as good a player as Kasparov? Of course not. Sometimes the RYK system will randomly make dreadful moves which the real-life Kasparov would never make - start the game with P-KN4. I assign such moves a low probability, but sometimes the computer makes them anyway, by sheer random chance.Don't think this is right. RYK would not make dreadful moves that

youwould not think Kasparov would make under any circumstances, unless you were assigning odds too high to those moves. So unless you think there's a chance that Kasparov would make that move more often thanyouwould, you would assign a log-odd of -200 dB or something, effective 0 chance.To clarify: what I'm specifically objecting to, I think, is "by sheer random chance". If it's the chance that's causing those bad moves, then you're not well-calibrated. If it's your uncertainty about Kasparov that's causing the moves, then my objection doesn't apply.

I swear to god I've read these Kasparov posts before...

I

thinkthe weakness of R[xx] systems is that they're only working one move at a time. Any position suggests a number of strategies to a good player. Being stuck with a randomized system that tends towards plausible moves means that a move that builds in one direction could be weakened in the next round by a move that's tending towards another good strategy.Could R[xx] be improved (in theory--I think the calculations would be unmanageable) by considering more than one move at a time? Is there some number of moves where you don't exactly have an R[xx] anymore because the program is more like simulating the way the real chess player thinks?

Nancy, the underlying algorithms of most modern chess engines don't have any sort of persistent strategy from move to move. In fact, most don't even keep any state between moves. (That's an advanced feature.) And yet chess engines are able to compete with top chess players. While they look more than one move ahead, so does RYK. I don't think your objection actually pans out in practice.

re: calibration, it seems like what we want to do is ask ourselves what happens if an agent is asked lots of different probability questions, consider the the true probability as a function of probability stated by the agent, use some prior distribution (describing

ouruncertainty) on all such functions that the agent could have, update this prior using a finite set of answers we have seen the agent give and their correctness, and end up with a posterior distribution on functions (agent's probability -> true probability) from which we can get estimates of how over/underconfident the agent is at each probability level, and use those to determine what the agent "really means" when it says 90%; if the agent is overconfident at all probabilities then it's "overconfident" period, if it's underconfident at all probabilities then it's "underconfident" period, if it's over at some and under at some then I guess it's just "misconfident"? (an agent could be usually overconfident in an environment that usually asked it difficult questions and usually underconfident in an environment that usually asked it easy questions, or vice versa) If we keep asking an agent that doesn't learn the same question like in anon's comment then that seems like a degenerate case. On first think it doesn't seem like an agent's calibration function necessarily depends on what questions you ask it; Solomonoff induction is well-calibrated in the long run in all environments (right?), and you could imagine an agent that was like SI but with all probability outputs twice as close to 1, say. Hope this makes any sense.pdf23ds, thanks. I wasn't sure whether the whole strategic situation could be implied in each position. Have people been able to learn anything about chess strategy by studying how programs play? Do program vs. program games make sense to people?

pdf23ds, most chess engines store analyzed subtrees into a transposition table. This is not the same as having a persistent strategy, but it does mean that much evaluation work can be saved by keeping a persistent state between moves.

This is similar to how iterative deepening works: in theory you could obtain the same result by performing a depth-first search directly, but iterative deepening provides a hint at the "best" moves which makes search more efficient.

I wasn't under the impression that transposition tables were kept between moves, but now that I think about it, they probably are. On the other hand, I'd wager that the reuse isn't very high, depending on the caching criteria. Maybe around 30%.

Have people been able to learn anything about chess strategy by studying how programs play?I'm don't know if they've

learnedanything. Computer style is heavy on tactics. I'm not sure how their strategy is characterized. But it's notable that most new young grandmasters nowadays rely heavily on chess engines to help them train.Do program vs. program games make sense to people?Yes. Only chess engine fans are really too interested in them, though.

It'd be interesting to know if perfect checkers games (checkers has been solved) were amenable to human analysis.

A bit more on chess engines: many engines have a number of parameters you can set that determines their playing style, and some might say their "strategy". Things like aggressiveness, willingness to sacrifice, valuation of pieces, willingness to trade, etc. What many of these parameters actually do is change the static board evaluation function. A chess engine looks ahead many moves, and due to time constraints, picks a board position after X moves and runs the static evaluation function, which is a very naive estimate of the quality of the engine's position. A human's static evaluation function is much higher quality, but humans only look ahead a few moves at most. (Well, grandmasters look ahead a dozen or two moves at most but they're still evaluating many, many fewer positions (their pruning factor is higher).)

When I play against a superior player, I can't predict exactly where my opponent will move against me. [...] But I can predict the consequence of the unknown move, which is a win for the other player; and the more the player's actual action surprises me, the more confident I become of this final outcome."Interestingly, playing an opponent which selects completely randomly from the range of possible moves gives a similar result: you never know what your opponent will do, but you can predict with a fair amount of confidence that you will win. And the more their actions surprise you (because you didn't spend any time thinking about such a remarkably dumb move as

that), the more confident you become in your prediction of the result.bah. moving too fast, forget to close an em tag. in the preceding comment, only the first paragraph should be em'd as a quote.

The quantity called Bayesian surprise expresses what you're trying to say. It is the KL divergence between the guessed probability distribution and the 'actual' distribution (distance between your probability distribution computed from some model (RYK) and Kasparov's actual distribution over moves (perhaps uniforms over the set of the 'top 5' moves according to some scoring heuristic)). When these are far apart in terms of KL divergence, your model will be greatly surprised (I recall a quote of yours somewhere in a post where you say, "There are no surprising facts, only models that are surprised by facts. And if a model is surprised by facts then it is no credit to that model.")

Bayesian surprise is experimentally linked to focus of attention mechanisms in human vision. If I have to have a good model of my world to avoid being hit by cars or eaten by bears, then I had better divert precious resources to allow me to scrutinize events of high Bayesian surprise (high shock to my current model) more intensely. This will almost certainly play a large role in understanding the specific biological mechanisms that lead to cognitive biases.

I also think that playing illustrates a use of Bayesian surprise. You are able to modify your map and discover new, perhaps unanticipated components of the territory, and (and this is what distinguishes 'playing' from other types of map updates) you

feel safe while you do it.When a child learns to swim, they usually have adults around to protect them and though there is an element of danger, one feels reasonably safe while having your model of buoyancy, gravity, and movement totally surprised (in the Bayesian sense) by the experience of being in the water. The safety and map updating at the same time are what we refer to as fun, at least in many cases.At any rate, I have noticed a number of places where this appeal to surprise is made and can benefit from rigorously linking to actual Bayesian surprise. For example, if I have two different probability distributions P and Q, then if they fail to have the same support over the domain, then D(P || Q) = \infty. Perhaps Engelbart invented the mouse and was able to extrapolate and anticipate a space of computer tools that could possibly be invented and he had some internal probability distribution over this space. But it's not those aspects of modern computing that are easily formed from this probability model of his that cause surprise or problems. If he understood mice conceptually, then trackballs clearly aren't surprising. Probably pen-based computing and even tablets are not surprising. But natural language processing allowing voice control might have been. Or brain-computer interface might have been... etc. Something in device space currently has non-zero probability that did not have non-zero probability in his model, and so there's infinite surprise between his model and the current situation. But maybe in concept space he did better. But by comparison, device space is to us as concept space is to FOOM'd AI. I certainly believe that a FOOM'd AI's probability model about various issues will have a strictly larger support set than my own would. There are concepts, heuristics, and metaheuristics that it will be able to assign meaningful probability to and of which I won't even be able to conceive. There will be infinite surprise.

All cases where humans can abstract problems (of economics, physics, mathematics, etc.) and accurately predict things in solution space indicate domains where the surprise isn't infinite. There are salient search directions to investigate that produce useful results. This is Robin's FOOM argument as I understand it from the debate. But it fails to consider cases where there literally are no salient search directions... at least not to us because we can't model them. Or the ones we can model are pitifully inadequate by comparison.

Spot Moore's paradox in that statement.

Better yet - explain it to those of us who don't have an encyclopedic knowledge of chess.

My extremely limited understanding suggests that this is an unpopular opening (and that barring a well-analyzed play-history that suggested it was favorable) Kasparov is unlikely to open with P-KN4 against an equal opponent, so RYK opening with that suggests we got lucky and drew one of the 1/1000 "let's try something fun" games.

I rank it as very possible that my extremely limited understanding is wildly incorrect, but with cryptic comments like these, I stand to benefit little from this insight, if correct.

If

EY saysthat starting the game with P-KN4 is a move which the real-life Kasparov would never make, this means that (either he's lying or) he gives zero probability to Kasparov making that move, and so by construction his bot will never make that move either.Together with the construction of RYK, the sentence I quoted amounts to “Kasparov will never open with that move, but I don't believe he never will”.

OK, I'll play.

Given the EY here in both scenarios is the active agent (EY, in flesh, and the RYK box, consisting of a synthetic EY guessing at what K does, and a system choosingly randomly based on EY-over-K predictions what the box does next)...

Yes. "never" is a strong word here. Assume when he says "never" he means, "EY thinks it is really unlikely for K to do".

In other words, when EY sees EY-playing-K-but-Selecting-Moves-Randomly-in-proportion-with-predictions-of-K, and observes RYK has played an unexpected/unlikely move, EY concludes that RYK has probably picked a move that is evaluated to be less fit than some other move. RYK is a worse player than EY because RYK is not picking its best options. RYK is the gambler that sometimes takes the sucker bet.

Yes, the sane meaning for “never” is ‘negligibly often’... but then again the sane meaning for “sometimes” is ‘non-negligibly often’.

I do get his point, but I think

that sentence in particularis paradoxical. Had he said “rarely” and “rarely” it'd be OK, but “sometimes” and “never”...