(Mathematicians may find this post painfully obvious.)

I read an interesting puzzle on Stephen Landsburg's blog that generated a lot of disagreement. Stephen offered to bet anyone $15,000 that the average results of a computer simulation, run 1 million times, would be close to his solution's prediction of the expected value.

Landsburg's solution is in fact correct. But the problem involves a probabilistic infinite series, a kind used often on less wrong in a context where one is offered some utility every time one flips a coin and it comes up heads, but loses everything if it ever comes up tails. Landsburg didn't justify the claim that a simulation could indicate the true expected outcome of this particular problem. Can we find similar-looking problems for which simulations give the wrong answer? Yes.

Here's Perl code to estimate by simulation the expected value of the series of terms 2^k / k from k = 1 to infinity, with a 50% chance of stopping after each term.

`my $bigsum = 0; for (my $trial = 0; $trial < 1000000; $trial++) { my $sum = 0;`

` `

my $top = 2;

` `

my $denom = 1;

` `

do {

` `

`$sum += $top / $denom;`

` `

` `

`$top *= 2;`

` `

` `

`$denom += 1;`

` `

` `

}

` `

while (rand(1) < .5);

` `

$bigsum += $sum; } my $ave = $bigsum / $runs; print "ave sum=$ave\n";

(If anyone knows how to enter a code block on this site, let me know. I used the "pre" tag, but the site stripped out my spaces anyway.)

Running it 5 times, we get the answers

ave sum=7.6035709716983

ave sum=8.47543819631431

ave sum=7.2618950097739

ave sum=8.26159741956747

ave sum=7.75774577340324

So the expected value is somewhere around 8?

No; the expected value is given by the sum of the harmonic series, which diverges, so it's infinite. Later terms in the series are exponentially larger, but exponentially less likely to appear.

Some of you are saying, "Of course the expected value of a divergent series can't be computed by simulation! Give me back my minute!" But many things we might simulate with computers, like the weather, the economy, or existential risk, are full of power law distributions that might not have a convergent expected value. People have observed before that this can cause problems for simulations (see *The Black Swan*). What I find interesting is that the output of the program above doesn't look like something inside it diverges. It looks almost normal. So you could run your simulation many times and believe that you had a grip on its expected outcome, yet be completely mistaken.

In real-life simulations (that sounds wrong, doesn't it?), there's often some system property that drifts slowly, and some critical value of that system property above which some distribution within the simulation diverges. Moving above that critical value doesn't suddenly change the output of the simulation in a way that gives an obvious warning. But the expected value of keeping that property below that critical value in the real-life system being simulated can be very high (or even infinite), with very little cost.

Is there a way to look at a simulation's outputs, and guess whether a particular property is near some such critical threshold? Better yet, is there a way to guess whether there exists some property in the system nearing some such threshold, even if you don't know what it is?

The October 19, 2012 issue of Science contains an article on just that question: "Anticipating critical transitions", Marten Scheffer et al., p. 344. It reviews 28 papers on systems and simulations, and lists about a dozen mathematical approaches used to estimate nearness to a critical point. These include:

- Critical slowing down: When the system is near a critical threshold, it recovers slowly from small perturbations. One measure of this is autocorrelation at lag 1, meaning the correlation between the system's output at times T and T-1. Counterintuitively, a higher autocorrelation at lag one by itself suggests that the system is more predictable than before, but may actually indicate it is less predictable. The more predictable system reverts to its mean; the unpredictable system has no mean.
- Flicker: Instead of having a single stable state that the system reverts to after perturbation, an additional stable state appears, and the system flickers back and forth between the two states.
- Dominant eigenvalue: I haven't read the paper that explains what this paper means when it cites this, but I do know that you can predict when a helicopter engine is going to malfunction by putting many sensors on it, running PCA on time-series data for those sensors to get a matrix that projects their output into just a few dimensions, then reading their output continuously and predicting failure anytime the PCA-projected output vector moves a lot. That probably is what they mean.

So if you're modeling global warming, running your simulation a dozen times and averaging the results may be misleading. [1] Global temperature has sudden [2] dramatic transitions, and an exceptionally large and sudden one (15C in one million years) neatly spans the Earth's greatest extinction event so far on the Permian-Triassic boundary [3]. It's more important to figure out what the critical parameter is and where its critical point is than to try and estimate how many years it will be before Manhattan is underwater. The "expected rise in water level per year" may not be easily-answerable by simulation [4].

And if you're thinking about betting Stephen Landsburg $15,000 on the outcome of a simulation, make sure his series converges first. [5]

[1] Not that I'm particularly worried about global warming.

[2] Geologically sudden.

[3] Sun et al., "Lethally hot temperatures during the early Triassic greenhouse", Science 338 (Oct. 19 2012) p.366, see p. 368. Having just pointed out that an increase of .000015C/yr counts as a "sudden" global warming event, I feel obligated to also point out that the current increase is about .02C/yr.

[4] It will be answerable by simulation, since rise in water level can't be infinite. But you may need a lot more simulations than you think.

[5] Better yet, don't bet against Stephen Landsburg.

The Perl code that you provide will depend a lot on your random number generator. It's worth remembering that most random-number generators aren't; they're

pseudo-random-number generators, which are defined in such a way that they mimic a lot of the properties of truly random numbers. (Some random number generators take advantage of external events to generate random numbers - like the timing of keypresses on the keyboard. This type of generator tends to fall back on pseudorandom numbers if it runs out of keyboard-timing-generated random numbers; and for the Perl program that you gave, this type of generator will run out of keyboard-generated random numbers with very high probability).One property of genuine random numbers that any pseudo-random number generator won't have is the vanishingly small possibility of a sequence of N heads for large N. A pseudo-random number generator will hold a certain amount of state - say, 64 bits. A well-designed 64-bit generator will, in a sequence of 2^64 coinflips, include all possible reasonably short sequences, and a number of suitably random-looking longer ones, with a believable distribution (which means very close to 2^63 heads and 2^63 tails). After that, it will start to repeat itself. What you will not get is 2^64 heads (or 2^64 tails).

So. While the sequence that the Perl program above simulates is potentially infinite, the Perl program itself is

notpotentially infinite. There is a maximum number of terms that it will consider; therefore a maximum possible value of $sum. And no matter where that maximum value is, such a truncated seriesdoeshave a finite expected value. Given the results of your run, and this sequence, I expect that your computer's random number generator allows for a series of more or less 1674 'heads' in a row; that would give a truncated series with an expected sum of near to 8.So the repeatability that you see there is a function of the limitations of the technology, not a function of the problem itself.

Run length finiteness will kick in long before program size finiteness. How long do you expect to have to run before you hit 100 heads in a row by flipping fair coins a trillion times a second?

Roughly speaking, we're talking multiples of the age of the universe so far.

An excellent point. I didn't even think of time.

Whenever you’re thinking that a 64-bit state might not be enough for something counter-like, it’s a good thing to think about time. I remember thinking once for more than half a minute how to handle overflow of a 64-bit variable. It was used to pass the number of files extracted from an archive. On an iPad.

What you say about random number generators is true, but it can't be the explanation, because there were nowhere near enough samples to get a run of 1674 heads in a row. I ran 5 tests with 1 million sequences of coin flips, so the longest run of heads we expect is 21 heads.

The sum would be near 8 even if the numbers were random. I chose that particular series to make that happen. The sum of the entire series diverges, but it diverges slowly enough that sometimes getting 24 heads in a row doesn't change the output a great deal, as you can see from this table showing # of heads, 1/p(getting that many heads in a row), and the resulting contribution to the sum.

Sorry the post originally didn't say there were 1 million trials per run. I had that in the code, but lost it while trying to reformat the code.

You'd have to take more than 5 million trials to see much different numbers. Here's all cases in a series of 240 runs of 1 million trials each where the average sum was greater than 9:

Yes... at one million trials per run, you wouldn't expect much more than 20 flips in a run in any case. By my quick calculation, that should result in an average around 3.64 - with perhaps some variability due to a low-probability long string of, say, 30 heads turning up.

Yet you got an average around 8. This suggests that a long chain of heads may be turning up slightly more often than random chance would suggest; that your RNG may be slightly biased towards long sequences.

So, I wrote a similar program to Phil and got similar averages, here's a sample of 5 taken while I write this comment

8.2 6.9 7.7 8.0 7.1

These look pretty similar to the numbers he's getting. Like Phil, I also get occasional results that deviate far from the mean, much more than you'd expect to happen with and approximately normally distributed variable.

I also wrote a program to test your hypothesis about the sequences being too long, running the same number of trials and seeing what the longest string of heads is, the results are

19 22 18 25 23

Do these seem abnormal enough to explain the deviation, or is there a problem with your calculations?

It's not the highest that matters; it's the distribution within that range.

There was

alsoa problem with my calculations, incidentally; a factor-of-two error, which is enough to explain most of the discreprency. What I did to calculate is, was to add up the harmonic sequence, up to around 24 (1+1/2+1/3+...+1/24), then doubling the last term (1+1/2+1/3+...+1/23 + 2/24). However, the code as given starts out with a 2, and then doubles the numerator with each added term; the calculation Ishouldhave used is (2+2/2+2/3+2/4+...+2/23+4/24). That leads to me expecting a value just a little over 7, which is pretty close....

I also ran a similar program. I copied and pasted Phil's, then modified it as slightly. My results were:

1 500523

2 250055

3 124852

4 62067

5 31209

6 15482

7 7802

8 4011

9 1978

10 1006

11 527

12 235

13 109

14 68

15 41

16 19

17 10

18 5

21 1

...where the left-hand column is the number of terms in a given sequence, and the right-hand column is the number of times that number of terms came up. Thus, there were 500523 runs of one term each; an insignificant distance from the expected number (500000). Most of the runs were very close to the expected value; interestingly, everything from 14 terms upwards for which there were any runs was

abovethe expected number of runs, and often by a significant margin. The most significant is the single 21-term run; I expect to see 0.476 of those, and I see 1, slightly over twice the expectation. At 15 terms, I expected to see 30.517 runs; I saw 41 of those. At 17 terms, I expect to see 7.629 on average; I see 10 this time.My final average sum is 7.25959229425851; a little higher than expected, but, now that I've corrected the factor-of-two error in my original calculation, not unexpectedly far off.

So most of the deviation is due to an error in my calculation. The rest is due to the fact that a 21-term or longer run turning up - which can easily happen - will probably pull the average sum up by 1 or more all by itself; it's easier for the average sum to be increased than decreased.

Thoroughly discussed on MathOverflow awhile ago.

That doesn't seem to be about the same thing.

In what way? Looks like the same problem to me.

Never mind -- I had read this LW article but not the original Landsburg post it links to, and I was under the impression that it was about the St Petersburg paradox.

(Note to self: Never comment on [LINK]-type LW posts until I've at least skimmed the original article they're based on, no matter how well I think I understand the summary on LW.)

Just to make this clear, the problem Stephen Landsburg discusses is pretty much irrelevant for the point being made here. It's only mentioned because his suggestion of using a simulation is a lead-in to main topic.

Landsburg's solution is somewhat head-scratching. It works fine if you're sampling families. But the problem as stated doesn't sample individual families. It asks about the overall population.

This washes out his effect pretty strongly, to the point that the expected excess in the case of an infinite-sized country is the same absolute excess as in the case of one family: You expect 0.3 more boys than girls. zero.

In total. Across the whole country. And it wasn't asking about the families, it was asking about the country. So the ratio is a teeny tiny deviation from 50%, not 30.6%. This is very much an ends in 'gry' situation.

(Note: what I was trying to say above, about the 0.3 more boys than girls, and then got confused:

This is all about the stopping condition. The stopping condition produces a bias by the disproportionate effect of the denominators on either side of 1/2. But this stopping condition only steps in

onceto the country and to the single family all the same: there's effectively only one stopping condition, that being when N boys have been born. Increasing N from 1 to a million just washes out the effect by adding more unbiased random babies on what for clarity you can imagine as thebeginningof the sequence. So you have this biased ratio, and then you weighted-average it with 1/2 with a huge weight on the 1/2)It's tricky. I at first thought that he was talking about sampling families, but he isn't.

The expected fraction G/(G+B) over many trials, where each trial t involves N families and leads to Gt girls and Bt boys, is

not(sum of Gt) / (sum of (Gt + Bt)).Trials that result in a smaller number of children have more boys than girls. Trials that result in an unusually large number of children have more girls than boys. Yet both kinds of trials count as 1 sample when computing the

average fractionof girls. So the average population fraction is smaller than the population fraction from all those trials.Not quite. The expected difference in numbers is zero. It's the expected ratio G/(G+B) for a country that is a hair -- an unmeasurably small hair -- under 0.5, and if you multiply that hair by the population you get something tending to a constant fraction of a person.

While there's an interesting puzzle to be posed, Landsburg didn't formulate it in the right way. The question he should have posed (and the one he actually answers in his solution) is "what is the expected proportion of girls per family?" This is the question whose correct answer shows a substantial deviation from the naive 0.5.

But I don't see what the puzzle has to do with the rest of the post.

Yeah, I got a bit confused because I imagined that you'd go to the maternity ward and watch the happy mothers leaving with their babies. If the last one you saw come out was a girl,

you know they're not done.Symmetry broken. You are taking an unbiased list of possibilities and throwing out the half of them that end with a girl...Except that to fix the problem, you end up adding an expected equal number of girls as boys.

I had it straight the first time, really. But his answer was so confusingly off that it re-befuddled me.

E(G-B) = 0, E(G/(G+B)) < 0.5.

He did answer the question he posed, which was "What is the expected fraction of girls in a population [of N families]?" It's not an unmeasurably-small hair. It depends on N. When N=4, the expected fraction is about .46. If you don't believe it, do the simulation. I did.

I believe the mathematics. He is correct that E(G/(G+B)) < 0.5. But a "country" of four families? A country, not otherwise specified, has millions of families, and if that is interpreted mathematically as asking for the limit of infinite N, then E(G/(G+B)) tends to the limit of 0.5.

To make the point that this puzzle is intended to make, about expectation not commuting with ratios, it should be posed of a single family, where E(G)/E(G+B) = 0.5, E(G/(G+B)) = 1-log(2).

But as I said earlier, how is this puzzle relevant to the rest of your post? The mathematics and the simulation agree.

Estimating the mean and variance of the Cauchy distribution by simulation makes an entertaining exercise.

Thinking about betting $15,000 on a math problem, to be adjudicated by the outcome of a computer simulation, made me wonder how we know when a computer simulation would give the right answer. Showing the results for the similar-looking but divergent series is the simplest example I could think of of when a computer simulation gives a very misleading estimate of expected value, which is the problem this post is about.

The question asked about a country. Unless you're counting hypothetical micro-seasteads as countries, the ratio is within noise of 50%.

(In response to a longer version of the previous post which was in response to the pre-edited version of its parent, which was opposite in nearly every way - and because it was up really briefly, I forgot that he could have seen the pre-edit version. If RK weren't a ninja this wouldn't have come up)

Dude, I already conceded. You were right, I said I was wrong. When I was saying I had it straight the first time, I meant

before I read the solution. That confused me, then I wrote in response, in error. Then you straightened me out again.Sorry, I posted before you corrected that post. I shall edit out my asperity.

Sorry for resisting correction for that short time.

He did answer the question he posed, which was "What is the expected fraction of girls in a population [of N families]?" It's not an unmeasurably-small hair. It depends on N. When N=4, the expected fraction is about .46.

Here is the exact solution for the expected value of

`G/(G+B)`

with`k`

families. From numerical calculation with`k`

up to 150, it looks like the discrepancy`0.5 - g/(g+b)`

approaches`0.25/k`

(from below) as`k`

goes to infinity, which is certainly mysterious.(The expected value of

`G-B`

is always 0, though, so I don't know what you mean by an excess of 0.3.)So for a reasonably-sized country of 1 million people, we're looking at a ratio of B/(B+G) = 0.50000025? I'll buy that.

And the 0.3 was a screwup on my part (my mistaken reasoning is described in a cousin of this post).

Funny though that the correct answer happens to be really close to my completely erroneous answer. It has the same scaling, the same direction, and similar magnitude (0.25/k instead of 0.3/k).

Some of those power laws have to break down. If you find a storm of 10^30 J, it isn't Terrestrial weather any more. Similarly, if we ever had 10000% growth in one year, that's probably because we hit the singularity and all bets are off.

The models may incorporate those divergences, but only because they're wrong.

This post is about two problems. The first half of the post is about errors caused by using averages. This has a simple solution: don't use averages. The second half of the post is about understanding tail risk. This is a hard problem, but I don't think it has much to do with the first problem.

Just don't use averages.

Why do you care about averages?

In almost all situations, the median is a better single number summary of a random variable. For example, the median sex ratio is more salient than the mean sex ratio. Mean temperature change is irrelevant. Median temperature change is important. More important is the probability of a catastrophically large change. So answer that question. And answer both questions! You can almost always afford more than a single number to summarize your random variable.

Once you know that the question is the size of the tail, it should be clear that the mean is irrelevant. Even if you do care about the mean, as by fiat in the random series, it is better not to just sample from it, but to understand the whole distribution. If you do that, you'll see that it is skewed, which should make you nervous.

Both parts are about the hazards of trusting results from computer simulations. Putting the two parts together shows the connection between divergent series and tail risk. When a series inside the computation changes from convergent to divergent, you then have tail risk, and the simulation's trustworthiness degrades suddenly, yet in a way that may be difficult to notice.

If you are a rational agent, you work with averages, because you want to maximize expected utility. In the kinds of simulations I mentioned, it's especially important to look at the average rather than the mean, because most of the harm in the real world (from economic busts, wars, climate change, earthquakes, superintelligences) comes from the outliers. In some cases, the median behavior is negligible. The average is a well-defined way of getting at that, while "probability of a catastrophically large change" is not capable of being defined.

Caring about expected utility doesn't mean you should care about expected anything else. Don't take averages early. In particular, the expected ratio of girls to boy is not the ratio of expected boys to expected girls. Similarly, "expected rise in water level per year" is irrelevant because utility is not linear in sea level.

True. Computing average utility is what's important. But just using the median as you suggested doesn't let you compute average utility.

PDF.

Similar weird things happen for the Cauchy distribution (whose probability density function is proportional to 1/(1+x^2)), which is symmetric around 0 but does

nothave mean 0 because the sum doesn't converge.Exercise: what do you expect to happen if you try to find the mean of the Cauchy distribution by simulation?

From the same source:

Now I'm wondering if there is a symmetric distribution where the sample mean is a strictly

worseestimator of the median than a single observation.Off the cuff: it's probably a random walk.

Edit: It's now pretty clear to me that's false, but plotting the ergodic means of several "chains" seems like a good way to figure it out.

Edit 2: In retrospect, I should have predicted that. If anyone is interested, I can post some R code so you can see what happens.

my expectation, written down before seeing more than seing one (naive) answer, is: greater than 50%. But not necessarily by a lot at first. continuing it on indefinitely, if our population is sufficiently large, will eventually result in a population dominated by females, depending heavily on the details of the distribution of certain abnormalities (aka is there even one two-headed or two-tailed coin, and if not, then what is the exact nature of graph in that region, at which point it becomes ridiculously complicated...)

terribly sloppy notation follows. E(geneticconditions):(chanceofgivingfemalebirth>malebirth). by (havingchildren untill maleborn): (amplifies the prevelance of such conditions in birthing populations.).

But then, i've heard this puzzle before and have had a LONG time to muse about it. 50% is wrong if we're talking about beings with a biology resembling our own. :P

50% is the correct expected distribution, however, if they're using perfect quantum coinflips.

aaanndd...

(please forgive the spelling errors; i just looked at the clock.)

Can't we get results for all practical purposes by setting some epsilon and saying "we are disregarding the highest epsilon percent of results"?

For most practical purposes, I would suggest that a reasonable pEpsilon would be around 10.

This bridge won't collapse under 90% of expected loads!

pEpsilon: p being the referent for the negative base-10 logarithm of whatever follows.

This bridge isn't expected to collapse under 100-(10^-10)%= 99.9999999999% of expected loads, despite the fact that there's no upper limit on the loads to which we are willing to subject it!

If we give it a load of one ton plus an X, where X has a 50% chance of being equal to one ton plus another X, that bridge has to handle about 34 tons for a pEpsilon of 10 if my math is correct.

I used an epsilon of greater than .1% when figuring expected outcomes of L5R dice rolls; I simply said "Assuming no one die explodes more than three times, this is the expected result of this roll." (Each d10 has a 10% chance of exploding, resulting in a value of 10+reroll recursively; epsilon is less than .1% because I assumed that no dice out of several would come up ten three times in a row, and nontrivial to figure because different numbers of dice were rolled each time)

Don't you mean exponent, not logarithm? I'm more used to seeing this concept as e, like 1e-10.

The primary trouble with this suggestion is this problem; most practical models will not be able to actually distinguish between 99.99th percentile events and 99.9999th percentile events, let alone 1-1e-10 quantile events.

No, I meant "-log10(X)=pX", in the general sense. That is standard use in chemistry and other fields that regularly deal with very small numbers.

And if your model can't distinguish between something that happens one in ten thousand times and something that happens one in a million times, you aren't simulating a probabilistic infinite series, and you can make a deterministic conclusion about your worst-case scenario.

The goal is to be able to calculate things currently relegated to simulation.

Ah, okay, that's cleared up my linguistic confusion; thanks!

Sort of. These sorts of simulation projects go on in my department, and so I'm familiar with them. For example, a person that used to work down the hall was working on uncertainty quantification for models of nuclear power plants- we have a relatively good idea of how likely any particular part is to break, and how to simulate what happens if those break, and we can get out a number that says "we think there's a 1.3e-9 chance per year per plant of a catastrophic event," for example, but we want to get out the number that says "if our estimate of the rate at which valves fail is off by 5%, our final estimate will be off by X%", and use that to get a distribution on the final catastrophe estimate. (This will give us a better sense of total risk, as well as where we should focus our experiments and improvements.)

So they can say "our model tells us that, with these point inputs, this happens once in a billion times," but they can't yet say "with our model and these input distributions, the chance of this happening more than once in a million times is less than one in a thousand," which must be true for the first statement to be useful as an upper bound of the estimate (rather than an expected value of the estimate).

It's not clear to me what you mean by this, since I see calculation as a subset of simulation. If you mean we'd like to analytically integrate over all unknowns, even if we had the integration power (which we can't), we don't have good enough estimates of the uncertainties for that to be all that much more meaningful than our current simulations. With only, say, a thousand samples, it may be very difficult to determine the thickness of the tails of the population distribution, but the thickness of the tails may determine the behavior of the expected value / the chance of catastrophe. These problems show up at much lower quantiles, and it's not clear to me that just ignoring events more rare than a certain quantile will give us useful results.

To use the bridge example again- for the 'expected load' distribution to be faithful above the 99.99th percentile, that would require that we be surprised less than once every thirty years. I don't think there are traffic engineers with that forecasting ability, and I think their forecasting ability starts to break down more quickly than the rare events we're interested in, so we have to tackle these problems somehow.

Well, when I was working on a S5W/S3G (MTS 635, SSBN 732 blue) power plant, our baseline "end of the world" scenario started with "a non-isolateable double-ended shear of a main coolant loop". (half of the power plant falls off). I can't begin to estimate the likelihood of that failure, but I think quantum mechanics can.

If classical mechanics gives you a failure rate that has uncertainty, you can incorporate that uncertainty into your final uncertainty: "We believe it is four nines or better that this type of valve fails in this manner with this frequency or less."

And at some point, you don't trust the traffic engineer to guess the load- you post a load limit on the bridge.

Why not? Can't we integrate over all of the input distributions, and compare the

total volume of input distributions with failure chance greater than one in Nwith thetotal volume of all input distributions?The impression I got was that this is the approach that they would take with infinite computing power, but that it took a significant amount of time to determine if any particular combination of input variables would lead to a failure chance greater than one in N, meaning normal integration won't work. There are a couple of different ways to attack that problem, each making different tradeoffs.

If each data point is prohibitively expensive, then the only thing I can suggest is limiting the permissible input distributions. If that's not possible, I think the historical path is to continue to store the waste in pools at each power plant while future research and politics is done on the problem.

"Logarithm" and "exponent" can name the same thing as viewed from different perspectives. Example: 3 is the base-10 logarithm of 1000, which is another way of saying that 10^3 = 1000 (an expression in which 3 is the exponent).

I think it would be helpful to explicitly write out the what you're calculating mathematically, e.g. \sum

{k=1}^{\infty} (\frac{2^k}{k})*P[halting~by~step~k] = \sum{k=1}^{infty} (\frac{2^k}{k})(2^k) = \sum_{k=1}^{infty} \frac{1}{k} just to make it clear that this is the harmonic series. I know it's described in English, but the math is often helpful. (For posting latex, see also: http://lwlatex.appspot.com/. I don't know if that's the most up-to-date method, but it works.)