LGS

Yes. But as far as I can see this isn't of any particular importance to this discussion. Why do you think it is?

It's the key of my point, but you're right that I should clarify the math here. Consider this part:

Actually, a frequentist can just keep collecting more data until they get p<0.05, then declare the null hypothesis to be rejected. No lying or suppression of data required. They can always do this, even if the null hypothesis is true: After collecting data points, they have a 0.05 chance of seeing p<0.05. If they don't, they then collect more data points, where is big enough that whatever happened with the first data points makes little difference to the p-value, so there's still about a 0.05 chance that p<0.05. If that doesn't produce a rejection, they collect more data points, and so on until they manage to get p<0.05, which is guaranteed to happen eventually with probability 1.

This is true for one hypothesis. It is NOT true if you know the alternative hypothesis. That is to say: suppose you are checking the p-value BOTH for the null hypothesis bias=0.5, AND for the alternate hypothesis bias=0.55. You check both p-values and see which is smaller. Now it is no longer true that you can keep collecting more data until their desired hypothesis wins; if the truth is bias=0.5, then after enough flips, the alternative hypothesis will never win again, and will always have astronomically small p-value.

To repeat: yes, you can disprove bias=0.5 with p<0.05; but at the time this happens, the alternative hypothesis of bias=0.55 might be disproven at p<10^{-100}. You are no longer guaranteed to win when there are two hypotheses rather than one.

But they

aren'tguaranteed to eventually get a Bayesian to think the null hypothesis is likely to be false, when it is actually true.

Importantly, this is false! This statement is wrong if you have only one hypothesis rather than two.

More specifically, I claim that if a sequence of coin flip outcomes disproves bias=0.5 at some p-value p, then for the same sequence of coin flips, there exists a bias b such that the likelihood ratio between bias b and bias 0.5 is . I'm not sure what the exact constant in the big-O notation is (I was trying to calculate it, and I think it's at most 10). Suppose it's 10. Then if you have p=0.001, you'll have likelihood ratio 100:1 for *some* bias.

Therefore, to get the likelihood ratio as high as you wish, you could employ the following strategy. First, flip coins until the p value is very low, as you described. Then stop, and analyze the sequence of coin flips to determine the special bias b in my claimed theorem above. Then publish a paper claiming "the bias of the coin is b rather than 0.5, here's my super high likelihood ratio". This is *guaranteed* to work (with enough coinflips).

(Generally, if the number of coin flips is N, the bias b will be on the order of , so it will be pretty close to 1/2; but once again, this is no different for what happens with the frequentist case, because to ensure the p-value is small you'll have to accept the effect size being small.)

This is silly. Obviously, Yudkowsky isn't going to go off on a tangent about all the ways people can lie indirectly, and how a Bayesian ought to account for such possibilities - that's not the topic. In a scientific paper, it is implicit that all relevant information must be disclosed - not doing so is lying. Similarly, a scientific journal must ethically publish papers based on quality, not conclusion. They're lying if they don't.

You're welcome to play semantic games if you wish, but that's not how most people use the word "lying" and not how most people understand Yudkowsky's post.

By this token, p-values also can never be hacked, because doing so is lying. (I can just define lying to be anything that hacks the p-values, which is what you seem to be doing here when you say that *not publishing a paper* amounts to lying.)

You misunderstand. H is some hypothesis, not necessarily about coins. Your goal is to convince the Bayesian that H is true with probability greater than 0.9. This has nothing to do with whether some coin lands heads with probability greater than 0.9.

You're switching goalposts. Yudkowsky was talking exclusively about how I can affect the likelihood ratio. You're switching to talking about how I can affect your posterior. Obviously, your posterior depends on your prior, so with sufficiently good prior you'll be right about everything. This is why I didn't understand you originally: you (a) used H for "hypothesis" instead of for "heads" as in the main post; and (b) used 0.9 for a posterior probability instead of using 10:1 for a likelihood ratio.

I don't think so, except, as I mentioned, that you obviously will do an experiment that could conceivably give evidence meeting the threshold - I suppose that you can think about exactly which experiment is best very carefully, but that isn't going to lead to anyone making wrong conclusions.

To the extent you're saying something true here, it is also true for p values. To the extent you're saying something that's not true for p values, it's also false for likelihood ratios (if I get to pick the alternate hypothesis).

The person evaluating the evidence knows that you're going to try multiple colors.

No, they don't. That is precisely the point of p-hacking.

But this has nothing to do with the point about the stopping rule for coin flips not affecting the likelihood ratio, and hence the Bayesian conclusion, whereas it does affect the p-value.

The stopping rule is not a central example of p-hacking and never was. But even for the stopping rule for coin flips, if you let me choose the alternate hypothesis instead of keeping it fixed, I can manipulate the likelihood ratio. And note that this is the more realistic scenario in real experiments! If I do an experiment, you generally don't know the precise alternate hypothesis in advance -- you want to test if the coin is fair, but you don't know precisely what bias it will have if it's unfair.

If we fix the two alternate hypotheses in advance, and if I have to report all data, then I'm reduced to only hacking by choosing the experiment that maximizes the chance of luckily passing your threshold via fluke. This is unlikely, as you say, so it's a weak form of "hacking". But this is also what I'm reduced to in the frequentist world! Bayesianism doesn't actually help. They key was (a) you forced me to disclose all data, and (b) we picked the alternate hypothesis in advance instead of only having a null hypothesis.

(In fact I'd argue that likelihood ratios are fundamentally frequentist, philosophically speaking, so long as we have two fixed hypotheses in advance. It only becomes Bayesian once you apply it to your priors.)

If you say that you are reporting all your observations, but actually report only a favourable subset of them, and the Bayesian for some reason assigns low probability to you deceiving them in this way, when actually you are deceiving them, then the Bayesian will come to the wrong conclusion. I don't think this is surprising or controversial.

OK but please attempt to square this with Yudkowsky's claim:

Therefore likelihood functions can

neverbe p-hacked byany possibleclever setup without you outright lying, because you can't have any possible procedure that a Bayesian knows in advance will make them update in a predictable net direction.

I am saying that Yudkowsky is just plain wrong here, because omitting info is not the same as outright lying. And publication bias happens when the person omitting the info is not even the same one as the person publishing the study (null results are often never published).

This is just one way to p-hack a Bayesian; there are plenty of others, including the most common type of p-hack ever, the forking paths (e.g. this xkcd still works the same if you report likelihoods).

But I don't see how the Bayesian comes to a wrong conclusion if you truthfully report all your observations, even if they are taken according to some scheme that produces a distribution of likelihood ratios that is supposedly favourable to you. The distribution doesn't matter. Only the observed likelihood ratio matters.

I'm not sure what you mean by "wrong conclusion" exactly, but I'll note that your statement here is more-or-less also true for p-values. The main difference is that p-values try to only convince you the null hypothesis is false, which is an easier task; the likelihood ratio tries to convince you some *specific* alternate hypothesis has higher likelihood, which is necessarily a harder task.

Even with Eliezer's original setup, in which the only thing I can control is when to stop the coin flip, it is hard to get p<0.001. Moreover, if I do manage to get p<0.001, that same sequence of coins will have a likelihood ratio of something like 100:1 in favor of the coin having a mild bias, if my calculation is correct. A large part of Eliezer's trick in his program's simulation is that he looked at the likelihood ratio of 50% heads vs 55% heads; such a specific choice of hypotheses is much harder to hack than if you let me choose the hypotheses after I saw the coinflips (I may need to compare the 50% to 60% or to 52% to get an impressive likelihood ratio, depending on the number of coins I flipped before stopping).

For example, suppose you want to convince the Bayesian that H is true with probability greater than 0.9. Some experiments may never produce data giving a likelihood ratio extreme enough to produce such a high probability. So you don't do such an experiment, and instead do one that could conceivably produce an extreme likelihood ratio. But it probably won't, if H is not actually true. If it does produce strong evidence for H, the Bayesian is right to think that H is probably true, regardless of your motivations (as long as you truthfully report all the data).

This is never the scenario, though. It is very easy to tell that the coin is not 90% biased no matter what statistics you use. The scenario is usually that my drug improves outcomes a little bit, and I'm not sure how much exactly. I want to convince you it improves outcomes, but we don't know in advance how much exactly they improve. Perhaps we set a minimum threshold, like the coin needs to be biased at least 55% or else we don't approve the drug, but even then there's no maximum threshold, so there is no fixed likelihood ratio we're computing. Moreover, we agree in advance on some fixed likelihood ratio that you need to reach to approve my drug; let's say 20:1 in favor of *some* bias larger than 55%. Then I can get a lot of mileage out of designing my experiment very carefully to target that specific threshold (though of course I can never guarantee success, so I have to try multiple colors of jelly beans until I succeed).

1mo21

The narrow point regarding likelihood ratios is correct, but the broader point in Eliezer's posts is arguably wrong. The issue with p-hacking is in large part selectively reporting results, and you don't get out of that by any amount of repeating the word "Bayesian". (For example, if I flip 10 coins but only show you the heads, you'll see HHHH, and no amount of Bayesian-ness will fix the problem; this is how publication bias works.)

Aside from selective reporting, much of the problem with p-values is that there's a specific choice of threshold (usually 0.05). This is a problem with likelihood ratios also. Eliezer says

Therefore likelihood functions can

neverbe p-hacked byany possibleclever setup without you outright lying, because you can't have any possible procedure that a Bayesian knows in advance will make them update in a predictable net direction. For every update that we expect to be produced by a piece of evidence , there's an equal and opposite update that we expect to probably occur from seeing .

The second sentence is true, but this only implies you cannot be p-hacked in expectation. I can still manipulate the probability that you'll pass any given likelihood, and therefore I can still p-hack to some extent if we are talking about passing a specific threshold (which is, after all, the whole point of the original concept of p-hacking).

Think about it like this: suppose I am gambling in a casino where every bet has expectation 0. Then, on expectation, I can never make money, no matter my strategy. However, suppose that I can get my drug approved by a regulator if I earn 10x my investment in this casino. I can increase my chances of doing this (e.g. I can get the chance up to 10% if I'm willing to lose all my money the rest of the time), or, if I'm stupid, I can play a strategy that never achieves this (e.g. I make some double-or-nothing 50/50 bet). So I still have incentives to "hack", though the returns aren't infinite.

Basically, Eliezer is right that if I have to report all my data, I cannot fool you in expectation. He neglects that I can still manipulate the distribution over the possible likelihood ratios, so I still have some hacking ability. He also neglects the bigger problem, which is that I don't have to report all my data (for example, due to publication bias).

For purposes of causality, negative correlation is the same as positive. The only distinction we care about, there, is zero or nonzero correlation.)

That makes sense. I was wrong to emphasize the "even negatively", and should instead stick to something like "slightly negatively". You have to care about large vs. small correlations or else you'll never get started doing any inference (no correlations are ever exactly 0).

I don't think problem 1 is so easy to handle. It's true that I'll have a hard time finding a variable that's perfectly independent of swimming but correlated with camping. However, I don't need to be perfect to trick your model.

Suppose every 4th of July, you go camping at one particular spot that does not have a lake. Then we observe that July 4th correlates with camping but does not correlate with swimming (or even negatively correlates with swimming). The model updates towards swimming causing camping. Getting more data on these variables only reinforces the swimming->camping direction.

To update in the other direction, you need to find a variable that correlates with swimming but not with camping. But what if you never find one? What if there's no simple thing that causes swimming. Say I go swimming based on the roll of a die, but you don't get to ever see the die. Then you're toast!

Slightly more generally, for instance, a combination of variables which correlates with low neonatal IQ but not with lead, conditional on some other variables, would suffice (assuming we correctly account for multiple hypothesis testing). And the "conditional on some other variables" part could, in principle, account for SES, insofar as we use enough variables to basically determine SES to precision sufficient for our purposes.

Oh, sure, I get that, but I don't think you'll manage to do this, in practice. Like, go ahead and prove me wrong, I guess? Is there a paper that does this for anything I care about? (E.g. exercise and overweight, or lead and IQ, or anything else of note). Ideally I'd get to download the data and check if the results are robust to deleting a variable or to duplicating a variable (when duplicating, I'll add noise so that they variables aren't exactly identical).

If you prefer, I can try to come up with artificial data for the lead/IQ thing in which I generate all variables to be downstream of non-observed SES but in which IQ is also slightly downstream of lead (and other things are slightly downstream of other things in a randomly chosen graph). I'll then let you run your favorite algorithm on it. What's your favorite algorithm, by the way? What's been mentioned so far sounds like it should take exponential time (e.g. enumerating over all ordering of the variables, drawing the Bayes net given the ordering, and then picking the one with fewest parameters -- that takes exponential time).

Thanks for linking to Yudkowsky's post (though it's a far cry from cutting to the chase... I skipped a lot of superfluous text in my skim). It did change my mind a bit, and I see where you're coming from. I still disagree that it's of much practical relevance: in many cases, no matter how many more variables you observe, you'll never conclude the true causational structure. That's because it strongly matters *which* additional variables you'll observe.

Let me rephrase Yudkowsky's point (and I assume also your point) like this. We want to know if swimming causes camping, or if camping causes swimming. Right now we know only that they correlate. But if we find another variable that correlates with swimming and is independent camping, that would be evidence towards "camping causes swimming". For example, if swimming happens on Tuesdays but camping is independent of Tuesdays, it's suggestive that camping causes swimming (because if swimming caused camping, you'd expect the Tuesday/swimming correlation to induce a Tuesday/camping correlation).

First, I admit that this is a neat observation that I haven't fully appreciated or knew how to articulate before reading the article. So thanks for that. It's food for thought.

Having said that, there are still a lot of problems with this story:

- First, unnatural variables are bad: I can always take something like "an indicator variable for camping, except if swimming is present, negate this indicator with probability p". This variable, call it X, can be made to be uncorrelated with swimming by picking p correctly, yet it will be correlated with camping; hence, by adding it, I can cause the model to say swimming causes camping. (I think I can even make the variable independent of swimming instead of just uncorrelated, but I didn't check.) So to trust this model, I'd either need some assumption that the variables are somehow "natural". Not cherry-picked, not handed to me by some untrusted source with stake in the matter.
- In practice, it can be hard to find any good variables that correlate with one thing but not the other. For example, suppose you're trying to establish "lead exposure in gestation causes low IQ". Good luck trying to find something natural that correlates with low neonatal IQ but not with lead; everything will be downstream of SES. And you don't get to add SES to your model, because you never observe it directly!
- More generally, real life has these correlational clusters, these "positive manifolds" of everything-correlating-with-everything. Like, consumption of all "healthy" foods correlates together, and also correlates with exercise, and also with not being overweight, and also with longevity, etc. In such a world, adding more variables will just never disentangle the causational structure at all, because you never find yourself adding a variable that's correlated with one thing but not another.

Tired and swimming are not independent, but that's a correlational error. You can indeed get a more accurate picture of the correlations, given more evidence, but you cannot conclude causational structure from correlations alone.

How about this: would any amount of observation ever cause one to conclude that camping causes swimming rather than the reverse? The answer is clearly no: they are correlated, but there's no way to use the correlation between them (or their relationships to any other variables) to distinguish between swimming causing camping and camping causing swimming.

What you seemed to be saying is that a certain rotation ("one should rotate them so that the resulting axes have a sparse relationship with the original cases") has "actually been used" and "it basically assumes that causality flows from variables with higher kurtosis to variables with lower kurtosis".

I don't see what the kurtosis-maximizing algorithm has to do with the choice of rotation used in factor analysis or PCA.

A platonically perfect Bayesian given complete information and with accurate priors cannot be substantially fooled. But once again this is true regardless of whether I report p-values or likelihood ratios. p-values are fine.