The Universal Medical Journal Article Error

(Oops. I forgot this was moved to Discussion.)

TL;DR:  When people read a journal article that concludes, "We have proved that it is not the case that for every X, P(X)", they generally credit the article with having provided at least weak evidence in favor of the proposition ∀x !P(x).  This is not necessarily so.

 

Authors using statistical tests are making precise claims, which must be quantified correctly.  Pretending that all quantifiers are universal because we are speaking English is one error.  It is not, as many commenters are claiming, a small error.  ∀x !P(x) is very different from !∀x P(x).

 

A more-subtle problem is that when an article uses an F-test on a hypothesis, it is possible (and common) to fail the F-test for P(x) with data that supports the hypothesis P(x).  The 95% confidence level was chosen for the F-test in order to count false positives as much more expensive than false negatives.  Applying it therefore removes us from the world of Bayesian logic.  You cannot interpret the failure of an F-test for P(x) as being even weak evidence for not P(x).

 

I used to teach logic to undergraduates, and they regularly made the same simple mistake with logical quantifiers.  Take the statement "For every X there is some Y such that P(X,Y)" and represent it symbolically:


∀x∃y P(x,y)

 

Now negate it:

!∀x∃y P(x,y)

 

You often don't want a negation to be outside quantifiers.  My undergraduates would often just push it inside, like this:


∀x∃y !P(x,y)

 

If you could just move the negation inward like that, then these claims would mean the same thing:

 

A) Not everything is a raven:  !∀x raven(x)

B) Everything is not a raven:  ∀x !raven(x)

 

To move a negation inside quantifiers, flip each quantifier that you move it past.


!∀x∃y P(x,y) = ∃x!∃y P(x,y) = ∃x∀y !P(x,y)

 

 

Here's the findings of a 1982 article [1] from JAMA Psychiatry (formerly Archives of General Psychiatry), back in the days when the medical establishment was busy denouncing the Feingold diet:

 

Previous studies have not conclusively demonstrated behavioral effects of artificial food colorings ...  This study, which was designed to maximize the likelihood of detecting a dietary effect, found none.

 

Now pay attention; this is the part everyone gets wrong, including most of the commenters below.

 

The methodology used in this study, and in most studies, is as follows:

  • Divide subjects into a test group and a control group.
  • Administer the intervention to the test group, and a placebo to the control group.
  • Take some measurement that is supposed to reveal the effect they are looking for.
  • Compute the mean and standard deviation of that measure for the test and control groups.
  • Do either a t-test or an F-test of the hypothesis that the intervention causes a statistically-significant effect.
  • If the test succeeds, conclude that the intervention causes a statistically-significant effect (CORRECT).
  • If the test does not succeed,
    • Reject that hypothesis.
    • Conclude that the intervention does not cause any effect (ERROR).

People make the error because they forget to explicitly state what quantifiers they're using.  Both the t-test and the F-test work by assuming that every subject has the same response function to the intervention:

 

response = effect + normally distributed error


where the effect is the same for every subject.  If you don't understand why that is so, read the articles about the t-test and the F-test.  The tests compute what a difference in magnitude of response such that, 95% of the time, if the measured effect difference is that large, the null hypothesis (that the responses of all subjects in both groups were drawn from the same distribution) is false.

 

ADDED:  People are making comments proving they don't understand how the F-test works.  This is how it works:  You are testing the hypothesis that two groups respond differently to food dye.

 

Suppose you measured the number of times a kid shouted or jumped, and you found that kids fed food dye shouted or jumped an average of 20 times per hour, and kids not fed food dye shouted or jumped an average of 17 times per hour.  When you run your F-test, you compute that, assuming all kids respond to food dye the same way, you need a difference of 4 to conclude with 95% confidence that the two distributions (test and control) are different.

 

If the food dye kids had shouted/jumped 21 times per hour, the study would conclude that food dye causes hyperactivity.  Because they shouted/jumped only 20 times per hour, it failed to prove that food dye affects hyperactivity.  You can only conclude that food dye affects behavior with 84% confidence, rather than the 95% you desired.

 

Finding that food dye affects behavior with 84% confidence should not be presented as proof that food dye does not affect behavior!

 

If half your subjects have a genetic background that makes them resistant to the effect, the threshold for the t-test or F-test will be much too high to detect that.  If 10% of kids become more hyperactive and 10% become less hyperactive after eating food coloring, such a methodology will never, ever detect it.  A test done in this way can only accept or reject the hypothesis that for every subject x, the effect of the intervention is different than the effect of the placebo.


So.  Rephrased to say precisely what the study found:

 

This study tested and rejected the hypothesis that artificial food coloring affects behavior in all children.

 

Converted to logic (ignoring time):


!( ∀child ( eats(child, coloring) ⇨ behaviorChange(child) ) )

 

Move the negation inside the quantifier:


∃child !( eats(child, coloring) ⇨ behaviorChange(child) )

 

Translated back into English, this study proved:

 

There exist children for whom artificial food coloring does not affect behavior.

 

However, this is the actual final sentence of that paper:

 

The results of this study indicate that artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents.


Translated into logic:


!∃child ( eats(child, coloring) ⇨ hyperactive(child) ) )

 

or, equivalently,


∀child !( eats(child, coloring) ⇨ hyperactive(child) ) )

 

This refereed medical journal article, like many others, made the same mistake as my undergraduate logic students, moving the negation across the quantifier without changing the quantifier.  I cannot recall ever seeing a medical journal article prove a negation and not make this mistake when stating its conclusions.

 

A lot of people are complaining that I should just interpret their statement as meaning "Food colorings do not affect the behavior of MOST school-age children."

 

But they didn't prove that food colorings do not affect the behavior of most school-age children.  They proved that there exists at least one child whose behavior food coloring does not affect.  That isn't remotely close to what they have claimed.

 

For the record, the conclusion is wrong.  Studies that did not assume that all children were identical, such as studies that used each child as his or her own control by randomly giving them cookies containing or not containing food dye [2], or a recent study that partitioned the children according to single-nucleotide polymorphisms (SNPs) in genes related to food metabolism [3], found large, significant effects in some children or some genetically-defined groups of children.  Unfortunately, reviews failed to distinguish the logically sound from the logically unsound articles, and the medical community insisted that food dyes had no influence on behavior until thirty years after their influence had been repeatedly proven.

 

 

[1] Jeffrey A. Mattes & Rachel Gittelman (1981). Effects of Artificial Food Colorings in Children With Hyperactive Symptoms: A Critical Review and Results of a Controlled Study. Archives of General Psychiatry 38(6):714-718. doi:10.1001/archpsyc.1981.01780310114012.

 

[2] K.S. Rowe & K.J. Rowe (1994). Synthetic food coloring and behavior: a dose response effect in a double-blind, placebo-controlled, repeated-measures study. The Journal of Pediatrics Nov;125(5 Pt 1):691-8.


[3] Stevenson, Sonuga-Barke, McCann et al. (2010). The Role of Histamine Degradation Gene Polymorphisms in Moderating the Effects of Food Additives on Children’s ADHD Symptoms. Am J Psychiatry 167:1108-1115.

189 comments, sorted by
magical algorithm
Highlighting new comments since Today at 7:37 AM
Select new highlight date

You claim that medical researchers are doing logical inference incorrectly. But they are in fact doing statistical inference and arguing inductively.

Statistical inference and inductive arguments belong in a Bayesian framework. You are making a straw man by translating them into a deductive framework.

Rephrased to say precisely what the study found:

This study tested and rejected the hypothesis that artificial food coloring causes hyperactivity in all children.

No. Mattes and Gittelman's finding is stronger than your rephrasing—your rephrasing omits evidence useful for Bayesian reasoners. For instance, they repeatedly pointed out that they “[studied] only children who were already on the Feingold diet and who were reported by their parents to respond markedly to artificial food colorings.” They claim that this is important because “the Feingold diet hypothesis did not originate from observations of carefully diagnosed children but from anecdotal reports on children similar to the ones we studied.” In other words, they are making an inductive argument:

  1. Most evidence for the Feingold diet hypothesis comes from anecdotal reports.
  2. Most of these anecdotal reports are mistaken.
  3. Thus, there is little evidence for the Feingold diet hypothesis.
  4. Therefore, the Feingold diet hypothesis is wrong.

If you translate this into a deductive framework, of course it will not work. Their paper should be seen in a Bayesian framework, and in this context, their final sentence

The results of this study indicate that artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents.

translates into a correct statement about the evidence resulting from their study.

This refereed medical journal article, like many others, made the same mistake as my undergraduate logic students, moving the negation across the quantifier without changing the quantifier. I cannot recall ever seeing a medical journal article prove a negation and not make this mistake when stating its conclusions.

They are not making this mistake. You are looking at a straw man.


Full-texts:

The number of upvotes for the OP is depressing.

It's a good example for forcing your toolset into every situation you encounter. If all you have is a hammer ...

Well, dammit, I wanted to delete this and rewrite above, but you can't delete comments anymore. This is not retracted, but I can't un-retract it.

You are wrong, and you have not learned to reconsider your logic when many smart people disagree with you.

You can delete retracted comments by reloading the page and clicking on a new delete icon that replaces the retract icon.

People can reply to any comments that they still see in their browser page, even though they've been "deleted", if the replier has not refreshed said browser page.

EDIT TO ADD: As I see that wedrifid also mentions below.

Possibly there is also a similar effect if the deleter hasn't refreshed his browser page.

Possibly there is also a similar effect if the deleter hasn't refreshed his browser page.

Possibly. Specifically it would be if you (as the example) had retracted the page then refreshed it (to get the 'delete' button available to you) and then there is an arbitrary period of time after which you click the delete button without first refreshing again. (Untested, but those are the circumstances under which it would be at all possible if the code is not specifically designed to prevent it.)

Why are you not sure of facts that are subject to easy experiments? (update: arundelo is correct)

Experiment clutters the venue, and being less blunt avoids the appearance of a status conflict.

If deletion is possible, there is very little clutter. If deletion is not possible, and the comment says "I can't figure out how to delete this," at least it discourages other people's experiments. But this thread is itself clutter, so I don't think that is your true rejection. As to bluntness, I conclude that my being less blunt caused you to confabulate bullshit.

PS - I experiment on the open thread.

As to bluntness, I conclude that my being less blunt caused you to confabulate bullshit.

On reflection, it is probably more accurate for me to say, "I wasn't interested in experimenting, including for concern that the experimenting would look low status, and I have higher preferred ways of acting low status."

As for my own choice not to be blunt, you are not correctly modelling my thought process.

In short, I gave two reasons for my action, and you might be right that one was confabulation, but not the one you identify as confabulation.

Why are you not sure of facts that are subject to easy experiments? (arundelo is wrong)

I have performed the experiment in question and it seems to support arundelo's claim. I am not able to remove this comment. At the very least it demonstrates that the experiment required to prove arundelo's fully general claim is false is not the 'easy' one.

Well, now I'm totally confused. Checking Eugine_Nier's account on ibiblio shows that the comment is missing. (Searching for the word "sarcasm" will get you to about when the comment took place, at least as of the date of this comment)

Well, now I'm totally confused. Checking Eugine_Nier's account on ibiblio shows that the comment is missing. (Searching for the word "sarcasm" will get you to about when the comment took place, at least as of the date of this comment)

See my investigation. Short answer: race condition.

Thanks actually experimenting. My beliefs were two months out of date. I stand by my objection to Tim's comment.

I'm not sure that's true. See here

It is possible that the comment was banned by a moderator rather than deleted by the author. (If so, it will still appear if you look at the user's comment page.)

After retraction EDIT: TimS. I can't seem to delete this comment even after refreshing.

As it happens, I remember what Eugine_Nier wrote, and I am certain it did not meet the local criteria for mod-blocking.

(Anonymous downvoter: What is it in wedrifid's post you'd like to see less of? Helpful commentary about the mechanics of this site is not on my list of things to downvote).

As it happens, I remember what Eugine_Nier wrote, and I am certain it did not meet the local criteria for mod-blocking.

Interesting. This suggests that a feature has changed at some point since the retraction-then-delete feature was first implemented. (I have memories of needing to be careful to edit the text to blank then retract so as to best emulate the missing 'delete' feature.)

I notice that I am confused. Investigates.

  • Testing deletion feature. Deletion of (grandparent) comment that you have already replied to: Fail. It is still not (usually) possible to delete comments with replies.
  • Check for moderator deletion. (ie. Moderator use of the ban feature, actual delete per se is extremely rare). Confirm absence of a reply on Eugine_Nier's page that fits that part of history. The comment is, indeed, deleted not banned.
  • Check timestamps for plausibility of race condition. Ahh. Yes. Tim, you replied to Eugine within 3 minutes of him writing the comment. This means that most likely Eugine deleted his message while you were writing your reply. Your comment was still permitted to be made despite the deleted parent. The reverse order may also be possible, depending on the details of implementation. Either way, the principle is the same.

ArisKatsaris suggests browser refresh, not timestamps, is the issue.

ArisKatsaris suggests browser refresh, not timestamps, is the issue.

He is describing the same phenomenon. The timestamps give an indication as to how likely the race condition is to occur based on the delays between GETs and POSTs. If the comments were a day apart I would have tentatively suggested "Perhaps one of you deleted or replied to a comments page that was old?". Whereas given that the timestamps were within 3 minutes I could more or less definitively declare the question solved.

Thanks. I'm not technologically fluent enough to tell the difference between what you said and what he said without the explanation.

For the record, I did in fact delete the comment.

You claim that medical researchers are doing logical inference incorrectly. But they are in fact doing statistical inference and arguing inductively.

Jaynes argued that probability theory was an extension of logic, so this seems like quite a quibbling point.

Statistical inference and inductive arguments belong in a Bayesian framework. You are making a straw man by translating them into a deductive framework.

They do, but did the paper he dealt with write within a Bayesian framework? I didn't read it, but it sounded like standard "let's test a null hypothesis" fare.

No. Mattes and Gittelman's finding is stronger than your rephrasing—your rephrasing omits evidence useful for Bayesian reasoners.

Which is not a valid objection to Phil's analysis if Mattes and Gittelman weren't doing a Bayesian analysis in the first place. Were they? I'll apologize for not checking myself if I'm wrong, but right now my priors are extremely low so I don't see value in expending the effort to verify.

Their paper should be seen in a Bayesian framework

If they did their calculations in a Bayesian framework. Did they?

hey do, but did the paper he dealt with write within a Bayesian framework? I didn't read it, but it sounded like standard "let's test a null hypothesis" fare.

You don't just ignore evidence because someone used a hypothesis test instead of your favorite Bayesian method. P(null | p value) != P(null)

I ignore evidence when the evidence doesn't relate to the point of contention.

Phil criticized a bit of paper, noting that the statistical analysis involved did not justify the conclusion made. The conclusion did not follow the analysis. Phil was correct in that criticism.

It's just not an argument against Phil that someone might take some of the data in the paper and do a Bayesian analysis that the authors did not do.

It's just not an argument against Phil that someone might take some of the data in the paper and do a Bayesian analysis that the authors did not do.

That's not what I'm saying. I'm saying that what the authors did do IS evidence against the hypothesis in question. Evidence against a homogenous response is evidence against any response (it makes some response less likely)

What they did do?

Are you saying the measurements they took make their final claim more likely, or that their analysis of the data is correct and justifies their claim?

Yes, if you arrange things moderately rationally, evidence against a homogenous response is evidence against any response, but much less so. I think Phil agrees with that too, and is objecting to a conclusion based on much less so evidence pretending to have much more justification than it does.

Ok, yeah, translating what the researchers did into a Bayesian framework isn't quite right either. Phil should have translated what they did into a frequentist framework - i.e. he still straw manned them. See my comment here.

That's not what I'm saying. I'm saying that what the authors did do IS evidence against the hypothesis in question.

I know that. That's not the point. They claimed to have proven something they did not prove. They did not present this claim in a Bayesian framework.

No. I am not attacking the inductive argument in your points 1-4 above, which is not made in the paper, is not the basis for their claims, and is not what I am talking about.

You speak of the evidence from their study, but apparently you have not looked at the evidence from their study, presented in table 3. If you looked at the evidence you would see that they have a large number of measures of "hyperactivity", and that they differed between test and control groups. They did not find that there was no difference between the groups. There is always a difference between the groups.

What they did, then, was do an F-test to determine whether the difference was statistically significant, using the assumption that all subjects respond the same way to the intervention. They make that assumption, come up with an F-value, and say, "We did not reach this particular F-value, therefore we did not prove the hypothesis that food dye causes hyperactivity."

THEY DID NOT ASK WHETHER FOOD DYE INCREASED OR DECREASED HYPERACTIVITY BETWEEN THE GROUPS. That is not how an F-test works. They were, strictly speaking, testing the hypothesis whether the two groups differed, not in which direction they differed.

THERE WAS NO EVIDENCE THAT FOOD DYE DOES NOT CAUSE HYPERACTIVITY IN THEIR DATA. Not even interpreted in a Bayesian framework. They found a difference in behavior, they computed an F-value for 95% confidence assuming population homogeneity, and they did not reach that F-value.

Go back and read the part I added, with the bulleted list. You are trying to get all subtle. No; these people did an F-test, which gave a result of the form "It is not the case that for all x, P(x)", and they interpreted that as meaning "For all x, it is not the case that P(x)."

I don't think you responded to my criticisms and I have nothing further to add. However, there are a few critical mistakes in what you have added that you need to correct:

Now pay attention; this is the part everyone gets wrong, including most of the commenters below.

The methodology used in this study, and in most studies, is as follows:

  • Divide subjects into a test group and a control group.

No, Mattes and Gittelman ran an order-randomized crossover study. In crossover studies, subjects serve as their own controls and they are not partitioned into test and control groups.

If you don't understand why that is so, read the articles about the t-test and the F-test. The tests compute what a difference in magnitude of response such that, 95% of the time, if the measured effect difference is that large, the null hypothesis (that the responses of all subjects in both groups were drawn from the same distribution) is false.

No, the correct form is:

  • The tests compute a difference in magnitude of response such that if the null hypothesis is true, then 95% of the time the measured effect is not that large.

The form you quoted is a deadly undergraduate mistake.

ADDED: People are making comments proving they don't understand how the F-test works. This is how it works: You are testing the hypothesis that two groups respond differently to food dye.

Suppose you measured the number of times a kid shouted or jumped, and you found that kids fed food dye shouted or jumped an average of 20 times per hour, and kids not fed food dye shouted or jumped an average of 17 times per hour. When you run your F-test, you compute that, assuming all kids respond to food dye the same way, you need a difference of 4 to conclude that the two distributions (test and control) are different.

If the food dye kids had shouted/jumped 21 times per hour, the study would conclude that food dye causes hyperactivity. Because they shouted/jumped only 20 times per hour, it failed to prove that food dye causes hyperactivity. That failure to prove is then taken as having proved that food dye does not cause hyperactivity, even though the evidence indicated that food dye causes hyperactivity.

This is wrong. There are reasonable prior distributions for which the observation of a small positive sample difference is evidence for a non-positive population difference. For example, this happens when the prior distribution for the population difference can be roughly factored into a null hypothesis and an alternative hypothesis that predicts a very large positive difference.

In particular, contrary to your claim, the small increase of 3 can be evidence that food dye does not cause hyperactivity if the prior distribution can be factored into a null hypothesis and an alternative hypothesis that predicts a positive response much greater than 3. This is analogous to one of Mattes and Gittelman's central claims (they claim to have studied children for which the alternative hypothesis predicted a very large response).

This is going to be yet another horrible post. I just go meta and personal. Sorry.

I don't understand how this thread (and a few others like it) on stats can happen; in particular, your second point (re: the basic mistake). It is the single solitary thing any person who knows any stats at all knows. Am I wrong? Maybe 'knows' meaning 'understands'. I seem to recall the same error made by Gwern (and pointed out). I mean the system works in the sense that these comments get upvoted, but it is like. . . people having strong technical opinions with very high confidence about Shakespeare without being able to write out a sentence. It is not inconceivable the opinions are good (stroke, language, etc), but it says something very odd about the community that it happens regularly and is not extremely noticed. My impression is that Less Wrong is insane on statistics, particularly, and some areas of physics (and social aspects of science and philosophy).

I didn't read the original post, paper, or anything other than some comment by Goetz which seemed to show he didn't know what a p-value was and had a gigantic mouth. It's possible I've missed something basic. Normally, before concluding a madness in the world, I'd be careful. For me to be right here means madness is very very likely (e.g., if I correctly guess it's -70 outside without checking any data, I know something unusual about where I live).

It is the single solitary thing any person who knows any stats at all knows.

Many people with statistics degrees or statisticians or statistics professors make the p-value fallacy; so perhaps your standards are too high if LWers merely being as good as statistics professors comes as a disappointment to you.

I seem to recall the same error made by Gwern (and pointed out).

I've pointed out the mis-interpretation of p-values many times (most recently, by Yvain), and wrote a post with the commonness of the misinterpretation as a major point (http://lesswrong.com/lw/g13/against_nhst/), so I would be a little surprised if I have made that error.

Sorry, Gwern, I may be slandering you, but I thought I noticed it long before that (I've been reading, despite my silence). Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey. Again, I may have you misidentified. Such behavior is striking, if true, since it seems to me one of the most basic complaints Less Wrong has about science (somewhat incorrectly).

Edited: Gwern is right (on my misremembering). Either I was skimming and didn't notice Gwern was quoting or I just mixed corrector with corrected. Sorry about that. In possible recompense: What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you're just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end. That level of correction then also saves your 'noticing' something interesting and checking it specifically being circular (because you were already checking 'everything' and correcting appropriately).

Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey.

It's true I didn't do any multiple correction for the 2012 survey, but I think you're simply not understanding the point of multiple correction.

First, 'Data exploration' is precisely when you don't want to do multiple correction, because when data exploration is being done properly, it's being done as exploration, to guide future work, to discern what signals may be there for followup. But multiple correction controls the false positive rate at the expense of then producing tons of false negatives; this is not a trade-off we want to make in exploration. If you look at the comments, dozens of different scenarios and ideas are being looked at, and so we know in advance that any multiple correction is going to trash pretty much every single result, and so we won't wind up with any interesting hypotheses at all! Predictably defeating the entire purpose of looking. Why would you do this wittingly? It's one thing to explore data and find no interesting relationships at all (shit happens), but it's another thing entirely to set up procedures which nearly guarantee that you'll ignore any relationships you do find. And which multiple correction, anyway? I didn't come up with a list of hypotheses and then methodically go through them, I tested things as people suggested them or I thought of them; should I have done a single multiple correction of them all yesterday? (But what if I think of a new hypothesis tomorrow...?)

Second, thresholds for alpha and beta are supposed to be set by decision-theoretic considerations of cost-benefit. A false positive in medicine can be very expensive in lives and money, and hence any exploratory attitude, or undeclared data mining/dredging, is a serious issue (and one I fully agree with Ioannides on). In those scenarios, we certainly do want to reduce the false positives even if we're forced to increase the false negatives. But this is just an online survey. It's done for personal interest, kicks, and maybe a bit of planning or coordination by LWers. It's also a little useful for rebutting outside stereotypes about intellectual monoculture or homogeneity. In this context, a false positive is not a big deal, and no worse than a false negative. (In fact, rather than sacrifice a disproportionate amount of beta in order to decrease alpha more, we might want to actually increase our alpha!)

This cost-benefit is a major reason why if you look through my own statistical analyses and experiments, I tend to only do multiple correction in cases where I've pre-specified my metrics (self-experiments are not data exploration!) and where a false positive is expensive (literally, in the case of supplements, since they cost a non-trivial amount of $ over a lifetime). So in my Zeo experiments, you will see me use multiple correction for melatonin, standing, & 2 Vitamin D experiments (and also in a recent non-public self-experiment); but you won't see any multiple correction in my exploratory weather analysis.

What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you're just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end.

See above on why this is pointless and inappropriate.

That level of correction then also saves your 'noticing' something interesting and checking it specifically being circular (because you were already checking 'everything' and correcting appropriately).

If you were doing it at the end, then this sort of 'double-testing' would be a concern as it might lead your "actual" number of tests to differ from your "corrected against" number of tests. But it's not circular, because you're not doing multiple correction. The positives you get after running a bunch of tests will not have a very high level of confidence, but that's why you then take them as your new fixed set of specific hypotheses to run against the next dataset and, if the results are important, then perhaps do multiple correction.

So for example, if I cared that much about the LW survey results from the data exploration, what I should ideally do is collect the n positive results I care about, announce in advance the exact analysis I plan to do with the 2013 dataset, and decide in advance whether and what kind of multiple correction I want to do. The 2012 results using 2012 data suggest n hypotheses, and I would then actually test them with the 2013 data. (As it happens, I don't care enough, so I haven't.)

Gwern, I should be able to say that I appreciate the time you took to respond (which is snarky enough), but I am not able to do so. You can't trust that your response to me is inappropriate and I can't find any reason to invest myself in proving your response is inappropriate. I'll agree my comment to you was somewhat inappropriate and while turnabout is fair play (and first provocation warrants an added response), it is not helpful here (whether deliberate or not). Separate from that, I disagree with you (your response is,historically, how people have managed to be wrong a lot). I'll retire once more.

I believe it was suggested to me when I first asked the potential value of this place that they could help me with my math.

It's possible I've missed something basic.

Nope, I don't think you have. Not everyone is crazy, but scholarship is pretty atrocious.

If you don't understand why that is so, read the articles about the t-test and the F-test. The tests compute what a difference in magnitude of response such that, 95% of the time, if the measured effect difference is that large, the null hypothesis (that the responses of all subjects in both groups were drawn from the same distribution) is false.

No, the correct form is:

  • The tests compute a difference in magnitude of response such that if the null hypothesis is true, then 95% of the time the measured effect is not that large. The form you quoted is a deadly undergraduate mistake.

I read through most of the comments and was surprised that so little was made of this. Thanks, VincentYu. For anyone who could use a more general wording, it's the difference between:

  • P(E≥S|H) the probability P of the evidence E being at least as extreme as test statistic S assuming the hypothesis H is true, and

  • P(H|E) the probability P of the hypothesis H being true given the evidence E.

I think you're interpreting the F test a little more strictly than you should. Isn't it fairer to say a null result on a F test is "It is not the case that for most x, P(x)", with "most" defined in a particular way?

You're correct that a F-test is miserable at separating out different classes of responders. (In fact, it should be easy to develop a test that does separate out different classes of responders; I'll have to think about that. Maybe just fit a GMM with three modes in a way that tries to maximize the distance between the modes?)

But I think the detail that you suppressed for brevity also makes a significant difference in how the results are interpreted. This paper doesn't make the mistake of saying "artificial food coloring does not cause hyperactivity in every child, therefore artificial food coloring affects no children." The paper says "artificial food coloring does not cause hyperactivity in every child whose parents confidently expect them to respond negatively to artificial food coloring, therefore their parents' expectation is mistaken at the 95% confidence level."

Now, it could be the case that there are children who do respond negatively to artificial food coloring, but the Feingold association is terrible at finding them / rejecting those children where it doesn't have an effect. (This is unsurprising from a Hawthorne Effect or confirmation bias perspective.) As well, for small sample sizes, it seems better to use F and t tests than to try to separate out the various classes of responders, because the class sizes will be tiny; if one child responds poorly after administered artificial food die, that's not much to go on, compared to a distinct subpopulation of 20 children in a sample of 1000.

The section of the paper where they describe their reference class:

If artificial additives affect only a small proportion of hyperactive children, significant dietary effects are unlikely to be detected in heterogeneous samples of hyperactive children. Therefore, children who had been placed on the Feingold diet by their parents and who were reported by their parents to have derived marked behavioral benefit from the diet and to experience marked deterioration when given artificial food colorings were targeted for this study. This sampling approach, combined with high dosage, was chosen to maximize the likelihood of observing behavioral deterioration with ingestion of artificial colorings.

(I should add that the first sentence is especially worth contemplating, here.)

I think I disagree with both of you here. The failure to reject a null hypothesis is a failure. It doesn't allow or even encourage you to conclude anything.

I think I disagree with both of you here. The failure to reject a null hypothesis is a failure. It doesn't allow or even encourage you to conclude anything.

Can you conclude that you failed to reject the null hypothesis? And if you expected to reject the null hypothesis, isn't that failure meaningful? (Note that my language carefully included the confidence value.)

As a general comment, this is why the Bayesian approach is much more amenable to knowledge-generation than the frequentist approach. The statement "the hyperactivity increase in the experimental group was 0.36+/-2.00, and that range solidly includes 0" (with the variance of that estimate pulled out of thin air) is much more meaningful than "we can't be sure it's not zero."

As a general comment, this is why Bayesian statistics is much more amenable to knowledge-generation than frequentist statistics. The statement "the hyperactivity increase in the experimental group was 0.36+/-2.00, and that range solidly includes 0" (with the variance of that estimate pulled out of thin air) is much more meaningful than "we can't be sure it's not zero."

I agree with the second sentence, and the first might be true, but the second isn't evidence for the first; interval estimation vs. hypothesis testing is an independent issue to Bayesianism vs. frequentism. There are Bayesian hypothesis tests and frequentist interval estimates.

Agreed that both have those tools, and rereading my comment I think "approach" may have been a more precise word than "statistics." If you think in terms of "my results are certain, reality is uncertain" then the first tool you reach for is "let's make an interval estimate / put a distribution on reality," whereas if you think in terms of "reality is certain, my results are uncertain" then the first tool you reach for is hypothesis testing. Such defaults have very important effects on what actually gets used in studies.

And if you expected to reject the null hypothesis, isn't that failure meaningful?

To me, but not to the theoretical foundations of the method employed.

Hypothesis testing generally works sensibly because people smuggle in intuitions that aren't part of the foundations of the method. But since they're only smuggling things in under a deficient theoretical framework, they're given to mistakes, particularly when they're applying their intuitions to the theoretical framework and not the base data.

I agree with the later comment on Bayesian statistics, and I'd go further. Scatterplot the labeled data, or show the distribution if you have tons of data. That's generally much more productive than any particular particular confidence interval you might construct.

It would be an interesting study generative study to compare the various statistical tests on the same hypothesis versus the human eyeball. I think the eyeball will hold it's own.

The results of this study indicate that artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents.

Translated into logic:

!∃child ( eats(child, coloring) ⇨ hyperactive(child) ) )

That's an uncharitable interpretation of that sentence. It would mean that if there was a word such as “any” before the phrase “school-age children”, but there isn't. The zero article before plural nouns in English doesn't generally denote an universal quantifier; “men are taller than women” doesn't mean ∀x ∈ {men} ∀y ∈ {women} x.height > y.height. The actual meaning of the zero article before plural nouns in English is context-dependent and non-trivial to formalize.

Are you a non-native English speaker by any chance? (So am I FWIW, but the definite article in my native language has a very similar meaning to the zero article in English in contexts like these.)

Suppose there is one school-age child, somewhere in the world, whose behavior is affected by artificial food colorings, and who is claimed to be sensitive to food coloring. Then the statement, "artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents," is false.

You shouldn't think of this paper as being in English. You should think of it as being written in Science.

It isn't uncharitable. Even if they had said, "Artificial food colorings do not affect the behavior of MOST school-age children who are claimed to be sensitive to these agents," it would still be a false claim, unsupported by their data and math. They proved that THERE EXIST children who are not sensitive to these agents. 5% may be enough.

You shouldn't think of this paper as being in English. You should think of it as being written in Science.

Science != Pure Mathematics.

They proved that THERE EXIST children who are not sensitive to these agents. 5% may be enough.

Yes, you can “prove” very little outside pure mathematics. But “doesn't prove” doesn't imply “doesn't support”. Chapter 1 of Probability Theory by E. T. Jaynes makes that clear.

(And BTW, how comes you're taking “school-age children” to mean “all school-age children” but you're not taking “artificial food colorings” to mean ‘all artificial food colorings’?)

Then the statement, "artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents," is false.

No it fucking isn't. Read the article I've linked to again. “Humans have opposable thumbs” doesn't stop being true as soon as someone somewhere gets both thumbs amputated.

[1] Jeffrey A. Mattes & Rachel Gittelman (1981). Effects of Artificial Food Colorings in Children With Hyperactive Symptoms: A Critical Review and Results of a Controlled Study. Archives of General Psychiatry 38(6):714-718. doi:10.1001/archpsyc.1981.01780310114012. ungated

[2] K.S. Rowe & K.J. Rowe (1994). Synthetic food coloring and behavior: a dose response effect in a double-blind, placebo-controlled, repeated-measures study. The Journal of Pediatrics Nov;125(5 Pt 1):691-8. ungated

[3 open access] Stevenson, Sonuga-Barke, McCann et al. (2010). The Role of Histamine Degradation Gene Polymorphisms in Moderating the Effects of Food Additives on Children’s ADHD Symptoms. Am J Psychiatry 167:1108-1115.

I wouldn't have posted this if I'd noticed earlier links, but independent links are still useful.

Both the t-test and the F-test work by assuming that every subject has the same response function to the intervention:

response = effect + normally distributed error

where the effect is the same for every subject.

The F test / t test doesn't quite say that. It makes statements about population averages. More specifically, if you're comparing the mean of two groups, the t or F test says whether the average response of one group is the same as the other group. Heterogeneity just gets captured by the error term. In fact, econometricians define the error term as the difference between the true response and what their model says the mean response is (usually conditional on covariates).

The fact that the authors ignored potential heterogeneity in responses IS a problem for their analysis, but their result is still evidence against heterogeneous responses. If there really are heterogeneous responses we should see that show up in the population average unless:

  • The positive and negative effects cancel each other out exactly once you average across the population. (this seems very unlikely)
  • The population average effect size is nonzero but very small, possibly because the effect only occurs in a small subset of the population (even if it's large when it does occur) or something similar but more complicated. In this case, a large enough sample size would still detect the effect.

Now it might not be very strong evidence - this depends on sample size and the likely nature of the heterogeneity (or confounders, as Cyan mentions). And in general there is merit in your criticism of their conclusions. But I think you've unfairly characterized the methods they used.

The fact that the authors ignored potential heterogeneity in responses IS a problem for their analysis, but their result is still evidence against heterogeneous responses.

Why do you say that? Did you look at the data?

They found F values of 0.77, 2.161, and 1.103. That means they found different behavior in the two groups. But those F-values were lower than the thresholds they had computed assuming homogeneity. They therefore said "We have rejected the hypothesis", and claimed that the evidence, which interpreted in a Bayesian framework might support that hypothesis, refuted it.