TL;DR: When people read a journal article that concludes, "We have proved that it is not the case that for every X, P(X)", they generally credit the article with having provided at least weak evidence in favor of the proposition ∀x !P(x). This is not necessarily so.

Authors using statistical tests are making precise claims, which must be quantified correctly. Pretending that all quantifiers are universal because we are speaking English is one error. It is not, as many commenters are claiming, a small error. ∀x !P(x) is very different from !∀x P(x).

A more-subtle problem is that when an article uses an F-test on a hypothesis, it is possible (and common) to fail the F-test for P(x) with data that supports the hypothesis P(x). The 95% confidence level was chosen for the F-test in order to count false positives as much more expensive than false negatives. Applying it therefore removes us from the world of Bayesian logic. You cannot interpret the failure of an F-test for P(x) as being even weak evidence for not P(x).

I used to teach logic to undergraduates, and they regularly made the same simple mistake with logical quantifiers. Take the statement "For every X there is some Y such that P(X,Y)" and represent it symbolically:

∀x∃y P(x,y)

Now negate it:

!∀x∃y P(x,y)

You often don't want a negation to be outside quantifiers. My undergraduates would often just push it inside, like this:

∀x∃y !P(x,y)

If you could just move the negation inward like that, then these claims would mean the same thing:

A) Not everything is a raven: !∀x raven(x)

B) Everything is not a raven: ∀x !raven(x)

To move a negation inside quantifiers, flip each quantifier that you move it past.

!∀x∃y P(x,y) = ∃x!∃y P(x,y) = ∃x∀y !P(x,y)

Here's the findings of a 1982 __article__ [1] from *JAMA Psychiatry* (formerly *Archives of General Psychiatry*), back in the days when the medical establishment was busy denouncing the Feingold diet:

Previous studies have not conclusively demonstrated behavioral effects of artificial food colorings ... This study, which was designed to maximize the likelihood of detecting a dietary effect, found none.

Now pay attention; this is the part everyone gets wrong, including most of the commenters below.

The methodology used in this study, and in most studies, is as follows:

- Divide subjects into a test group and a control group.
- Administer the intervention to the test group, and a placebo to the control group.
- Take some measurement that is supposed to reveal the effect they are looking for.
- Compute the mean and standard deviation of that measure for the test and control groups.
- Do either a t-test or an F-test of the hypothesis that the intervention causes a statistically-significant effect on all subjects.
- If the test succeeds, conclude that the intervention causes a statistically-significant effect (CORRECT).
- If the test does not succeed, conclude that the intervention does not cause any effect to any subjects (ERROR).

People make the error because they forget to explicitly state what quantifiers they're using. Both the t-test and the F-test work by assuming that every subject has *the same response function* to the intervention:

*response = effect + normally distributed error*

where the effect is the same for every subject. If you don't understand why that is so, read the articles about the t-test and the F-test. The null hypothesis is that the responses of all subjects in both groups were drawn from the same distribution. The one-tailed versions of the tests take a confidence level C and compute a cutoff Z such that, if the null hypothesis is false,

P(average effect(test) - average effect(control)) < Z = C

ADDED: People are making comments proving they don't understand how the F-test works. This is how it works: You are testing the hypothesis that two groups respond *differently* to food dye.

Suppose you measured the number of times a kid shouted or jumped, and you found that kids fed food dye shouted or jumped an average of 20 times per hour, and kids not fed food dye shouted or jumped an average of 17 times per hour. When you run your F-test, you compute that, assuming all kids respond to food dye the same way, you need a difference of 4 to conclude with 95% confidence that the two distributions (test and control) are different.

If the food dye kids had shouted/jumped 21 times per hour, the study would conclude that food dye causes hyperactivity. Because they shouted/jumped only 20 times per hour, it failed to prove that food dye affects hyperactivity. You can only conclude that food dye affects behavior with 84% confidence, rather than the 95% you desired.

Finding that food dye affects behavior with 84% confidence should not be presented as proof that food dye does not affect behavior!

If half your subjects have a genetic background that makes them resistant to the effect, the threshold for the t-test or F-test will be much too high to detect that. If 10% of kids become more hyperactive and 10% become less hyperactive after eating food coloring, such a methodology will never, ever detect it. A test done in this way can *only* accept or reject the hypothesis that *for every* subject x, the effect of the intervention is different than the effect of the placebo.

So. Rephrased to say precisely what the study found:

This study tested and rejected the hypothesis that artificial food coloring affects behavior in all children.

Converted to logic (ignoring time):

!( ∀child ( eats(child, coloring) ⇨ behaviorChange(child) ) )

Move the negation inside the quantifier:

∃child !( eats(child, coloring) ⇨ behaviorChange(child) )

Translated back into English, this study proved:

There exist children for whom artificial food coloring does not affect behavior.

However, this is the actual final sentence of that paper:

The results of this study indicate that artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents.

Translated into logic:

!∃child ( eats(child, coloring) ⇨ hyperactive(child) ) )

or, equivalently,

∀child !( eats(child, coloring) ⇨ hyperactive(child) ) )

This refereed medical journal article, like many others, made the same mistake as my undergraduate logic students, moving the negation across the quantifier without changing the quantifier. I cannot recall ever seeing a medical journal article prove a negation and *not* make this mistake when stating its conclusions.

A lot of people are complaining that I should just interpret their statement as meaning "Food colorings do not affect the behavior of MOST school-age children."

But they didn't prove that food colorings do not affect the behavior of most school-age children. They proved that there exists at least one child whose behavior food coloring does not affect. That isn't remotely close to what they have claimed.

For the record, the conclusion is wrong. Studies that did not assume that all children were identical, such as studies that used each child as his or her own control by randomly giving them cookies containing or not containing food dye [2], or a recent study that partitioned the children according to single-nucleotide polymorphisms (SNPs) in genes related to food metabolism [3], found large, significant effects in some children or some genetically-defined groups of children. Unfortunately, reviews failed to distinguish the logically sound from the logically unsound articles, and the medical community insisted that food dyes had no influence on behavior until thirty years after their influence had been repeatedly proven.

[1] Jeffrey A. Mattes & Rachel Gittelman (1981). Effects of Artificial Food Colorings in Children With Hyperactive Symptoms: A Critical Review and Results of a Controlled Study. *Archives of General Psychiatry *38(6):714-718. doi:10.1001/archpsyc.1981.01780310114012.

[2] K.S. Rowe & K.J. Rowe (1994). Synthetic food coloring and behavior: a dose response effect in a double-blind, placebo-controlled, repeated-measures study. *The Journal of Pediatrics* Nov;125(5 Pt 1):691-8.

[3] Stevenson, Sonuga-Barke, McCann et al. (2010). The Role of Histamine Degradation Gene Polymorphisms in Moderating the Effects of Food Additives on Children’s ADHD Symptoms. Am J Psychiatry 167:1108-1115.

You claim that medical researchers are doing

logicalinference incorrectly. But they are in fact doingstatisticalinference and arguing inductively.Statistical inference and inductive arguments belong in a Bayesian framework. You are making a straw man by translating them into a deductive framework.

No. Mattes and Gittelman's finding is stronger than your rephrasing—your rephrasing omits evidence useful for Bayesian reasoners. For instance, they repeatedly pointed out that they “[studied] only children who were already on the Feingold diet and who were reported by their parents to respond markedly to artificial food colorings.” They claim that this is important because “the Feingold diet hypothesis did not originate from observations of carefully diagnosed children but from anecdotal reports on children similar to the ones we studied.” In other words, they are making an inductive argument:

[1] Jeffrey A. Mattes & Rachel Gittelman (1981). Effects of Artificial Food Colorings in Children With Hyperactive Symptoms: A Critical Review and Results of a Controlled Study. Archives of General Psychiatry 38(6):714-718. doi:10.1001/archpsyc.1981.01780310114012. ungated

[2] K.S. Rowe & K.J. Rowe (1994). Synthetic food coloring and behavior: a dose response effect in a double-blind, placebo-controlled, repeated-measures study. The Journal of Pediatrics Nov;125(5 Pt 1):691-8. ungated

[3 open access] Stevenson, Sonuga-Barke, McCann et al. (2010). The Role of Histamine Degradation Gene Polymorphisms in Moderating the Effects of Food Additives on Children’s ADHD Symptoms. Am J Psychiatry 167:1108-1115.

I wouldn't have posted this if I'd noticed earlier links, but independent links are still useful.

The F test / t test doesn't quite say that. It makes statements about

population averages. More specifically, if you're comparing the mean of two groups, the t or F test says whether theaverage responseof one group is the same as the other group. Heterogeneity just gets captured by the error term. In fact, econometriciansdefinethe error term as the difference between the true response and what their model says the mean response is (usually conditional on covariates).The fact that the authors ignored potential heterogeneity in responses IS a problem for their analysis, but their result is still evidence against heterogeneous responses. If there really are heterogeneous responses we should see that show up in the population average unless:

That's an uncharitable interpretation of that sentence. It would mean that if there was a word such as “any” before the phrase “school-age children”, but there isn't. The zero article before plural nouns in English doesn't generally denote an universal quantifier; “men are taller than women” doesn't mean ∀x ∈ {men} ∀y ∈ {women} x.height > y.height. The actual meaning of the zero article before plural nouns in English is context-dependent and non-trivial to formalize.

Are you a non-native English speaker by any chance? (So am I FWIW, but the definite article in my native language has a very similar meaning to the zero article in English in contexts like these.)

If whether this particular paper exemplifies this error is disputed (as it appears to be!) and the author's claim that he "cannot recall ever seeing a medical journal article prove a negation and not make this mistake" is correct, then

it should be easy for the author to give several more exampleswhich more clearly display the argument given here. I would encourage PhilGoetz or someone else to do so.Interesting. Those two statements seem quite different; more than just a rephrasing.

Probabilistically, it sounds like the study found

`P(hyper|dye) = P(hyper|~dye)`

, that is they rejected`P(hyper|dye) > P(hyper|~dye)`

, and concluded`P(hyper|dye) = P(hyper|~dye)`

(no connection) correctly.I think your logical interpretation of their result throws out most of the information. Yes they concluded that it is not true that all children that ate dye were hyperactive, but they also found that the proportion of dye-eaters who were hyperactive was not different from the base rate, which is a much stronger statement, which

doesimply their conclusion, but can't be captured by the logical formulation you gave.The whole point of inductive reasoning is that this is evidence for artificial food coloring not affecting the behavior of any children (given a statistically significant sample size). You cannot do purely deductive reasoning about the real world and expect to get anything meaningful. This should be obvious.

Moved to Discussion. (Again.)

why is it back in main?

The problem is that you don't understand the purpose of the studies at all and you're violating several important principles which need to be kept in mind when applying logic to the real world.

Our primary goal is to determine net harm or benefit. If I do a study as to whether or not something causes harm or benefit, and see no change in underlying rates, then it is non-harmful. If it is making some people slightly more likely to get cancer, and others slightly less likely to get cancer, then there's no net harm - there are just as many cancers as there were before. I may have changed the distribution of cancers in the population, but I have certainly not caused any net harm to the population.

This study's purpose is to look at the net effect of the treatment. If we see the same amount of hyperactivity in the population prior to and after the study, then we cannot say that the dye causes hyperactivity in the general population.

"But," you complain, "Clearly some people are being harmed!" Well yes, some people are worse off after the treatment in such a theoretical case. But here's the key: for the effect NOT to show up in the general population, then you have only ... (read more)

I've similarly griped here in the past about the mistaken ways medical tests are analyzed here and elsewhere, but I think you over complicated things.

The fundamental error is misinterpreting a failure to reject a null hypothesis for a particular statistical test, a particular population, and a particular treatment regime as a generalized demonstration of the null hypothesis that the medication "doesn't work". And yes, you see it very often, and almost universally in press accounts.

You make a good point about how modeling

response = effect + errorleads to confusion. I think the mistake is clearer written as "response = effect + noise", where noise is taken as a random process injecting ontologically inscrutable perturbations of the response. If you start with the assumption that all differences from the mean effect are due to ontologically inscrutable magic, you've ruled out any analysis of that variation by construction.OK, I may be dense today, but you lost me there. I tried to puzzle out how the raven sentences could be put symbolically so that they each corresponded to one of the negations of your original logic sentence, and found that fruitless. Please clarify?

The rest of the post made sense. I'll read through the comments and figure out why people seem to be disagreeing first, which will give me time to think whether to upvote.

If 11 out of 11 children studied have a property (no food coloring hyperactivity response), that's a bit stronger than "there exist 11 children with this property", though perhaps not quite "all children have this property".

Or rather, you can conclude that, if there were no effect of food dye on hyperactivity and we did this test a whole lotta times, then we'd get data like this 16% of the time, rather than beneath the 5%-of-the-time maximum cutoff you were hoping for.

It's not so easy to jump from frequentist confidence intervals to confidence for or against a hypothesis. We'd need a bunch of assumptions. I don't have access to the original article so I'll just make ... (read more)

I think part of the problem is that there is a single confidence threshold, usually 90%. The problem is that setting the threshold high enough to compensate for random flukes and file drawer effects causes problems when people start interpreting threshold - epsilon to mean the null hypothesis has been proven. Maybe it would be better to have two thresholds with results between them interpreted as inconclusive.

This post makes a point that is both correct and important. It should be in Main.

[pollid:424]

This post makes a point that is both correct and important. A post that makes this point should be in Main.

The reception of this post indicates that the desired point is not coming through to the target audience. That matters.

Not even that. It takes

the zero-article pluralas used in everyday language and pretends it is intended to be precisely the same as the logical "all" operator, which it of course it is not.Does what you're saying here boil down to "failing to reject the null (H0) does not entail rejecting the alternative (H1)"? I have read this before elsewhere, but not framed in quantifier language.

I think the picture is not actually so grim: the study does reject an entire class of (distributions of) effects on the population.

Specifically, it cannot be the case (with 95% certainty or whatever) that a significant proportion of children are made hyperactive, while the remainder are unaffected. This does leave a few possibilities:

Only a small fraction of the children were affected by the intervention.

Although a significant fraction of the children were affected by the intervention in one direction, the remainder were affected in the opposite direct

When people do studies of the effects of food coloring on children, are the children blindfolded?

That is, can the studies discern the neurochemical effects of coloring molecules from the psychological effects of eating brightly-colored food?

I expect that beige cookies are not as exciting as vividly orange cookies.

My read of the Mattes & Gittelman paper is that they're comparing natural and artificial food coloring.

I think that should be: The tests compute what is a difference in magnitude of response such that, if the null hypothesis is true, then 95% of the time the measured effect difference will not be that large.

Frequentist statistics cannot make the claim that with some probabilty the null hypothesis is true or false.

Ever. You must have a prior and invoke Bayes theorem to do that.I'm not as interested in proving my point, as in figuring out why people resist it so strongly. It seems people are eager to disagree with me and reluctant to agree with me.

How did the post make you

feel, and why?(Eliezer does, anyway. I can't say I see very many quotes or invocations from others.)

I am hostile to some criticisms, because in some cases when I see them being done online, it's not in the spirit of 'let us understand how these methods make this research fundamentally flawed, what this implies, and how much we can actually extract from this research'*, but in the spirit of 'the earth is actually not spherical but an oblate spheroid

thus you have been educated stupid and Time has Four Corners!' Because the standard work has flaws, they feel free to jump to whatever random bullshit they like best. 'Everything is true, nothing is forbidden.'* eg. although extreme and much more work than I realistically expect anyone to do, I regard my dual n-back meta-analysis as a model of how to react to potentially valid criticisms. Instead of learning that passive control groups are a serious methodological iss... (read more)

I think that the universal quantifier in

is not appropriate.

The original statement

only implicates that artificial food coloring was responsible for all children's hyperactivity, not that children who ever ate artificial food coloring would inevitably have hyperactivity. So the formula without universal quantifier is more reasonable and thus the final statement of the article is without problem.

It would've been very helpful if some sort of glossary or even a Wikipedia link was provided before diving into the use of the notational characters such as those used in "∀x !P(x)".

Although this post covers an important topic, the first few sentences almost lost me completely, even though I learned what all those characters meant at one time.

And, as LessWrong is rather enamored with statistics, consider that by writing P(x,y), the readers have an exactly 50% chance of getting the opposite meaning unless they have very good recall. :)

That would be interesting if true. I recommend finding another one, since you sya they're so plentiful. And I also recommend reading it carefully, as the study you chose to make an example of is not the study you were looking for. (If you don... (read more)

Why is this post not visible under "Main->New"?

Unfortunately, there's an error in your logic: You call that type of medical journal article error "universal", i.e. applicable in all cases. Clearly a universal quantifier if I ever saw one.

That means that for all medical journal articles, it is true that they contain that error.

However, there exists a medical journal article that does not contain that error.

Hence the medical journal error is not universal, in contradiction to the title.

First logical error ... and we're not even out of the title? Oh dear.