The result looks pretty weak. They had 62 kids. First, they gave all the kids a fluid intelligence test to measure their baseline fluid intelligence. Then half the kids (32) were given a month of n-back training (which the authors expect to increase their fluid intelligence) while the other half (30) did a control training which was not supposed to influence fluid intelligence. At the end of the month's training all of the kids took another fluid intelligence test to see if they'd improved, and 3 months later they all took a fluid intelligence test once more to see if they'd retained any improvement.
The result that you'd look for with this design, if n-back training improves fluid intelligence, is that the group that did n-back training would show a larger increase in fluid intelligence scores from the baseline test to the test after training. They looked and did not find that result - in fact, it was not even close to significant (F < 1). That's the effect that the study was designed to find, and it wasn't there. So that's not a good sign.
The kids who did n-back training did improve at the n-back task, so the authors decided to look at the data in another way - they divided the 32 kids in that group in half based on how much they had improved on the n-back task, and looked separately at the 16 who improved the most and the 16 who improved the least. The group of 16 high-improvers did improve on the fluid intelligence test, significantly more than the control group, and they retained that improvement on the follow-up test of fluid intelligence. That is the main result that the paper reports, which they interpret as a causal effect of n-back training. The 16 low-improvers did not have a statistically significant difference from the control group on the fluid intelligence test.
But this just isn't that convincing a result, as the study no longer has an experimental design when you're using n-back performance to divide up the kids. If you give kids 2 intelligence tests (one the n-back task, one the fluid intelligence test), and a month later you give them both intelligence tests again, then it's not surprising that the kids who improved the most on one test would tend to also improve the most on the other test. And that's basically all that they found. Their study design involved training the kids on one of those two tests (n-back) during the month-long gap, but there's no particular reason to think that this had a causal effect on their improvement on the other test. There are plenty of variables that could affect intelligence test performance which would affect performance on both tests similarly (amount of neural development, being sick, learning disability, etc.).
If there is a causal benefit of n-back, then it should show up in the effect that they were originally looking for (more fluid intelligence improvement in the group that did n-back training than the control group). Perhaps they'd need a larger sample size (200 kids instead of 62?) to find it if the benefit only happens to some of the kids (as they claim), but if some kids benefit from the training while others get no effect from it then the net effect should be a measurable benefit. I'd want to see that result before I'm persuaded.
My primary objection is: perhaps some of the students in both groups got smarter (these are 8-9 year olds and still developing) for reasons independent of the interventions, which caused them to improve on the n-back training task AND on the other intelligence tests (fluid intelligence, Gf). If you separated the "active control" group into high and low improvers post-hoc just like was done for the n-back group, you might see that the active control "high improvers" are even smarter than the n-back "high improvers". We should expect some 8-9 year olds to improve in intelligence or motivation over the course of a month or two, without any intervention.
Basically, this result sucks, because of the artificial post-hoc division into high- and low- responders to n-back training, needed to show a strong "effect". I'm not certain that the effect is artificial; I'd have to spend a lot of time doing some kind of sampling to show how well the data is explained by my alternative hypothesis.
It's definitely legitimate to look at the whole n-back group vs. the whole active control group. Those results there aren't impressive at all. I just can't give any credit for the post-hoc division because I don't know how to properly penalize it and it's clearly self-serving for Jaeggi. It's borderline deceptive that the graphs don't show the unsplit n-back population.
It's unsurprising (probably offering no evidence against my explanation) that the initial average n-back score for the low improvers is higher than the initial average for the high improvers; this is what you'd expect if you split a set of paired samples drawn from the same distribution with no change at all, for example.
Also, on pg 2/6, I don't understand how the t statistics line up with the group sizes.
The groups are ((16 high improvement+16 low improvement)+30 control), so why is it (15), t(15), t(30), and then later t(16)? Does t(n) not mean that it's a t statistic over a population of n? I'm guessing so. I assume the t is an unpaired student's t-test, which of course assumes the distributions compared are normal. I'm not sure if that's demonstrated, but it may be obvious to experts (it's not to me).
Disclaimer: I did dual n-back for a month or so, and got stuck at 5. I haven't resumed, though I may do so in the future.
You are way too underconfident. If an intervention is equally likely to raise or lower the score with respect to the control group, without increasing variation, it does nothing.
When you say that the aggregate results "aren't impressive," you imply that they are positive, but if I read table 1 correctly, the aggregate results are often negative.
(by the way, the "active control" group practiced vocab and trivia, which should have no overlap to what's tested by SPM and TONI, which are completely nonverbal)
You're right. I didn't actually locate and compare the unsplit numbers from table 1; I just visually estimated (from the pretty bar chart, Fig 4) the average of the two n-back subgroups, since they're equal-sized. It looks like the n-backers (compared to the trivia/vocab studiers) a non-significantly superior improvement short term, and a non-significantly worse improvement long term.
I'm also puzzled as to why there's no passive control. Even though there's no obvious overlap in vocabulary/trivia learning and SPM/TONI, I'd expect some generalized training effect, at least in motivation/focus.
I guess my overall view of the evidence is: don't expect single n-back to do much better than any other form of same-effort mental exercise, for any purpose except the exact task trained.
There's no passive control because there are only 62 kids. Only spend as many kids as it takes to publish.
I would not expect a generalized training effect. Almost nothing exhibits cross-test training. People are excited about n-back because it is the only test that is said to.
If you believed single n-back was going to definitively beat the active control, then you wouldn't pay for a passive control. I buy that. But now that it hasn't, it's worth adding a passive control.
Some apparently randomly chosen training task (vocabulary and trivia memorization) exhibited just as much generalized training as single n-back. In your interpretation, neither had any generalized benefit, then - the improvement is just due to normal ~9yr old child development over the timespan.
I do recall hearing some credible evidence that dual n-back (whatever configuration was in some older Jaeggi study) gave a boost to "fluid intelligence". (thus the interest in the topic). But now I'm given to mistrust Jaeggi more than I would the average influential researcher.
Only spend as many kids as it takes to publish.
That's unfair. Getting 62 kids for this study must have been difficult. You don't know what the costs would have been to get a few dozen more.
I said "spend kids," so the cost of acquiring them is irrelevant. I'm sure they're expensive, so I keep them fixed. If there were half as many studies each with twice as many subjects, they would be much more valuable. But they wouldn't be publishable, because they'd all have negative results.
The groups are ((16 high improvement+16 low improvement)+30 control), so why is it (15), t(15), t(30), and then later t(16)? Does t(n) not mean that it's a t statistic over a population of n?
Not usually. Numbers in brackets after a well-known statistic normally represent parameters for that statistic's distribution; in the case of a t-test the bracketed number would be the number of degrees of freedom, which might be one less than the sample size (for a one-sample t-test) or two less than the sum of sample sizes (for an equal variances two-sample t-test).
(Disclaimer: I haven't read the paper.)
[Edited for unambiguity.]
From the study:
Because we included children from both the Detroit and Ann Arbor metropolitan areas, we had a broad range of socioeconomic status, race, and ethnicity
Detroit has some of the worse public schools in the country and Ann Arbor some of the best. Yet, unless I missed it Jaeggi didn't breakdown any of the results by school location. Strange. I wonder which type of students were among those that she claimed benefited from the N-back?
That so closely resembles the old saw about econometrics hunting that I can't believe someone would actually do that...
Of course mainstream coverage is unquestioning. That's what science coverage is, except when there are other scientists disputing it. Journalists don't know what is science or statistics, so it's probably for the best that they don't assess the papers they cover.
We played it when I was in high school, but we called it Egyptian Ratscrew.
(Egyptian Ratscrew, or "ERS" in the company of adults, is a card game that, depending on variant, can include 1-back, 2-back, or 3-back, among other simultaneous pattern recognition tasks. With a nonstandard deck of cards with more features, e.g. Set cards, it would be trivial to adjust it to multi-feature n-back.)
Sorry if replying to an old post is frowned upon but, I have been doing N-Back for a while now and have seen drastic improvements in my learning ability (problem solving?) at school. I'm in my late 30's and was born with learning disabilities (most people are). I know that at least 2 back has helped me.
The reason I'm replying to this post is that I did searches on youtube for Egyptian Ratscrew and the explanations of how to play the game didn't sound or look anything like Dual N Back.
N-Back involves recalling sounds and the position of a visual stimuli. Egyptian Ratscrew does not.
Thanks for commenting everyone. I was kind of hoping someone could convince me otherwise, that this study was solid evidence of IQ improvement, but I guess not. Anyway, I have linked & excerpted this page on the DNB ML.
Following up on the 2010 study, Jaeggi and University of Michigan people have run a Single N-back study on 60 or so children.
The abstract is confident and the mainstream coverage unquestioning of the basic claim. But reading it, the data did not seem very solid at all - I will forbear from describing my reservations exactly; I have been accused of being biased against n-backing, however, and I'd appreciate outside opinions, especially from people with expertise in the area.
(Background: Jaeggi 2011 in my DNB FAQ. Don't read it unless you can't render the above requested opinion, since it includes my criticisms.)