# Summary

I attempted to reproduce Scott’s analysis of Birth order effect vs Age gap. I found that:

1. There appeared to be an error in graphs 2 & 3 where people with one sibling were counted when they shouldn’t have been (graph 2) or were counted twice (graph 3)

2. Comparing oldest children to youngest children causes a systematic bias in the results. This can be prevented by comparing oldest children to 2nd oldest children

3. I was unable to reproduce Scott’s result on people reporting 0 year age gap – I get a non-significant 58% older siblings compared to Scott’s 70%. I was unable to discover the cause of the difference.

I have reanalysed the data based on points 1 & 2 in a separate post.

# Previously in Birth order effect

In the 2018 Slate Star Codex survey Scott asked some questions about what order in the family respondents were born. He found that eldest children were massively overrepresented.

Following on, historical mathematicians and Nobel winning physicists were found to exhibit the same property.

In the 2019 SSC survey Scott included questions about age gaps between respondents and their adjacent siblings. He analysed the results, finding that:

This study found an ambiguous and gradual decline [in Birth order effect] from one to seven years [Age gap between siblings], but also a much bigger cliff from seven to eight years.

# Failed reproduction of Scott’s graphs

I had originally intended to analyse the data to see if I could draw any further conclusions. However, when running the analysis I found that I was unable to reproduce Scott’s results.

Scott includes 3 graphs.

The first – comparing % of sample oldest child vs age gap for people with 1 sibling – I was able reproduce almost exactly (Scott also has access to respondents’ data who asked not to be included in the public data so we aren’t exactly the same. There may be other differences too but these are small).

The second – comparing how many oldest vs youngest children there are in the sample for people with more than 1 sibling – I was unable to reproduce. Actually, I was able to reproduce the graph but only if I also included people with 1 sibling.

This is actually what the third graph was supposed to show, but it looks like the third graph double counts the people with 1 sibling.

The graphs are below, with Scott’s on the left and my reproductions on the right. Note the similarity between Scott’s graph 3 and my graph 2.

(My version of graph 2 has a different y-axis to all of the other graphs as the range is larger)

As an intuitive way of seeing that there is something wrong – Scott’s third graph lists 7,613 samples included. There are only 8,171 people in the whole survey but the third graph should be ruling out any only children and any children between first and last in their family. It seems that there should be more than 558 people in that group (actually there are 767 only children, not even counting any middle children).

(Scott’s results were double checked by Tumblr user athenaegalea but this only involved looking at the data for the first graph which I also agree with).

Correcting this mistake is important as the 7 year age gap drop in birth effect is predicated on graphs 1 & 2 both showing such a drop. In my reproduction there is no such drop in graph 2 which suggests the 7 year age gap sudden drop may not be a thing after all.

# Problem with comparing oldest and youngest children

Below I have plotted the data from graphs 1 & 2 on the same axis (switching to line graphs to make trends a bit easier to see). I have changed the y-axis to be a ratio of oldest:youngest instead of a percentage to highlight the strangeness of the results.

The 1 sibling data still show a relatively constant birth effect across different age gaps until 7 years gap. 8 years and above then shows a drop in Birth order effect size.

The >1 sibling data show a very different trend. At 1 year age gap there are over 7x more firstborns than lastborns. This decreases rapidly as age gap increases. Above 9 years age gap there are actually more youngest children than oldest (although the sample size is relatively small here).

This seems odd – the two data sets should be reflecting roughly the same process but with different family sizes – the graphs should be similar, or at least closer than this! Does having additional siblings qualitatively change how age gap modifies the Birth order effect?

Above we are comparing oldest siblings with youngest siblings and looking at the relative age gaps. In doing so we are implicitly assuming that the age gaps between 1st and 2nd children should be statistically similar to the age gaps between penultimate and last children in the general population.

This does not match with my experience. In families with >2 children that I know, age gaps between later children tend to be larger than between earlier children.

Checking this in the data, I looked at people who are neither first nor last children and compared the age gaps to their next older sibling and next younger sibling. On average the age gap to their younger sibling was 0.55 years longer than the gap to their older sibling.

This effect would explain the incredibly high birth order effect with 1 year age gap seen in my graph 2. If many oldest and few youngest children in the general population have 1 year gap to their neighbouring sibling and we add this to the SSC Birth order effect then the overall effect in the SSC sample will be huge.

It would also explain how the birth order effect in graph 2 appears to decrease dramatically for families with large age gaps – in the population as a whole, more youngest than oldest children in the overall population will have >5 year gap to their neighbouring sibling. This overall population effect cancels out some (or all for >9 year gap) of the Birth order effect from the SSC population.

I can't think of a way to quantify and/or cancel out this effect whilst comparing oldest to youngest children but fortunately we don't have to.

Instead of comparing oldest and youngest children, we could compare first and second children. If we only look at first and second children and compare age gaps downwards and upwards respectively then we should be looking at the same underlying distribution of age gaps in the general population.

One advantage of comparing oldest to youngest siblings is that this represents the largest Birth order effect (see Scott’s 2018 analysis). However, as most of the effect happens between first and second siblings, the effect should still be large enough to detect using these samples.

Redoing my previous graph based on first and second children rather than first and last gives something which makes a lot more sense - there isn’t much difference between 1 sibling and >1 sibling. The >1 sibling data set doesn’t have the sudden drop after 7-years but does suggest a slight downwards trend around the same time.

Recreating Scott’s graph 3 (i.e. including all family sizes) gives a drop in birth order effect at 7 years but not as low as with just the 1 sibling data – ~59% oldest children vs 2nd oldest, compared to ~54%.

# Failed reproduction of birth order effect with 0 year age gap

I also failed to reproduce Scott’s finding regarding people reporting a zero year age gap. He finds that:

Weirdly, among people who reported a zero-year age gap, 70% are older siblings.

but I was unable to produce a result like this.

First I removed the respondents who reported 0 year age gaps in a direction in which they reported 0 siblings. This removed over half of the reported 0 year age gaps.

Of those remaining, I get a non-significant 58% older siblings (53 vs 39). There were also 3 respondents who indicated that they were in the middle of a multiple birth.

In this case I’m not sure how I get such different results from Scott. Even if I don’t remove the responses as I detailed above I don’t get anywhere near Scott results (I actually get ~60% __younger__ sibling – with so many oldest children there are more opportunities to put a 0 years in the upwards direction where it should have been left blank).

So I’m very confused about why I’m getting such different results.

In my next post I reanalyse the data based on 1st and 2nd oldest children.