Introduction and motivation

Science! It’s important. It’s also difficult, largely because of how easily people fool themselves. I’ve seen plenty of people (my younger self included!) lose the plot as they attempt their first few investigations, letting subtle methodological problems render their entire chain of reasoning worthless.

So I’ve decided to pick a small, well-behaved dataset nobody’s had a thorough look at yet, and perform an egregiously thorough and paranoid exploration. My aim is both to showcase Good Science for aspiring Data Analysts, and to give others a chance to criticise my approach: it’s entirely possible that I’m still making some blunders, and if I am then I want to know.

Below is a summary of my methodology, with emphasis on the mistakes I managed to dodge. You can see the full R kernel here.

The dataset

The data was collected by renowned gamedev/educator/all-round-smart-person Nicky Case, in an attempt to decide which of six projects (‘Crowds’, ‘Learn/Teach’, ‘Win/Win’, ‘Nonviolence’, ‘Understand’, and ‘Mindfulness’) they should work on next. They described each possible project, and asked fans to rank them from 1-5 stars. The poll, with full descriptions of each project, is still up here, though the project names are indicative enough that you shouldn’t need it.

‘Crowds’ won, handily, and Case built it. At time of writing, they are between major projects; my pretext for performing this analysis is helping them decide which of the runners-up (‘Understand’ and ‘Learn/Teach’) to prioritise.

Exploration

I began by loading and cleaning the data. For most data analysis jobs, handling irregularities in a dataset can take up as much time as actually analysing it: I picked an unusually tidy dataset so I’d be able to skip to the parts where people make interesting mistakes.

The most I had to deal with was a few responses with missing values, caused by people not filling in every part of the poll. I considered handling this by imputation – deciding to interpret blanks as three-star ratings, say – but eventually decided to limit subjectivity and drop all incomplete responses.

[Normally, I’d look into missing values in more detail, but because I knew the causal mechanism and they were a small enough proportion of the dataset and I knew there aren’t going to be outliers (there’s no way a missing value could be hiding a rating of 100/5 stars) and these seemed like the least important respondents anyway (that they couldn’t be bothered completing a six-question survey suggests they probably don’t have strong opinions about this topic), dropping them seemed fair.]

I set a seed, and split the dataset in half using random sampling. I christened one half exploreDF, and the other half verifyDF: my plan was to use the former for Exploratory Data Analysis, and then perform any ‘proper’ statistical tests on the latter.

[A classic error of new Data Analysts performing EDA is to explore their entire dataset, and then test hypotheses on the same data that suggested them: this is an issue so common it has its own Wikipedia page. The heart of statistical testing is finding results which are unlikely in ways which contradict the null hypothesis, and a result which you decided to test because you noticed it contradicting the null hypothesis contradicting the null hypothesis isn’t very unlikely.

There are workarounds – penalties you can apply in an attempt to retroactively purify tests performed on the data that suggested them – but I’m sceptical of these. You can dutifully multiply your p-values by N to accommodate the fact that you picked one of N possible comparisons to test, but there’s no sensible way to accommodate the fact that you looked at that kind of comparison in the first place.

Tl;dr: don’t do anything isomorphic to training on the testing set.]

[Another very common mistake is to not randomise data splits, or to not set a seed when you do. If, for example, I’d just taken the first 50% of the rows as my exploreDF, that might have introduced bias: what if the dataset were ordered chronologically, and those who responded first had consistently different views to those who responded later? The only way to ensure a fair split is to randomly sample the entire dataset.

As for failure to set seeds, that’s more bad housekeeping and bad habits than an actual mistake, but it’s still worth talking about. When you set a seed, you guarantee that the split can be replicated by other people running your code. Enabling others to replicate results is an integral part of the scientific endeavour, and keeps everyone honest. It also lets you replicate your results, just in case R goes haywire and wipes your workspace.]

The range of possible scores is 1-5, but the average score for each column is around 3.8 stars. This suggests the possibility of large numbers of respondents who limit themselves to a range 3-5 stars, plus a minority who do not, and who have a disproportionate impact on the average scores. Whether this is a bug or a feature is subjective, but it’s worth looking into.

To investigate this possibility, I derived two extra features: range (distance between the highest and lowest scores given by each respondent), and positivity (average score given by each respondent). Also, I was curious to see what correlated with positivity. Were people who generally awarded higher ratings more interested in some projects than others?

[I got ‘3.8 stars’ from Case’s summary of the entire dataset: technically this violates data purity, but I would have got that impression about the distribution from the next section anyway.]

I began with univariate analysis: checking how each of the six variables behaved on its’ own before seeing how they interacted.

They were all pretty much like this

Then, I moved on to multivariate analysis. How much do these scores affect each other?

There’s an R function I really like, called ggcorr, which plots the correlation coefficients of every variable against every other variable, and lets you see what connections shake out.

I love this genre of triangle so much

Inferences:

  • Surprisingly, there aren’t obvious ‘cliques’ of strong mutual correlation, like I’ve gotten used to finding when I use ggcorr. The closest thing to a clique I can see here is the mutual affection between ‘Crowds’, ‘Mindfulness’ and ‘Win/Win’, but it’s not particularly strong. Also, this finding has no theoretical backing, since I can’t think of a good post-hoc explanation that groups these three together but leaves out 'Nonviolence'.
  • There’s a strong general factor of positivity, as demonstrated by the fact that only two of the fifteen mutual correlations between the six original are negative.
  • The two negative correlations are between ‘Understand’ and ‘Win/Win’, and between ‘Crowds’ and ‘Learn/Teach’. The former kind of makes sense: people who like the most abstract and academic project dislike the fluffiest and most humanistic project, and vice-versa. The latter, however, astonishes me: what’s the beef between education and network theory?
  • Five of the six least positive correlations are between 'Understand' and the other projects: this project seems to be more of a niche than its competitors.

I ran a quick statistical test – using R’s default cor.test function – on the negative relationship between ‘Crowds’ and ‘Learn/Teach’ in exploreDF, just to check it wasn’t a fluke. The test returned p=0.0913: not statistically significant, but close enough to reassure me.

[I know I’ll catch some flak around here for using p-values instead of Bayesian stats, but p-values are what people are familiar with, so they’re what I use to support any results I might want to share with the unenlightened.]

Then, I added range and positivity to the ggcorr plot.

This is also an excellent triangle

Additional inferences:

  • Range and positivity are confirmed to be strongly negatively correlated, suggesting (but not confirming) that my theory of a low-positivity minority having a disproportionate impact is correct.
  • Every project correlates with the average (that’s to be expected, even without the general factor), but some features correlate much more strongly than others.

Thinking of Anscombe's Quartet, I plotted the scores against each other on 2D graphs (see below). After eyeballing a few, I was confident that the strengths of these relationships could be approximated with linear correlation.

Nothing about this looks blatantly nonlinear; let’s keep going

What I was most interested in, though, wasn’t the ratings or scores, but the preferences they revealed. I subtracted the positivity from each score, leaving behind each respondent’s preferences.

On general principles, I glanced at the univariate graphs for these newly derived values . . .

Oh hey, they actually look kind of Gaussian now

. . . before re-checking correlations.

What a calming colour palette

High positivity is correlated with liking ‘Win/Win’, ‘Nonviolence’, and especially ‘Mindfulness’; low positivity is correlated with liking ‘Crowds’, ‘Learn/Teach’, and especially ‘Understand’.

These correlations looked important, so I statistically tested them. All p-values were below 0.1, and the p-values associated with ‘Mindfulness’ and ‘Understand’ were below 0.002 and 0.001 respectively.

In other words, the three frontrunners were the three whose proponents had been least positive overall. Did this mean the results of the original poll were primarily the result of lower-scoring respondents having a greater impact?

To investigate further, I divided all of the positivity-adjusted data for each respondent by the range (in other words, a set of responses [2, 2, 3, 3, 3, 5], which had become [-1, -1 , 0, 0, 0, 2], now became [-1/3, -1/3, 0, 0, 0, 2/3]), to see what things looked like when low-positivity people weren’t having their outsized effect.

As always, I checked the univariate graphs for anything interesting . . .

Yup, those are lines alright

. . . before moving on to comparisons. The averages for my normalised interpretation were:

‘Understand’: 0.028

‘Nonviolence’: 0.023

‘Learn/Teach’: 0.042

‘Win-Win’: -0.076

‘Mindfulness’: -0.080

‘Crowds’: 0.063

For comparison, Case’s averages are

‘Understand’: 3.88

‘Nonviolence’: 3.80

‘Learn/Teach’: 3.87

‘Win-Win’: 3.61

‘Mindfulness’: 3.61

‘Crowds’: 3.94

This is a reassuring null result. The ordinal rankings are more-or-less preserved: ‘Crowds’ > ‘Understand’ & ‘Learn/Teach’ > ‘Nonviolence’ > ‘Win/Win’ & ‘Mindfulness’. The main difference is that ‘Learn/Teach’, in my adjusted version, does significantly better than ‘Understand’.

(Well, I say ‘significantly’: I tried one-sample t-testing the difference between ‘Learn/Teach’ and ‘Understand’, but the p-value was embarrassingly high. Still, a null result can be worth finding.)

I’d found and tested some interesting phenomena; I felt ready to repeat these tests on the holdout set.


It was at this point that I suddenly realised I’d been a complete idiot.

(Those of you who consider yourselves familiar with statistics and Data Science, but didn’t catch my big mistake on the first read-through, are invited to spend five minutes re-reading and trying to work out what I’m referring to before continuing.)

I’d used R’s standard tests for correlation and group differences, but I’d failed to account for the fact that R’s standard tests assume normally-distributed variation. This was despite the fact that I’d had no reason to assume a Gaussian distribution, and that I’d had visibly not-normally-distributed data staring me in the face every time I created a set of univariate graphs, and that I’d been consciously aiming to do an obnoxiously thorough and well-supported analysis. Let this be a lesson to you about how easy it is to accidentally fudge your numbers while using tools developed for social scientists.

On realising my error, I redid every statistical test in the exploration using nonparametric methods. In place of the t-tests, I used a Wilcoxon test, and in place of the Pearson correlation tests, I used Kendall’s tau-b (the most common method for not-necessarily-normal datasets is a Spearman test, but that apparently has some issues when handling discrete data). Fortunately, the main results turned out more or less the same: the biggest change was to the p-value reported for correlation between ‘Crowds’ and ‘Learn/Teach’, which dropped to half its’ value and started to look testable.

[Among the many benefits of doing an explore/verify split and saving your final tests for last: any ‘oh crap oh crap I completely used the wrong test here’ moments can’t have data purity implications unless they happen at the very end of the project.]

Testing

My main relevant findings in exploreDF were:

  1. Votes for ‘Understand’ correlate negatively with positivity.
  2. Interest in ‘Crowds’ is negatively correlated with interest in ‘Learn/Teach’.
  3. The level of correlation between ‘Crowds’ and ‘Learn/Teach’ is unusually low.
  4. ‘Understand’ is an unusually niche topic.
  5. There’s no major change as a result of normalizing by range.

Appropriate statistical tests for verifyDF are:

  1. After adjusting for positivity, run a one-sided Kendall test on ‘Understand’ vs positivity, looking for negative correlation. (α=0.001)
  2. Without adjusting for positivity, run a one-sided Kendall test on ‘Crowds’ vs ‘Learn/Teach’, looking for negative correlation. (α=0.05)
  3. Recreate the first ggcorr plot: if the correlation between ‘Crowds’ and ‘Learn/Teach’ is the lowest correlation out of the 15 available, consider this confirmed. (this would have a 1/15 probability of happening by chance, so that’s p=α=0.0666)
  4. Recreate the first ggcorr plot: if the 5 correlations between ‘Understand’ and other projects are all among the lowest 7 in the 15 available, consider this confirmed. (this would have a 1/143 chance of happening by chance, so that’s p=α=0.007)
  5. N/A; I don’t need a test to report on a change I didn’t find.

[Note that I use a Fisherian approach and call the critical p-values for my final tests in advance. I think this is generally good practice if you can get away with it, but unlike with most of my advice I’m not going to be a hardass here: if you prefer to plug your results into statistical formulae and report the p-values they spit out, you are valid, and so are your conclusions.]

And the results?

[Fun fact: I wrote the entire post up to this point before actually running the tests.]

  1. Test passes.
  2. Test fails.
  3. Test fails.
  4. Test passes.

[I’m making a point of publishing my negative results alongside positive ones, to avoid publication bias. Knowing what doesn’t work is just as important as knowing what does; do it for the nullz.]

[I realised in retrospect that these tests interfere in ways that aren’t perfectly scientific. In particular: given #1 passing, #4 passing becomes much more likely, and vice versa. The two positive results I got are fine when considered separately, but I should be careful not to treat these two with the weight I’d assign to two independent results with the same p-values.]

Interpretation

My main results are that votes for ‘Understand’ negatively correlate with positivity, and that ‘Understand’ has unusually low degrees of correlation with all the other projects.

So what does this actually mean? Remember, my pretext for doing this is helping Case choose between ‘Understand’ and ‘Learn/Teach’, given similar average scores.

Well, based on my best understanding of what Case is trying to achieve . . . I’d say that ‘Learn/Teach’ is probably the better option. ‘Understand’ is the preferred project of people who seemed to show least enthusiasm for Case’s projects in general, and whose preferences were least shared by other respondents. If we’re taking a utilitarian approach – optimising for the greatest satisfaction of the greatest number – it makes sense to prioritise ‘Learn/Teach’.

However, there are many ways to interpret this finding. I called the average score for a given respondent their ‘positivity’, but that was just to avoid it being confused with the average for a given project. Giving lower scores on average could be explained by them taking a more bluntly honest approach to questionnaires, or being better at predicting what they wouldn’t enjoy, or any number of other causes. I can run the numbers, but I can’t peer into people’s souls.

Also, even if the pro-‘Understand’ subset were less enthusiastic on average about Case’s work, that wouldn’t imply a specific course of action. “The desires of the many outweigh those of the few.” is a sensible reaction, but “This subset of my fans are the least interested in my work, so they’re the ones I have to focus on keeping.” would also be a completely valid response, as would “This looks like an undersupplied niche, so I could probably get a cult following out of filling it.”

Lessons Learned

In the spirit of ‘telling you what I told you’, here’s a quick summary of every important idea I used in this analysis:

  • Don’t test hypotheses on the data that suggested them. One way to avoid this is to split your dataset at the start of an exploration, and get your final results by testing on the unexplored part.
  • Splits should be done via random sampling. Random sampling should have a seed set beforehand.
  • Take a quick look at univariate graphs before trying to do anything clever.
  • Results which have some kind of theoretical backing are more likely to replicate than results which don’t.
  • If you don’t have good reasons to think your variables are normally distributed, don’t use t-tests or Pearson correlation tests. These tests have nonparametric equivalents: use those instead.
  • Your inferences have limits. Know them, and state them explicitly.

I’ll also throw in a few that I didn’t have cause to use in this analysis:

  • If you’re analysing a dataset with a single response variable (i.e. a specific factor you’re trying to use the other factors to predict), it’s probably worth looking at every possible bivariate plot which contains it, so you can pick up on any nonlinear relations.
  • Some of the most useful inferences are achieved through multivariate analysis, discovering facts like “wine acidity predicts wine quality if and only if price per 100ml is above this level”. I didn’t try this here because the dataset has too few rows, and so any apparent three-factor connection would probably be spurious (too many possible hypotheses, too little data to distinguish between them). Also, I’m lazy.
  • If you’re performing an analysis as part of a job application, create a model, even if it isn’t necessary. Wasted motion is in general a bad thing, but the purpose of these tasks is to give you an opportunity to show off, and one of the things your future boss most wants to know is whether you can use basic Machine Learning techniques. Speaking of which: use Machine Learning, even if the generating distribution is obvious enough that you can derive an optimal solution visually.

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 6:51 PM

This is very similar to the EDA I would run on this sort of data - marginals, correlations, basic tests. A few things I would do differently:

  • I generally think in terms of standard devs/standard errors rather than p-vals/log odds updates, at least for EDA purposes. I then think about how many std devs I need for signficance by eyeballing normality and falling back to Chebyshev's inequality if needed. For large datasets with highly significant results, that's plenty. Otherwise, I need to think harder about what model actually makes sense.
  • That said, approximations of normality are still valid for your correlations - the raw data isn't normal, but the correlation is a *sum* over (presumably) IID survey respondents. Same with group differences: the *average* can be modeled as normal, even if the data isn't. Central limit theorem and all that. That said, I don't use R, it's possible their tests rely on additional assumptions.
  • I would look for clusters. With data like this, I'd wonder if there are separate populations filling in the survey. I'd run SVD and looks at plots/histograms of the larger components to see if there's any visually-obvious clustering.

Is the dataset publicly available? If so, I might do a write-up of how I'd analyze it. Would be interesting to have a bunch of different people try it and compare notes. (I mean, it would probably eventually devolve into a dick-measuring competition, but even then it would be interesting.)

Responses to your differences:

.I hear you, but R has enough fully-automated testing tools that it's much simpler for me to just run the appropriate test and see what pops out the other end. (Also, THANK YOU for mentioning Chebyshev, I can't believe I'd never heard of that inequality before and it's EXACTLY my kind of thing)

.I think (?) you're operating on the wrong level of meta here. A t-test uses both the mean and the variance of the distribution(s) you feed it, and that's true whether or not it's being used to test a correlation. The CLT will not save us, because the single (admittedly gaussian-distributed) datapoint representing the mean has a variance of zero. (Something I could have done - in fact, something I remember doing much earlier in my career, back when I was better at identifying problems than finding expedient solutions - was to group not-necessarily-normal datapoints together into batches of about twenty, take the averages per-batch, and then t-test the lists of those: it was a ridiculous waste of statistical power, but it was valid!)

.That's an excellent idea. My excuse for not doing that is that I was prioritising pointedly-not-getting-things-wrong over actually-getting-things-right; my reason is that I just didn't think of it and I'm too lazy (and data-purist) to go back and try that now.

The dataset is, at time of writing, still up at https://gist.github.com/ncase/74ae97cb74893a0c540274b44f550503. I'd love to see what you throw at it.