Case Study: Testing Confirmation Bias

Master copy lives on

61 comments, sorted by
magical algorithm
Highlighting new comments since Today at 5:25 AM
Select new highlight date
Moderation Guidelinesexpand_more

I support your focus on testing confirmation bias, but I don't think that it was worth it to explicitly falsify results (for a short time), compared to saying "oh well" and repeating the process until you do legitimately get an inconvenient result on a self-experiment. You've demonstrated that you're willing to break the taboo (or injunction) of never falsifying object-level results of scientific experience, which makes all of your data less valuable.

I found this to be a good and informative post, nonetheless.

I found this to be a good and informative post, nonetheless.

Really? Are you really surprised that people are reluctant to broadcast data that doesn't fit their theory? Have you read any political blogs?

By my model, it takes a pretty unusual person to give anywhere near equal weight to confirming and disconfirming evidence. We're holding Seth Roberts to a very high standard here--one that Gwern himself has not necessarily achieved. Criticizing is easy.

This is a great example of what Frank Adamek talked about in his recent post re: lowering other people's status. The reason folks subconsciously avoid disconfirming evidence is so they can preserve their status. In an ideal world preserving status would be a nonissue and disconfirming evidence would be fine. But then someone like Gwern comes along and snipes someone's status, validating the concern with status that leads to confirmation bias in the first place.

(Stop violating useful social norms Gwern! Punish the norm violator! Just kidding, saying that would make me a hypocrite. I'll assume Gwern posted this in good faith and didn't mean to erode useful social norms.)

So can future articles on individual irrationality please be restricted to people writing about themselves?

To clarify: I'm in support of doing psychological tests on small scales and writing up the results on Less Wrong. I'm not in support of breaking certain ethical injunctions in the process.

If gwern had legitimately gotten a different self-experiment than Seth Roberts, and then the same process had transpired, I'd be entirely in favor of this post. It's an important caveat to self-experimentation that you need to really watch out for confirmation bias, and trust people more if they're willing to publicize negative results as well as positive ones.

But falsifying results to achieve that, even temporarily, was a bad choice (and it makes me less willing to invest my time in reading gwern's self-experimentation in the future).

Since I'm objecting, I may as well clarify: I would appreciate it if people told me unimportant lies (that were corrected later) in order to test me for biases, as long as the results of the test were just between the two of us, and possibly also in other circumstances. (Let's say you have to pay me one dollar for each additional person who knows up to the first 20 people, with additional people being free after the first 20.)

How would you test yourself for confirmation bias?

If I were Seth Roberts, I would look into my blog archive for the initial anecdotal results I posted on experiments now proven to have negative results. If most of these posts seemed positive, I probably have confirmation bias.

I don't think that can be done, since I don't know of any of his theories which have been 'now proven to have negative results'. I think a post linked somewhere here accuses Roberts of actively avoiding clinical trials, where Roberts replies that he worked with a SUNY professor on 20 case-studies for the Shangri-La diet. Since the diet is his centerpiece and the subject of his only book (AFAIK), it probably represents the best-case testing of any of his theories...


This was possibly an expensive experiment in terms of social capital...

I think it would have been better to have waited longer. After only three days, his response seems reasonable. Perhaps after two weeks, it would be more difficult to believe that he would have ever published your data.

He doesn't even tell us what the publication lag for the first experiment was.

The first experiment? You mean the SIAI habit formation thing? I thought it was obvious from the intro specifying when the call for applicants went up and when I posted, but I've edited it to be more explicit.

Or do you mean the vitamin D evening experiment? The results didn't contradict any of his theories, and to the extent it matters to the theory at all, his theory predicts that it ought to damage sleep in the evening since it's influencing circadian rhythms and it isn't a mere matter of vitamin D deficiency.

How long before he linked to your initial vitamin D results?

Dunno. As I said, it didn't matter.

It just occurred to me - I have an active experiment going with deleting random external links on Wikipedia, but even though this affects a rough minimum of ~335,445 readers of Wikipedia articles (based on the summed March statistics of the affected articles), I will probably catch far less flak when I post my results on the WikiEN-l mailing list than I have already caught for this post here. Humans!

It just occurred to me - I have an active experiment going with deleting random external links on Wikipedia,

I object to this more than I object to the experiment in the OP.

Bless your soul! I was completely disheartened at the disinterest of even Wikipedians in my earlier experiment demonstrating that suggestions for adding external links get ignored. Anger is better than apathy.

I agree, the number of people affected by an amateur experiment you perform is a good measure of how much flak you should catch.

On the other hand, people would be reading his site and drawing the wrong conclusions about D supplementation for two weeks. That's some further-spread epistemic pollution costs.

Yes, that was a major reason for only 3 days. Roberts makes it sound like he was going to do an in-depth analysis or whatever before discussing my data, but I don't believe this: if you look at the vitamin D category, you see he posts plenty of people's reports without formally analyzing their data but just describing it, and he had time to post something like 3 blog posts before I published this, one of which was a link roundup perfect for linking my results.

I didn't realize that people would see the 3 day waiting period as super-questionable. Thinking about it some more, I realize now what I should have done: I should have created a separate page on my site just for the fake results, and sent the subject that but linked it nowhere else. The subject would have no reason to be suspicious, the page would indeed be public, but it would not actually get any traffic from normal readers; hence, I could leave the fake page up for months.

(At some point I could even put up the real results on the main page (for the normal readers), since it would be unlikely for the subject to just randomly visit the page and notice the discrepancy.)

Andrew Gelman also has a post up today about Seth Roberts not being diligent about seeing disconfirmatory evidence: Selection Bias

Edit: Seth makes several responses, including this succinct claim to have avoided confirmation bias: comment

And testing confirmation bias in this fashion is intrinsically deceptive, so I probably have damaged my online reputation as well.

I have no way to back this up and just posting this tastes like hindsight bias to me. Nonetheless, this was my train-of-thought when I read this post :

  1. Gwern is using what looks like normal communication as a means to experiment.
  2. Gwern is a data-crunching, prediction-making, experiment-performing, freak-of-nature. (I mean this in an extremely complimentary fashion.) Therefore:
  3. This wasn't the first time Gwern did something like this and it will not be the last time.

Of course, as Rational!Beyonce would say "If I thought it then I should have put a prediction on it."

Of course, as Rational!Beyonce would say "If I thought it then I should have put a prediction on it."


And testing confirmation bias in this fashion is intrinsically deceptive, so I probably have damaged my online reputation as well.

It's interesting that you mentioned that, because I spent several minutes after reading this being confused, and then I thought:

"It is at least hypothetically possible Gwern is testing confirmation bias with THIS post to Less Wrong."

That was followed by "What evidence would I be looking for to confirm or disprove this hypothesis?"

Which was followed by "It would probably take to much reading to plausibly gather sufficient evidence to make any kind of judgement, and I don't have that much time to assess this post. I'll just put my assessment of this experiment on hold until more verification arrives either way on whether it is or is not a meta experiment."

Was this the kind of reputation damage you were expecting?

Was this the kind of reputation damage you were expecting?

More 'anything gwern says is a lie and his emails should be ignored and anyone reading his stuff be told he is a self-confessed liar'. (I don't think this is a fair appraisal, since I just wrote the lie up in exhaustive detail, and I only falsified 1 out of 9 results for ~3 days while keeping it as low-key as possible. I could have sent the fake results to Roberts privately, but then his assent or dissent would not be as credible as compared to actually posting it or not.)

For what it's worth, I actually had intended to post this as an Article and not a Discussion if Roberts did fail, but only as a Discussion if he passed. Then I realized this was a publication bias - giving higher billing to positive findings - which leads to confirmation bias, so I resolved to post it as a Discussion no matter the result.

Upon reading this, I categorized it with "Towards a progressive hermeneutics of quantum gravity" in the deception department (though not in the 'should have been easily caught' department) - the lie was temporary, used as a delicate test of someone else's honesty, and has probably earned you an enemy and gotten a bunch of other people to trust you less.

Speaking of which, it would be best if you could avoid gratuitous deception (like, if you do something like the volunteer experiment, use the data in all cases but neglect to inform half the cohort).

Post summary: I, Gwern, am better than Seth Roberts. Watch as I trick him with my superior science skills! This is important, because I have proven that a human is biased, something no one has ever done before.

Perhaps a bit rude, but this is what one could read from this. I don't think gwern at all intended to show that he's better than Seth but simply tells an interesting story. Still, there's major signalling going on here, and Seth does have every right to be offended, in my opinion.

And testing confirmation bias in this fashion is intrinsically deceptive, so I probably have damaged my online reputation as well.

Yes, this test might not have passed a university ethics review - not only did you deceive your subject, which must be done carefully, but he didn't even know he was participating in an experiment!

Amateur Science -
I do what I must,
because, I can.
For the good of all of us.
Except the ones who were tricked.
But there's no sense crying over all the missed frills,
You just keep on trying until you run out of pills.
And the Science* was fun,
And you get neat posts done
For the people who are, still alive.

* This parody has not been approved by the FDA for any medicinal purposes nor has it been replicated.

This is cute, but I think it would have been worth trying to salvage the rhyme scheme. (Maybe "... all the missed frills"/"pills" and "And the Science helps most"/"post"?) You also missed a huge opportunity by not starting a line earlier with the words "Amateur science" (sound-alike with Aperture).

Amateur was a major omission, yeah. I don't think 'helps most/post' really works because it diverges too much. 'Done' is hard to rhyme with, but checking a rhyming dictionary, I realized there was a perfect rhyme: 'fun'. As in, 'And the Science was fun / and you make a neat post'. The rhythms scan, the rhyme is perfect, and it makes sense by reinforcing 'neat'.

I like how this response is just as relevant to your non-IRB approvable study on what stimuli cause more effort.

Please tell me that I not the first person to note the (ethically difficult) deceptiveness of that study, even if the results are interesting.

Why is it ethically difficult? (I don't particularly care whether it passes a university ethics review, I base whether I consider it ethical on my emotional reaction to it and emotional reaction to its consequences)

There is some ethical difficulty in running a study (and performing an intervention) on subjects who don't even know that they are in a study. In the application for a research position, people thought they were applying for a position, not tested for their responses to stimuli about the use of their work.

Obviously, certain kinds of results are impossible to obtain when you tell the subjects that they are part of an experiment. The value of the results might justify the deception, particularly in cases like these two studies in which the participants did not suffer any significant harm. But it is unrealistic to pretend that no deception occurred, or that deception is not a flag for potential ethical difficulty.

Section 8.02, for starts. And yes, 8.05 and 8.07 provides an exception, but it's debatable whether it applies here, where harm was actually done to Seth Roberts' reputation. It's less debatable that it applies to the researcher study, where apparently no harm was done.

Arguably, the researchers were harmed inasmuch as they were induced to apply more effort/time than they otherwise would have.

Putting in more work to get another interesting experimental result is a harm to the researcher? On what planet?

Well, presumably if working harder on their submission was their utility-maximizing choice, they would have done so already sans experimental manipulation; if any more quality time was used up, it probably came at the expense of some other activity...

It looks like I badly misunderstood your comment. When you wrote "the researchers," I thought that was a coy way of referring to yourself in reference to the two experimental results of which I questioned the ethics.

I'm not arguing for the optimality of compliance with an IRB or other "ethical" guidelines - I'm doubtful they do a reasonable job of creating morally optimal research protocols, and they clearly prevent the discovery of certain interesting or useful results - like your results from these posts that relied on deception. And that doesn't even account for the compliance costs that I now realize was the point of your comment. Oops

I can't find it in my comment history right now, but I've also brought up the apparent lack of ethical oversight in SI-related experiments before. I think the first time was during the first rationality mini-camp.

It would have been passed by the IRB of Gwern U. As long as Gwern is not affiliated with another IRB, that's all that matters.

In the end, what did we learn from the results of this experiment?

(Please take a moment to think about this yourself before proceeding, to avoid priming.)

My answer: Nyzbfg abguvat. Ernfbaf jr qvqa'g yrnea nalguvat nobhg Eboregf:

1) Onfrq ba jung lbh jebgr urer, vg ybbxf yvxr Eboregf punatrq uvf zvaq nobhg gur rssrpgvirarff bs Mrb orgjrra jura lbhe rkcrevzrag fgnegrq naq jura vg raqrq. Guvf tvirf uvz n cresrpgyl yrtvgvzngr ernfba gb cebgrfg ntnvafg lbhe qngn.

2) Bar vagrenpgvba vf fvzcyl abg rabhtu gb tnhtr ubj ovnfrq n crefba vf. Gurer ner znal pbasbhaqvat inevnoyrf urer, naq guvf vf na a = 1 fnzcyr. Eboregf' erfcbafr znl unir orra nssrpgrq ol uvf zbbq, ubj zhpu ur ngr sbe oernxsnfg, naq nal ahzore bs bgure guvatf. Nf lbh fnvq va gur cbfg, a = 1 rkcrevzragf whfg nera'g tbbq fbheprf bs qngn.

3) Rira vs gurfr ceboyrzf jrer pbeerpgrq sbe, tvira gung gur onfr engr sbe pbzzvggvat pbasvezngvba ovnf vf cebonoyl irel uvtu, C(Eboregf vf zber cebar gb pbasvezngvba ovnf guna gur nirentr vaqvivqhny|guvf rkcrevzrag) vf oneryl uvture guna C(Eboregf vf rdhnyyl cebar gb pbasvezngvba ovnf pbzcnerq gb gur nirentr vaqvivqhny|guvf rkcrevzrag). Ubjrire, vs lbh pbhyq pbeerpg sbe gur ceboyrzf V yvfgrq nobir, guvf xvaq bs rkcrevzrag jbhyq or fbzr rivqrapr ntnvafg "Eboregf eneryl snyyf cerl gb pbasvezngvba ovnf."

Gur rkcrevzrag cebirq rira yrff nobhg pbasvezngvba ovnf, zbfgyl sbe gur ernfbaf fgngrq nobir. Nqqvgvbanyyl, Eboregf vfa'g n ercerfragngvir fnzcyr bs nyy uhzna orvatf, naq bar vapvqrag yvxr guvf vfa'g rabhtu gb tvir rivqrapr nobhg n oebnq gurbel yvxr pbasvezngvba ovnf.

Va fhzznel: Guvf rkcrevzragny erfhyg vf vaqvfgvathvfunoyr sebz abvfr, naq qenjvat pbapyhfvbaf sebz vg jbhyq or hawhfgvsvrq.

Not only did you write an entire post in rot13, but in addition your rot13 link is broken.

And testing confirmation bias in this fashion is intrinsically deceptive, so I probably have damaged my online reputation as well.

If anything, you should have more deceptive and convoluted plots if you want to maximize interesting drama.

I'm not in Chaos Army - fear, uncertainty, and doubt is not my goal.

satt points out that (via the Bienaymé formula) “An RCT with a sample size of e.g. 400 would still be 10 times better than 4 self-experiments by this metric.”

Since this has come up again, I may as well point out that this is a very abstruse argument.

  • First of all, if the standard error in a random variable is low to begin with, or I've already done many experiments, decreasing the standard error of my estimate by a factor of 10 is much less valuable.
  • And second of all, this analysis doesn't connect with anything actionable. What does decreasing the standard error of my estimate by a factor of 10 even mean in actionable terms? How often will this actually end up changing what I do?

The way I'm thinking about this argument is to picture a normal distribution representing my uncertainty about some value. When I do 100 times as many experiments, the distribution

  • becomes skinnier by a factor of 10, and
  • centers itself at a new location, where the probability of the new location is determined by the original distribution. If my original distribution is especially wide, more experiments could be valuable, especially if the new distribution ends up jumping somewhere far from the center of the original distribution. But if my original distribution was plenty skinny to begin with, making it skinnier won't help me.

See also this comment of mine, which does math showing that just a few perfectly done self-experiments can be quite valuable in actionable terms:

Just out of curiosity, what would you have done if he'd published and linked to the incorrect results? While this version of the story causes you some social damage (gwern is deceptive), wouldn't the damage have been much worse if he'd passed the confirmation bias test?

I was monitoring my email and RSS closely the night I sent the email. I had already written and proofread the real version, and written an email explaining I had discovered a mistake; so to change my site was just a matter of issuing a single revision-control command (darcs rollback) & re-syncing my site, and then sending the email. I don't think the bad version would have been up for more than 10 or 20 minutes past him posting or replying clearly that he would post it.

Two technical glitches: the graph is too large and the relevant portion is blocked by the sidebar, and clicking the link sends me to a different essay.

Image should be fixed.

clicking the link sends me to a different essay.

The link sends you to the subsection of that essay (on predictions) corresponding to this page (on confirmation bias), does it not?

Cult-like behavior. Post has essentially no info (trivial on the specific case, redundant with known when - improperly - generalized) about the topic ostensibly about.

Rich in info on Gwern. Gwern values blah1 over blah2. He knows, or should know, LessWrong will approve of this (on balance, exhibiting personal commitment to investigation versus violation of norms against, etc). Outside LessWrong will not approve of this. Indeed, the cost Gwern pays personally (outside LessWrong) only shows greater personal commitment (to LessWrong), so contributes more to upvoting. But really just character estimation. Post itself is virtually valueless outside that. The attempt to justify this as an important case (previously discussed by cult leader) is dopey (and, if true, broad evidence of dopeyness). And, of course, costs paid by LessWrong more than Gwern. This is the behavior folks will (correctly) think of as LessWrongian. You (general) mostly don't think about that (cultish). Too late to downvote post now.

I'd like to hear about why people are downvoting this, specifically.

I assume some combination of using the word "cult" repeatedly, and the tonal offensiveness - and the writing style is lazy and hard to read (point form). The content is also more offensive the drethelin's similar point above.

For my part, I'd like to restate my point and see if the votes shift.

I feel drethelin's summary of Gwern's post is pretty fair. There's not much of interest in it. I think people are also underestimating how immoral Gwern's behavior is (meaning: how counterproductive if adopted as a rule). It is also injurious in this specific case.

I also think it is pretty obvious that Gwern could expect this post to be upvoted. The injury is not to a LessWrongian, and it confirms "beliefs" held by LessWrongians. Human experimentation is often fun! Gwern may not see himself as explicitly having posted for that reason, but I do think he paid very little heed to outside parties, both as they might perceive this reflecting on LessWrong and as they might be directly harmed (even in odd ways, such as taking this seriously). It is this lack of awareness of the outside world that I called cult-like. I do not believe folks here are seriously letting this hurt their opinion of Gwern, and they should. I would take considerable pains to avoid people who behave in this way.

If I had the option to vote up multiple times, I would have voted up a post like gwern's about a prominent LWer more times than I would have this post of gwern's.

I now want to interact with gwern more but want people I know who are less accepting of abrasive social norm violations to interact with gwern less. I personally would be elated if my friends pulled this kind of thing on me with some (but not too much) regularity.

Edit: fivelier brings up a good point. My first thought was "I wish it was normal to expect someone might be performing this kind of test!", then "Really? Would I really be happy about that?" then "well, if my friends did it, yes, definitely". But I only wrote the final statement.

I marvel at the apparent certitude with which you introspect. Maybe doubt is supposed to be implicit in all such statements, but I'd throw some extra weaseling in sometimes to make it clearer. In any event, your analogy is horrible: "friends" means private, different reputation issues, etc.

Edit: I should add, my point has nothing to do with "abrasive" social norm violations, as I understand that term. I consider that typical LessWrongian self-congratulation.

I've edited to include a more full version of my thoughts, good point.

Please do me the favor of accepting my apologies for tone, even if you don't care. So, my original response was intended as a compatibility test. If I were a LessWrongian (I'm giving digs for fun), I would call it a literacy test or an intelligence test. So, if you think that -11 (as once it was) is surprising or not really sensible, we're possibly compatible (even though I think the score is fair; personal biases). Otherwise, less likely (is the intent). I was also considering forbidding people from downvoting me. Apologies: just amusing myself. I'll stop now.

If I had the option to vote up multiple times, I would have voted up a post like gwern's about a prominent LWer more times than I would have this post of gwern's.

I would love to test LWers more, but it's hard to find any good approaches. Suppose I wanted to test Eliezer - what field of new data exactly would I supply? If I had a time-machine, I could supply observations of the future's progress on AI and reverse them, but unfortunately I do not have one.

I suppose I could supply critical information on cryonics - claim there were new defrostings?, but when I've supplied good information on cryonics in the past, my posts and comments get upvoted!