Months ago, my roommate and I were discussing someone who had tried to replicate Seth Roberts' butter mind self-experiment. My roommate seemed to be making almost no inference from the person's self-reports, because they weren't part of a scientific study.
But knowledge does not come in two grades, "scientific" and "useless". Anecdotes do count as evidence, they are just weak evidence. And well designed scientific studies constitute stronger evidence then poorly designed studies. There's a continuum for knowledge quality.
Knowing that humans are biased should make us take their stories and ad hoc inferences less seriously, but not discard them altogether.
There exists some domains where most of our knowledge is fairly low-quality. But that doesn't mean they're not worth study, if the value of information in the domain is high.
For example, a friend of mine read a bunch of books on negotiation and says this is the best one. Flipping through my copy, it looks like the author is mostly just enumerating his own thoughts, stories, and theories. So one might be tempted to discard the book entirely because it isn't very scientific.
But that would be a mistake. If a smart person thinks about something for a while and comes to a conclusion, that's decent-quality evidence that the conclusion is correct. (If you disagree with me on this point, why do you think about things?)
And the value of information in the domain of negotiation can be very high: If you're a professional, being able to negotiate your salary better can net you hundreds of thousands over the course of a career. (Anchoring means your salary next year will probably just be an incremental raise from your salary last year, so starting salary is very important.)
Similarly, this self-help book is about as dopey and unscientific as they come. But doing one of the exercises from it years ago destroyed a large insecurity of mine that I was only peripherally aware of. So I probably got more out of it in instrumental terms than I would've gotten out of a chemistry textbook.
In general, self-improvement seems like a domain of really high importance that's unfortunately flooded with low-quality knowledge. If you invest two hours implementing some self-improvement scheme and find yourself operating 10% more effectively, you'll double your investment in just a week, assuming a 40 hour work week. (ALERT: this seems like a really important point! I'd write an entire post about it, but I'm not sure what else there is to say.)
Here are some free self-improvement resources where the knowledge quality seems at least middling: For people who feel like failures. For students. For mathematicians. Productivity and general ass kicking (web implementation for that last idea). Even more ass kicking ideas that you might have seen already.
Methodologically, each self-experiment is typically much more poorly run than the kinds of trials we try to discuss here (RCTs), so each self-experiment represents less than n=1 of data. The RCTs usually have at least a few dozen and ideally hundreds or thousands of subjects, either singly or pooled for meta-analysis. So a single such meta-analysis represents thousands of subjects times the fractional quality of a self-experiment, leading to the conclusion that one self-experiment is worth somewhere less than one-hundredths to one-thousandths of any comparable RCT.
The human mind doesn't do 64-bit floating weights. It doesn't even do shorts.
BTW, has anyone ascertained what resolution it does do? (Is this even a coherent question?)
Well, it's a question which could be turned into a coherent question in a couple ways, so before getting an answer, you need to decide what question you're asking and what an answer ought to look like. For example:
I don't know the answers to any of these - my own impression is that people have fairly granular probabilities. I don't bother with single-percent differences in my own predictions on PredictionBook.com unless I'm in the 0-10/90-100% decile (where 0% is quite different from 1%).
Rolling dice a ton of times starts running into problems with short-term memory buffer size and conflation with explicit strategies for managing that limit; it might be more useful to provide a histogram of the results of a hundred die rolls and ask whether it's a biased die or not.
Though, thinking about this... surely this isn't an absolute granularity? I mean, even supposing that it's constant at all. I would expect the minimum size of a detectable probability shift to be proportional to the magnitude of the original probability.
This is a question I've thought of posting in discussion before, but I couldn't work out a coherent phrasing. Just how well can the untrained human mind resolve probabilities? Just how well can the trained human mind (e.g. say, a professional bookmaker) resolve probabilities? (Note I have no idea how individual bookmakers do things these days, for all I know they routinely use computers rather than estimating odds themselves. I know the chain ones do.)
The quality of a belief is not linear in the number of participants in the study supporting it.
You're ignoring heavily diminishing returns from additional data points. In other words, to persuade me that studies with many participants really are a lot better, you'd have to do some math and show me that if I randomly sampled just a few study participants and inferred based on their results only, my inferences would frequently be wrong.
This seems pretty clearly not the case (see analysis in my reply to this comment).
Additionally, in domains like negotiation, I'd guess that decent-quality knowledge of many facts is more valuable than high-quality knowledge of just a few. Studies are a good way to get high-quality knowledge regarding a few facts, but not decent-quality knowledge regarding many. (Per unit effort.)
Testing something a bunch of times doesn't make it the thing you most need tested. (And some things may be hard to test cleanly.)
Although the win (expressed as precision of an effect size estimate) from upping the sample size n probably only goes as about √n, I think that's enough for gwern's quantitative point to go through. An RCT with a sample size of e.g. 400 would still be 10 times better than 4 self-experiments by this metric. (And this is leaving aside gwern's point about methodological quality. RCTs punch above their weight because random assignment allows direct causal inference.)
Where is the math for this?
I agree that methodology is important, but humans can often be good at inferring causality even without randomized controlled trials.
Edit: more thoughts on why I don't think the Bienaymé formula is too relevant here; see also.
(Of course, any systematic bias stays the same no matter how big you make the sample.)
What steven0461 said. Square rooting both sides of the Bienaymé formula gives the standard deviation of the mean going as 1/√n. Taking precision as the reciprocal of that "standard error" then gives a √n dependence.
This is true, but we're also often wrong, and for small-to-medium effects it's often tough to say when we're right and when we're wrong without a technique that severs all possible links between confounders and outcome.
I'm too lazy to do a better analysis now, but just to provide the barest of intuitions:
Let's say a study with trillions of participants has shown that using Strategy A works better than not using Strategy A 80% of the time. I'm about to decide whether or not to use Strategy A, and unfortunately I don't know about the study. I poll three of my friends who have all done rigorous self-experiments. (Or maybe I've done three rigorous self-experiments myself.) All it takes is a pocket calculator to show that I have a 90% chance of correctly guessing whether I should use Strategy A: .2^3 + 3 (.8.2*.2) = .104. And obviously if I poll myself, based on a single past rigorous self-experiment, I'll have an 80% chance of getting it right.
(A better analysis would probably use the normal approximation for the binomial distribution, so we could see results for all sorts of parameters, but that would be a pain to write out with my voice recognition system.)
I suspect that scientific evidence is most useful on questions that are hard to decide (e.g. if Strategy A works 51% of the time; incidentally this sort of knowledge is also the most useless), or in cases where your degree of belief matters beyond just choosing whether or not to use a strategy (seems kind of rare).
This last point about degree of belief not mattering much could explain why Bayesian statistics didn't catch on as well as frequentist statistics initially: most of the time, your exact degree of belief doesn't matter and you just need to decide whether or not to do something.
You're making a massive assumption: that self-experimentation is not biased worse than regular clinical trials by things like selection effects. This is what I mean by methodological concerns making each self-experiment far far less than n=1. I mean, look at OP - from the sound of it, the friend did not report their results anywhere (perhaps because they were null?). Bingo, publication effect. People don't want to discuss null effects, they want to discuss positive results. I've seen this first-hand with dual n-back, among others, where I had trouble eliciting the null results even though they existed.
Given this sort of bias and zero effort on self-experimenters' part to counter it, yes, you absolutely could do far worse than random by sampling 1000 self-experimenters compared to 1000 clinical trial participants! This is especially true for highly variable stuff like sleep, where you can spot any trend you like in all the noise - compare the dramatic confident anecdotes collected by Seth Roberts about vitamin D at night based on purely subjective retrospective recall of <10 nights to my actual relatively moderate findings based on 40 nights of Zeo data.
(I actually have a little demonstration that someone is engaging in considerable confirmation bias, but I'm not done yet. I should be able to post the result in early May.)
I don't necessarily disagree with you on any of this. Looks to me like we are talking past each other a little bit.
Something about your rough model disagrees with me (in addition to the stuff in gwern's comment). Tentatively I'd put my finger on strategies like your hypothetical strategy A being rarer than they look. I think it's uncommon for a prospective lifestyle change to simultaneously
(Edited to add "be" to bullet point 2.)
Well obviously you have to decide on a case-by-case basis whether Real Science is necessary, but the butter mind thing is looking pretty good:
Would you wait for a real study before trying this?
W. T. F! ?
A half stick of butter every day makes you smarter - and in contrast to an equivalent amount of other saturated fats? That's really rather surprising. I would like to see more research on that. Because it is kind of awesome.
To be sure. I don't think my line of argument should shut the door on self-experimentation. I'd just focus on low-risk, low-effort interventions as candidates. (Otherwise I'm likely to end up with more high-risk/high-effort false positives than I'd like.)
So it is! When I saw the original Seth Roberts blog post my reaction was to write it off as a probable fluke. The fact that it seems to replicate in a randomized trial with n = 45 makes me much more interested, especially as the relative speed-up from the butter remained at about 5% (suggesting Seth's original result wasn't just a high/low outlier). I'd have chosen a different experimental design, and I'll have to take a look at the raw data to convince myself of the analysis, but it seems promising.
As for the Anki thing, I probably wouldn't wait! It's the sort of low-effort, low-risk intervention that's best for self-experimentation.
(Agreeing and elaborating.)
The benefits of self experimentation in areas where there has already been actual bulk research to the level where meta analysis can be done is in the realm of personalizing - finding the effects on yourself in particular. Even then the degree to which self experiment can cause you to update the predicted benefits to yourself will depend on the degree to which variability between people is found (unfortunately not mentioned too much in most studies) and the degree to which self reports (or whatever metrics you use) tend to be reliable indicators of actual influence.
This means that if I am self-experimenting with modafinil I will update significantly on how useful the substance is to myself while I wouldn't even bother self experimenting with respect to how much background noise polution influences my general wellbeing over an extended period. In the latter case p(background sound is detrimental | my self experiment indicates background sound is beneficial) > 0.7.
I'd be interested to see an analysis of how many failures to replicate we should expect if replicators duplicate methodology perfectly, and whether real-world failures to replicate seem to occur in line with that assumption. Wild guess: there are way more failures to replicate then we should expect. If this guess is accurate, that suggests that experimenters tend to introduce undocumented distorting factors into their experiments, and compiled anecdotal evidence is actually more valuable than experimental evidence if you can find a way to sample it randomly.
To provide some intuition for this guess, I remember reading about some guy who was doing experiments on mice and found that random stuff like the lighting in his laboratory were actually the primary explanatory factors for his experimental results. (Maybe someone else can provide a link? I can't seem to find the guy on Google.) From this he concluded that almost all experiments that had been done on mice previously were useless. But you can imagine a mouse experiment where instead of using 100 mice in a single laboratory, 100 mice in 100 different laboratories are used. This could deal with the random stuff problem pretty well.
Of course, there's also the problem of interpreting study results accurately... So I don't think the number of participants is the bottleneck to making inferences in most cases.
And a meta-analysis obviously won't suffer from the random stuff problem as much.
You're thinking of the mouse study covered by Lehrer in his decline effect New Yorker article, which was Crabbe et al 1999 "Genetics of mouse behavior: interactions with laboratory environment".
Of course the human mind does shorts! They're comfy and easy to wear!
You're actually why we don't do shorts. (Please, never again wear them in public.)
Mine does, it even adds in legs.
I find this post by Seth Roberts relevant: http://blog.sethroberts.net/2012/04/03/lack-of-repeatability-of-cancer-research-the-mystery/
Most people are aware that publication bias exists, but it could be much worse than we think. If researchers do an experiment 6 times and only get their P=0.05 result once, does P really equal 0.05? Who knows how many of the "statistically significant" studies published in the medical literature are being misrepresented this way.
You use that expression a lot. How does Bayesian evidence differ from mere evidence?
In a certain sense the 'mere' goes the other way. Bayesian evidence includes things that many are in the habit of declaring 'not evidence'. Anecdotes for example. The color of the speaker's underpants. Just about everything.
I'm pretty sure by "Bayesian evidence" he means rational evidence (which is to say, the set that contains all kinds of evidence), rather than legal evidence (the subset of rational evidence that is admissible in a court of law), scientific evidence (the subset of rational evidence that meets certain publication requirements for scientific journals), etc...
Thanks for reminding me, I intended to take the word Bayesian out because it didn't seem to add much.
Yes, this needed to be said, and I suspect a few LessWrongians have a blind spot here.
It bothers me when people say something doesn't work because it is a placebo effect. If it actually has a placebo effect, then it does work!
You are retrieving the silly popular cached belief about the so-called placebo effect.
My advice to you is to unpack the "doesn't work" phrase, perhaps by drawing a causal diagram, and ponder what kind of study could possibly show that the placebo effect in fact does "work". Then exercise your scholarship skills trying to find such a study.
Perhaps, but the definition of the placebo effect you gave in that link ("every random effect other than that of the active ingredient") is not what I was referring to, nor is it what I take other people to mean when they use the term. It is possible that your cached belief of what people mean by the so-called "placebo effect" renders you unable to recognise the phenomenon people are talking about. A definition closer to my intended meaning would be something like "the effect caused by the perceptions and expectations of the patient, rather than the particular substance being administered in the treatment".
Okay, why not? By "doesn't work", I mean its effects are the same as or worse than no treatment at all. By "does work", I mean its effects are more beneficial than no treatment at all. Is that unpacked enough?
Even if I hadn't already pondered such a thing (which I had), the link you provided is just such a study. In fact, the conclusion of the study you linked to (very) weakly confirms my belief that the placebo effect works! Their conclusion states:
When I wrote my original comment I had in mind a particular conversation I had with strangers at the airport. It went something like this:
As for your last piece of advice:
The American Cancer Society (tentatively) explains why the placebo effect works here. Here you can find a study showing changes in the brain chemistry of subjects undergoing a placebo treatment in order to discover the pathway placebos make use of in order to produce their effects. Here is a story about a woman who tracked objective metrics and found large effects from taking placebos (that she knew were placebos). Here is a randomised trial on the treatment of irritable bowel syndrome that found quite large beneficial results from the placebo effect. I could go on and on and will if anyone is actually interested.
My working hypothesis is that IBS is somatisation of mental illness. That could be why the placebo has an effect for it. It's all in the head to start with. Mental illness and IBS are associated.
Nice. Strikes me as a good example of applied rationality.
Thanks for the links, I'll take a look in particular at the Benedetti et al. stuff.
The IBS study was already the topic of the previous, linked discussion so I don't expect to get any more out of it - I've already explained why it (and the excerpt you bolded above) are not sufficient grounds for the claim "there is such a thing as the placebo effect".
Has Alice also had two back surgeries? Otherwise, I'd assign more probability to the hypothesis that their differential response to pain meds has something to do with their different medical histories.
I find this confusing: in the relevant experimental setups, nobody is directly manipulating (intervening on) "the perceptions and expectations of the patient". These perceptions and expectations are themselves an effect, usually not measured except indirectly through the patient's self-reports and voluntarily controlled responses.
What is being manipulated in the studies I've seen (e.g. the IBS study) is a specific experimental condition, namely the type of treatment being administered: either a particular substance or a particular therapeutic intervention or gesture.
When people talk about placebo effect, I take them to mean something like "the mysterious healing power of the mind"; that is, they predict that there is a reliable causal effect of holding particular beliefs, other than voluntarily selecting behavioural responses consistent with the beliefs, in particular an amelioration of some pathology or a mitigation of some symptoms. In the IBS studies and in the Dougherty case, the effects observed are all voluntarily selected responses - there is nothing more mysterious at work than the desire to behave or respond in ways consistent with a professed belief.
Ingesting an inert pill (vs a similar pill with paracetamol) doesn't make my headache or back ache go away, it just makes me less likely to verbally report pain than I would otherwise have. What really takes care of the pain is the paracetamol, and that's what I happen to be interested in.
Playing games with my own expectations, so as to elicit behaviour that is already under my control, is of limited value - if I want to shut up about the pain, I can just decide to shut up about the pain. (I would predict that a study comparing spontaneous pain reports under differential treatments of "Verbal instruction to suppress pain by not reporting it" and "Administration of inert pill" would favor verbal instruction in a wide range of pain-causing injuries or pathologies.)
You test for the effect of a sham treatment (placebo) by comparing it with no treatment at all.
So you've partitioned the treatments between (placebo + other unknown) versus (other unknown). The effect of the placebo is the difference between the two treatments, just as the effect of a "real drug" is the difference between the (real treatment + unknown effects) and (sham treatment + unknown effects).
It's very difficult to arrange an experimental setup for "no treatment at all", because even enrolling people in a study can be considered a "treatment" - it's an expectation-inducing intervention with perforce no control group.
I agree that it's the partition you want, and I'm not saying it's impossible, but - can you come up with a specific way of running such an experiment?
This is the best advice ever. I regret I have only one upvote to give this comment.
Hm. This seems a bit like writing the conclusion at the bottom of the page, to me.
The value of something is not even half an idea - you need value to whom in what context.
Value is about preference of someone, and his preferences will change with context, like scarcity and competitive advantage.