Motivated reasoning, confirmation bias, and AI risk theory

Seth Herd

Of the fifty-odd biases discovered by Kahneman, Tversky, and their successors, forty-nine are cute quirks, and one is destroying civilization. This last one is confirmation bias.

- From Scott Alexander's review of Julia Galef's The Scout Mindset.

Alexander goes on to argue that this bias is the source of polarization in society, which is distorting our beliefs and setting us at each other's throats. How could someone believe such different things unless they're either really stupid or lying to conceal their selfishness? I think smart people who care about the truth go on believing conflicting things largely because of confirmation bias and motivated reasoning.

The corner of civilization I'm most worried about is the one figuring out how to handle the advent of strong AI. I think confirmation bias makes us each a little to a lot overconfident in our beliefs about alignment and AI impacts, and that's pretty bad for collectively finding the truth. I think the effects of biases are still strong and still overlooked in this field, despite its strong values of truth-seeking and relative awareness of biases. Bias has more influence where there's less direct evidence, and that's the case in alignment theory and predicting AI impacts.

I think the effects are underappreciated in part because empirically measured effect sizes tend to understate the problem. Confirmation bias happens at multiple stages of cognition, so it compounds during complex thinking.

In this article, I'll talk about the relevant empirical research, challenges to reasoning about complex topics with a human brain, and some implications for AI risk and alignment thinking. I studied the brain basis of cognitive biases on an IARPA program on understanding biases in intelligence analysis, from 2011-2014. I became fascinated by motivated reasoning, and kept it as a research interest until switching to alignment in 2022.

Confirmation bias is well known, and careful thinkers already try to avoid its effects. But the mechanistic explanations of confirmation bias are rarely discussed. Confirmation bias seems to be caused by several locally or partly rational effects.^[1] The primary sources seem to be motivated reasoning; differing prior beliefs; discounting evidence; and coherence bias (§2.4). I focus on motivated reasoning and the cognitive limitations or problem complexities that create fertile ground for confirmation bias.

Understanding our biases and limitations does not cure them, but it's a start at correcting for and working around them.

Confirmation bias may play a large role in group and personal epistemics. Measured effects on specific tasks are modest, but they can compound in complex problems. Studies have demonstrated confirmation bias in selecting evidence or arguments, in evaluating them, and in remembering them. There are also biases resulting from choice of framings and hypotheses for evaluation, and social effects of weighting evidence and opinions of some experts over others. That's five layers across which biases can cascade or compound, and biases are usually pushing in the same direction at each layer. Section 4.3 contains some rough estimates of total effect sizes; they go from large on up, depending on assumptions about how carefully you're debiasing your thinking.

I recently realized that motivated reasoning was stopping me from writing this article. I was afraid of writing it badly and motivating readers against the topic itself. This fear was giving me a negative reward signal, because motivated reasoning could be a major factor in alignment thinking, and I care a lot that we collectively get this right.

To allay my remaining fears: I'm not telling anyone they're wrong about AI impacts or alignment. Despite thinking about and researching these questions a lot, I'm not confident where the truth lies. Motivated reasoning could easily go in multiple directions. It could be simple motivation to look forward to a bright future. Or it could spring from attachments to theories or group membership, or identities as farsighted or willing to look doom in the face.

These sources of confirmation bias are pernicious and difficult to correct. I don't think I'm close to correcting all of my own. But I think the effort is worthwhile. I don't know how much effort you've put into correcting for motivated reasoning and other sources of confirmation bias, but I suspect for most there's still low-hanging fruit and benefits to claim. I discuss some and speculate on more in the last section.

1.1 Motivated reasoning^[2]

Confirmation bias is an effect in which we are irrational in favor of beliefs we already hold. Motivated reasoning is one cause of that effect. Loosely defined, it is our tendency to believe what's comfortable or useful. Motivated reasoning in this sense is largely non-conscious. The term is sometimes used for deliberately selectively presenting evidence and arguments. But here and in the academic literature, motivated reasoning refers to an accidental, unconscious bias.

Here I'm primarily addressing motivated reasoning and other sources of confirmation bias within scientific or expert communities, particularly the AI risk community. The same sources of bias probably have even greater effects on public opinion, but I mostly leave that as a separate topic.

The core issue is that our reasoning is directed by our motivation, by means of reinforcement learning. Getting correct answers is on average rewarding. So is getting answers we like for other reasons, or answers our peers like.^[3] Our brains mix predictions of those two types of reward. Each belief is shaped partly by whether the path to it felt comfortable.

"Wait!" you might be saying. "I care about the truth! I don't just believe what's comfortable!"

Yes, that's partly true. Believing in seeking truth when it's hard does provide some resistance to motivated reasoning. Truthseekers enjoy changing their minds sometimes. But it doesn't confer immunity. Rationalists still have emotions, and it's still usually more comfortable to think that we're already right because we're skilled reasoners who've already discerned the truth.

Motivated reasoning is a miniature "Ugh field" around evidence and arguments that might disprove a belief you value. There's an unpleasant anti-reward feeling signaling you to think about something else. This can be generated by a flicker of a thought or a pre-learned association. Either makes the accurate prediction that this thought could lead you to having to admit you were wrong, and doing a bunch of work re-evaluating all of your related beliefs, both negative reward predictions. The mind twists away from unpleasant conclusions, and it does so before consciously confronting them.

This is a natural consequence of how the brain estimates the value of predicted outcomes and uses that to guide its decision-making. Those include micro-decisions about what to attend to. I wrote and co-wrote papers reviewing all of the neuroscience behind this, but they're very much written for neuroscientists. So I recommend Steve Byrnes' valence sequence; it perfectly describes the psychological level, and he's basing it on those brain mechanisms of dopamine-driven reinforcement learning even though he's not directly talking about them. And he's a great writer.

While researching this post I shifted to giving more weight to other causes of confirmation bias and cognitive limitations. There are other causes like discounting evidence and assuming coherence that are sometimes or locally rational. I kept more relative focus on motivated reasoning since it's what I know best, but I did learn some interesting things about the other semi-rational causes, which I'll try to share. We'll discuss each of these in section 2.4, and strategies for compensating all of them in section 5.1.

Motivated reasoning is also rational in an important sense. Suppose there's some belief that really doesn't make a difference in your daily life, like that there's a cozy afterlife, or which of two similar parties should receive your vote (which will almost never change any outcomes). Here the two definitions of rationality (epistemic and instrumental) diverge: believing the truth is now at odds with doing what works. It will obviously work better to say you believe what your friends and neighbors believe, so you won't be in arguments with them and they'll support you more when you need it.

If we had infinite cognitive capacity, we could just believe the truth while claiming to believe whatever works. And we could keep track of all of the evidence instead of picking and choosing which to attend to. But we don't have unlimited cognitive capacity.

Our cognitive limitations create fertile ground for confirmation bias. We're making lots of decisions and quick judgments when we do complex thinking, and each of these is a new avenue for confirmation bias and motivated reasoning to influence thinking. And those effects probably compound across types and stages of reasoning. I'll come back to this after discussing the research on confirmation bias effects.

So motivated reasoning, confirmation bias, and the resulting tribalism are important factors, even for a devoted truthseeker.

Recognizing motivated reasoning, confirmation bias, and cognitive limitations has some downsides and some upsides. You may lose some hard-won sense of confidence.^[4] But it allows us to view those who disagree with us more as fellow well-meaning but confused primates and less as dishonest or malicious rivals. And it offers routes to compensating for our own biases and limitations, and communicating around others’.

2. Empirical evidence for confirmation bias

This is my take on the overall literature; I'll talk about a few specific example studies below.

Confirmation bias causes small effects when problems are easy and topics aren't emotionally charged. It causes larger effects when questions are complex and important on an emotional level. Like, unfortunately, the broader questions of alignment.

Studies framed as motivated reasoning probably capture a mix of causal effects. So I'm discussing them under the umbrella term of confirmation bias, and then separately analyzing how much of that might actually arise from motivation.

We might hope that expertise would reduce confirmation bias, but empirically, it appears not to. There have been concerns that expertise in some cases seems to actually create more confirmation bias (e.g. Kahan's "motivated numeracy" and many other studies). Fortunately, those effects have not replicated; unfortunately, subject knowledge, intelligence, or domain skill doesn't usually reduce bias, either. All of these give more ways to correct biases, but also more cognitive tools to justify our conclusions.

The relevant effects are modest to large by behavioral psychology standards, and vary widely under different conditions. They're typically not that large in an intuitive sense; on the order of 10% for some relevant cases, for selection, evaluation, and memory. But since those effects take place at each of those cognitive stages, they can have cascading or compounding effects. Each stage is an input to the next, so effects roughly multiply; see §4.3 for a very rough estimate of total effect sizes after compounding.

Researchers typically distinguish three types of effects: evaluating evidence, selecting evidence, and remembering evidence. The Mechanics of Motivated Reasoning (Epley & Gilovich 2016) and Partisan Bias in Political Judgment (Ditto et al. 2023) are good starting reviews for this topic.

2.1 Bias in evaluating evidence

This effect is usually studied by asking people how good they think some evidence or argument is, and comparing people motivated to consider it convincing to people motivated to think it's not. It's considered bias if people think arguments/evidence that are congruent are more valid than ones incongruent with their motivations or beliefs. The effects are "moderate" in psychological terms, often around 8%-16% differences in ratings (like 3.5 vs 4 on a seven-point Likert scale for "rate the quality of this evidence"). This is a rough and average translation of the less intuitive r=.25 and D=.5 from one recent meta-analysis of political bias studies. I estimated the average effect sizes by looking at standard deviations in a handful of studies from that meta-analysis. I trust it to be close.

Close enough is good enough, because that's not the biggest approximation. Here and elsewhere, uncertainty in the effect size is secondary to guessing how the effect generalizes from lab conditions to relevant real-world conditions. Study designs and populations vary, and none of them capture the conditions we actually care about. But such is science. The effects I mention have replicated extensively; I've dropped several lines of research when I realized they might not generalize or capture the underlying causes I address here.

We might wonder how much it's worth to correct a 10% or so bias. But that appears to be the effect size before confirmation bias compounds across stages of processing. More elaborate and important conclusions, like "this is what my research results mean" or "what are my political beliefs" probably have more opportunities for compounding effects of confirmation bias, as well as more opportunities for contact with outside evidence and arguments. More on this in section 3.

Effects can be larger with more deliberation, and with stronger beliefs. Motivated Skepticism in the Evaluation of Political Beliefs (Taber & Lodge 2006) found effects of 30 to 40% among those with strong beliefs and more subject knowledge when they gave participants longer to respond, despite giving instructions to "set feelings aside" and "be objective" when evaluating the quality of arguments and evidence. This is almost a worst case for confirmation bias, but it's also the most careful analysis of the pattern of thoughts producing those biases.

They timed responses and afterward asked participants to write down all of the thoughts they had in that time. Those with the strongest beliefs and most knowledge spent 25–50% longer thinking about arguments incongruent with their beliefs (22 seconds average), and the extra thinking was mostly denigration. Steelmanning the opposing side and criticizing arguments on one's own side were each around half a thought per argument on average, while denigrating thoughts on incongruent arguments ran 6+ (as scored by raters). Those with less knowledge and weaker beliefs were closer to parity but still had around three times more thoughts denigrating incongruent and bolstering congruent arguments.

We aren't undergraduates in intro to political science. I hope I've thought more and care more about good epistemics, and have developed better habits. You probably have too. But I notice my thoughts charging off in this direction when I encounter arguments incongruent with my beliefs. I can corral them back into steelmanning the arguments, but I wonder how often I'm doing this by accident when I'm not paying close attention. Evaluation of evidence can be arbitrarily complex if you spend any time on it. If you read an argument that's decent but incomplete, you can do an arbitrary set of moves before deciding how to update your beliefs. That includes reviewing some of your favorite counterarguments. This can result in fake updates, in which encountering new evidence can cause us to review our favorite old evidence and re-update on that.

At one point we thought this was crippling; there was a "backfire effect", in which presenting multiple balanced sources of evidence strengthened existing beliefs. Fortunately, that turned out to be real but rare; it was curiously specific to WMDs in Iraq; it didn't replicate to even fairly similar situations (Wood & Porter 2019). But the primary effect replicated robustly; people think arguments are better when they lead to a comfortable/confirming conclusion.

Fully evaluating how relevant those results are to you or your colleagues in alignment research requires knowing exactly how these studies are run, and on what sorts of people and topics. I've noted some particularly salient points, but a full understanding would require reading each study. In lieu of that, here's a general description that applies to many of the studies I cite, in case you want that depth.

General methods in empirical study of confirmation bias

Studies of motivated reasoning and confirmation bias almost always use fairly simple lab tasks. Participants were usually undergraduate students for older work, prior to around 2005. These students were sometimes paid small sums, but more often required to participate in several studies for class credit in introductory classes. For more recent work, students are still sometimes used, but online survey services are more often used. Participant pools vary, often selected for an interest in the small payments for quick piecework.

There are many different paradigms, but here's an aggregate of the most typical/canonical. First, participants are asked about their background beliefs or affiliation (often political), usually on a scale like 1-7 strongly agree to strongly disagree. Then researchers ask their opinion (usually on the same scale) on some related topic (like how effective a public policy would be). Then participants are asked to select some arguments or evidence to look at (e.g., a list of four relevant article titles asking them to click on and read one.

Participants' preference for looking at congruent evidence or arguments (supporting their measured or estimated beliefs, e.g. those typical of their party) is scored as bias in selecting evidence. Asking them how good or important that evidence or argument is gives a measure of bias in evaluating evidence. Asking how their opinion has changed gives a measure of overall confirmation bias. This is calculated by comparing the change in their opinion given the evidence/arguments they saw, relative to those with different beliefs/motivations. Finally, a study might measure bias in memory with a recall test after a delay.

Evaluation of evidence in the broader sense could extend to selecting frames or hypotheses within which to evaluate it; more on that in sections 3 and 4.1. However, studies like the above don't usually give enough time or ask deep enough questions for those framing effects to take center stage.

In sum, bias in evaluating evidence is a real effect; it's hard to guess how strong this is on average, and how it applies to careful thinkers on alignment questions. The impact will depend on how careful we are to compensate and steelman. I'd guess it is by default a large effect even before compounding with bias in other cognitive steps.

2.2 Bias in selecting evidence

Bias in selecting evidence is harder to explain as locally rational. It's more likely to be caused by motivated reasoning or simple associative processing biases.

One early test is the Wason card sorting task. Subjects are told to test an abstract rule like "All cards with a vowel on the front have an odd number on the other side" and then are shown four cards they can turn over to test the rule. The Wason Selection Task: A Meta-Analysis (Ragni et al., 2017) of 228 experiments showed 89% choosing the confirming card vs ~25% choosing the disconfirming card; the confirming card gives no useful information according to the experimenters' intended interpretation. This is a massive and fairly pure demonstration of confirmation bias; it appears to be largely powered by the associative nature of cognition "vowel... odd... okay I'll flip those."

The effect probably includes some assumptions misgeneralized from experience like "the rule probably also means odd numbers can't have a consonant on the other side" and "if it was worth mentioning, vowels and odd numbers are probably rare". See Oaksford & Chater 1994 for a defense of those assumptions as rational; I think these explanations account for some minority of the effect, leaving most of the large effect as pure associative thinking. We notice what's on our mind; this cause of confirmation bias isn't even locally rational. Ideological Bayesians is a nice brief treatment.

Bias in selecting evidence in tests more directly relevant to complex belief formation is also a large effect size. One meta-analysis, Feeling Validated Versus Being Correct: A Meta-Analysis of Selective Exposure to Information (Hart et al. 2009), found a mean odds ratio of 1.92 in selecting consistent vs inconsistent evidence (usually from a list of article or argument titles). Selecting almost twice as many pieces of evidence for your view as against it is likely to substantially skew conclusions toward confirmation (or motivation, in the rare cases where the two diverge).

But confirmation bias in selection of evidence is relatively well-known. You are probably already making some efforts to compensate for it. Confirmation bias is well-known in rationalist circles, as the opening quote indicates, and looking selectively at evidence is a pretty obvious trap. If you're highly aware of confirmation bias effects on selecting evidence, you might be avoiding a lot of the selection effects by making sure you seek out sources and think about evidence that lead away from your favored goals.

However, it might be harder to watch for biased selection of evidence when you're selecting arguments or evidence internally. The Taber & Lodge self-report cited above suggests that the baseline is highly biased. Given the degrees of freedom for internally selecting evidence, it could be easy for motivation to get substantial sway. Steelmanning the opposing argument in a serious effort should substantially counteract this effect, but that takes time and developing the habit.

2.3 Bias in remembering evidence

I didn't dig as far into the literature on memory effects, and they aren't studied as much as evaluation or selection. The effects I looked at range from about 10% better memory for congruent/confirming evidence or arguments, down to absent or even reversed, with incongruent evidence/arguments remembered better. More on that occasional reversal later. But that effect size comes from studies of cued recall; people are cued to try remembering a set of arguments they were exposed to earlier. It doesn't measure free recall, or which arguments we tend to remember on our own. Equally relevant is the limited work like the Taber & Lodge study from § 2.1, in which people report which thoughts/arguments come to mind when they're thinking about how to evaluate some evidence. They report lots more congruent arguments, especially from more knowledgeable and committed people. The recall process itself can be motivated; the goal often isn't "remember some arguments" but "remember arguments to prove this irritating point wrong".

In addition to the apparent bias for remembering arguments congruent with our beliefs, I think we might sometimes remember the most irritating rather than the best arguments against our favored position. This could play into how biases compound as we mentally run arguments and counterarguments for a position. Remembering emotionally salient counterarguments may lead us to accidentally strawman opposing positions by reviewing the worst arguments for them. Or if we're most emotionally engaged by the best arguments, this motivated memory bias could actually counteract confirmation bias and lead toward truth.

2.4 Other causal explanations of confirmation bias effects

When I came back to the topic for this post, I found some new explanations of confirmation bias effects classically attributed to motivated reasoning. These are:

Updating from differing prior beliefs
Discounting of evidence from ideologically opposed sources
Coherence as a useful inferential bias

How to Distinguish Motivated Reasoning from Bayesian Updating (Little 2025) gives a formal proof. For any effect where we have only a proxy for motivations, and beliefs aren't known (for example, knowing someone's political affiliation), there's a "Fully Bayesian Equivalent" agent that would produce identical observable beliefs. This agent has different priors but no motivation. The difference in update comes strictly from its priors. The skeptical import of motivated reasoning (van Doorn 2023) and The evidence for motivated reasoning in climate change (Druckman & McGrath 2019) make similar points. Selective scrutiny and belief polarization can look like rational updating from different priors. However, the global rationality of those updates can be questioned. Those models sometimes require strong assumptions of different priors. And it seems wrong to call a process fully rational if it can lead to two equally intelligent and "rational" agents disagreeing with each other, based on which evidence and social connections they happened across first.

There's another likely causal mechanism of confirmation bias; coherence of representations/world models, acting in many ways. See Toward a General Framework of Biased Reasoning: Coherence-Based Reasoning (Simon & Read 2023). I think this is correct on mechanisms, although I'm biased; I've collaborated with Steve Read in the past, and descend from the "connectionist" academic tradition that frames this explanation. In brief, coherence is often a very useful inferential bias. But it can create cofirmation bias.

The exact mix of causes is important, but it's secondary to the existence of strong biases for confirmation and coherent or comfortable beliefs. The alternate explanations for confirmation bias effects change how we might fight these biases, but not whether the effects exist. "Rational" biases like differing priors and discounting sources of disconfirming evidence are only locally rational within specific highly questionable assumptions that my priors and my ingroup are better and more trustworthy. Assuming such Epistemic Luck seems like an easy but large mistake to make.

2.5 Empirical evidence for motivated reasoning

There are also a few studies that show motivated reasoning effects persisting where prior beliefs or discounting source credibility rationally don't account for the effects. These are much stronger evidence for the causal effects of motivation itself.

Understanding Partisan Bias in Misinformation Judgments (Hubeny, Nahon & Gawronski 2026) uses a clever procedure where they give a personality test, and then tell participants (falsely, while randomizing) that their personality matches some national character and assigns them to "team France" or similar. They find small but highly significant effects, despite the minimal motivation from that manipulation. Motivated Reasoning and the Wason Selection Task (Dawson et al. 2002) used the same favorite trick of lying outright to subjects, and showed that they were far more likely (approx. 15% vs 50%) to seek disconfirming evidence properly if they were told it would disconfirm evidence that they might die early, or disconfirm a negative stereotype about them. Of Preferences and Priors (Celniker & Ditto 2024) shows that people rate scientific studies' methodology much lower when their results are incongruent with their politics and beliefs, relative to a baseline of not knowing their results. They measured prior beliefs explicitly and found that they had a separate effect from preferences.

The direct evidence is pretty limited since the problems with older studies weren't recognized until recently. It's enough to be indicative, but not enough to build an interpretation entirely on it. Collectively, these and a few other studies I've found suggest that a good fraction of the effect is probably really motivated reasoning, but not all of it. In making this judgment, of course I'm placing some weight on my own priors, the mechanistic story and indirect evidence for expecting that the human brain as a reinforcement-learning and reinforcement-seeking system should produce motivated reasoning.

All of those causes of confirmation bias should be expected to have stronger effects where it's harder to discern the truth, and where they can compound across multiple stages of reasoning.

3. Limitations in human cognitive capacity for very complex problems

The effects of confirmation bias need to be understood in relation to the cognitive "playing field" on which it acts.

Cognitive limitations in the face of complex problems also seem somewhat neglected. It's more comfortable and easier to assume that smart people can understand whatever they turn their minds to. I think this is true in the limit; we can understand anything with enough work and careful approximations. But the difficulties of attaining reliable understanding are real, and understanding those difficulties can help us understand the world more efficiently.

When human brains process complex topics like alignment and predicting AI impacts, the process probably includes a lot of judgment calls where biases can enter . But the evidence for that is indirect. If your intuition matches this, you could skip this whole section with the mental tag something like the following: Human reasoning on complex and open-ended "wicked" problems is pretty approximate and includes a lot of judgment calls based on intuition. So confirmation bias and motivated reasoning probably have a lot of leeway to work in questions of AI progress and alignment.

Here's the argument structure of the rest of this section. If you want to follow my process in deriving this, and hear about some of the research, read on.

Introspection suggests that we're not systematically updating complex hypothesis structures
- let alone accurately summing across all of the possible different structures
What we know about expert intuition suggests that our unconscious (system 1) probably isn't doing better than our conscious (system 2)
Bayesian reasoning is limited.
- It doesn't cover creating hypotheses and causal links among them,
- or coming up with likelihoods for updating

3.1 Introspection suggests fuzzy models and updating

Can you lay out your Bayesian hypothesis space for a complex, important question, like why you're working on what you are, or your prediction for AI outcomes? Do you feel like you're updating a set of hypotheses anything like the below? I do not.

If you spend 30 seconds thinking about your model of one of your favorite complex topics, I think you'll find it pretty clear that there's not a discrete and well-defined set of hypotheses with causal chains that lead all the way to evidence. If I try to inspect my hypothesis space, it's pretty vague and inconsistent.

That's not necessarily a problem. Discrete hypotheses don't fit the structure of the world all that well, anyway, and the brain is evolved to work in a complex world. So we might hope our brain handle this sort of update outside of our conscious awareness. Unfortunately, it's probably doing very approximate and incomplete updates, because that's not the type of thing unconscious processes are good at.

3.2 Intuition vs. analysis - evidence and brain mechanisms

Intuition or System 1 processing is largely non-conscious, while analysis or System 2 processing is more accessible to our conscious awareness. Causal reasoning of any complexity is usually System 2 processing, a useful, learned sequence of System 1 cognitive acts. "The new Claude constitution makes me update slightly toward developers taking the alignment problem seriously, and that makes me reduce my probability of disaster" is a summary of minimal System 2 processing. But the amount of that update isn't going to be well-calibrated, because it was performed in a single step, by System 1.

The brain is designed to do something resembling optimal inference, but only for a certain type of inference. Brain mechanisms are evolved for things like guessing whether predators are nearby, not for answering questions like "what should I work on now to optimize our chances of getting good results from progress in AI".

System 2 is not our forte; it's a relatively recent evolutionary adaptation. It works, but clumsily. Most of evolution's efforts were devoted to System 1 processing. This isn't the place to make the full mechanistic argument, and there's not a full consensus on that level of brain function, so I won't waste your time with more of my theories of brain function here.

But we can fall back on empirical work on when intuition is reliable and when it's not. Conditions for Intuitive Expertise: A Failure to Disagree (Kahneman & Klein, 2009) is a rare type of research I especially trust: an expert integrative review, based on a collaboration starting from two seemingly opposed perspectives and working collaboratively to find points of agreement.^[5] Kahneman & Klein identified three conditions for good expert intuition: the environment has to have stable regularities, the expert needs enough practice recognizing them, and feedback needs to be rapid and clear. Their examples of areas that meet these criteria are chess, firefighting, and some areas of medicine. Long-range geopolitical forecasting, clinical psychology, and stock-picking don't.

I think it's safe to add broad alignment and AI forecasting questions to the list of areas where intuition won't work well. Far from having rapid and clear feedback, they have little to none.

Superforecasting could be taken as a counterexample. Tetlock's Good Judgment Project showed that some people, using specific cognitive strategies like breaking problems into components, updating frequently, and calibrating confidence consistently outperformed untrained experts. We're trying to do those things to predict AI impacts and think about alignment. But superforecasters have been able to use a lot of feedback from historical examples to learn from. The AI impacts and alignment challenges we care about haven't happened yet.

One interesting discussion in Conditions for Intuitive Expertise was that experts in many domains with poor feedback are worse at forecasting than algorithms- even very simple ones from the 80s. The human experts were highly variable; for instance, judges might predict recidivism based on some detail of someone's story or demeanor; descriptions of their past behavior proved more predictive. Superforecasters can predict better than that, in domains where they've practiced. Superforecasting skills probably don't generalize very well to domains like alignment and AGI, since those domains don't have training sets to practice against. These are largely out-of-distribution relative to the problems superforecasters have trained on. Those are shorter-term and mostly don't involve black-swan events. See this post for more.

3.3 Bayesian reasoning is an ideal, not a method

If you're already aware of the limitations in applying Bayesian methods to complex problems, you can probably skip this. It's been covered elsewhere; Against strong bayesianism, bayes: a kinda-sorta masterpost, and the intro of Approximately Bayesian Reasoning are three great sources.

It seems impossible for a human to "be a Bayesian" in anything like a full sense, because we just don't have the cognitive horsepower to take in and properly weigh all of the relevant evidence. We can't update properly across all of the possible hypothesis spaces without spending excessive time on System 2 processing. How well we can approximate it in complex domains hasn't been studied in depth.

The problem isn't just that we have a hard time doing accurate Bayesian updating, although we certainly do. That would be fine if the evidence were overwhelming. But in complex domains, errors in updating can propagate and overwhelm the small signal in complex data.

An equal or bigger problem is that Bayesian reasoning by itself isn't adequate for understanding our complex reality. Reality doesn't come prepackaged into hypotheses for us to evaluate. Choosing a causal model is a lot of the work, and the sainted Reverend Bayes and even his more sophisticated modern followers have little to say about how to do it.

It's possible to choose hypotheses broad enough to cover the important questions, but that leaves a different problem. Suppose you choose the broad hypothesis "AI will go well for humanity," and then update that on a piece of evidence like Claude's new constitution. To make that Bayesian update, you have to estimate the ratio of p(constitution | good AI outcomes) to (p(constitution | bad AI outcomes). That is pretty clearly a pretty wild guess.

The alternative is to make a more elaborate causal model. That would make updating on evidence less of a guess, but it would introduce the challenges of accurately propagating belief updates. As discussed in the previous section, that's not something our brains do well without a lot of effort and skill. Propagating belief updates through a complex model is probably a skill that can be developed, but the way I see people write about this suggests that their updates are roughly as approximate as mine seem to me.

3.4 AI risk is complicated

We can view the problem from the other side, as well. Looking at the complexity of the problem helps us understand why our limited brains have trouble dealing with it efficiently.

Problems in the alignment space that seem local often have complex dependencies on surrounding questions from other fields. Choosing a useful research agenda depends on specialized technical questions, but it also benefits dramatically from having a model of how our first AGIs will be built and deployed. And broader questions of global strategy are entirely dependent on that question. That central question, how transformative AI will function and be used, includes questions from many fields. And it requires us to successfully extrapolate work in those fields to conditions that have never existed.

You don't have to address all of these neighboring hard questions to answer some easier but less useful questions. "Is this line of research going to help align LLM-based AGI" touches only a few fields. But the connections to other fields and subfields grow rapidly if we allow ourselves to consider them. And for the really important question, "what should I do to make AI go well," it really does touch on open questions in all of those fields.

I think we can be justifiably pretty confident in our answers to scoped questions we've spent a lot of time on and developed knowledge and expertise for. My concern is that the bigger questions require assembling a lot of those domains, and there's a "blind men and elephant" property to the problem. We each have expertise in some of the relevant questions, but not all of them. So we tend to over-apply our specialty to understanding the whole problem, (like the man feeling the elephant's leg thinking it's a tree, and so on).

We know how much we know, but almost by definition we don't know everything we don't know that would be relevant. So it's almost inevitable that we'll underestimate what we don't know, and how relevant it is to the problem. That seems likely to make us overconfident.

None of us can claim expertise in all of the relevant fields, or even in all the subfields of our primary fields. Even if someone did manage to attain adequate expertise, putting all of the pieces together into accurate models would be another large project. And if somebody managed all of that, they'd still have to write it all up clearly enough to convince everyone else that they had figured out what's going on!

These issues are known and recognized. See this brief annotated bibliography of LW posts.^[6] I see some careful thinkers frequently acknowledging their model uncertainty, but this is pretty rare. (I'm afraid we're just not hearing from some people who are more epistemically cautious; this is a separate problem). But I also see very sophisticated reasoners failing to acknowledge or express their uncertainty. And I catch myself doing the same. This seems to create a lot of churn and confusing arguments, in which people argue against the level of certainty, which is confused for arguing against the primary arguments in that direction (and that confusion seems to happen in both directions).

It takes more time and attention to make or take in epistemic notes alongside object-level arguments. And even if we do decide to prioritize epistemic clarity, there are a lot of habits of thought to remember and cultivate. But on topics like AI predictions and alignment where the uncrtainty is large and can be crucial for decision-making, I think that effort is usually worthwhile.

In sum, the complexity of predicting AI progress and understanding AI creates more necessity for judgment calls where confirmation bias can compound.

4. Compounding of confirmation bias

The causes of confirmation bias exert their influence at several different stages of thinking about complex problems. Each of these stages creates the input for the next, so the effects of bias at each stage must compound with those at later stages. No study has captured the net effect of all of this bias. So we're stuck making rough estimates of the total impact of confirmation bias and motivated reasoning. That estimate has to account for compounding across multiple stages of reasoning.

We can do some very rough guesses at the structure of compounding. We have at least five types of reasoning which seem likely to have compounding effects:

Choosing framings/hypothesis spaces
Selecting evidence/arguments
Evaluating evidence/arguments
Remembering evidence/arguments.
Social sources of evidence/arguments

The process by which we arrive at beliefs in complex domains is unknown, and probably pretty varied and idiosyncratic to individuals. I don't know of a study that's even attempted to simulate this in any detail. Theoretical work on how the brain does this is pretty limited; this was my main interest while doing neuroscience, and while I think I understand the broad outlines, that's not much help in assembling a causal model of someone thinking for weeks or years about important topics.

So we need other ways to guess how biases aggregate in complex cognition. It may be useful to look at two attempts to make complex reasoning explicit, to help think about the many (many) decisions that go into making a complete model on a complex topic.

We'll look more at framing and social effects, since the other entry points for confirmation bias were covered in section 2.

4.1 Example of frame/hypothesis choices and confident disagreement among experts

I'll use two examples. Both have thought carefully about rationality and epistemic rigor. This pair does double duty, in that it also illustrates the central problem I'm indirectly trying to address in this post: disagreement among experts on critical questions about alignment and AI progress. Our best thinking, even within the rationalist community, is not producing convergence. It results in what I believe are honest disagreements, but with both parties confident they are correct. This appears to be a dramatic failure of our best epistemics to date, and one that could be our undoing when applied to alignment.

My examples are Nate Soares' AGI ruin scenarios are likely (and disjunctive) (2022) and Joe Carlsmith's Is Power-Seeking AI an Existential Risk?, (2022) although there are many such examples to be found (e.g., Paul Christiano is in some ways a better contrast to Soares, but I don't know of a place he's tried to convey his causal models in this way; his My views on “doom” (2023) focuses more on conclusions). Both of these are approximately p(doom) models, but have very different structures. Each author states that they're dramatic simplifications of their mental models, despite the complexity of what they do present.

Carlsmith's causal model is conjunctive, in contrast to Soares' disjunctive model, below. He posits six steps, all of which must happen for AI disaster:

advanced AI is developed,
it's given dangerous levels of power,
it has misaligned goals,
this isn't corrected,
it seeks power, and
this leads to existential catastrophe.

He assigns probabilities to each step and multiplies through to get ~5% p(doom) as a conjunctive product (updated to 10% in 2023; I wonder what he'd say now). He provides extensive discussion of each point, but no further explicit structure.

Soares, on the other hand, says if we develop AGI soon, doom is disjunctive; success requires that all of these conditions are met:

The world’s overall state needs to be such that AI can be deployed to make things good.
Technical alignment needs to be solved to the point where good people could deploy AI to make things good.
The internal dynamics at the relevant organizations need to be such that the organizations deploy an AGI to make things good.

Unstructured sub-bullets (around ten or so per heading) illustrate why he finds each of these unlikely. His estimated p(doom) is >90%.

Their framings seem linked to their conclusions. The conjunctive model includes success as a baseline; it asks what all needs to happen before there's a possibility of doom. The disjunctive model asks what needs to go well to avoid doom as a default once we have better-than-human AI.

Estimating the likelihood of each component hypothesis is itself quite complex. Each paper goes into that logic but naturally does not provide further formal structure for making those estimates. Some combination of complex causal models and loose estimates is necessary to integrate evidence for each hypothesis. The looser those estimates are, the more susceptible they are to MR and confirmation bias.

I can get the two frameworks to converge and agree with my overall estimate of risk, but it requires work. If I weren't explicitly aiming at convergence, accepting each framing would push my estimate heavily toward either end of the spectrum.

Looking for empirical evidence of framing effects didn't turn up anything close enough to be worth using as an empirical estimate. Here I think taking a guess is better than generalizing from empirical studies that aren't really in the ballpark of the complex belief formation we're trying to understand.

I don't think Carlsmith or Soares, or thinkers like them, are tied to framings like these. Novices just starting to consider these questions might have their first conclusions strongly biased by the framing they've chosen, but anyone who reads a few counterarguments and takes them seriously can at least try on alternate framings. Therefore, I think the question of bias from framing in expert thought revolves around how often and smoothly we switch framings to consider the question from different angles. If we do this well, we apply arguments and evidence as they were intended. If we don't, we risk discarding arguments because they seem irrelevant or foolish within our own framing, even though they are valid and useful when interpreted in the framing someone else is using.

Choice of framings is crucial and a valid subject of analysis. The mere existence of alternate framings doesn't demand we take them seriously. But without the ability and habit of trying to take them seriously, we're at risk of dismissing them when we shouldn't. When we do that, we'll overestimate our certainty by mis-applying some evidence and arguments.

I think this is both an example of the power of choosing framings, and of the complexity of the problem relative to our ability to think and communicate about it. The communication side provides another level at which confirmation bias can compound.

4.2 Social compounding of confirmation bias effects

Confirmation bias can compound across like minds. I won't belabor this, because it is well-known. We speak commonly of echo chambers, and hopefully take steps to avoid them. But it's difficult to avoid social network effects, even if you're deliberately looking at information from people you disagree with. See Escape the Echo Chamber for a rationalist-adjacent treatment.

Even when we make real efforts to avoid echo chamber effects by attending to a diversity of opinions and evidence, there are subtle and difficult-to-correct sources of reverberatory confirmation bias. We should include experts' opinions in our all-things-considered beliefs. And we should rate recommendations from experts higher than others. But our estimate of how relevant and extensive their expertise is, is itself biased. This creates a feedback effect and a second level of confirmation bias.

Confirmation bias in attributing expertise and trustworthiness creates another source of bias on each of the other effects we've looked at. I will tend to prefer evidence and arguments presented by those I respect more. Recalling an expert and then their arguments is another entry point for bias in memory. Thus, between-minds sources of confirmation bias would seem to work in sequence with the others, and therefore be roughly multiplicative with them.

To a first very rough approximation, we might expect the inter-social effects to be separate but similar in size to internal causes of confirmation bias. Social influence exerts a second set of motivations, and thus bias. Social influence might also evoke distinct priors by foregrounding the beliefs of respected experts. If I had to guess, prior to looking at the evidence, I'd guess that additional confirmation bias would be exerted at each step to a similar but somewhat smaller degree than the primary effects, since the motivational effects of respect and group affiliation are strong, but secondhand adoption of priors is probably a smaller factor than one's own priors.

The evidence I've found since hasn't disconfirmed that very rough guess. But the evidence is limited, and I haven't done a thorough reading of the relevant literatures, so it remains a guess.

4.2.1 Social effects on evaluating evidence.

Favoring evidence from a source you like or respect is one form of The Halo Effect. Byrnes' Valence series (also referenced in the intro) gives an intuitive and compelling description of how our value or quality estimates spread between people and ideas.

The social or halo effects on evaluating evidence are empirically of similar magnitude to those from internal confirmation bias. One meta-analysis (Ou & Ho 2024) estimated effects of general source "credibility" on evaluation of evidence across a collection of studies. They found 6.5% of variance explained () overall, but only about 3% from expertise. An earlier meta-analysis over mostly different studies found 4.5% of variance explained across categories, (.045 ) but 16% from expertise (Wilson & Sherrell 1993). The different sample of studies is probably the cause of those very different estimates. This highlights the wide variability across particular methods, and the difficulty of guessing how effects generalize to real-world situations.

Survey results using real-world sources and information/evidence show stronger correlations. The studies aggregated in Ou & Ho show larger correlations, with 25% of variance in participants' ratings of evidence quality explained by their rating of the source. But this is partly a product of non-social preferences. People like people who agree with them, and agreeing people tend to present agreeing evidence. Thus, this correlation includes the individual confirmation bias in evaluation of evidence effect, as well as the social effect. The large correlation seems to indicate an additional effect of social bias. It also suggests a large total effect from internal and social confirmation bias.

However, those effect sizes aren't really what we'd want. The ideal study would be run on the people and issues we care most about. Even taking a guess at how the studies generalize to particular groups and issues would require characterizing the studies in those meta-analyses in much more detail. Their methods vary, and their effect sizes are not well-captured by the statistical aggregation. Adequately characterizing them would require reading a sufficiently large sample of those studies to make a better estimate, and I haven't spent the time to do that.

At a guess from reading just a few of the component papers, I'd put those effects at something like 10% or so. That's similar to the estimate I got for the effects of confirmation bias on evaluation of evidence, after doing much more reading. Of course effects will be highly dependent on the particular situation, and how hard the individual has tried to avoid this effect. (I suspect avoiding social bias in evaluating evidence is harder and less common than avoiding internal bias).

4.2.2 Social effects on selecting evidence, memory, and framing

The social effects of biases are outside of my former area of expertise. After spending some days on the social effects on evaluating and selecting evidence, I cut myself off from trying to read enough to make even rough estimates of the remaining effect sizes.

Based on the searching and reading I did, the literature on social/reputational effects on selecting evidence seems surprisingly thin. It seems likely that people select evidence or arguments recommended by people they respect, but I haven't been able to find good studies without major confounds. There are good studies on Facebook connections and clickthrough rates, but those are heavily confounded. Clicking a link could be driven by wanting to talk to that friend about the source they recommended, or by treating their recommendation as informative. Most studies of evidence selection that avoid that confound don't have a measure of how much the subject actually likes/respects the recommender, just a weak inducement like "Dr. Johnson is an expert in this field." This manipulation probably doesn't evoke the level of respect we feel for leaders in our own fields and communities.

Algorithms have effects that parallel those of our actual social influences. Algorithms on many platforms show us information from those who share our views, unless we work very diligently to prevent this. But I'm not trying to account for algorithmic effects here. They play less role in science than politics. And accounting for them would open up a whole new research project.

Without digging deeper into the relevant literatures (if they indeed exist!), I'll guess that confirmation biases from social/reputational causes are similar in size to the internal effects discussed in §2. Social factors create a second source of both motivation and priors, the two main causes of the internal confirmation bias effects. I will tend to assume people I respect are good judges of which evidence is worth looking at (selection), and its worth (evaluation), and their presentations will guide my memory. And when I take in evidence through their restatement, I will partially adopt their beliefs and framings.

Of course that logic is too vague to make precise estimates, but rough Fermi estimates are a start. We could try to refine that very broad "double each one", but it's probably not worth the trouble since we're already in Fermi estimate territory. (My first cut suggests as many upward as downward shifts: social effects on selection could be larger, because they're putting that evidence or argument right in front of you; evaluation could be smaller since you don't entirely share their beliefs; and memory effects could be larger since thinking about individuals' arguments is a useful cue for episodic memory. Based on that, I'm sticking with "roughly equal to individual confirmation bias").

Let's briefly review, since we're re-using those estimates. Internal confirmation bias effects were modest, at 0-40%, but 8-16% most often; §2.1. They were very large on selection of evidence (1.9 times more congruent than incongruent sources from one meta-analysis); §2.2. They were moderate (~10%) to zero for memory for evidence, and even reversing in some cases (§2.3). However, memory can also be biased toward irritating counterarguments, leading to strawmanning the other side. Thus, I'm keeping the 10% memory bias and think it could be an underestimate for the functional role. Framing of hypotheses and arguments seems like it could have large or very large effects, but I found no empirical evidence adequate for even loose numerical estimates, so that remains a wild guess; §4.1.

Thus, at a very (very) rough estimate, we have two sets of each effect, one from our own bias and one from the similar confirmation bias of those we've chosen to trust.

There's another route to making this guess: observational studies. This is equally rough, but it seems to agree on the order of magnitude with the estimates above.

Total effects of social and individual confirmation bias on beliefs observationally seem to be enormous in some cases. Consider the polarized US political climate and its effects on factual beliefs. For example: in political near-neighbors, group-linked factual belief gaps can be enormous: PRRI found a 57-point Republican-Democrat gap on whether the 2020 election was stolen, and a 2024 Frontiers paper found roughly 40-point partisan gaps on whether warming is human-caused. This isn't just social network effects, but it's probably close to a sum of those and individual confirmation bias. Note here that my use of social effects includes the effects on evidence sources; a biased media source is considered a social factor. In this scenario, most people aren't very engaged, let alone expert. But the questions of fact are much less complex than the hard questions of alignment and AI impact predictions.

4.2.3 Interlude: don't give up on seeking truth

Biases abound! I've just piled on a duplicate of each source of bias. It's tempting to either shrug this whole thing off, or approximate it as "bias swamps evidence." I don't think either is useful.

My conclusion isn't one of epistemic despair or nihilism: all of these sources of bias can be reduced with effort. Primate epistemology is hard but not impossible. The conclusion isn't to give up on knowing things, but to work to counteract biases where we can efficiently do that, and reduce our certainty, particularly in the face of "counter-consensus" groups with similar expertise.

4.2.4 Social belief contagion or information cascade effects

There's a separate social source of confirmation bias beyond the amplification effects: epistemic modesty, or treating others' beliefs as evidence. This creates a problem of "double-counting." If I update my beliefs on those of expert A whom I respect, and then someone else updates their beliefs from my stated beliefs and A's, they have double-counted A's beliefs. Understanding information cascades succinctly describes how this works, if the above isn't adequate.

This can go far beyond double-counting, when we're dealing with whole communities, so it's another potent source of compounding confirmation bias effects. This problem receives less attention than echo chamber or epistemic bubble effects. I think it's a fairly severe problem for group epistemics.

In many situations, epistemic modesty seems quite rational. It's hard to argue we shouldn't weigh the beliefs of those with much more relevant expertise, time-on-task, or raw intelligence.^[7] If I know I have much less expertise and haven't thought as deeply about it as someone else I trust, I'll get better results if I simply use their opinion in place of my own. Later, when my expertise and time-on-task nears theirs, I might still give their beliefs some weight. I should assume they've seen evidence I have not, even if I trust my own judgment more.

So complete epistemic immodesty seems irrational. But epistemic modesty in our publicly stated beliefs leads to double-counting (actually many-counting).

Studies like How social influence can undermine the wisdom of crowd effect (Lorenz et al. 2011) experimentally show what mathematical simulations and intuition suggest: giving people access to others' guesses has a distortionary effect. It empirically makes average and individual estimates worse, and pulls individual estimates toward extremes. But the main effect I'm concerned with is more intuitive: an inflation of confidence through clustering. If others tend to agree with me, it seems like evidence that we're collectively fairly confident, and thus can be individually confident in our conclusions. But if we're basing our beliefs on each other's, we're agreeing more than our samples of the evidence and arguments would actually suggest.

This effect depends on who we hear from and pay attention to, more than the raw distribution of beliefs. So social network effects can play complex roles, particularly when filtered through online algorithms and self-selected online information sources. Estimating an effect size is quite difficult, and it would vary widely based on each individual's epistemic practices. My subjective impression from watching public discourse around alignment questions is that these effects are substantial in the overall discourse.

There's a partial solution to the "double-counting" problem, but few people seem to use it. Careful thinkers sometimes state both a "my own view" and "all things considered" estimate that gives some weight to others' opinions. This would largely avoid the double-counting source of group confirmation bias if we did it scrupulously. Of course, it's not possible to really switch off updating our beliefs based on those of others we respect; but we can make rough estimates of those effects and try to adjust.

I'd expect biased epistemic modesty to move beliefs toward more clustered distributions. I think this may have happened in the field of alignment, but that's worth a separate post.

I think this issue is probably pretty severe for group epistemics. When I look at histories of scientific disagreements, I see these effects and other social network and motivation effects. But of course I'm biased in that direction. Draw your own conclusions.

Despite thinking this effect is large and important, I haven't gone beyond the vague characterization as "extra clustering effects". I have not included belief-contagion effects in the numerical model below. I only started appreciating their potential importance late in writing this post, and I don’t feel qualified to even guess at the average effect. It would depend heavily on who you respect, where they sit in the belief-space around you, and how much your own stated beliefs already incorporate theirs. A better estimate would include this factor. This seems worth a separate post.

For now, I'll say: this effect is probably important and highly dependent on the topic and individual. For non-experts, this effect may be larger than the compounded effecs of the remaining sources of confirmation bias.

4.3 Very rough estimates of total compounded confirmation bias

I wavered on whether to include this section. Trying to put numbers to these claims is fraught. And doing so highlights just how large I think the effects of biases are. I worried that the reader might simply spit out the idea whole if confronted with numbers on this scale. But using numbers is an aid to thinking rigorously, even when those numbers are merely order-of-magnitude approximations. So that's the spirit in which I offer these numbers.

The large uncertainty in these numbers might make the empirical mind recoil. This seems important enough to do at least rough math. I'm unsure on the size of each bias, but I stand behind some version of compounding being likely. This makes small effects at each stage stack up to large or very large effects in total. You can reject my estimates and insert your own. And I welcome corrections or suggestions on how to model how biases compound.

In my model of compounding, the resultant bias effects are large. My point, again, is not that thinking clearly about complex problems is impossible. It is that understanding and counteracting our biases is necessary to do so. There are thinkers I respect as nearly completely unbiased. They appear to have exerted extraordinary effort and practice. I do not count myself among them, and I doubt you should either. Those thinkers are marked by high levels of hedging and uncertainty statements in complex domains, even when they are expert in those domains.

With that in mind, I can't stress too highly how uncertain I am about these estimates. My goal is to provide a reasonable range based on the empirical literature where I know it and it's helpful, and outright guesses elsewhere. You can replace my estimates and guesses with your own. The actual amount of bias will vary dramatically by situation and individual. I don't think it's realistic for anyone to estimate zero bias in any of these categories. It's possible to overcompensate, but I doubt anyone is actually doing this. And compensating exactly enough seems even more unrealistic.

How to read this table:

The bottom line is at the bottom of the table. It's expressed as how much this compounding of biases distorts a belief that would be 1:1 or 50% credence based on an unbiased evaluation of the evidence. For instance, the result in the second column is inflating an accurate 50% credence to 69% after effects of all biases.

Biases are expressed as Bayes factors. These are usually used as a compact way to express the effect of new evidence in a Bayesian update between two hypotheses. Biases can be expressed in this form as an inflation of real evidence.

Where available, this amount is estimated from the empirical work I've reviewed above; for instance, 12% is my estimate of the median value in studies of bias on evaluating evidence (§2.1). This translates to a 1.12 Bayes factor, under the assumption that 12% more estimated quality or importance for congruent evidence tilts the balance by that much. These are little better than order-of-magnitude Fermi estimates. More on each is contained in the collapsible box below the table.

I've included an adjustment for imperfect correlation of biases. Most but not all of the biases in each step will "push" in the same direction; motivation need not align with confirmation, for instance. I think a .7 correlation is a low estimate.

You can copy the spreadsheet this came from and tinker with it. More on why I chose these values in the collapsible section below.

Stage	Very careful debiaser	Careful evidence selection	Typical thinker	Motivated, echo chamber
Choosing framings	1.05	1.25	1.25	1.5
Selecting evidence	1.1	1.1	1.9	2.5
Evaluating evidence	1.06	1.12	1.12	1.4
Remembering evidence	1.02	1.1	1.1	1.2
Social: framings	1.05	1.25	1.25	1.5
Social: selection	1.1	1.1	2	4
Social: evaluation	1.06	1.12	1.12	1.4
Social: memory	1.02	1.1	1.1	1.2
Total Bayes factor	1.55	2.86	9.01	63.50
Correlation among factors (guess)	0.7	0.7	0.7	0.7
Correlation-adjusted	1.39	2.30	6.60	44.70
Optimal p	0.5	0.5	0.5	0.5
Biased p	0.58	0.69	0.86	0.97

No individual will fall exactly on any of these categories. The second column is my caricature of an average scientist, someone who's careful to look at all of the evidence, but attached to their preferred framings and not very attentive to motivated reasoning. The third column models the average study participant; and the last column models someone who doesn't put any effort toward good epistemology. The very careful debiaser modeled in the first column is a status I aspire to but don't claim. I count few in the field who seem that careful, but they exist on both sides of the aisle.^[8]

There are many more caveats and qualifiers. One major question is the role of "selection of evidence" among actual experts. Experts are typically at least familiar with all of the major types of evidence and arguments available on open questions in their field. For them, selection of evidence/arguments is more like selection of which to take seriously and think about deeply. I think selection of evidence thus still plays a major role in determining expert beliefs on open questions, but I'm unsure and would like better models and data.

Another major question is whether effects of memory should be treated as compounding with selection of evidence. When you're looking at evidence, memory isn't a factor. But we're frequently running arguments and counterarguments in our heads, and here memory becomes critical. So I suspect memory bias plays a major role, and I include it as a compounding factor. But the numerical value is vastly underdetermined from the studies I've read.

Logic and evidence for each bias level

Columns/personality types: wild guesses on how biases might be expressed and controlled differently by different people. Each individual would be different. The critical question is probably: how well do you personally compensate for bias from each source?
Selection: 1.92 ratio of congruent to incongruent evidence over studies, from Hart et al. 2009 in §2.2; careful thinkers may force themselves to read roughly equal evidence from all sides
Evaluation 1.12 as a median of the 8-16% average in §2.1, 1.4 high end from Taber & Lodge for strong-belief experts (30–40%).
Memory 1.10: 10% seems like a low estimate if we included bias for bad/irritating incongruent arguments; §2.3
Framing 1.25 / 1.50: pure wild guess! Empirical studies don't give estimates. Substitute your own wild guess. This seems potentially quite large, but careful thinkers usually adopt multiple framings at least occasionally.
Careful debiaser column: ~1/3 of typical effects, a guess at the rough magnitude of real and effortful debiasing.
Social columns: "roughly doubles each layer" very rough estimate, based on the logic that these are separate sources of motivation and priors in each area. Memory is more debatable; I'm including it because memory for arguments is often mediated by memory of public discussions and therefore social influences.
Correlation adjustment: motivation doesn't always push the same direction as confirmation bias, but it usually does. Confirmation bias usually pushes in the same direction on every step, but some intermediate steps might be taken with somewhat different beliefs in mind. .7 seems like a very conservative estimate of how well these would all correlate. Multiplying the total Bayes ratio by this factor is another rough but close-enough approximation.

5. Implications and remediations

I experienced one interesting shift when I started thinking that biases and cognitive limitations were central factors in disagreement: I liked people more. Whether you think the people building AI are reckless or the people forecasting certain doom are hysterical, understanding them as biased and fallible seems more charitable and more accurate than assuming either incompetence or malice.

From this perspective, disagreement often persists not because people are stupid or dishonest, but because emotional barriers make certain conclusions hard to reach. Reducing those barriers may do more than adding more evidence.

The less pleasant shift I experienced was watching many of my beliefs weaken or evaporate under my own skepticism.

The uncomfortable implication isn't that some particular group is wrong. It's that everyone's confidence is probably too high, on most things, most of the time. Motivated reasoning pushes different people in different directions depending on what's emotionally at stake for them: their career investments, their community identity, their fears about the future, and, particularly, the opinions of people they respect (see [Valence series] 4: Liking / Admiring).

Strongly valuing the truth over convenience or social reward creates some resistance to confirmation bias, but it does not confer immunity.

One implication of communicating clearly about our uncertainty is to avoid point estimates and unqualified statements of belief on important topics. Careful thinkers do frequently provide some estimate of their confidence or a means to estimate it ("I've thought about this a little/lot" or sometimes "10-90%" to express large model uncertainty). Expressing a probability estimate as a range seems like a compact way to include model uncertainty.

Uncertainty interval statements often mix model uncertainty and estimated inherent uncertainty; for instance, "2-4 years" might mean either that you've done an incredibly thorough job modeling all of the causal factors, so you're highly confident that a better prediction would be very difficult, or "2-4 years" might mean that you're taking a wild guess at a highly knowable quantity. Clarifying is useful; sounding like we're certain when we're not makes the double-counting problem worse, as well as derailing discussions toward claims we didn't intend to make.

Uncertainty intervals are often dropped when thinking about or repeating claims; for instance, Daniel Kokotajlo's predicted timeline to automated coding isn't simply "mid 2028" (currently), even though it's often restated that way; it's a distribution. Saying "10-30% chance" or "1-4 years" conveys uncertainty more memorably than "maybe 20%" or "maybe two years". See Ord's Broad Timelines for more on the importance of including uncertainties for timelines. In addition to the points he makes, I worry that motivated reasoning is subtly turning our attention away from the short end of predicted timeline distributions.

5.1 Standard remediations

This piece is primarily about recognizing a problem. But I'll offer at least some thoughts about what we might do about that problem. These thoughts are speculative.

It would be useful to know exactly how much of our confirmation bias is caused by each source we've discussed. But this isn't necessary to start compensating.

Strategies for overcoming confirmation bias are well-known. But employing them takes time and practice. There will always be tradeoffs in how much time we spend debiasing ourselves versus becoming more expert in our chosen fields of study and thinking about problems on the object level.

We know that taking in a variety of evidence and arguments is good practice for arriving at true beliefs. Much of the effect of biases is from choosing what to read, whom to talk to, and which objections you take seriously. There's no formula for deciding which of these deserve your time, but efforts to avoid bias in our choices seem useful. Forming warm relationships with people we disagree with is difficult, but rewarding on both epistemic and personal levels to whatever extent we do it.

Adopting a "Scout Mindset" is taking an attitude of curiosity and trying to learn instead of the "soldier mindset" of trying to convince others your current beliefs are right. This seems likely to help counteract your confirmation bias. But it doesn't seem likely to create a full solution. It might reduce your desire to be right and therefore motivated reasoning effects, but it won't eliminate them. Instilling it as a cognitive habit seems like a worthwhile project.

Steelmanning is another known technique that should counteract confirmation bias, to the extent we put time and effort into it. Trying to construct the best argument we can for a position we don't hold can harness some of our biases to work against others. Trying to thoroughly inhabit that set of beliefs could even compensate for some of the effects driven by different priors. And imagining how someone would react emotionally could create empathetic emotions and counter your own motivated reasoning.

5.2 Remediations for motivated reasoning

The main thing I have to add is that it's important to be aware of how you and others feel about the discussion and the arguments.

To the extent that motivated reasoning is a strong effect, group epistemology will be improved by attention to feelings and motivations. Changing minds doesn't happen as much by bludgeoning people with evidence as it does by making it feel safe for them to change their minds. Leave a Line of Retreat addresses this on a personal level; adapting this to public dialogue seems important and underexplored.

I sometimes notice an aversion to engaging with some arguments. Often I can track that to my feelings about the people advocating that position, or to how I'd feel if those arguments were strong and forced a major change in my beliefs. Doing all of this tracking of feelings can be a lot of extra work. I think it pays off by helping me notice where I'm prone to pass over uncomfortable arguments, but it does take time and developing the habit. Of course I don't know how often I'm catching important biases.

I can't make a strong claim that any of these will be worth your time and effort, but they do seem worth considering. I have become a lot less confident on complex questions, and hopefully my beliefs have become better-reasoned in areas where I've spent time considering my biases.

Another major factor, and possible point of intervention, is watching how you decide you've thought about something enough. Yudkowsky's discussion of motivated stopping and motivated continuation addresses this nicely. Stopping when you're comfortable means you do all the reasoning locally correctly, but still reach the conclusion you're comfortable with. I suspect that subconscious motivation to stop while we're liking the conclusion is a major factor in motivated reasoning on complex topics. And after we've set aside the topic for a while, we won't be able to remember all of the pieces of logic as clearly as we remember the conclusion.

Some other particularly relevant LW posts are annotated in this footnote.^[9]

Actually enjoying being wrong as a means of becoming less wrong should help. To the extent we can do this, it will turn motivated reasoning from a source of confirmation bias to a force that counteracts other sources of bias. The research on accuracy incentives suggests that when people are motivated toward accuracy rather than identity-defense, their reasoning improves.

Valuing changing our minds can be done at a community level, too. Social rewards are real, as evidenced by dopamine release and clear behavioral effects.

One interesting corollary of thinking that emotions might heavily sway reasoning is a candidate principle of rational discourse: be nice. Being nice in this sense doesn't mean saying you agree when you don't; it means trying hard not to irritate people, since that will bias them against the ideas you're arguing for. From this perspective, norms of politeness aren't just for comfort or community-building; meticulous manners and generosity are load-bearing for rationality.

For what it's worth, here's a summary of the above:

Notice your feelings
Particularly when engaging feels uncomfortable
Try to enjoy being wrong as a predictor of becoming smarter
State probabilities as ranges
Try to note where you're weighting others' beliefs
Be nice; don't motivate others against your arguments

These thoughts on remediations are speculative. Draw your own conclusions, and share them.

Conclusion

This post has grown beyond its initial focus on motivated reasoning to the broader question of how human brains handle ultra-complex problems. It remains incomplete in places, and I welcome corrections and expansions.

You can quarrel with the estimates of individual bias effects. I hope you do, carefully; those estimates are highly uncertain and could use improvement. The claim I stand by is that the effects of bias compound in complex reasoning.

AI risk is a complex problem, and we're trying to tackle it armed with brains built for survival. Correcting for our limitations and biases will help us make better collective decisions about AI.

^{^}
By locally rational, I mean behaving in a way that's optimal for discerning the truth given what one currently knows on an object level, but not optimal given what one could guess from knowing others' beliefs. Strong belief in God might be locally rational if you've heard a lot of arguments for and few against, but not globally rational if you know there are a bunch of atheists in other towns.
^{^}
Section 1.1 and the opening three paragraphs of §1.0 are adapted and expanded from a 2024 short answer on motivated reasoning.
^{^}
Neural mechanisms of human decision-making, (Herd et al. 2020) and A systems-neuroscience model of phasic dopamine (Mollick et al. 2020) provide overviews of and references to the empirical literature on dopamine function and the surrounding neurobiology of complex decision-making.
^{^}
This post is a mild infohazard. Reading it risks making you underconfident in your beliefs. I recommend EY's Status Regulation and Anxious Underconfidence from Inadequate Equilibria, particularly if you are habitually modest and at risk of underconfidence. On the other hand, not reading it risks leaving you overconfident, and unaware of one correctable source of bias. There's probably a lot of individual variation; I'd guess humans as a whole trend pretty strongly toward overconfidence, since we don't know what we don't know, and leaving that out overestimates what we do know.
^{^}
It occurred to me only long after reading Kahneman & Klein's "failure to disagree" that this that it might actually be an example of how being bias-aware creates better collaborations across disparate scientific camps and viewpoints. Such work is rare, so it's tempting to interpret it this way. But that may be motivated reasoning on my part.
^{^}
See also Defeating Ugh Fields In Practice for an interesting and useful review. Staring into the abyss as a core life skill seems to very much be about why and how to overcome motivated reasoning. The author learned to value the idea of being wrong about important beliefs, by seeing a few people accomplish extraordinary things as a result of questioning their central beliefs and changing their minds.
^{^}
Note that I'm arguing for epistemic modesty toward those with more time on the question at hand. Practice seems more important than raw intelligence wherever practice is possible. Intelligence, measured as IQ or g factor, is real and important, but it is roughly a multiplicative factor on practice.
So in deciding whether to weight someone's opinion, a simple metric would be "how much time do I think this person has spent learning about this topic?" This is difficult to judge, since some parts of their background expertise will be more relevant than others, and some time on task will be relatively useless if it's misdirected, so this is another free parameter where judgment and bias come into play.
Expertise in the form of knowing all the arguments counts even for novel problems like alignment, so I'd still trust time on task over raw intelligence. But alignment and AI predictions aren't like most fields where practice makes perfect, or at least less often wrong. The important questions have no real feedback mechanisms, since the important predictions and arguably the most important alignment questions address entirely new events with no close precedents.
^{^}
There are careful thinkers with careful epistemics on both sides of the optimist/pessimist divide in alignment and AI risk. However, they usually don't fall into the far extremes, since they maintain a lot of model uncertainty.
^{^}
LessWrong has much on confirmation bias, but less on motivated reasoning.

Annotated bibliography of articles related to confirmation bias and motivated reasoning:
Separating Prediction from Goal-Seeking "tl;dr: Mixing goal-directedness into cognitive processes that are working to truth-seek about possible futures tends to undermine both truth-seeking and effective pursuit of your goals." It's difficult but desirable to separate them.
Irrationality is Socially Strategic Valentine, recent. Doesn't use MR terminology but describes why we'd expect this.
Ideological Bayesians What you notice or what questions you ask can produce dramatically different results even with perfect Bayesian updating
Ethnic Tension And Meaningless Arguments About the horns/halo effect, another statement of valence. Great writing. SSC Alexander
Comment on "Endogenous Epistemic Factionalization" If you're Bayesian but somewhat distrust evidence given by those who disagree with you, factions emerge spontaneously.
Trapped Priors As A Basic Problem Of Rationality Scott Alexander. Principal example: fear of dogs does not disappear even when dogs never bite. He's stating this as uncertain and a new theory. This effect clearly happens in phobias, and may happen to a lesser degree in encounters with opposed beliefs.
Heads I Win, Tails?—Never Heard of Her; Or, Selective Reporting and the Tragedy of the Green Rationalists Selective reporting and correcting for it. Ruby comments: what if you're filtering your own evidence?
Motivated Stopping and Motivated Continuation From the against rationalization sequence. This is about as close as the sequences come to addressing motivated reasoning directly.
Escape the Echo Chamber (2018) "And, in many ways, echo-chamber members are following reasonable and rational procedures of enquiry."
"Other people are wrong" vs "I am right" The post is more in-depth, but the central point seems very relevant. It's a lot easier to note that other people are definitely wrong on many topics than to know that you're right in complex domains
Politics is the Mind-Killer Classic warning against political examples that references the strength of motivated reasoning effects but doesn't try to explain them. I'm worried that alignment difficulty and AI risk are also becoming mindkillers.
Understanding information cascades Relevant to the tribal view of alignment. The Information Cascades wikitag has more. An information cascade occurs when people update on other people's beliefs. This is locally rational but may still result in a self-reinforcing wrong community belief.
The Limits of Intelligence and Me: Domain Expertise Argument that domain expertise with modest intelligence generally wins over brilliance. Short but the end was most valuable to me.
Epistemic Luck A social-path-dependence gut punch: who you learn from is a big causal driver of your beliefs. Accepting that you might've had bad epistemic luck is the obvious conclusion.
Update Yourself Incrementally Why one counterexample shouldn’t flip you, and how people abuse that fact to immunize pet theories.
^{^}
My thinking about intuition vs. system 2 analysis was catalyzed partly by Malcolm Gladwell popularizing the topic in Blink, but he skipped merrily to the next topic without establishing when intuition is awesome and when it's totally misleading.

[-]Jeremy Gillen3h82

I experienced one interesting shift when I started thinking that biases and cognitive limitations were central factors in disagreement: I liked people more.

Thanks for reminding me of this.

It's that everyone's confidence is probably too high, on most things, most of the time.

I think this is true, but simultaneously I'm always annoyed at how confidently people dismiss the possibility of confidence. Even though the opposite is more common, there are a bunch of places where it's possible to be reasonably confident despite strong disagreement between well-meaning experts.

[-]Seth Herd3h40

I agree that high confidence is realistic and justified in many situations. Maybe I didn't emphasize that strongly enough. I think people are often justified in their confidence on narrowly scoped questions. But that tends to bleed out toward overconfidence on the larger, more important questions to which scoped questions contribute.

Strong disagreement between experts sounds like exactly the situation I'm talking about, where you're probably overconfident. Perhaps you're thinking of cases where one side of the disagreement is making arguments that are pretty obviously bad or irrelevant, so the other side's confidence is justified?

I think this happens, but I think people claim it a lot more than it's true. It's pretty much exactly what I'm talking about. I'm not saying it's impossible, and I don't think it can be dismissed on an outside view. Experts are fallible and conflict creates tons of confusion and motivated reasoning, leading to very bad arguments.

I am saying you should be quite suspicious and expect others' to be as well when you're making an argument that a bunch of equally expert people are just being foolish, while you've got the whole picture properly in your model.

[-]Jeremy Gillen2h42

Perhaps you're thinking of cases where one side of the disagreement is making arguments that are pretty obviously bad or irrelevant, so the other side's confidence is justified?

Yeah, cases where you actually deeply understand the arguments made on all sides and see the flaws (even if they're non-obvious flaws).

I agree that this is rare, and almost all of the time when people think they're in this situation they are not. But at the same time it's often worth trying to be in this situation rather than giving up and sticking with uncertainty. It's possible to succeed and it's possible at the meta-level to be reasonably confident you've succeeded. (It's rare, and one should be very suspicious every time this happens, but not infinitely suspicious).

bunch of equally expert people are just being foolish

I agree that if this should invoke even more suspicion, but the central cases I'm thinking about only involve non-foolish mistakes by experts.

I think we're fully in agreement, we both think one should be quite suspicious of oneself if you're more confident than experts on a controversial question. And I agree that this is the main thing to emphasize. I just think it's important to nitpick that this isn't a fully general argument, the amount of suspicion is finite and can occasionally be overcome by object-level considerations.

[-]Vladimir_Nesov5hΩ351

The practical method is pluralistic understanding, maintaining multiple pictures/models/framings/worldviews around contentious topics at the same time, even when they are wildly in conflict. This should involve taking them seriously enough to at least give them authority to develop further, to seek out more understanding relevant to them, even (or especially) for the framings that are not currently accepted as decision-relevant, that don't shape beliefs or values.

This relates to how epistemic luck/misfortune is path dependence, and path dependence is defeated by aggregation across as many legitimate paths as feasible. The danger is in including paths that are not legitimate, taken over by various forms of memetic corruption, as judged by (an idealized extrapolation of) some founding values. But this is more a danger of aggregation into decision relevance, or into goal content, than a danger of developing understanding of additional possibilities. And some worldviews are hard to accurately judge until they are sufficiently developed in your own mind.

[-]Seth Herd5h30

Yes, I think that's right. I think this addresses bias in framing, and hopefully if you can really inhabit multiple framings it will at least help mitigate biases in other steps. I do think other mitigations are probably useful in addition to trying to take seriously multiple worldviews, because you'll still be having negative or troubled emotional reactions to worldviews that aren't really your preferred one.

I definitely agree that not all framings are legitimate and worth tracking. But you make an excellent point that it's hard to judge until you've sufficiently understood it. This complicates deciding how much time to spend on framings/worldviews that don't seem valid or useful to you initially.

32

Motivated reasoning, confirmation bias, and AI risk theory

32

Ω 13

1.1 Motivated reasoning^[2]

2. Empirical evidence for confirmation bias

2.1 Bias in evaluating evidence

2.2 Bias in selecting evidence

2.3 Bias in remembering evidence

2.4 Other causal explanations of confirmation bias effects

2.5 Empirical evidence for motivated reasoning

3. Limitations in human cognitive capacity for very complex problems

3.1 Introspection suggests fuzzy models and updating

3.2 Intuition vs. analysis - evidence and brain mechanisms

3.3 Bayesian reasoning is an ideal, not a method

3.4 AI risk is complicated

4. Compounding of confirmation bias

4.1 Example of frame/hypothesis choices and confident disagreement among experts

4.2 Social compounding of confirmation bias effects

4.2.1 Social effects on evaluating evidence.

4.2.2 Social effects on selecting evidence, memory, and framing

4.2.3 Interlude: don't give up on seeking truth

4.2.4 Social belief contagion or information cascade effects

4.3 Very rough estimates of total compounded confirmation bias

5. Implications and remediations

5.1 Standard remediations

5.2 Remediations for motivated reasoning

Conclusion

32

Ω 13

32

Ω 13

32

Motivated reasoning, confirmation bias, and AI risk theory

32

Ω 13

1.1 Motivated reasoning[2]

2. Empirical evidence for confirmation bias

2.1 Bias in evaluating evidence

2.2 Bias in selecting evidence

2.3 Bias in remembering evidence

2.4 Other causal explanations of confirmation bias effects

2.5 Empirical evidence for motivated reasoning

3. Limitations in human cognitive capacity for very complex problems

3.1 Introspection suggests fuzzy models and updating

3.2 Intuition vs. analysis - evidence and brain mechanisms

3.3 Bayesian reasoning is an ideal, not a method

3.4 AI risk is complicated

4. Compounding of confirmation bias

4.1 Example of frame/hypothesis choices and confident disagreement among experts

4.2 Social compounding of confirmation bias effects

4.2.1 Social effects on evaluating evidence.

4.2.2 Social effects on selecting evidence, memory, and framing

4.2.3 Interlude: don't give up on seeking truth

4.2.4 Social belief contagion or information cascade effects

4.3 Very rough estimates of total compounded confirmation bias

5. Implications and remediations

5.1 Standard remediations

5.2 Remediations for motivated reasoning

Conclusion

32

Ω 13

32

Ω 13

1.1 Motivated reasoning^[2]