I feel like you omit the possibility that the trait of motivated reasoning is like the “trait” of not-flying. You don’t need an explanation for why humans have the trait of not-flying, because not-flying is the default. Why didn’t this “trait” evolve away? Because there isn’t really any feasible genomic changes that would “get rid” of not-flying (i.e. that would make humans fly), at least not without causing other issues.
RE “evolutionarily-recent”: I guess your belief is that “lots of other mammals engaging in motivated reasoning” is not the world we live in. But is that right? I don’t see any evidence either way. How could one tell whether, say, a dog or a mouse ever engages in motivated reasoning?
My own theory (see [Valence series] 3. Valence & Beliefs) is that planning and cognition (in humans and other mammals) works by an algorithm that is generally very effective, and has gotten us very far, but which has motivated reasoning as a natural and unavoidable failure mode. Basically, the algorithm is built so as to systematically search for thoughts that seem good rather than bad. If some possibility is unpleasant, then the algorithm will naturally discover the strategy of “just don’t think about the unpleasant possibility”. That’s just what the algorithm will naturally do. There isn’t any elegant way to avoid this problem, other than evolve an entirely different algorithm for practical intelligence / planning / etc., if indeed such an alternative algorithm even exists at all.
Our brain has a hack-y workaround to mitigate this issue, namely the “involuntary attention” associated with anxiety, itches, etc., which constrain your thoughts so as to make you unable to put (particular types of) problems out of your mind. In parallel, culture has also developed some hack-y workarounds, like Reading The Sequences, or companies that have a red-teaming process. But none of these workarounds completely solves the issue, and/or they come along with their own bad side-effects.
Anyway, the key point is that motivated reasoning is a natural default that needs no particular explanation.
I agree with all of that. I want to chip in on the brain mechanisms and the practical implications because it's one of my favorite scientific questions. I worked on it as a focus question in computational cognitive neuroscience, because I thought it was important in a practical sense. I also think it's somewhat important for alignment work, because difficult-to-resolve questions are more subject to motivated reasoning as a tiebreaker; more on this at the end.
The mechanism is only important to the degree that it gives us clues about how MR affects important discussions and conclusions; I think it gives some. In particular, it's not limited to seeking social approval; "sounds good" can be just to me, and for highly idiosyncratic reasons. Countering MR requires thinking about what feels good to you, and working against that, which is swimming upstream in a pretty difficult way. Or you can counteract it by learning to really love being wrong; that's tough too.
So here's a shot at briefly describing the brain mechanisms. We use RL of some stripe to choose actions. This has been studied relatively thoroughly, so we're pretty clear on the broad outlines but not the details. That makes sense from an evolutionary perspective. That system seems to have been adapted for use in selecting "internal actions," which roughly select thoughts. Brain anatomy suggests this adaptation to selecting internal actions pretty strongly.
It's a lot tougher to judge which thoughts reliably lead to reward, so we make a lot of mistakes. I think that's what Steve means by searching for thoughts that seem good. That's what produces motivated reasoning. Sometimes it's useful; sometimes it's not.
There's some other interesting stuff about the way the critic/dopamine system works; I think it's allowed to use the full power of the system to predict rewards. And it's only grounded to reality when it's proven wrong, which doesn't happen all that often in complex domains like "should alignment be considered very hard?" Steve describes the biology of the reward-prediction system in [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering.. (This is a really good overview, in addition to tying to alignment). He goes into much more detail in the Valence sequence he linked above. There he doesn't mention the biology at all, but this matches my views on the function of the dopamine system in humans perfectly.
In sum, the brain gets to use as much of its intelligence as it wants (including long system-2 contemplation) to take a guess about how "good" (reward-predictive) each thought/concept/idea/plan/belief is. This can be proven wrong in two ways, both fairly rare, so on average there's a lot of bad speculation sticking around and causing motivated reasoning. On rare occasions you get direct and fast feedback (someone you respect telling you that's stupid when you voice that thought); on other rare occasions, you spend the time to work backward from rare feedback, and your valence estimates of all the plans and beliefs in that process don't prevent you from reaching the truth.
Of note, this explanation "people tend toward beliefs/reasoning that sounds good/predicts reward" is one of the oldest explanations. It's formulated in different ways. It is often formulated as "we do this because it's adaptive", which I agree is wrong, but some of the original formulations, predating the neuroscience, were essentially what Steve is saying: we choose thoughts that sound good, and we're often wrong about what's good.
From this perspective, identifying as a rationalist provides some resistance to motivated reasoning, but not immunity. Rationalist ideals provide a counter-pressure to the extent you actually feel good about discovering that you were wrong, and so seek out that possibility in your thought-search. But we shouldn't assume that identifying as a rationalist means being immune to motivated reasoning; the tendency to feel good when you can think you're right and others are wrong is pretty strong.
Sorry to give so much more than the OP asked for. My full post on this is perpetually stuck in draft form and never first priority. So I thought I'd spit out some of it here.
I wrote about the brain mechanisms and the close analogy between the basal ganglie circuits that choose motor actions based on dopamine reward signals, and those to the prefrontal cortex that seem to approximately select mental actions in Neural mechanisms of human decision-making, but I can't highly recommend it. Co-authoring with mixed incentivies is always a mess, and I felt conflicted about the brainlike AGI capability implications of describing things really clearly, so I didn't try hard to clear up that mess. But the general story and references to the known biology is there. Steve's work in the Valence sequence nicely extends that to explaining not only motivated reasoning, but how valence (which I take to be reward prediction in a fairly direct way) produces our effective thinking. To a large degree, reasoning accurately is a lucky side-effect of choosing actions that predict reward, even at a great separation. Motivated reasoning even in harmful ways is the large but relatively small downside.
I think motivated reasoning (often overlapping with confirmation bias) is the most important cognitive bias in practical terms, particularly when combined with the cognitive limitation that we just don't have time to think about everything carefully. As I mentioned, this seems very important as a problem in the world at large, and perhaps particularly for alignment research. People disagree about important but difficult to verify matters like ethics, politics, and alignment, and everyone truly believes they're right because they've spent a bunch of time reasoning about it. So they assume their opponents are either lying or haven't spent time thinking about it. So distrust and arguments abound.
This theory mostly does not actually contradict OP's, except for the assumption that the problem of motivated reasoning would eventually be evolved away.
We can believe that motivated reasoning is caused by a short-term planning algorithm pressuring a long-term planning algorithm into getting what it wants ; and that getting rid of this mechanism would be very costly and so we shouldn't expect it to disappear any time soon, if at all. Both seem quite plausible to me, and do not preclude one another.
I think motivated reasoning is mostly bad, but there is some value to having some regularization towards consistency with past decisions. for example, there are often two almost-equally good choices you can make, but you need to commit to one instead of indefinitely waffling between the two, which is way worse than either option. having some delusional confidence via motivated reasoning can help you commit to one of the options and see it through. I've personally found that my unwillingness to accept motivated reasoning also has the side effect that I spend a lot more time in decision paralysis.
diffusion planning is entirely made from motivated reasoning and performs pretty well. this is imo a reasonable exemplar for a slightly broader belief I have, which is that the simplest hypothesis for motivated reasoning is that reasoning from percept-feature to outcome-feature (prediction), outcome feature to motor feature (control), outcome feature to percept feature (wishful thinking), are not trivially distinguishable when you have something confuseable for "big ol' diffusion model over whatever", and so avoiding motivated reasoning is hard in a messy system. There's enough pressure to avoid a lot of it, but given that reasoning from outcome-feature to motor-feature is a common need, going through correlational features that mix representations is totally allowed-by-substrate and thus common anywhere there isn't sufficient pressure against it.
of course, I'm being kind of sloppy in my claims here.
(This is a less-specific version of saying "it's just active inference", because the active inference math hasn't clicked for me yet, so I can't claim that it's exactly active inference; but it does seem like in general, planning-by-inference ought to be the default, as hinted by the fact that you can get it just by jiggling stuff around diffusion style.)
Once one learns to spot motivated reasoning in one's own head, the short term planner has a much harder problem. It's still looking for outputs-to-rest-of-brain which will result in e.g. playing more Civ, but now the rest of the brain is alert to the basic tricks. But the short term planner is still looking for outputs, and sometimes it stumbles on a clever trick: maybe motivated reasoning is (long-term) good, actually? And then the rest of the brain goes "hmm, ok, sus, but if true then yeah we can play more Civ" and the short term planner is like "okey dokey let's go find us an argument that motivated reasoning is (long-term) good actually!".
In short: "motivated reasoning is somehow secretly rational" is itself the ultimate claim about which one would motivatedly-reason. It's very much like the classic anti-inductive agent, which believes that things which have happened more often before are less likely to happen again: "but you've been wrong every time before!" "yes, exactly, that's why I'm obviously going to be right this time". Likewise, the agent which believes motivated reasoning is good actually: "but your argument for motivated reasoning sure seems pretty motivated in its own right" "yes, exactly, and motivated reasoning is good so that's sensible".
... which, to be clear, does not imply that all arguments in favor of motivated reasoning are terrible. This is meant to be somewhat tongue-in-cheek; there's a reason it's not in the post. But it's worth keeping an eye out for motivated arguments in favor of motivated reasoning, and discounting appropriately (which does not mean dismissing completely).
Your explanation about the short-term planner optimizing against the long-term planner seems to suggest we should only see motivated reasoning in cases where there is a short-term reward for it.
It seems to me that motivated reasoning also occurs in cases like gamblers thinking their next lottery ticket has positive expected value, or competitors overestimating their chances of winning a competition, where there doesn't appear to be a short-term benefit (unless the belief itself somehow counts as a benefit). Do you posit a different mechanism for these cases?
I've been thinking for a while that motivated reasoning sort of rhymes with reward hacking, and might arise any time you have a generator-part Goodharting an evaluator-part. Your short-term and long-term planners might be considered one example of this pattern?
I've also wondered if children covering their eyes when they get scared might be an example of the same sort of reward hacking (instead of eliminating the danger, they just eliminate the warning signal from the danger-detecting part of themselves by denying it input).
"But man, that motivated reasoning sure does not seem very socially-oriented?" I mean add in a internal family systems model of the brain and suddenly it make perfect sense for one part of your brain to recruit the logic center to convince the rest.
Epistemic status: kinda vibesy
My general hypothesis on this front is that the brain's planning modules are doing something like RL as inference, but that they're sometimes a bit sloppy about properly labelling which things are and are under their own control.
To elaborate: in RL as inference, you consider a "prior" over some number of input -> action -> outcome loops, and then perform a Bayesian-ish update towards outcomes which get high reward. But you have to constrain your update to only change P(action | input) values, while keeping the P(outcome | action) and P(input | outcome) values the same. In this case, the brain is sloppy about labelling and labels P(tired | stay up) as something it can influence.
This might happen because of some consistency mechanism which tries to mediate between different predictors. Perhaps if it gets one system saying "We will keep playing video games" and another saying "We mustn't be tired tomorrow" then the most reasonable update is that P(tired | stay up) is, in fact, influencable.
Motivated reasoning is a misfire of a generally helpful heuristic: try and understand why what other people are telling you makes sense.
In a high trust setting, people are usually well-served by assuming that there’s a good reason for what they’re told, what they believe, and what they’re doing. Saying, “figure out an explanation for why your current plans make sense” is motivated reasoning, but it’s also a way to just remember what the heck you’re doing and to coordinate effectively with others by anticipating how they’ll behave.
The thing to explain, I think, is why we apply this heuristic in less than full trust settings. My explanation for that is that this sense-making is still adaptive even in pretty low-trust settings. The best results you can get in a low-trust (or parasitic) setting are worse than you’d get in a higher-trust setting, but sense-making it typically leads to better outcomes than not.
In particular, while it’s easy in retrospect to pick a specific action (playing Civ all night) and say “I shouldn’t have sense-made that,” it’s hard to figure out in a forward-looking way which settings or activities do or don’t deserve sense-making. We just do it across the board, unless life has made us into experts on how to calibrate our sense-making. This might look like having enough experience with a liar to disregard everything they’re saying, and perhaps even to sense-make “ah, they’re lying to me like THIS for THAT reason.”
In summary, motivated reasoning is just sense-making, which is almost always net adaptive. Specific products, people and organizations take advantage of this to exploit people’s sense-making in limited ways. If we focus on the individual misfires in retrospect, it looks maladaptive. But if you had to predict in advance whether or not to sense-make any given thing, you’d be hard-pressed to do better than you’re already doing, which probably involves sense-making quite a bit of stuff most of the time.
Another class of examples: very often in social situations, the move which will actually get one the most points is to admit fault and apologize. And yet, instead of that, people instinctively spin a story about how they didn't really do anything wrong.
as a nitpick (i find the other examples compelling): recruiting others to go along with obvious lies is a strong test and demonstration of status-power.
Motivated reasoning is a natural byproduct of any mind that tries to do anything to the outside world.
Consider an optimal temperature controller. It has thermometers and runs a Kalman filter to calculate the probability distribution of the temperature at each moment taking into account the model of the process and all the data available. What will be the expected value of the temperature?
The set point. Always, so long as the output isn't saturated. Because if the expected temperature were any lower than the setpoint it'd increase the heat until it isn't. If it were any higher, it would decrease the heat until it isn't. The temperature controller fundamentally works by expecting the temperature to be what it "wants", and then acting to maintain that expectation.
This is unavoidable, because if it's not acting to keep the expected value in line with the setpoint -- according to it's own model/data -- then it isn't functioning as a control system with respect to that setpoint, and will be better described as optimizing for something else.
When you analyze the system from the third person and think "Hm, p(it will achieve its goal of controlling the temp to 70f at all times) is low", then that's your hint to redesign the controller to stop expecting "70f" to be the temperature in the next timestep. Instead, program it to expect something that won't fail catastrophically (e.g. "The temperature will rise to 70f as fast as it safely can without burning out heating elements").
To bring it back to the human, it's not a "quirk of human biology" that your sportsball coach talks about how "You have to believe you can win!". It comes directly from the fact that you cant try to win without expecting to win, and that people want to win.
That doesn't mean you can't use more fault tolerant strategies like "score as many points as possible". The latter can be even more effective at winning, but that is a different plan and requires giving up on trying to win. Minds that can find these more fault tolerant plans no longer need to believe they'll win and therefore do less motivated cognition failures, so if you want to fail less due to motivated cognition then that's the way to do it. You'll still be expecting what you want to come true, just hopefully in a more realizable way.
Apologizing requires not being out of trouble, and you're trying to not be in trouble. Do you want to be in trouble, and face the consequences? If that's not appealing to you, of course you're going to try to not do it, and that involves expecting to not be in trouble. No wonder people come up with defensive justifications in those cases. When you want to face the music, because what you're drawn to is being a person of integrity, and "not being guilty" is something you recognize you cannot have, then you won't feel tempted.
The question that can tranform the former into the latter is "Can I get what I want? Can I be not guilty and stay out of trouble?". When you sit with that and "No, I can't" sinks in, the temptation to rationalize melts away.
Of course, that can be tricky to sit with too, because there are often temptations to flinch away from the question. There are reasons for that, and understanding them opens up paths to making it easier there too, but this comment has gotten long enough.
At some point, a temperature control system needs to take actions to control the temperature. Choosing the correct action depends on responding to what the temperature actually is, not what you want it to be, or what you expect it to be after you take the (not-yet-determined) correct action.
If you are picking your action based on predictions, you need to make conditional predictions based on different actions you might take, so that you can pick the action whose conditional prediction is closer to the target. And this means your conditional predictions can't all be "it will be the target temperature", because that wouldn't let you differentiate good actions from bad actions.
It is possible to build an effective temperature control system that doesn't involve predictions at all; you can precompute a strategy (like "turn heater on below X temp, turn it off above Y temp") and program the control system to execute that strategy without it understanding how the strategy was generated, and in that case it might not have models or make predictions at all. But if you were going to rely on predictions to pick the correct action, it would be necessary to make some (conditional) predictions that are not simply "I will succeed".
Optimal controls is something I do professionally, and the (reasonable) misconceptions you have about controls are exactly the kind that produce the (reasonable) misconceptions get people stuck with motivated reasoning.
I'll focus on the controls first, since it's easier to see in simpler better defined situations first, and then tie it back in to the human failures.
Choosing the correct action depends on responding to what the temperature actually is, not what you want it to be, or what you expect it to be after you take the (not-yet-determined) correct action.
So, you do have to respond to the data, obviously.
But like, the correct action also depends on what you want the temperature to be. If the jacuzzi is 100f, the correct action is different if you want it to be 101 than if you want it to be at 99.
When you actually try to build an optimal controller, ideas like "You respond to what the temperature actually is" fall apart. It takes several seconds to get a good temperature estimate from a thermometer. You read the thermometer, calculate new outputs, and change the underlying reality many times per second. By the time you've gathered enough data to make a decent estimate, the state changed long ago. If you're really pushing the limits even the most recent data is out of date by the time you've parsed it, and this has very significant effects. This is what I spent this whole week dealing with at work, actually.
When doing optimal control you're constantly thinking about what was your estimate in the most recent past timestep, and what it will be in the next timestep in the future. It's rapid iteration between "I think it will be X", "oops, lemme correct that", "I think it will be X". The key insight here is that this process of "oops, lemme correct that" binds future expectation to the desired value, at every future timestep.
The prediction for the next timestep will always be equal to the setpoint in an unsaturated optimal controller, exactly, because that's what defines optimality. If you choose an output that results in 69.9f in expectation, then you could make an argument that you're optimally controlling to 69.9f, but you're not optimally controlling to 70f because outputting more heat would have done better by that metric.
The obvious response to this is "What if it can't get to 70f by the next timestep, even at max power!?", and the answer is that this would mean it's saturated. Saturation changes things in interesting ways which I'll return to at the end.
If you are picking your action based on predictions, you need to make conditional predictions based on different actions you might take, so that you can pick the action whose conditional prediction is closer to the target. And this means your conditional predictions can't all be "it will be the target temperature", because that wouldn't let you differentiate good actions from bad actions.
You don't need to make conditional predictions at all. Most control systems don't. A Kalman filter coupled with an optimal control law will make unconditional predictions only, for example.
It's true that if you try to reason the way you describe and bullshit your answers you'll get bad results, but it doesn't have anything to do with what I'm saying. Even if I reason "If I don't eat next month, I'll starve, and that's bad", the thing that follows is "so I'm not gonna do that". At the end of the day, I expect to not starve, because I intend to not starve.
and program the control system to execute that strategy without it understanding how the strategy was generated,
Control systems never understand how the strategy was generated. Control systems are what do the control. They implement the strategy. Controls engineers are what do the understanding of how the strategy was generated.
Kalman filters are explicitly framed in terms of predictions, but Kalman filters don't sit around saying "I'm a Kalman filter! I make predictions!". They just do things which are usefully described as "making predictions" from the outside.
It is possible to build an effective temperature control system that doesn't involve predictions at all; you can precompute a strategy (like "turn heater on below X temp, turn it off above Y temp")
What counts as a "prediction" is in the eye of the beholder.
It's possible to find a well documented Kalman filter and describe it's behavior on the level of what mathematical operations are performed without ever thinking of it as "predicting" anything. "What do you mean 'predicting'? It's a computer, it can't think! it's just adding this number to that number and multiplying by these other numbers!".
It is equally possible to give a teleological explanation of the bang bang controller. "It thinks that the temperature will follow the right trajectory iff it turns on the heater when it's too cold" perfectly describes the behavior. The bimetallic strip closing the circuit functions as a prediction that more heat will put the temp on the right trajectory, and the control system "works" to the extent that this prediction is accurate.
It's possible to build a temperature control system without thinking in terms of predictions, but it's not possible to build one that cannot be usefully thought of as such. If you ever find a system that you can't describe as modeling the process it's controlling, it won't work. If you show me the structure of your controller and tell me that it works well, I can use that to infer things about the process you're using it to control (e.g. if bang bang works well, there's negligible lag between the output and the state).
This might sound like "semantics", but it is actually necessary in order to create good control systems. If you design your temperature controller to wait until it knows what the temperature is before choosing an output, and make "predictions" about what will happen without driving them to what you want to happen, you will have a crummy controller. Maybe crummy is "good enough", but it will fail in any application that demands good performance. This is stuff I had to figure out before I could get the performance I wanted out of an actual temperature controller. Looking at a PID controller as if it's "making predictions" allowed me to see where the predictions were wrong, implement better predictions by incorporating the additional information I was using to judge, and set gains that keep the expected value of temperature equal to the setpoint. The result is better control than is possible otherwise.
Okay, so let's return to the question of saturation and connect it back to human cognition.
An optimal controller with implicit "self confidence" will maintain the prediction that its output will realize the prediction. What happens when the heating element is less powerful than the one it was programmed to expect?
The controller will keep predicting it will hit the setpoint, keep putting out enough heat that it "should" reach the setpoint, and keep being wrong.
If it has an integral gain, it will notice this and try to add more and more heat until it stops being wrong. If it can't, it's going to keep asking for more and more output, and keep expecting that this time it'll get there. And because it lacks the control authority to do it, it will keep being wrong, and maybe damage its heating element by asking more than they can safely do. Sound familiar yet?
So what's the fix? Update its self model to include this limitation, of course.
But what happens to its predictions when this happens? What happens to the temperature that it acts to realize? It drops.
It is now functionally identical to a controller which controls to an optimum trajectory, rather than controlling to the state the controls engineer wishes it were already at. You can describe it as "trying to be at 70f" if you add enough epicycles of "When it can"/"without destroying itself", etc. Or you can describe it more simply as trying to regulate to an optimal trajectory towards 70f, without epicycles. Updating on one's inability to achieve a goal necessarily results in no longer trying for that goal, or corrupting your idea of "trying" until it no longer pays the rent.
So what's the fix for people?
If you find someone beating his head against a wall in attempt to get through, it's because he's thinking "I'm gonna get through, dammit!". Get him to sit with the question "Are you, though? Really?", and he will stop trying, because obviously no, lol.
If he doesn't want to look, a good bet is that it's because he doesn't have any other trajectory to fall back to. Show him that he can walk around, and all of a sudden you'll find it much easier to convince him that he can't bash his head through the wall.
Just like the temperature controller thing, this is a real thing that produces real results. Even my post showing how I helped someone untangle his debilitating chronic pain over a few PMs is an example of this. You might not think of "Nerve damage pain" as motivated cognition, but the suffering comes from refused updates, he was refusing the updates because it would have meant that he could no longer work towards something important, and helping him see how to control without denying reality is what actually helped.
If it has an integral gain, it will notice this and try to add more and more heat until it stops being wrong. If it can't, it's going to keep asking for more and more output, and keep expecting that this time it'll get there. And because it lacks the control authority to do it, it will keep being wrong, and maybe damage its heating element by asking more than they can safely do. Sound familiar yet?
From tone and context, I am guessing that you intend for this to sound like motivated reasoning, even though it doesn't particularly remind me of motivated reasoning. (I am annoyed that you are forcing me to guess what your intended point is.)
I think the key characteristic of motivated reasoning is that you ignore some knowledge or model that you would ordinarily employ while under less pressure. If you stay up late playing Civ because you simply never had a model saying that you need a certain amount of sleep in order to feel rested, then that's not motivated reasoning, it's just ignorance. It only counts as motivated reasoning if you, yourself would ordinarily reason that you need a certain amount of sleep in order to feel rested, but you are temporarily suspending that ordinary reasoning because you dislike its current consequences.
(And I think this is how most people use the term.)
So, imagine a scenario where you need 100J to reach your desired temp but your heating element can only safely output 50J.
If you were to choose to intentionally output only 50J, while predicting that this would somehow reach the desired temperature (contrary to the model you regularly employ in more tractable situations), then I would consider that a central example of motivated reasoning. But your model does not seem to me to explain how this strategy arises.
Rather, you seem to be describing a reaction where you try to output 100J, meaning you are choosing an action that is actually powerful enough to accomplish your goal, but which will have undesirable side-effects. This strikes me as a different failure mode, which I might describe as "tunnel vision" or "obsession".
I suppose if your heating element is in fact incapable of outputting 100J (even if you allow side-effects), and you are aware of this limitation, and you choose to ask for 100J anyway, while expecting this to somehow generate 100J (directly contra the knowledge we just assumed you have), then that would count as motivated reasoning. But I don't think your analogy is capable of representing a scenario like this, because you are inferring the controller's "expectations" purely from its actions, and this type of inference doesn't allow you to distinguish between "the controller is unaware that its heating element can't output 100J" from "the controller is aware, but choosing to pretend otherwise". (At least, not without greatly complicating the example and considering controllers with incoherent strategies.)
Meta-level feedback: I feel like your very long comment has wasted a lot of my time in order to show off your mastery of your own field in ways that weren't important to the conversation; e.g. the stuff about needing to react faster than the thermometer never went anywhere that I could see, and I think your 5-paragraph clarification that you are interpreting the controller's actions as implied predictions could have been condensed to about 3 sentences. If your comments continue to give me similar feelings, then I will stop reading them.
I think the key characteristic of motivated reasoning is that you ignore some knowledge or model that you would ordinarily employ while under less pressure.
A pretty standard definition of motivated reasoning is that it is reasoning that is actively working towards reaching a certain preferred conclusion,
Quoting Googles AI overview (which is generally pretty terrible, but suffices here),
"Motivated reasoning is the psychological tendency to process information in a biased way, seeking out evidence that supports what we want to be true (our beliefs, desires, identity) while dismissing contradictory facts, often unconsciously, to avoid discomfort or maintain a positive self-image."
It doesn't require that you already have the knowledge or model, if you would otherwise acquire it if you weren't trying to reach a certain conclusion. Failure to learn new things is far more central, because if you already have well integrated models it becomes hard to form the broken intentions in the first place.
If you were to choose to intentionally output only 50J, while predicting that this would somehow reach the desired temperature (contrary to the model you regularly employ in more tractable situations), then I would consider that a central example of motivated reasoning.
I think there are a lot of missing pieces in your picture here. How do you operationalize "intentionally", for one? Like, how do you actually test whether a system was "intentional" or "just did a thing"? If a system can't put out more than 50j, in what sense is 50j the intention and not 100 or "more" or something else?
Rather, you seem to be describing a reaction where you try to output 100J, meaning you are choosing an action that is actually powerful enough to accomplish your goal, but which will have undesirable side-effects.
Well, not necessarily, which is why I said "and maybe". If I program in a maximum pulse width, the controller upstream doesn't know about it. It puts out a new value, which maybe would or maybe wouldn't be enough, but it can't know. All it knows is that it didn't work this time, and it's not updating on the possibility that maybe failing the last twenty times in a row means the temperature won't actually reach the setpoint.
I suppose if your heating element is in fact incapable of outputting 100J (even if you allow side-effects), and you are aware of this limitation, and you choose to ask for 100J anyway, while expecting this to somehow generate 100J (directly contra the knowledge we just assumed you have), then that would count as motivated reasoning.
That is far closer to the point. The controller makes motions that would work under its model of the world... in expectation, without any perceived guarantee of this being reality... and in reality that isn't happening.
The problem now is in the interaction between the meta level and the object level.
On the object level, the controller is still forming its conclusions of what will happen based on what it wants to happen. This is definitionally motivated cognition in a sense, but it's only problematic when the controller fails. The object level controller itself, by definition of "object level", is in the business of updating reality not its model of reality. The problematic sense comes in when the meta level algorithm that oversees the object level controller chooses not to deliver all the information to the object level controller because that would cause the controller to stop trying, and the meta level algorithm doesn't think that's a good idea.
Let's look at the case of the coach saying "You gotta BELIEVE!". This is an explicit endorsement of motivated reasoning. The motivational frame he's operating in is that you expect to win, figure out what you gotta do to get there, and then do the things. The problem with giving this object level controller full info is that "Oh, I'm not gonna win" is a conclusion it might reach, and then what actions will it output? If you're not gonna win, what's it matter what you do next? If full effort is costly, you're not going to do it when you're not going to win anyway.
When you shift from controlling towards "win" to controlling towards the things that maximize chances of winning, then "I'm not gonna win though" becomes entirely irrelevant. Not something you have to hide from the controller, just something that doesn't affect decision making. "Okay so I'm gonna lose. I'm still going to put in 100% effort because I'm going to be the person who never loses unnecessarily".
The motivated reasoning, and explicit endorsement of such, comes from the fact that being fully honest can cause stupid reactions, and if you don't know how to use that additional information well, updating on it can result in stupider actions (from the perspective of the meta layer). Same thing with "No, this dress doesn't make your ass look fat honey"/"She's just gonna get upset. Why would I upset her?" coming from a person who doesn't know how to orient to difficult realities.
because you are inferring the controller's "expectations" purely from its actions, and this type of inference doesn't allow you to distinguish between "the controller is unaware that its heating element can't output 100J" from "the controller is aware, but choosing to pretend otherwise".
Oh, no, you can definitely distinguish. The test is "What happens when you point at it?". Do they happily take the correction, or do they get grumpy at you and take not-fully-effective actions to avoid updating on what you're pointing at? Theoretically it can get tricky, but the pretense is rarely convincing, in practice.
With simple bimetallic thermostat, it's pretty clear from inspection that there's just no place to put this information, so it's structurally impossible for it to be aware of anything else. Alternatively, if you dig through the code and find a line "while output>maxoutput, temp--", you can run the debugger and watch the temperature estimate get bullshitted as necessary in order to maintain the expectation.
Meta-level feedback:
I can't help but notice that the account you're offering is fairly presumptuous, makes quite a few uncharitable assumptions, and doesn't show a lot of interest in learning something like "Oh, the relevance of the response time thing wasn't clear? I'll try again from another angle". It'd be a lot easier to take your feedback the way you want it taken if you tried first to make sure you weren't just missing things that I'd be happy to explain.
If you're wed to that framing then I agree it's probably a waste of your time continue. If you're interested in receiving meta level feedback yourself, I can explain how I see things, why, and we can find out together what holds up and what doesn't.
Amusingly, this would require neither of us controlling towards "being right" and instead controlling towards the humility/honesty/meta-perspective-taking/etc that generates rightness. Might be an interesting demonstration of the thing I'm trying to convey, if you want to try that.
Also, sorry if it's gotten long again. I'm pretty skeptical that a shorter solution exists at all, but if it does I certainly can't find it. Heck, I'd be pleasantly surprised if it all made sense at this length.
I remember the BBQ benchmark which had the LLMs(!) exhibit such reasoning. Maybe motivated reasoning is more adaptive than we think, as I conjectured back when Eli Tyre asked this question first?
LLMs mimic human text. That is the first and primary thing they are optimized for. Humans motivatedly reason, which shows up in their text. So, LLMs trained to mimic human text will also mimic motivated reasoning, insofar as they are good at mimicking human text. This seems like the clear default thing one would expect from LLMs; it does not require hypothesizing anything about motivated reasoning being adaptive.
I also see an additional mechanism for motivated reasoning to emerge. Suppose that we have an agent who is unsure of its capabilities (e.g. GPT-5 who arguably believed its time horizon to be 20-45 mins). Then the best thing the agent could do to increase its capabilities would be to attempt[1] tasks a bit more difficult than the edge of the agent's capabilities and to either succeed by chance or do something close to success and/or have the agent's capabilities increase from mere trying, which is the case at least in Hebbian networks. Then the humans who engaged in such reasoning found it easier to keep trying and had the latter trait, and not motivated reasoning itself, correlate with success.
Or, in the case of LLMs, have the hosts give such a task and let the model try the task.
There’s a standard story which says roughly "motivated reasoning in humans exists because it is/was adaptive for negotiating with other humans". I do not think that story stands up well under examination; when I think of standard day-to-day examples of motivated reasoning, that pattern sounds like a plausible generator for some-but-a-lot-less-than-all of them.
Examples
Suppose it's 10 pm and I've been playing Civ all evening. I know that I should get ready for bed now-ish. But... y'know, this turn isn't a very natural stopping point. And it's not that bad if I go to bed half an hour late, right? Etc. Obvious motivated reasoning. But man, that motivated reasoning sure does not seem very socially-oriented? Like, sure, you could make up a story about how I'm justifying myself to an imaginary audience or something, but it does not feel like one would have predicted the Civ example in advance from the model "motivated reasoning in humans exists because it is/was adaptive for negotiating with other humans".
Another class of examples: very often in social situations, the move which will actually get one the most points is to admit fault and apologize. And yet, instead of that, people instinctively spin a story about how they didn't really do anything wrong. People instinctively spin that story even when it's pretty damn obvious (if one actually stops to consider it) that apologizing would result in a better outcome for the person in question. Again, you could maybe make up some story about evolving suboptimal heuristics, but this just isn't the behavior one would predict in advance from the model "motivated reasoning in humans exists because it is/was adaptive for negotiating with other humans".
That said, let’s also include an example where "motivated reasoning in humans exists because it is/was adaptive for negotiating with other humans" does seem like a plausible generator. Suppose I told a partner I’d pick them up on my way home at 6:00 pm, but when 6:00 pm rolls around I’m deep in an interesting conversation and don’t want to stop. The conversation continues for a couple hours. My partner is unhappy about this. But if I can motivatedly-reason my way to believing that my choice was justified (or at least not that bad), then I will probably have a lot easier time convincing my partner that the choice was justified - or at least that we have a reasonable disagreement about what’s justified, as opposed to me just being a dick. Now personally I prefer my relationships be, uh, less antagonistic than that whole example implies, but you can see where that sort of thing might be predicted in advance by the model "motivated reasoning in humans exists because it is/was adaptive for negotiating with other humans".
Looking at all these examples (and many others) together, the main pattern which jumps out to me is: motivated reasoning isn't mainly about fooling others, it's about fooling oneself. Or at least a part of oneself. Indeed, there's plenty of standard wisdom along those lines: "the easiest person to fool is yourself", etc. Yes, there are some examples where fooling oneself is instrumentally useful for negotiating with others. But humans sure seem to motivatedly-reason and fool themselves in lots of situations which don’t involve any other humans (like the Civ example), and situations in which the self-deception is net harmful socially (like the apology class of examples). The picture as a whole does not look like "motivated reasoning in humans exists because it is/was adaptive for negotiating with other humans".
So why do humans motivatedly reason, then?
I’m about to give an alternative model. First, though, I should flag that the above critique still stands even if the alternative model is wrong. "Motivated reasoning in humans exists because it is/was adaptive for negotiating with other humans" is still basically wrong, even if the alternative I’m about to sketch is also wrong.
With that in mind, model part 1: motivated reasoning simply isn’t adaptive. Even in the ancestral environment, motivated reasoning decreased fitness. The obvious answer is just correct.
What? But then why didn’t motivated reasoning evolve away?
Humans are not nearly fitness-optimal, especially when it comes to cognition. We have multiple arguments and lines of evidence for this fact.
First, just on priors: humans are approximately the stupidest thing which can cognitively “take off”, otherwise we would have taken off sooner in ancestral history, when we were less smart. So we shouldn’t expect humans to be optimal minds with all the bugs worked out.
Second, it sure does seem like humans have been evolving at a relatively quick clip, especially the brain. It’s not like we’ve been basically the same for tens of millions of years; our evolution is not at equilibrium, and wasn’t at equilibrium even before agriculture.
Third, it sure does seem like humans today have an awful lot of cognitive variation which is probably not fitness-neutral (even in the ancestral environment). The difference between e.g. an IQ-70 human and an IQ-130 human is extremely stark, mostly genetic, and does not seem to involve comparably large tradeoffs on other axes of fitness in the ancestral environment (e.g. IQ-130 humans do not get sick twice as often or burn twice as many calories as IQ-70 humans).
So in general, arguments of the form “<apparently-suboptimal quirk of human reasoning> must be adaptive because it didn’t evolve away” just… aren’t that strong. It’s not zero evidence, but it’s relevant mainly when the quirk is something which goes back a lot further in the ancestral tree than humans.
(This does mean that e.g. lots of other mammals engaging in motivated reasoning, in a qualitatively similar way to humans, would be much more compelling evidence that motivated reasoning is adaptive.)
Ok, but then why do humans motivatedly reason?
Even if we accept that human cognition is not nearly fitness-optimal, especially when it comes to cognition, that doesn’t tell us which particular cognitive bugs humans have. It doesn’t predict motivated reasoning specifically, out of the bajillions of possibilities in the exponentially large space of possible cognitive bugs. It doesn’t positively predict motivated reasoning, it just negates the argument that motivated reasoning must somehow be fitness-optimal.
Our above argument does predict that motivated reasoning must have shown up recently in human evolutionary history (otherwise it would have evolved away). And motivated reasoning does seem innate to humans by default (as opposed to e.g. being installed by specific cultural memes), so it must have come from one or a few genetic changes. And those changes must have increased fitness overall, otherwise they wouldn’t have spread to the whole population. So, insofar as we buy those premises… motivated reasoning must be a side-effect of some other evolutionarily-recent cognitive changes which were overall beneficial, despite motivated reasoning itself being net negative.
Can we guess at what those changes might be?
Observation: in examples of motivated reasoning, it feels like our brains have two internal plan-evaluators. One of them is a relatively short-sighted, emotionally-driven plan evaluator. The other is focused more on the long term, on reputation and other people’s reactions, on all the things one has been told are good or bad, etc; that one is less myopic. The basic dynamic in motivated reasoning seems to be the shorter-range plan-evaluator trying to trick the longer-range plan evaluator.
Thus, model part 2: the longer-range plan evaluator is a recent cognitive innovation of the human lineage. Other animals sometimes do long-range-oriented things, but usually not in a general purpose way; general purpose long-range planning seems pretty human specific. The shorter sighted plan evaluator is still just doing basically the same thing it’s always done: it tries to find outputs it can feed to the rest of the brain which will result in good-feeling stuff short term. In humans, that means the short sighted search process looks for outputs it can feed to the long range planner which will result in good-feeling stuff short term. Thus, motivated reasoning: the short sighted search process is optimizing against the long range planner, just as an accident of working the same way the short sighted process always worked throughout evolutionary history.
For example, when I’m playing Civ at 10 pm, my long range planner is like “ok bed time now”, but my short range planner is like “oh no that will lose good-feeling stuff right now, let’s try spitting some other outputs into rest-of-brain to see if we can keep the good-feeling stuff”. And sometimes it hits on thoughts like “y'know, this turn isn't a very natural stopping point” or “it's not that bad if I go to bed half an hour late, right?”, which mollify the long range planner enough to keep playing Civ. In an ideal mind, the short range and long range planners wouldn’t optimize against each other like this; both do necessary work sometimes. But humans aren’t ideal minds, the long range planner is brand spanking new (evolutionarily) and all the bugs haven’t been worked out yet. The two planners just kinda both got stuck in one head and haven’t had time to evolve good genetically hardcoded cooperative protocols yet.