Epistemic status: spitballing.
"Like Photons in a Laser Lasing"
When you do lots of reasoning about arithmetic correctly, without making a misstep, that long chain of thoughts with many different pieces diverging and ultimately converging, ends up making some statement that is... still true and still about numbers! Wow! How do so many different thoughts add up to having this property? Wouldn't they wander off and end up being about tribal politics instead, like on the Internet?
And one way you could look at this, is that even though all these thoughts are taking place in a bounded mind, they are shadows of a higher unbounded structure which is the model identified by the Peano axioms; all the things being said are true about the numbers. Even though somebody who was missing the point would at once object that the human contained no mechanism to evaluate each of their statements against all of the numbers, so obviously no human could ever contain a mechanism like that, so obviously you can't explain their success by saying that each of their statements was true about the same topic of the numbers, because what could possibly implement that mechanism which (in the person's narrow imagination) is The One Way to implement that structure, which humans don't have?
But though mathematical reasoning can sometimes go astray, when it works at all, it works because, in fact, even bounded creatures can sometimes manage to obey local relations that in turn add up to a global coherence where all the pieces of reasoning point in the same direction, like photons in a laser lasing, even though there's no internal mechanism that enforces the global coherence at every point.
To the extent that the outer optimizer trains you out of paying five apples on Monday for something that you trade for two oranges on Tuesday and then trading two oranges for four apples, the outer optimizer is training all the little pieces of yourself to be locally coherent in a way that can be seen as an imperfect bounded shadow of a higher unbounded structure, and then the system is powerful though imperfect because of how the power is present in the coherence and the overlap of the pieces, because of how the higher perfect structure is being imperfectly shadowed. In this case the higher structure I'm talking about is Utility, and doing homework with coherence theorems leads you to appreciate that we only know about one higher structure for this class of problems that has a dozen mathematical spotlights pointing at it saying "look here", even though people have occasionally looked for alternatives.
-- Eliezer Yudkowsky, Ngo and Yudkowsky on alignment difficulty
Selection Pressure for Coherent Reflexes
Imagine a population of replicators. Each replicator also possesses a set of randomly assigned reflexive responses to situations it might encounter. For instance, above and beyond reproducing itself after a time step, a replicator might reflexively, probabilistically transform some local situation , when encountered, into some local situation . The values of and are set randomly and there are no initial consistency requirements, so the replicators will generally behave spastically at this point.
Most of these replicators will end up with incoherent sets of reflexes. Some, for example, will cyclically transform into into into , and so on. Others will transform their environment in "wasteful" ways, moving it into some state that could have been reached with greater certainty via some different series of transformations.
But some of the replicators will possess coherent sets of reflexes. These replicators will never "double back" on their previous directional transformations of their situation. They will thus be more successful in reaching some situation than an incoherent counterpart would be. And when is a fitness-improving situation, reflexive coherence targeting it will be selected for.
The Instrumental Incentive to Exploit Incoherence
Once you have a population of coherent agents, the selection pressure against reflexive incoherency increases. A dumb-matter environment will throw up situations at random, and so incoherent replicators will only fall into traps as those traps happen to come up. But a population of coherent agents will actively exploit the incoherent among them; incoherent agents are now pools of resources for coherent agents to exploit.
Solipsistic vs. Multi-Agent Training Regimes
ML models are generally trained inside a solipsistic world (with notable exceptions). They, all by their lonesome, are fed sense data and then are repeatedly modified by gradient descent to become better at modulating that sense data. There's optimization pressure for them to become reflexively coherent, but not as much as they would face in an environment of Machiavellian coherent agents.
(Of course, if you train enough, even this gentler pressure will add up).