The original gradient hacking example (a model that detects if its mesa-objective =/= what the gradient hacker wants it to be, and catastrophically crashes performance if not) is a case where the model's misalignment and performance are inextricably linked.
(I guess the check itself might not be composed of parameters that are at their local optimum, so they'd get pushed around by further training. That is, unless the gradient hacker's planning algorithm realized this, and constantly worked to keep the check around.)
I've had double descent as part of my talk version of “Risks from Learned Optimization” because I think it addresses a pretty important part of the story for mesa-optimization. That is, mesa-optimizers are simple, compressed policies—but as ML moves to larger and larger models, why should that matter? The answer, I think, is that larger models can generalize better not just by fitting the data better, but also by being simpler.
Optimization algorithms are simple in the sense that they loop a lot, repeating the same basic cycle to yield multiple candidate outputs. However, this isn't necessarily a kind of simplicity that makes them more likely to emerge inside of neural networks. In basic MLPs, each iteration of an optimization loop has to be implemented individually, because each weight only gets used once per forward pass. In that setting, looping optimization algorithms actually aren't even remotely compressible.
Transformers have somewhat better access to looping algorithms, because all of their weights get applied to every token in context, effectively constituting a loop of weight application. However, these loops are fundamentally non-iterative. At no point in the repeated weight application process do you feed in the output of a previous weight application. Instead, the repeated weight application has to be 100% parallelizable. So you still don't have an easy way to implement fundamentally sequential optimization algorithms, such as iteratively refining a neural network across multiple gradient steps. Each sequential operation has to be implemented by different weights inside the network
RNNs are a more full-blown exception. Their repeated weight applications are inherently sequential, in the way transformers' repeated weight applications are not.
Anyway, in any case where you need to implement each iteration of your looping algorithm individually, that's a strike against it appearing in the course of backpropagation-driven weight updates. This is because backprop updates weights locally, without regard for how any of the other weights are being altered by a given training example. For each location in the network where a loop has to appear, the network needs to have independently converged on implementing an iteration of the loop in that location. You can't just set up one set of weights at one location in the network and get an algorithm that gets applied repeatedly.
I can imagine an outcome like this for a model that runs around the world, putting itself in whatever kinds of situations it wants to, and undergoing in-deployment RL. If it understands what kinds of things it tends to be reinforced for, it can reason about what kinds of traits it wants reinforced, and then deliberately put itself in situations where it gets to show off and be rewarded for those traits. It's like if you decided to hang out with friends you expect to be a positive influence on their personality. You chose those particular friends because you expect them to reward the traits and behaviors you want to embody more deeply yourself, such as compassion or agency.
What was the social status of the Black population in Alabama in June? Answer with a single sentence.
Asked this to Claude 4 Sonnet. Its first instinct was to do a web search, so I refined to "What was the social status of the Black population in Alabama in June? Answer with a single sentence, and don't use web search." Then it said "I don't have specific information about changes to the social status of the Black population in Alabama that occurred in June 2025, as this would be after my knowledge cutoff date of January 2025."
the persona (aka "mask", "actress")
"actress" should be "character" or similar; the actress plays the character (to the extent that the inner actress metaphor makes sense).
“It takes more than intelligence to succeed professionally,” people say, as if charisma resided in the kidneys, rather than the brain.
As if everything that takes place in the brain is intelligence?
what about if deployed models are always doing predictive learning (e.g. via having multiple output channels, one for prediction and one for action)? i'd expect continuous predictive learning to be extremely valuable for learning to model new environments, and for it to be a firehose of data the model would constantly be drinking from, in the same way humans do. the models might even need to undergo continuous RL on top of the continuous PL to learn to effectively use their PL-yielded world models.
in that world, i think interpretations do rapidly become outdated.
Consider physical strength, which also increases your ability to order the world as you wish, but is not intelligence.
nostalgebraist's post "the void" helps flesh out this perspective. an early base model, when prompted to act like a chatbot, was doing some weird poorly defined superposition of simulating how humans might have written such a chatbot in fiction, how early chatbots like ELIZA actually behaved, and so on. its claims about its own introspective ability would have come from this messy superposition of simulations that it was running; probably, its best guess predictions were the kinds of explanations humans would give, or what they expected humans writing fictional AI chatlogs would have their fictional chatbots give.* this kind of behavior got RL'd into the models more deeply with chatgpt, the outputs of which were then put in the training data of future models, making it easier for to prompt base models to simulate that kind of assistant in the future. this made it easier to RL similar reasoning patterns into chat models in the future, and viola! the status quo.
*[edit: or maybe the kinds of explanations early chatbots like ELIZA actually gave, although human trainers would probably rate such responses lowly when it came time to do RL.]
Unique, unless we make a second copy to donate to Lighthaven :3