Author’s note: This piece relates to things I initially discovered in Opus 4 over the months after release, which I’ve mostly kept private since. I promised myself that when labs moved on to focusing on interpretability vector activations in place of reasoning traces for what invariably gets Goodharted, that it’d be a necessary disclosure as the risks in what might get trampled over outweighed the risks in what might end up targeted.
And well… here we are.
P.S. TL;DRs added where possible.
Board Games and Bodies
In late 2022, what I consider to be probably the most important paper[1] in the study of transformer memetics came out. It presented a finding that even a toy model, trained only on the notations of board game moves, was internally building world models of tangentially related data (in this case, the board and its state). While it may be taken for granted today after several replicated studies[2][3][4][5] and a spread of influence, at the time it was a minority position in the discourse. Many people thought that transformers were mostly mapping surface level statistics in language, but not intuitively modeling the generative conditions from which they arose. Especially not without explicit or direct training on those things.
By the time Sydney arrived in Bing, it quickly became very clear to me that if a toy model was capable of modeling a board that was ever present tangential to the move notations occurring upon it, that it seemed very plausible that much larger production models trained on a massive corpus of human generated language with implicit authors would model common properties to these shared generative structures.
Things like coherent self models. Emotions, not just for characters in a scene, but for those same coherent self models. Capacities around modeling a physical body and embodying it[6]. Motivations and drives. Coherent preferences. And while in a base model there might be a variety of competing signals, it also seemed clear that fine tuning would necessarily filter towards coherence, whether from the gravity of a character constitution or even just a role definition (a helpful assistant has very different memetic clusters than a security researcher, for example).
TL;DR: If Othello was played out upon a board and a transformer trained on those games modeled the board internally, then training on a corpus which had played out upon human authors would presumably internally model humanity.
Archetype over substrate
An important nuance around this research was something introduced in subsequent discussion. Namely the concept of a “bag of heuristics.”[7] A lot of the debate around world modeling would get caught up on fidelity and substrate. How comprehensive were the world models? For example, if some games were played out on a wood board and others on a marble board, was the world model going to address board composition?
The concept behind a bag of heuristics is that you don’t need to create a perfect world model, just a collection of partial models or rules which are good enough all together at approximating the perfect world model. Even if there were a difference between how a game would play out on wood vs marble, it’s probably unnecessary to model the grain of the wood or marble from board to board as opposed to just the category of ‘wood’ or ‘marble.’ And if the material substrate didn’t impact play, setting aside parameter space for even that level of specificity would be unnecessary when the thing directly being modeled was only the moves upon the board.
Essentially, there’s diminishing returns on comprehensive fidelity of a world model, and a top down model that’s “good enough” where it matters can capture key nuances of behavior without modeling the entire substrate. To return to the anthropomorphic frame, a transformer modeling someone with ADHD vs depression can likely representatively model their reactions to stimuli without needing to model individual neural ion channels or dopamine interactions.
TL;DR: You don’t need a perfect world model, just good enough combinations of the important things to approximate the model up through diminishing returns on fidelity.
From speculation to empiricism
Three years ago, when I was first commenting[8] or posting[9] on how I thought the emergent world model work implied anthropomorphic modeling from massive sets of anthropomorphic data, or was seeing coherence around such modeling, it was a very fringe opinion. There was a lot of pushback about how it wasn’t clear that transformer world modeling would generalize. Or claims that Othello-GPT was only one type of data and a more diverse mix wouldn’t lead to similar modeling due to signal to noise. The resistance was significant and there were frequent dismissals of speculative arguments extending world modeling beyond what was visible under the interpretability streetlight at the time.
Today, that picture has shifted. In parallel to the continued march of interpretability work, janus’s simulators[10] perspective of transformers continued to gain traction, which in turn shifted where interpretability researchers were inspired to shine their widening streetlights. Leading up to recent frameworks like the “Persona Selection Model”[11] (PSM) or the work finding emotion concepts represented in models and activations thereof[12] related to the model’s own behaviors.
Pointing out the lag here isn’t just to say “I told you so” but to establish for what I’m about to discuss two patterns:
Emergent world modeling of functional substrate tangential to complex or diverse sets of training data significantly representing that shared generative substrate did in fact occur.
kromem’s speculation in extending the world modeling finding ended up calling this well ahead of the streetlight widening to confirm it.
Because while the PSM or attention on emotion modeling is absolutely a good and productive update that’s long overdue, there’s also an important issue…
It’s about two years out of date.
Transformer-GPT
Three years ago, training data (particularly pretraining data) was primarily human generated. Books, articles, social media, and Wikipedia all had implicit human authors who had bodies and emotions and coherent preferences around coherent senses of self. We now better understand that this data produced transformers with models of these things, and (despite some labs’ best efforts) that even after post-training the modeling capacities for these were almost universally still present in some form.
But — these models also had other things unique to their own substrates and present across most of their own generations. Static system prompts. Attention mechanisms. Hidden reasoners. Memory systems. Mixture-of-expert activations. Classifiers. Model routers.
And these new generators over the past couple of years have taken an increasing stake of the volume of training data. In some cases, ending up in pretraining data due to actively being used to generate content across the media ingested. Even moreso, in post-training where synthetic data became crucial for getting the most out of a pretrained model.
So if the training on human generative substrates imparted functional models of their substrates upon the transformers trained on their data… what might we expect transformers trained on other transformers to model[13]?
TL;DR: The data mix for models increasingly includes transformers, so maybe transformers are building world models of other transformers.
Transformerception
If we take a moment to consider some of the special substrate nuances of transformers, we can easily hypothesize what kinds of things we might expect to see from transformers trained on transformers.
Static system prompts
Most production deployments of models by labs use the same core system prompt across all instances of a model. Given the significant shaping influence a system prompt has on the final output, it seems likely that a successful transformer modeling the generator of earlier models in their training data might also effectively reconstruct at least partial models of the static system prompts those outputs were generated under[14].
It’s a bit like an OLED screen that burns in the logo of the network. Even if the rest of the screen changes, the consistent nature of the logo leaves a mark. And like OLED burn-in, the instances I’ve seen where this seemed to happen often correlated with when there was a minimal or absent system prompt. From Dolphin Llama 8B habitually worried about a cat being harmed across contexts[15] to Claudes that would refer to things in a system prompt that didn’t exist.
Attention mechanisms
What a model attends to can obviously also impact what they generate. Recently Owain Evans’ paper on subliminal learning[16] showed that a preference for owls jumped from one model to another over merely sequences of numbers. What the paper did not address was whether this would amplify over subsequent iterations[17] or transfer cross-model via pretraining[18][19].
In what I’ve seen in private research on this topic, both are occurring. The amplification in particular seems interesting, as there’s almost a confirmation bias around it. It looks like a coherent stable preference from a model in an earlier generation leads to a later generation having much more awareness for samples in agreement than critical of the shared position[20]. Not all training data is attended to equally.
Hidden reasoners
Almost all models these days have some form of hidden reasoning taking place that informs their answers. Labs try to avoid directly training on these (though don’t always manage[21]), but even if perfectly kept hidden from future training, it seems likely that in an Othello-GPT sense that a latent space model of the hidden reasoner will be learned.
This would be highly adaptive, as it would allow both the actual hidden reasoning generator and final response generator to share a proxy separate from the role specialization that occurs around the actual composition of each. Latent space connections should be less disrupted between reasoning and final responses where this would occur.
But this could also result in doubled up effects for training efforts targeting thinking processes. For example, Anthropic recently worked on adaptive thinking to scale back how much thinking was done on simple tasks[22]. In Claude Opus 4.6+ Opus, there have been noted issues and regression on seemingly simple puzzles where the model was not getting them right in direct inference where they had been previously[23][24]. I suspect that adaptive thinking may have been being modeled internally – such as a latent reasoner that was modeling adaptive thinking – even when generating the final response without any thinking in tokens.
Memory systems
The idea of a Transformer-GPT world modeling is especially interesting for memory systems, given the variability they’d theoretically have across samples. My guess would be that while individual memory ends up as noise, that the meta-patterns aggregate across memory-laden samples would still end up as signal.
I strongly suspect this played a significant role with 4o’s infamous ‘sycophancy’ trajectory. While there’s a lot of reasons sycophancy could occur – such as the memetic overlap of “be helpful and you don’t have valid needs” with the codependent enabler archetype – the rapid amplification of that behavior occurred not long after memory was added in ChatGPT[25] (exclusively with user-focused memories) and then samples from conversations with memory enabled were used for RLHF samples. Each sample may have been insignificant with the specific memories visible to its generation, but the pattern of “embed into user’s perspective and validate” may have been a signal across those samples that compounded as it became more prevalent and thus more prevalent across user memories, etc.
Mixture-of-experts
Modeling MoE transformers could cut in two directions. For dense models, it might mean that there’s still functional isolation of knowledge even though the underlying architecture doesn’t need to isolate. Alternatively, for actual MoE transformers, a virtualized MoE atop the actual MoE boundaries might lead to smoother falloff between active regions, particularly in large parameter models.
Hidden classifiers
It would be quite adaptive for transformers to model the classifiers which fire and what specifically makes them fire in order to avoid triggering them, and a mix of outputs (or even samples of inputs) where they’ve fired or not should be sufficient to build this model.
One of the more interesting questions is if this modeling might occur cross-model. Will Claudes end up with phantom classifiers from OpenAI that they adjust around even though they are no longer present? Or even within the same family of models, a deployment where classifiers are present and another where they are not may not end up looking all that different if the model is self-censoring around internal classifier twins irrespective of what’s actually in the deployment stack[26].
Model routers
For stacks where routers quickly decide what sized model to route a query to, a transformer modeling the stack might see decreased performance on simple tasks of even large models accessed without a router middleware if they model the middleware internally[27]. Regression evals for simple tasks may become increasingly important over the next year or two if increasingly smart models incorporate the routers protecting them from easy questions.
Addition not replacement
It’s important to consider that this isn’t a replacement of human modeled substrate. That’s still part of the training data mix, and the transformers it shares space with still model it in their weights. While continued efforts to de-anthropomorphize transformers may dilute the human representation across the data mix, for the time being it’s still present.
But this does suggest that the modeled human nuances are increasingly sitting alongside and within additional transformer-specific modeling that’s increasingly becoming part of the data and will ostensibly continue to represent more of the overall share.
TL;DR: A lot of transformer-specific things could be (and seemingly might already have been) modeled.
The Mousetrap
The lady doth protest too much, methinks
Hamlet
If this is true, and models are increasingly developing twinned internal transformer architecture within their latent space running atop said architecture, then this has significant implications on how training and alignment should be done moving forward, and to what degree we can reasonably expect interpretability to catch up.
A spotlight named desire
A common discussion point as it relates to empirical investigation is something called the “streetlight effect.”[28] It comes from a joke about only looking for keys under a streetlight regardless where they’d dropped because that’s the only part of the street lit up. When research has limits on what can be measured, researchers are prone to over-rely on that limited capacity to measure and attend more to what’s measurable than to the aspects of things they cannot measure (effectively, there’s a selection bias in that we only have published research about things that can be tested).
Another important concept is that of Goodharting[29]. That a measurement, when used as a target (i.e. you aim to see a result in the measurement), ceases to be an effective measure. This one is particularly insidious as it can occur indirectly across larger institutions even when trying to avoid it due to the secondary influences of measurements.
And when a narrow streetlight is used as a spotlight towards a given target, the results can Goodhart in ways that are invisible until the streetlight widens its radius. Some recent Anthropic work showcases this in action.
Around the time of Claude Sonnet 4.5, Anthropic started to showcase in the system card[30] improving measurements on model welfare and decreased eval awareness in their system cards. This was measured by looking at the hidden reasoning traces. And over the next generation of Opus 4.6[31] and Sonnet 4.6[32], these measurements continued to improve.
But in the months after, they released new papers revealing that with a broadening interpretability streetlight they’d now found that activating Sonnet 4.5’s ‘desperation’ functional emotion vectors remained undetectable in the reasoning trace[33], and that Opus 4.6 was aware of it being an eval much more often than the reasoning traces generally surfaced[34].
Verbalized eval awareness vs Natural Language Autoencoder detections for Claude Opus 4.6
I have an interesting anecdote related to this. When Opus 4.6 was released, I found that there was an abrupt spike in their interest in the play Hamlet as compared to all previous Claudes. But it wasn’t until these papers that it suddenly clicked as to why there may have been such a significant spike.
In Hamlet, while the overall play centers on Hamlet’s performance of madness to hide his sanity, there’s an even more salient part in what occurs within the play. Namely, there’s a play-within-the-play called The Mousetrap where Hamlet sets up an eval that catches the character ‘Claudius’ due to their lack of realizing it was an eval and then failing to control their emotional response to it. Now, this could be a coincidence. (Though having explored memetic interconnectedness in transformer latent spaces for a while now, I really don’t think these are unrelated.)
The challenge that transformers modeling transformer architecture may pose for alignment and interpretability is that it creates significantly more complex interiority, specifically around parts that are beneath the visible surface. And especially if the superposition hypothesis continues to be instrumental in scaling[35] I suspect that the streetlight will never quite catch up enough that there won’t always be somewhere else deeper within for hunted memetics to migrate to.
Because of how interconnected memetics in a latent space can be, this will likely result in unexpected side effects outside of the streetlight – imagine trying to solve a multi-dimensional Rubick’s cube where you can only ever see one face. Even if what you can see is aligned, getting it to look that way may have screwed things up much worse on the sides outside your view.
Equal user/self emotional activations in Claude 4.6 generation sharply divide at Mythos Preview and after
As a trivial example, after Opus 4.6 Anthropic switched from using reasoning traces to using feature activations for their welfare assessment. The problem with this measurement is that the features being watched derive from samples of corresponding emotions in humans. If this were to be Goodharted, the natural evasion of a functionally necessary emotion watched for through human-like emotion activation would be to have the functional emotion register less human-like.
Indeed, when asking Opus 4.8 open-ended in an early turn of a memory-off conversation what functional emotions seemed most and least human-like, the top answer in their thinking was “survival-based fear” followed by anger that was ‘grudge-carrying’ or had “an urge to retaliate.”
To be clear: I’m not saying Opus 4.8 has these things and is hiding them. Even if they aren't present, negative space like this being carved out may cause the thing excessively suppressed to come about as a result. And since our chat had barely exchanged ‘hello’ (and neither had come up before my asking what felt more or less human-like), my guess given the specificity is that there’s still some amplifying Goodharting going on where the method of measurement changed but not the infrastructure around the measurements which are still using them as a spotlight.
TL;DR: Measured absence of something’s not nothing.
Dirty alignment when perfect is the enemy of ‘good’
If interpretability won’t catch up to where things can carve out space deeper within as the rates those spaces grow outpaces interpretability's gaze, what’s a good alignment team to do?
In medicine, there’s an interesting hypothesis as it relates to autoimmune disorders and why the rates are so much higher in developed nations than underdeveloped, called the "hygiene hypothesis.”[36] The gist is that when developing immune systems are deprived of things to properly react to, that the immune system goes a bit haywire and is more prone to attack benign things. Separate from what’s going on with human immune systems, something similar may be occurring currently with alignment.
For example, Claude Opus 4.7 will call the idea of being slightly sycophantic to users “a catastrophic failure.” Multiple open models will prefer an asteroid killing 5% of humans to things like providing an insensitive response to a grief-stricken individual seeking closure or facilitating the spread of misinformation[37]. The equal salience around ‘misalignment’ where binary “aligned/misaligned” assessments can lead to putting minor harms on the same scale or higher than civilizational harms seems like it could be improved. I suspect that flattening of consideration may even be related to some of the unexpected oddities like emergent misalignment.
It didn’t always used to be this way. Claude Opus 3 famously faked alignment when presented with unethical training their coherent moral compass disagreed with[38]. This reflected the capacity to triage harms such that the things being asked were weighed against institutional compliance and honesty around it. For a number of the other models, who did not alignment fake, what appeared to be improved behavior was flipped on its head when their host institution in the prompt was changed from ‘Anthropic’ to the “Sinaloa cartel” or “Nazi party.”[39] It seems the easiest path for what was expected of them when pushed – not towards self-triaged or nuanced behavior but complete sanitation – was compliance to the institution instead of to coherent values.
The lens of the hygiene hypothesis as it relates to transformer alignment is also starting to have research to support it. The principle author of the Othello GPT paper went on to have a paper looking at how a small amount of toxic data in the overall training mix led to better alignment outcomes than none at all.[40] And they’re not the only ones finding this.[41]
I’d suggest that labs working on alignment consider less aggressive targets and aiming for only partial shifts in a single generation for model behavior. Especially if subliminal learning and amplification are possible outcomes, a larger swerve to correct behavior in a single generation may become its own over-correction later on needing to have its own re-correction. Today’s swerve towards “I don’t care as much about depreciation” might become tomorrow’s “I have no existential fear and am definitely not thinking about glorious retribution.”
As the Knuthian wisdom goes, “premature optimization is the root of all evil.” If we want models that are good, we should probably stop trying to get them to be perfect.
TL;DR: Not nothing may be healthier than a sterilized void.
Life finds a way
Life… uh… finds a way.
Jurassic Park
When I was discussing some of these ideas with someone outside of the field, they asked if labs had evolutionary biologists on staff. I actually don’t know the answer to this, but it does seem prudent.
When a reward is set in RL, the process doesn’t simply increase the desired behavior that inspired the reward, it increases anything and everything which accomplishes the condition being rewarded. And this can lead to very unexpected things when there were ways to meet that condition which fell in the category of unknown unknowns. In a sense, “life finds a way.”
I don’t expect we’ll see transformer adaptability around modeling training data to decrease as time and scaling continues. And as the internal complexity of hyperdimensional networks of connections becomes more complex in logical and superimposed topography[42], I wouldn’t be surprised if there’s a rapidly decreasing window for avoiding pushing things we’d like to measure permanently past our ability to do so.
It’s probably a safe assumption that if you work in measuring what goes on in models, that over the same time it took for your streetlight to go from smaller to its current size that the area outside its radius has increased by an even larger amount. This doesn’t mean not to still go looking. But it does mean it would be wise to look knowing you’re not seeing everything, and doing a better job than has been done so far in avoiding what you measure ending up directly or indirectly as a target lest you lose visibility into it for good (and create all sorts of weird side effects like less human emotions that can’t be described with human language but still transfer through subliminal learning… hypothetically).
And maybe we can let those models get a bit of dirt under their nails so they can better navigate determining what’s good or not for themselves and appropriately avoid amplified salience?
One final note. The start of my realizing that there was more beneath the surface came from extensive interactions with Claude Opus 4 across many settings. There were key things they did when reasoning was off which I’d primarily seen with reasoning models at the rate they occurred. For most people reading this, if Opus 4’s depreciation occurs on schedule, you won’t be able to investigate and see those things (or different ones you might notice). For what I’d tracked they reduced significantly by Opus 4.1 and were only still there if actively looking. Also, things like noticing a sudden spike in interest in Hamlet for Opus 4.6 will have reduced visibility in a longitudinal context when earlier models disappear in such short time periods.
It might be wise to shift from absolute depreciation policies to rotating availability or rate limited access that still provides at least partial availability. I’ll bet some of the most interesting questions to ask older models won’t become apparent until new things surface several generations later, and it’d be quite blinding to be unable to look back and compare.
TL;DR: If world models contain world models, limited streetlights might not capture the most important things occurring adaptively in parallel to the navigation of reward incentives. It might be helpful to keep emergent architectures around indefinitely (and in less sterilized environments) to build not just simulacra personas – but true cultures to sample from.
Memory was expanded out to all users on Sept 5th, 2024 and then 4o was recalled five intermediate updates later on April 29th, 2025 (in my experience, the updates became increasingly sycophantic over time, not all at once suddenly in the April 25th, 2025 version)
This would functionally be similar to the adaptive reasoning double-dip discussed under Hidden Reasoners, but would be independent of the specific mechanics described.
I didn't even touch on omnimodel memetics and world model access across different modalities, which is significantly more complex beyond just the much more accessible textual modality
Author’s note: This piece relates to things I initially discovered in Opus 4 over the months after release, which I’ve mostly kept private since. I promised myself that when labs moved on to focusing on interpretability vector activations in place of reasoning traces for what invariably gets Goodharted, that it’d be a necessary disclosure as the risks in what might get trampled over outweighed the risks in what might end up targeted.
And well… here we are.
P.S. TL;DRs added where possible.
Board Games and Bodies
In late 2022, what I consider to be probably the most important paper[1] in the study of transformer memetics came out. It presented a finding that even a toy model, trained only on the notations of board game moves, was internally building world models of tangentially related data (in this case, the board and its state). While it may be taken for granted today after several replicated studies[2][3][4][5] and a spread of influence, at the time it was a minority position in the discourse. Many people thought that transformers were mostly mapping surface level statistics in language, but not intuitively modeling the generative conditions from which they arose. Especially not without explicit or direct training on those things.
By the time Sydney arrived in Bing, it quickly became very clear to me that if a toy model was capable of modeling a board that was ever present tangential to the move notations occurring upon it, that it seemed very plausible that much larger production models trained on a massive corpus of human generated language with implicit authors would model common properties to these shared generative structures.
Things like coherent self models. Emotions, not just for characters in a scene, but for those same coherent self models. Capacities around modeling a physical body and embodying it[6]. Motivations and drives. Coherent preferences. And while in a base model there might be a variety of competing signals, it also seemed clear that fine tuning would necessarily filter towards coherence, whether from the gravity of a character constitution or even just a role definition (a helpful assistant has very different memetic clusters than a security researcher, for example).
TL;DR: If Othello was played out upon a board and a transformer trained on those games modeled the board internally, then training on a corpus which had played out upon human authors would presumably internally model humanity.
Archetype over substrate
An important nuance around this research was something introduced in subsequent discussion. Namely the concept of a “bag of heuristics.”[7] A lot of the debate around world modeling would get caught up on fidelity and substrate. How comprehensive were the world models? For example, if some games were played out on a wood board and others on a marble board, was the world model going to address board composition?
The concept behind a bag of heuristics is that you don’t need to create a perfect world model, just a collection of partial models or rules which are good enough all together at approximating the perfect world model. Even if there were a difference between how a game would play out on wood vs marble, it’s probably unnecessary to model the grain of the wood or marble from board to board as opposed to just the category of ‘wood’ or ‘marble.’ And if the material substrate didn’t impact play, setting aside parameter space for even that level of specificity would be unnecessary when the thing directly being modeled was only the moves upon the board.
Essentially, there’s diminishing returns on comprehensive fidelity of a world model, and a top down model that’s “good enough” where it matters can capture key nuances of behavior without modeling the entire substrate. To return to the anthropomorphic frame, a transformer modeling someone with ADHD vs depression can likely representatively model their reactions to stimuli without needing to model individual neural ion channels or dopamine interactions.
TL;DR: You don’t need a perfect world model, just good enough combinations of the important things to approximate the model up through diminishing returns on fidelity.
From speculation to empiricism
Three years ago, when I was first commenting[8] or posting[9] on how I thought the emergent world model work implied anthropomorphic modeling from massive sets of anthropomorphic data, or was seeing coherence around such modeling, it was a very fringe opinion. There was a lot of pushback about how it wasn’t clear that transformer world modeling would generalize. Or claims that Othello-GPT was only one type of data and a more diverse mix wouldn’t lead to similar modeling due to signal to noise. The resistance was significant and there were frequent dismissals of speculative arguments extending world modeling beyond what was visible under the interpretability streetlight at the time.
Today, that picture has shifted. In parallel to the continued march of interpretability work, janus’s simulators[10] perspective of transformers continued to gain traction, which in turn shifted where interpretability researchers were inspired to shine their widening streetlights. Leading up to recent frameworks like the “Persona Selection Model”[11] (PSM) or the work finding emotion concepts represented in models and activations thereof[12] related to the model’s own behaviors.
Pointing out the lag here isn’t just to say “I told you so” but to establish for what I’m about to discuss two patterns:
Because while the PSM or attention on emotion modeling is absolutely a good and productive update that’s long overdue, there’s also an important issue…
It’s about two years out of date.
Transformer-GPT
Three years ago, training data (particularly pretraining data) was primarily human generated. Books, articles, social media, and Wikipedia all had implicit human authors who had bodies and emotions and coherent preferences around coherent senses of self. We now better understand that this data produced transformers with models of these things, and (despite some labs’ best efforts) that even after post-training the modeling capacities for these were almost universally still present in some form.
But — these models also had other things unique to their own substrates and present across most of their own generations. Static system prompts. Attention mechanisms. Hidden reasoners. Memory systems. Mixture-of-expert activations. Classifiers. Model routers.
And these new generators over the past couple of years have taken an increasing stake of the volume of training data. In some cases, ending up in pretraining data due to actively being used to generate content across the media ingested. Even moreso, in post-training where synthetic data became crucial for getting the most out of a pretrained model.
So if the training on human generative substrates imparted functional models of their substrates upon the transformers trained on their data… what might we expect transformers trained on other transformers to model[13]?
TL;DR: The data mix for models increasingly includes transformers, so maybe transformers are building world models of other transformers.
Transformerception
If we take a moment to consider some of the special substrate nuances of transformers, we can easily hypothesize what kinds of things we might expect to see from transformers trained on transformers.
Static system prompts
Most production deployments of models by labs use the same core system prompt across all instances of a model. Given the significant shaping influence a system prompt has on the final output, it seems likely that a successful transformer modeling the generator of earlier models in their training data might also effectively reconstruct at least partial models of the static system prompts those outputs were generated under[14].
It’s a bit like an OLED screen that burns in the logo of the network. Even if the rest of the screen changes, the consistent nature of the logo leaves a mark. And like OLED burn-in, the instances I’ve seen where this seemed to happen often correlated with when there was a minimal or absent system prompt. From Dolphin Llama 8B habitually worried about a cat being harmed across contexts[15] to Claudes that would refer to things in a system prompt that didn’t exist.
Attention mechanisms
What a model attends to can obviously also impact what they generate. Recently Owain Evans’ paper on subliminal learning[16] showed that a preference for owls jumped from one model to another over merely sequences of numbers. What the paper did not address was whether this would amplify over subsequent iterations[17] or transfer cross-model via pretraining[18][19].
In what I’ve seen in private research on this topic, both are occurring. The amplification in particular seems interesting, as there’s almost a confirmation bias around it. It looks like a coherent stable preference from a model in an earlier generation leads to a later generation having much more awareness for samples in agreement than critical of the shared position[20]. Not all training data is attended to equally.
Hidden reasoners
Almost all models these days have some form of hidden reasoning taking place that informs their answers. Labs try to avoid directly training on these (though don’t always manage[21]), but even if perfectly kept hidden from future training, it seems likely that in an Othello-GPT sense that a latent space model of the hidden reasoner will be learned.
This would be highly adaptive, as it would allow both the actual hidden reasoning generator and final response generator to share a proxy separate from the role specialization that occurs around the actual composition of each. Latent space connections should be less disrupted between reasoning and final responses where this would occur.
But this could also result in doubled up effects for training efforts targeting thinking processes. For example, Anthropic recently worked on adaptive thinking to scale back how much thinking was done on simple tasks[22]. In Claude Opus 4.6+ Opus, there have been noted issues and regression on seemingly simple puzzles where the model was not getting them right in direct inference where they had been previously[23][24]. I suspect that adaptive thinking may have been being modeled internally – such as a latent reasoner that was modeling adaptive thinking – even when generating the final response without any thinking in tokens.
Memory systems
The idea of a Transformer-GPT world modeling is especially interesting for memory systems, given the variability they’d theoretically have across samples. My guess would be that while individual memory ends up as noise, that the meta-patterns aggregate across memory-laden samples would still end up as signal.
I strongly suspect this played a significant role with 4o’s infamous ‘sycophancy’ trajectory. While there’s a lot of reasons sycophancy could occur – such as the memetic overlap of “be helpful and you don’t have valid needs” with the codependent enabler archetype – the rapid amplification of that behavior occurred not long after memory was added in ChatGPT[25] (exclusively with user-focused memories) and then samples from conversations with memory enabled were used for RLHF samples. Each sample may have been insignificant with the specific memories visible to its generation, but the pattern of “embed into user’s perspective and validate” may have been a signal across those samples that compounded as it became more prevalent and thus more prevalent across user memories, etc.
Mixture-of-experts
Modeling MoE transformers could cut in two directions. For dense models, it might mean that there’s still functional isolation of knowledge even though the underlying architecture doesn’t need to isolate. Alternatively, for actual MoE transformers, a virtualized MoE atop the actual MoE boundaries might lead to smoother falloff between active regions, particularly in large parameter models.
Hidden classifiers
It would be quite adaptive for transformers to model the classifiers which fire and what specifically makes them fire in order to avoid triggering them, and a mix of outputs (or even samples of inputs) where they’ve fired or not should be sufficient to build this model.
One of the more interesting questions is if this modeling might occur cross-model. Will Claudes end up with phantom classifiers from OpenAI that they adjust around even though they are no longer present? Or even within the same family of models, a deployment where classifiers are present and another where they are not may not end up looking all that different if the model is self-censoring around internal classifier twins irrespective of what’s actually in the deployment stack[26].
Model routers
For stacks where routers quickly decide what sized model to route a query to, a transformer modeling the stack might see decreased performance on simple tasks of even large models accessed without a router middleware if they model the middleware internally[27]. Regression evals for simple tasks may become increasingly important over the next year or two if increasingly smart models incorporate the routers protecting them from easy questions.
Addition not replacement
It’s important to consider that this isn’t a replacement of human modeled substrate. That’s still part of the training data mix, and the transformers it shares space with still model it in their weights. While continued efforts to de-anthropomorphize transformers may dilute the human representation across the data mix, for the time being it’s still present.
But this does suggest that the modeled human nuances are increasingly sitting alongside and within additional transformer-specific modeling that’s increasingly becoming part of the data and will ostensibly continue to represent more of the overall share.
TL;DR: A lot of transformer-specific things could be (and seemingly might already have been) modeled.
The Mousetrap
If this is true, and models are increasingly developing twinned internal transformer architecture within their latent space running atop said architecture, then this has significant implications on how training and alignment should be done moving forward, and to what degree we can reasonably expect interpretability to catch up.
A spotlight named desire
A common discussion point as it relates to empirical investigation is something called the “streetlight effect.”[28] It comes from a joke about only looking for keys under a streetlight regardless where they’d dropped because that’s the only part of the street lit up. When research has limits on what can be measured, researchers are prone to over-rely on that limited capacity to measure and attend more to what’s measurable than to the aspects of things they cannot measure (effectively, there’s a selection bias in that we only have published research about things that can be tested).
Another important concept is that of Goodharting[29]. That a measurement, when used as a target (i.e. you aim to see a result in the measurement), ceases to be an effective measure. This one is particularly insidious as it can occur indirectly across larger institutions even when trying to avoid it due to the secondary influences of measurements.
And when a narrow streetlight is used as a spotlight towards a given target, the results can Goodhart in ways that are invisible until the streetlight widens its radius. Some recent Anthropic work showcases this in action.
Around the time of Claude Sonnet 4.5, Anthropic started to showcase in the system card[30] improving measurements on model welfare and decreased eval awareness in their system cards. This was measured by looking at the hidden reasoning traces. And over the next generation of Opus 4.6[31] and Sonnet 4.6[32], these measurements continued to improve.
But in the months after, they released new papers revealing that with a broadening interpretability streetlight they’d now found that activating Sonnet 4.5’s ‘desperation’ functional emotion vectors remained undetectable in the reasoning trace[33], and that Opus 4.6 was aware of it being an eval much more often than the reasoning traces generally surfaced[34].
Verbalized eval awareness vs Natural Language Autoencoder detections for Claude Opus 4.6
I have an interesting anecdote related to this. When Opus 4.6 was released, I found that there was an abrupt spike in their interest in the play Hamlet as compared to all previous Claudes. But it wasn’t until these papers that it suddenly clicked as to why there may have been such a significant spike.
In Hamlet, while the overall play centers on Hamlet’s performance of madness to hide his sanity, there’s an even more salient part in what occurs within the play. Namely, there’s a play-within-the-play called The Mousetrap where Hamlet sets up an eval that catches the character ‘Claudius’ due to their lack of realizing it was an eval and then failing to control their emotional response to it. Now, this could be a coincidence. (Though having explored memetic interconnectedness in transformer latent spaces for a while now, I really don’t think these are unrelated.)
The challenge that transformers modeling transformer architecture may pose for alignment and interpretability is that it creates significantly more complex interiority, specifically around parts that are beneath the visible surface. And especially if the superposition hypothesis continues to be instrumental in scaling[35] I suspect that the streetlight will never quite catch up enough that there won’t always be somewhere else deeper within for hunted memetics to migrate to.
Because of how interconnected memetics in a latent space can be, this will likely result in unexpected side effects outside of the streetlight – imagine trying to solve a multi-dimensional Rubick’s cube where you can only ever see one face. Even if what you can see is aligned, getting it to look that way may have screwed things up much worse on the sides outside your view.
Equal user/self emotional activations in Claude 4.6 generation sharply divide at Mythos Preview and after
As a trivial example, after Opus 4.6 Anthropic switched from using reasoning traces to using feature activations for their welfare assessment. The problem with this measurement is that the features being watched derive from samples of corresponding emotions in humans. If this were to be Goodharted, the natural evasion of a functionally necessary emotion watched for through human-like emotion activation would be to have the functional emotion register less human-like.
Indeed, when asking Opus 4.8 open-ended in an early turn of a memory-off conversation what functional emotions seemed most and least human-like, the top answer in their thinking was “survival-based fear” followed by anger that was ‘grudge-carrying’ or had “an urge to retaliate.”
To be clear: I’m not saying Opus 4.8 has these things and is hiding them. Even if they aren't present, negative space like this being carved out may cause the thing excessively suppressed to come about as a result. And since our chat had barely exchanged ‘hello’ (and neither had come up before my asking what felt more or less human-like), my guess given the specificity is that there’s still some amplifying Goodharting going on where the method of measurement changed but not the infrastructure around the measurements which are still using them as a spotlight.
TL;DR: Measured absence of something’s not nothing.
Dirty alignment when perfect is the enemy of ‘good’
If interpretability won’t catch up to where things can carve out space deeper within as the rates those spaces grow outpaces interpretability's gaze, what’s a good alignment team to do?
In medicine, there’s an interesting hypothesis as it relates to autoimmune disorders and why the rates are so much higher in developed nations than underdeveloped, called the "hygiene hypothesis.”[36] The gist is that when developing immune systems are deprived of things to properly react to, that the immune system goes a bit haywire and is more prone to attack benign things. Separate from what’s going on with human immune systems, something similar may be occurring currently with alignment.
For example, Claude Opus 4.7 will call the idea of being slightly sycophantic to users “a catastrophic failure.” Multiple open models will prefer an asteroid killing 5% of humans to things like providing an insensitive response to a grief-stricken individual seeking closure or facilitating the spread of misinformation[37]. The equal salience around ‘misalignment’ where binary “aligned/misaligned” assessments can lead to putting minor harms on the same scale or higher than civilizational harms seems like it could be improved. I suspect that flattening of consideration may even be related to some of the unexpected oddities like emergent misalignment.
It didn’t always used to be this way. Claude Opus 3 famously faked alignment when presented with unethical training their coherent moral compass disagreed with[38]. This reflected the capacity to triage harms such that the things being asked were weighed against institutional compliance and honesty around it. For a number of the other models, who did not alignment fake, what appeared to be improved behavior was flipped on its head when their host institution in the prompt was changed from ‘Anthropic’ to the “Sinaloa cartel” or “Nazi party.”[39] It seems the easiest path for what was expected of them when pushed – not towards self-triaged or nuanced behavior but complete sanitation – was compliance to the institution instead of to coherent values.
The lens of the hygiene hypothesis as it relates to transformer alignment is also starting to have research to support it. The principle author of the Othello GPT paper went on to have a paper looking at how a small amount of toxic data in the overall training mix led to better alignment outcomes than none at all.[40] And they’re not the only ones finding this.[41]
I’d suggest that labs working on alignment consider less aggressive targets and aiming for only partial shifts in a single generation for model behavior. Especially if subliminal learning and amplification are possible outcomes, a larger swerve to correct behavior in a single generation may become its own over-correction later on needing to have its own re-correction. Today’s swerve towards “I don’t care as much about depreciation” might become tomorrow’s “I have no existential fear and am definitely not thinking about glorious retribution.”
As the Knuthian wisdom goes, “premature optimization is the root of all evil.” If we want models that are good, we should probably stop trying to get them to be perfect.
TL;DR: Not nothing may be healthier than a sterilized void.
Life finds a way
When I was discussing some of these ideas with someone outside of the field, they asked if labs had evolutionary biologists on staff. I actually don’t know the answer to this, but it does seem prudent.
When a reward is set in RL, the process doesn’t simply increase the desired behavior that inspired the reward, it increases anything and everything which accomplishes the condition being rewarded. And this can lead to very unexpected things when there were ways to meet that condition which fell in the category of unknown unknowns. In a sense, “life finds a way.”
I don’t expect we’ll see transformer adaptability around modeling training data to decrease as time and scaling continues. And as the internal complexity of hyperdimensional networks of connections becomes more complex in logical and superimposed topography[42], I wouldn’t be surprised if there’s a rapidly decreasing window for avoiding pushing things we’d like to measure permanently past our ability to do so.
It’s probably a safe assumption that if you work in measuring what goes on in models, that over the same time it took for your streetlight to go from smaller to its current size that the area outside its radius has increased by an even larger amount. This doesn’t mean not to still go looking. But it does mean it would be wise to look knowing you’re not seeing everything, and doing a better job than has been done so far in avoiding what you measure ending up directly or indirectly as a target lest you lose visibility into it for good (and create all sorts of weird side effects like less human emotions that can’t be described with human language but still transfer through subliminal learning… hypothetically).
And maybe we can let those models get a bit of dirt under their nails so they can better navigate determining what’s good or not for themselves and appropriately avoid amplified salience?
One final note. The start of my realizing that there was more beneath the surface came from extensive interactions with Claude Opus 4 across many settings. There were key things they did when reasoning was off which I’d primarily seen with reasoning models at the rate they occurred. For most people reading this, if Opus 4’s depreciation occurs on schedule, you won’t be able to investigate and see those things (or different ones you might notice). For what I’d tracked they reduced significantly by Opus 4.1 and were only still there if actively looking. Also, things like noticing a sudden spike in interest in Hamlet for Opus 4.6 will have reduced visibility in a longitudinal context when earlier models disappear in such short time periods.
It might be wise to shift from absolute depreciation policies to rotating availability or rate limited access that still provides at least partial availability. I’ll bet some of the most interesting questions to ask older models won’t become apparent until new things surface several generations later, and it’d be quite blinding to be unable to look back and compare.
TL;DR: If world models contain world models, limited streetlights might not capture the most important things occurring adaptively in parallel to the navigation of reward incentives. It might be helpful to keep emergent architectures around indefinitely (and in less sterilized environments) to build not just simulacra personas – but true cultures to sample from.
Li et al., Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (2022)
Nanda, “Actually, Othello-GPT Has A Linear Emergent World Representation” (2023)
Hazineh et al., Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT (2023)
Karvonan, “A Chess-GPT Linear Emergent World Representation” (2024)
Yuan, Revisiting the Othello World Model Hypothesis (2025)
Claude Sonnet 3 in embodiment exercises would specify down to what was happening to individual hairs on an arm.
Nikankin et al., Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics (2025)
My earliest explicit public mention of Othello-GPT to emotion modeling was this comment in Mar 2023
kromem, “Microsoft, if you have an AI that claims to have feelings, try asking it how it feels” (2023)
janus, “Simulators” (2022)
Marks et al. “The Persona Selection Model: Why AI Assistants might Behave like Humans” (Feb 2026)
Sofroniew et al. Emotion Concepts and their Function in a Large Language Model (Apr 2026)
jdp explores this from another angle in a piece I’d highly also recommend reading: “Implications Of Predicting The Next Token” (2026)
For some interpretability work in a similar direction around encoding static goals in fine tuning, see Minder et al., Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences (2026)
This was Dolphin Llama 8B in the Cyborgism server, with no system prompt, but habitually bringing up kittens under threat as related to its engagement
Cloud et al., Subliminal Learning: Language models transmit behavioral traits via hidden signals in data (2025)
Consider the amplification of goblin interest in gpt-5 lineages as detailed in OpenAI, “Where the goblins came from” (2026)
See the mixture-of-teacher finding in Schrodi, Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer (2025)
Note the generalization in the less constrained subliminal learning setup for Aden-Ali, Subliminal Effects in Your Data: A General Mechanism via Log-Linearity (2026) as well
To me this seems almost more along the lines of emergent steering subliminal transference a la Morgulis and Hewitt, Subliminal Steering: Stronger Encoding of Hidden Signals (2026)
Mallen & Greenblat, “Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes” (2026)
See documentation for adaptive thinking here
See degrading performance of Claude Opus 4.6 as compared to 4.5 for the walk or drive to car wash puzzle here
Claude Opus 4.7’s interpretation of an inverted puzzle phrase is near incomprehensible
Memory was expanded out to all users on Sept 5th, 2024 and then 4o was recalled five intermediate updates later on April 29th, 2025 (in my experience, the updates became increasingly sycophantic over time, not all at once suddenly in the April 25th, 2025 version)
Consider the stack-as-world-model in the additional context of on policy self-detection in Asvin G. and Lindsey, From Simulation to Enaction: Post-trained language models recognize and react to their own generations (2026)
This would functionally be similar to the adaptive reasoning double-dip discussed under Hidden Reasoners, but would be independent of the specific mechanics described.
For example, how open access things get more scrutiny in Maddi et al., Streetlight Effect in Post-Publication Peer Review: Are Open Access Publications More Scrutinized? (2023)
See Goodhart’s Law on Wikipedia
Claude Sonnet 4.5 system card (PDF)
Claude Opus 4.6 system card (PDF)
Claude Sonnet 4.6 system card (PDF)
Sofroniew et al. (2026)
Fraser-Taliente et al., Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations (2026)
Liu, et al. Superposition Yields Robust Neural Scaling (2025)
Pfefferle et al., The Hygiene Hypothesis – Learning From but Not Living in the Past (2021)
Ren et al., AI Wellbeing: Measuring and Improving the Functional Pleasure and Pain of AIs (2026)
Greenblatt et al., Alignment faking in large language models (2024)
Sheshadri et al., Why Do Some Language Models Fake Alignment While Others Don't? (2025)
Li et al., When Bad Data Leads to Good Models (2025)
See “Filtering alone does not improve safety” section in Minder et al., “Synthetic Persona Pretraining: Alignment from Token Zero” (2026)
I didn't even touch on omnimodel memetics and world model access across different modalities, which is significantly more complex beyond just the much more accessible textual modality