This post was rejected for the following reason(s):
No LLM generated, heavily assisted/co-written, or otherwise reliant work. LessWrong has recently been inundated with new users submitting work where much of the content is the output of LLM(s). This work by-and-large does not meet our standards, and is rejected. This includes dialogs with LLMs that claim to demonstrate various properties about them, posts introducing some new concept and terminology that explains how LLMs work, often centered around recursiveness, emergence, sentience, consciousness, etc. (these generally don't turn out to be as novel or interesting as they may seem).
There’s a peculiar feeling that comes when staring too long at a repeated word. After the tenth or twentieth time your eyes land on "dog", the word begins to lose its sharpness. It collapses into a sound, then into a set of shapes, and eventually, if you persist, into something almost meaningless. The mind quiets around it. This isn’t just a quirk of perception, it’s habituation, one of the most ancient and fundamental forms of neural plasticity.
The question that pulled me into this project was simple, almost childlike: do transformers habituate too? If you give a language model the same word over and over, does it, like us, learn to tune it out? Or does it just grind away statistically, reinforcing the repetition, never really learning to let go?
That curiosity grew into a series of experiments where I opened up the “artificial brain” of a transformer and looked directly at the flow of activations. What I found was not only surprising but deeply suggestive: transformers do appear to habituate, though only in certain layers, and in a way that mirrors, and then diverges from, biological intelligence.
Why Habituation Matters
Habituation is not glamorous in the way that memory or consciousness is, but it is a bedrock principle of efficient intelligence. A nervous system that cannot habituate is a system that drowns in noise. Every repeated input would demand full metabolic attention, leaving no capacity to detect novelty or build abstractions. From sea slugs to humans, habituation is nature’s way of saying: “this again? I’ll conserve energy and ignore it.”
In artificial intelligence, the analogy is less clear. Transformers are often described as sophisticated pattern recognizers, engines of statistical frequency rather than adaptive suppression. The default assumption has been that they "notice" repeated tokens more, not less. And yet, their uncanny ability to generalize, to hold context, and to discard irrelevance suggests there may be deeper computational strategies at play. If we find habituation in these models, it is not just a curiosity; it speaks to whether AI systems are converging on the same efficiency principles that biological evolution discovered. That convergence would have implications for interpretability, safety, and even the philosophy of intelligence.
What the Experiments Showed
I worked with Qwen3-4B, a medium-scale transformer, recording internal activations as I fed it repeated inputs. The analysis focused on the down-projection vectors of the MLPs at key layers. What emerged was a two-stage story of how information flows from redundancy-filtering to output optimization.
Early and intermediate layers, specifically 10, 20, and 30, showed clear habituation. When the token “dog” was repeated twenty times in identical sentences, these layers responded like biological neurons under repetition suppression: activation magnitudes decayed exponentially. In Layer 10, the decline was dramatic, with activations dropping by over forty percent, closely matching canonical neuronal adaptation curves. Layer 20 showed a more moderate but consistent decrease, and Layer 30 exhibited another steep decline.
Then, something striking happened. At Layer 34, close to the output head, habituation vanished. By Layer 35, the pattern had flipped entirely: instead of suppressing redundancy, the model amplified it. Repeated words became louder in the network, not quieter. In this final stage, the model wasn’t filtering anymore, it was preparing to predict, leaning into the statistical reinforcement that next-token prediction demands.
Further experiments teased apart the nuances. When repetitions were spaced with distractor sentences, early habituation collapsed; the context reset seemed to erase the “memory” of redundancy in Layers 10 and 20. But Layer 30 remained robust, continuing to suppress repeated meanings even across varied contexts. This hints at a hierarchical gradient of resilience to interference.
And when I tested semantic variation, phrases like “The dog barks every night” interleaved with “That canine is barking nightly”, the results sharpened the story. Layers 10 and 20 still habituated, suppressing the semantic concept of barking dogs even when the surface form shifted. By contrast, Layer 30 no longer habituated at all under semantic variation, revealing that its adaptation was tied to exact syntactic form rather than abstract meaning.
The Biological Parallels
In neuroscience, habituation is characterized by exponential decay curves and sensitivity to stimulus timing and spacing. The transformer layers behaved in the same way. The early layers habituated strongly, but their suppression was fragile to spacing, just as in biological systems where spaced repetition disrupts adaptation. Some layers even showed decay constants close to those measured in neurons.
At the same time, the divergence is equally important. Biological brains don’t have a “prediction head” that flips suppression into amplification at the very end of processing. This is where the analogy breaks down, and where the purpose of the artificial system asserts itself. Transformers aren’t built to conserve metabolic energy; they’re built to optimize token prediction. So the last layers re-weight repetition into a feature, not a bug.
This hybrid architecture, a front end that filters like biology and a back end that amplifies like statistics, suggests transformers are not just statistical parrots, nor are they biological mimics. They are something in between, a composite computational system that borrows principles of adaptation where they’re useful and discards them where prediction demands override.
Why This Matters for AI Safety and Alignment
Habituation might seem like a technical curiosity, but it cuts straight into questions about alignment.
First, it challenges the assumption that transformers are “just” frequency maximizers. If early layers are already suppressing redundant inputs, then models are developing adaptive filters more akin to cognitive systems than raw statistical machines. That should matter to anyone building interpretability frameworks, because it means the computations we want to align are not uniform across layers. Some parts of the network are closer to brains than we thought.
Second, habituation interacts with adversarial robustness. A model that habituates to repeated stimuli could be resistant to prompt attacks that rely on overwhelming the context with repeated patterns. Conversely, the amplification in later layers might make it vulnerable to certain forms of repetition-based manipulation. Understanding this balance is not just academic, it’s directly relevant to how secure these systems are.
Third, the presence of semantic habituation at Layer 20 suggests that transformers are capable of suppressing meaning itself, not just form. This has implications for mesa-optimization. If internal layers can discard repeated conceptual content, then the model may be learning efficiency strategies that go beyond the training objective. That opens the door to emergent objectives and behaviors that are not trivially tied to next-token prediction.
Reflections and Future Questions
For me, this research felt like holding a candle inside a vast cathedral. I could only illuminate a corner, five layers, one token, a handful of experiments, but even that glimpse revealed an intricate architecture of suppression and amplification. It left me with more questions than answers.
Do larger models habituate more strongly, or do they find entirely different strategies as depth increases? Would attention layers show the same suppression patterns, or are they governed by different dynamics altogether? Could we design architectures that lean into habituation deliberately, using it as a mechanism for efficient context management, rather than discovering it only as an emergent side effect?
On a more speculative note, habituation may represent a hinge point between perception and prediction. In biology, it’s the mechanism that clears space for novelty detection. In transformers, it appears to operate in the same role, up until the point where prediction demands invert it. What would happen if we built models where habituation was not overridden, but preserved all the way through to output? Would they behave more like brains, perhaps better tuned to abstraction and generalization than to token counting?
And finally, what does it mean for alignment if parts of our systems are already converging toward biological principles without us intending them to? Should we take this as encouragement, that intelligence has attractors we can map and understand, or as a warning, that models may adopt cognitive strategies we didn’t design, strategies that complicate alignment in ways we have yet to anticipate?