We've been experimenting with a small inference-time "attractor layer"... basically trying to get a lightweight stateful influence on generation without touching weights or training.
Early results are interesting... stable perplexity, small gain on one task, and a pretty spectacular failure mode that I think is worth sharing.
This has some promise... especially if anyones thinking about dynamic memory during inference.
Motivation
Transformers are great at short-range mechanics but they dont maintain any persistent state that adapts across forward passes.
We wanted to see if a tiny inference-only update rule (inside the transformer) could act as a dynamic memory signal.
This doesnt involve recurrence, backprop, or architectural surgery.
Method (High-Level)
The layer keeps a small set of "attractor" vectors that measure similarity to current attention output. These vectors strengthen when they get activated repeatedly and decay when not used... but they also feed a small signal back into the next forward pass.
Its not recurrence exactly... more like a one-step inference-time nudge.
Early Observations
On small transformer models we saw...
Some attractors stabilised around recurring conceptual regions which was cool A short burn-in phase helped reduce instability Unused attractors just drifted into noise Sometimes the layer actually made generation worse not better
No performance claims here - just behavioural stuff worth noting.
The failure case is the interesting bit. Longer generation around 500 tokens collapsed badly, like 80% accuracy drop. The attractors started competing with actual context which caused repetition and drift. Pretty sure this is a real dynamic failure mode not a bug... might map to theoretical limits on inference-time state?
Adding gating plus a burn-in threshold got us a small +3.3% improvement on a shorter comprehension task.
This was fragile... but it reproduced.
What Failed
Too many attractors caused instability.
Long sequences kept snapping back to earlier attractor patterns.
Heavy decay basically made the whole thing stateless.
No robustness across sequence length at all
What This Doesnt Show (yet)
Any generalizable performance boost.
Evidence this scales.
Applicability outside the tested model family.
Anything like proper recurrence or RNN-style working memory.
Related Work
This feels adjacent to a few things...
Fast Weights from Ba et al - fast-changing matrices but those update during training or recurrent passes. Ours only updates at inference.
Differentiable Plasticity from Miconi et al - they learn the update rule, we just use a fixed one.
KV-Cache extensions and recurrence stuff - reuses activations but doesnt maintain a persistent attractor-like state across steps. Were specifically focused on single-step inference-time state updates here.
Possible Arguments
"This is just Fast Weights but worse.”
Fast Weights already did state updates and were trained or recurrent.
This experimental method, is untrained and inference-only... therefore it’s “the same but crippled”.
Answer: Fast Weights and this experiment answer different questions:
Fast Weights: “What happens when we train a fast-changing memory?”
This experiment: “What happens if we don’t train it, and just graft a stateful lens onto a frozen model?”
“This is bad Hopfield / bad memory transformer.”
Answer:
Hopfield nets: stable attractors → trained basins of attraction
Modern memory Transformers: extended KV caches or memory modules
This experiment: attractors without learned stability = “it collapses” (of course)
Which misses the point: the collapse is the interesting part.
Questions
Is there prior work on inference-time state updates that dont involve learning? (I cant find anything that fits.)
Are there known theoretical limits on attractor-style influence during generation?
When is this approach strictly worse than recurrence or KV-cache extensions?
What minimal benchmark would rule out just overfitting to perplexity?
Does this failure mode with context-attractor competition resemble anything known in dynamical systems?
Code
Looking for replication attempts, critique, pointers to related work.
We've been experimenting with a small inference-time "attractor layer"... basically trying to get a lightweight stateful influence on generation without touching weights or training.
Early results are interesting... stable perplexity, small gain on one task, and a pretty spectacular failure mode that I think is worth sharing.
This has some promise... especially if anyones thinking about dynamic memory during inference.
Motivation
Transformers are great at short-range mechanics but they dont maintain any persistent state that adapts across forward passes.
We wanted to see if a tiny inference-only update rule (inside the transformer) could act as a dynamic memory signal.
This doesnt involve recurrence, backprop, or architectural surgery.
Method (High-Level)
The layer keeps a small set of "attractor" vectors that measure similarity to current attention output. These vectors strengthen when they get activated repeatedly and decay when not used... but they also feed a small signal back into the next forward pass.
Its not recurrence exactly... more like a one-step inference-time nudge.
Early Observations
On small transformer models we saw...
Some attractors stabilised around recurring conceptual regions which was cool A short burn-in phase helped reduce instability Unused attractors just drifted into noise Sometimes the layer actually made generation worse not better
No performance claims here - just behavioural stuff worth noting.
Key Results
Perplexity - baseline preserved, roughly 0% change.
About 6.5% compute overhead.
The failure case is the interesting bit. Longer generation around 500 tokens collapsed badly, like 80% accuracy drop. The attractors started competing with actual context which caused repetition and drift. Pretty sure this is a real dynamic failure mode not a bug... might map to theoretical limits on inference-time state?
Adding gating plus a burn-in threshold got us a small +3.3% improvement on a shorter comprehension task.
This was fragile... but it reproduced.
What Failed
Too many attractors caused instability.
Long sequences kept snapping back to earlier attractor patterns.
Heavy decay basically made the whole thing stateless.
No robustness across sequence length at all
What This Doesnt Show (yet)
Any generalizable performance boost.
Evidence this scales.
Applicability outside the tested model family.
Anything like proper recurrence or RNN-style working memory.
Related Work
This feels adjacent to a few things...
Fast Weights from Ba et al - fast-changing matrices but those update during training or recurrent passes. Ours only updates at inference.
Differentiable Plasticity from Miconi et al - they learn the update rule, we just use a fixed one.
KV-Cache extensions and recurrence stuff - reuses activations but doesnt maintain a persistent attractor-like state across steps. Were specifically focused on single-step inference-time state updates here.
Possible Arguments
"This is just Fast Weights but worse.”
Fast Weights already did state updates and were trained or recurrent.
This experimental method, is untrained and inference-only... therefore it’s “the same but crippled”.
Answer: Fast Weights and this experiment answer different questions:
Fast Weights: “What happens when we train a fast-changing memory?”
This experiment: “What happens if we don’t train it, and just graft a stateful lens onto a frozen model?”
“This is bad Hopfield / bad memory transformer.”
Answer:
Hopfield nets: stable attractors → trained basins of attraction
Modern memory Transformers: extended KV caches or memory modules
This experiment: attractors without learned stability = “it collapses” (of course)
Which misses the point: the collapse is the interesting part.
Questions
Is there prior work on inference-time state updates that dont involve learning? (I cant find anything that fits.)
Are there known theoretical limits on attractor-style influence during generation?
When is this approach strictly worse than recurrence or KV-cache extensions?
What minimal benchmark would rule out just overfitting to perplexity?
Does this failure mode with context-attractor competition resemble anything known in dynamical systems?
Code
Looking for replication attempts, critique, pointers to related work.
Repo: https://github.com/HalcyonAIR/Duality