This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
We experimented with a small inference-time “attractor layer” designed to provide a lightweight, stateful influence on generation without modifying weights or training anything. Early results show stable perplexity, a small task-specific gain, and a very clear failure mode that feels worth sharing — especially for anyone thinking about dynamic memory during inference.
Motivation
Transformers handle short-range dependencies well, but they don’t maintain a persistent state that adapts across multiple forward passes. This experiment explores whether a tiny, inference-only update rule can act as a dynamic memory signal without recurrence, backprop, or architectural changes.
Method (High-Level)
The layer maintains a small set of “attractor” vectors that:
Measure similarity to the current attention output
Strengthen when repeatedly activated
Decay when unused
Inject a small signal back into the next forward pass
This is a one-step inference-time update.
Early Observations
On small transformer models:
Some attractors stabilized around recurring conceptual regions
A short burn-in phase helped reduce instability
Unused attractors drifted into noise
Occasionally the layer degraded generation quality rather than improving it
No performance claims as of yet, we just report behavioral signals.
Key Results
Perplexity
Baseline perplexity preserved (≈0% change)
~6.5% compute overhead
Failure Case Longer generation (~500 tokens) collapsed by ~80% accuracy. Attractors competed with the actual context, causing repetition and drift. This appears to be a real dynamic failure mode, not a bug, and might map to theoretical limits on inference-time state.
Revised Configuration Adding gating + a burn-in threshold produced a small +3.3% improvement on a short comprehension task. Fragile, but reproducible.
What Failed
Too many attractors → instability
Long sequences “snapped back” to earlier attractors
Heavy decay effectively removed the attractor state
No robustness across sequence length
What This Does Not Show
Any generalizable performance boost
Evidence the method scales
Applicability outside the tested model
Anything like recurrence or RNN-style working memory
Related Work (Brief)
This sits adjacent to:
Fast Weights (Ba et al.) — fast-changing matrices, but updated during training or recurrent passes. Here, updates occur only at inference. Differentiable Plasticity (Miconi et al.) — learns the update rule; this uses a fixed rule. KV-Cache Extensions / Recurrence — reuse activations but don’t maintain a persistent attractor-like state across steps.
This work focuses specifically on single-step, inference-time state updates.
Questions for the Community
Is there prior work on inference-time state updates that don’t involve learning?
Are there known theoretical limits on attractor-style influence during generation?
Under what conditions is this strictly worse than recurrence or KV-cache extensions?
What minimal benchmark suite would rule out “just noise overfitting to perplexity”?
Does this failure mode (context–attractor competition) resemble anything known in dynamical systems literature?
Code & Data
Looking for replication attempts, critique, and pointers to related theoretical work.
We experimented with a small inference-time “attractor layer” designed to provide a lightweight, stateful influence on generation without modifying weights or training anything. Early results show stable perplexity, a small task-specific gain, and a very clear failure mode that feels worth sharing — especially for anyone thinking about dynamic memory during inference.
Motivation
Transformers handle short-range dependencies well, but they don’t maintain a persistent state that adapts across multiple forward passes. This experiment explores whether a tiny, inference-only update rule can act as a dynamic memory signal without recurrence, backprop, or architectural changes.
Method (High-Level)
The layer maintains a small set of “attractor” vectors that:
This is a one-step inference-time update.
Early Observations
On small transformer models:
No performance claims as of yet, we just report behavioral signals.
Key Results
Perplexity
Failure Case
Longer generation (~500 tokens) collapsed by ~80% accuracy. Attractors competed with the actual context, causing repetition and drift. This appears to be a real dynamic failure mode, not a bug, and might map to theoretical limits on inference-time state.
Revised Configuration
Adding gating + a burn-in threshold produced a small +3.3% improvement on a short comprehension task. Fragile, but reproducible.
What Failed
What This Does Not Show
Related Work (Brief)
This sits adjacent to:
Fast Weights (Ba et al.) — fast-changing matrices, but updated during training or recurrent passes. Here, updates occur only at inference.
Differentiable Plasticity (Miconi et al.) — learns the update rule; this uses a fixed rule.
KV-Cache Extensions / Recurrence — reuse activations but don’t maintain a persistent attractor-like state across steps.
This work focuses specifically on single-step, inference-time state updates.
Questions for the Community
Code & Data
Looking for replication attempts, critique, and pointers to related theoretical work.
Repo: https://github.com/HalcyonAIR/Duality