This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
Epistemic status: negative results from a small scale (~125M param) architecture experiment. The architecture itself is genuinely novel from what I could find after doing a deep literature search, but novelty did not translate into performance here. Sharing because I think negative results like this are useful, especially when the failure mode itself is interesting.
TL;DR
I designed a two stage attention mechanism with a width bottleneck between the two attention passes. The original hypothesis was that this could potentially give better parameter efficiency than standard transformer attention. In practice, standard GPT-2 consistently outperformed every bottleneck variant I tested by around 0.19–0.31 nats[1] at equal token budget.
But the surprising result was that one variant actually performed better than the baseline during early training. Under shorter compute budgets (~under 16M training tokens), the bottleneck model was more efficient. The standard transformer only overtook it later in training.
So overall:
Long training → baseline wins clearly
Short compute budget → one bottleneck variant actually looks competitive or even better
That crossover point ended up being more interesting than the negative result itself.
The Main Idea
Standard transformer attention is basically:
Queries attend to keys
Weighted sum over values
Output projection
Done
Every head operates at full width and attention happens once per block.
I was curious about whether attention could refine itself in multiple stages instead of doing everything in one pass.
So I designed what I called Bottleneck Scout-Refiner Attention.
The idea was:
First attention module ("scout") performs broad/global attention at full width
Output gets compressed into a bottleneck representation
Original input is partially preserved using a learned skip projection
Second attention module ("refiner") performs another attention pass at reduced width
Output gets projected back to full dimension
The intuition was something like:
scout finds globally useful information
refiner integrates/compresses/refines it
bottleneck forces more efficient representation learning
Sort of inspired by coarse-to-fine processing ideas.
I also used 6 larger macro blocks instead of 12 normal GPT-2 blocks to roughly match compute.
After doing a pretty extensive search through prior work (Longformer, BigBird, MLA, Funnel Transformer, Perceiver, Nexus, Reformer etc.), I could not find this exact combination:
sequential causal attention passes
width bottleneck between them
learned skip projection inside the module
Closest things existed, but not this exact combination.
Unfortunately that still did not make it good.
Variants Tested
I tried a few variants to isolate where the problems were coming from.
1. Standard Scout-Refiner (SR)
Base architecture:
two stage attention
bottleneck
dual parallel MLP branches
2. Span Bottleneck (SB)
Same architecture except:
refiner only attends locally using window size 128
scout still attends globally
This tested whether global refinement was hurting performance.
3. Gated Skip (GS)
Instead of:
fixed skip projection
I used:
sigmoid gated skip projection
The idea was letting the model dynamically decide how much original information should bypass the bottleneck.
4. Single MLP (SMLP)
Instead of:
two parallel 4× MLPs averaged together
I used:
one larger 8× MLP
This tested whether the dual MLP structure itself was harmful.
Experimental Setup
Setup was intentionally fairly controlled.
Dataset:
OpenWebText style corpus
~46M tokens
Model scale:
around GPT-2 small scale (~125M params)
Training:
GPT-2 tokenizer
context length 1024
AdamW
lr = 3e-4
Hardware:
V100 for energy comparison runs
Validation:
evaluated every 500 steps
same data split and seed across runs
One thing I learned pretty early: matching token throughput matters a lot.
Initially one variant accidentally had much higher throughput and looked much better than it actually was. Once training conditions were properly matched, that advantage disappeared.
Main Result: Baseline Wins
The baseline GPT-2 beats every bottleneck variant fairly consistently.
The gap stabilizes around:
~0.22 to 0.27 nats
Which is definitely large enough that I do not think it is noise.
The local-window variant (SB) was basically almost identical to full SR, which was interesting because it suggests: the issue is probably not the refiner attention span itself.
Even after multiple architectural fixes:
GS improves over SR
SMLP improves over SR
The baseline still wins.
At 3000 steps:
BL = 4.858
SMLP = 5.045
GS = 5.056
SR/SB ≈ 5.17
So the bottleneck architecture itself seems to be the main limitation.
The Most Interesting Result: Early Training Crossover
This was the part I did not expect.
At low token budgets, SMLP actually beats the baseline.
For example:
at 8M tokens, SMLP performs noticeably better
baseline only overtakes later around 16M–24M tokens
So:
early training → bottleneck model learns faster
later training → standard transformer scales better
Energy efficiency showed the same thing.
SMLP reached moderate loss thresholds using less energy and less wall-clock time than baseline.
But once you try pushing performance further, baseline keeps improving while the bottleneck variants plateau.
So the architecture may actually have some usefulness in:
quick iteration settings
low compute environments
small budget experimentation
possibly edge deployment situations
I do not know if this survives scaling though. It may disappear completely at larger model sizes.
What The Ablations Revealed
The ablations actually helped a lot.
Dual MLP was hurting performance
SMLP clearly beats SR.
My guess is:
both MLP branches receive identical input
both optimize similar functions
averaging them weakens gradients / partially collapses representations
So the dual branch idea was mostly unnecessary complexity.
Fixed skip projection was also hurting
GS improved noticeably over SR.
The fixed skip was probably too rigid because it injects the same transformed input regardless of context.
Adding a gate lets the model suppress the skip path when needed.
But none of this fixes the core issue
Even the best variant still loses clearly to baseline.
So the real bottleneck is probably: the sequential attention + compressed representation itself.
Why I Think It Failed
A few possible reasons.
1. Gradient flow issues
Standard transformers have extremely clean residual paths.
Here:
gradients must propagate through 768 → 384 → 768 compression
before reaching earlier attention computations.
The plateau patterns in training curves support this idea.
2. Sequential attention may simply be harder to optimize
Standard transformer attention:
one strong attention operation
Scout-refiner:
two weaker sequential attention operations
second one depends on noisy intermediate representations early in training
This may simply be less sample efficient.
3. Parameter mismatch
SR had fewer parameters than baseline (~10M fewer).
I do not think this fully explains the result because the gap is fairly stable and large, but I should have run parameter matched experiments earlier.
That is probably the biggest thing I would change.
Things I Learned
Novelty means very little
A design being “novel” is almost meaningless for predicting whether it will work.
Architecture research probably has an enormous graveyard of:
plausible sounding ideas
combinations of reasonable mechanisms
architectures that almost work
Most never get written about.
Cheap ablations should happen earlier
The dual MLP issue could have been discovered with a very short run.
Instead I tested it much later.
Training curves matter more than endpoints sometimes
The plateau pattern was actually more informative than final loss.
Every bottleneck variant had this same stall pattern mid training.
That probably says more about the architecture than the final metrics themselves.
Negative results also have structure
This was not: “architecture completely broken”
Instead it was:
strong early efficiency
later optimization plateau
long-run underperformance
That shape itself might still be useful somewhere else.
Final Thoughts
No further experiments are planned currently.
Overall I would still classify this architecture as a failure relative to standard GPT-2 at equal token budget.
But I do think the early-training crossover is real and interesting enough to be worth sharing.
If someone is optimizing specifically for:
low compute
quick convergence
limited training budgets
then some form of bottleneck attention may still have value.
I also would not be surprised if some version of this idea works better with:
different scaling regimes
better normalization
improved residual routing
larger model sizes
different bottleneck ratios
But at least in this setup, standard transformer attention remained consistently stronger once training continued long enough.
A nat is the natural-log unit of cross-entropy loss — the same thing language model papers report as "loss." A model with cross-entropy L nats has perplexity e^L, so a gap of 0.2 nats means one model has roughly e^0.2 ≈ 1.22× the perplexity of the other (about 22% worse next-token prediction). For reference: GPT-2 small reports ~3.0 nats on WebText. (1 nat = 1/ln(2) ≈ 1.44 bits, if you prefer bits-per-token.)
Epistemic status: negative results from a small scale (~125M param) architecture experiment. The architecture itself is genuinely novel from what I could find after doing a deep literature search, but novelty did not translate into performance here. Sharing because I think negative results like this are useful, especially when the failure mode itself is interesting.
TL;DR
I designed a two stage attention mechanism with a width bottleneck between the two attention passes. The original hypothesis was that this could potentially give better parameter efficiency than standard transformer attention. In practice, standard GPT-2 consistently outperformed every bottleneck variant I tested by around 0.19–0.31 nats[1] at equal token budget.
But the surprising result was that one variant actually performed better than the baseline during early training. Under shorter compute budgets (~under 16M training tokens), the bottleneck model was more efficient. The standard transformer only overtook it later in training.
So overall:
That crossover point ended up being more interesting than the negative result itself.
The Main Idea
Standard transformer attention is basically:
Every head operates at full width and attention happens once per block.
I was curious about whether attention could refine itself in multiple stages instead of doing everything in one pass.
So I designed what I called Bottleneck Scout-Refiner Attention.
The idea was:
The intuition was something like:
Sort of inspired by coarse-to-fine processing ideas.
I also used 6 larger macro blocks instead of 12 normal GPT-2 blocks to roughly match compute.
After doing a pretty extensive search through prior work (Longformer, BigBird, MLA, Funnel Transformer, Perceiver, Nexus, Reformer etc.), I could not find this exact combination:
Closest things existed, but not this exact combination.
Unfortunately that still did not make it good.
Variants Tested
I tried a few variants to isolate where the problems were coming from.
1. Standard Scout-Refiner (SR)
Base architecture:
2. Span Bottleneck (SB)
Same architecture except:
This tested whether global refinement was hurting performance.
3. Gated Skip (GS)
Instead of:
I used:
The idea was letting the model dynamically decide how much original information should bypass the bottleneck.
4. Single MLP (SMLP)
Instead of:
I used:
This tested whether the dual MLP structure itself was harmful.
Experimental Setup
Setup was intentionally fairly controlled.
Dataset:
Model scale:
Training:
Hardware:
Validation:
One thing I learned pretty early:
matching token throughput matters a lot.
Initially one variant accidentally had much higher throughput and looked much better than it actually was. Once training conditions were properly matched, that advantage disappeared.
Main Result: Baseline Wins
The baseline GPT-2 beats every bottleneck variant fairly consistently.
The gap stabilizes around:
Which is definitely large enough that I do not think it is noise.
The local-window variant (SB) was basically almost identical to full SR, which was interesting because it suggests:
the issue is probably not the refiner attention span itself.
Even after multiple architectural fixes:
The baseline still wins.
At 3000 steps:
So the bottleneck architecture itself seems to be the main limitation.
The Most Interesting Result: Early Training Crossover
This was the part I did not expect.
At low token budgets, SMLP actually beats the baseline.
For example:
So:
Energy efficiency showed the same thing.
SMLP reached moderate loss thresholds using less energy and less wall-clock time than baseline.
But once you try pushing performance further, baseline keeps improving while the bottleneck variants plateau.
So the architecture may actually have some usefulness in:
I do not know if this survives scaling though. It may disappear completely at larger model sizes.
What The Ablations Revealed
The ablations actually helped a lot.
Dual MLP was hurting performance
SMLP clearly beats SR.
My guess is:
So the dual branch idea was mostly unnecessary complexity.
Fixed skip projection was also hurting
GS improved noticeably over SR.
The fixed skip was probably too rigid because it injects the same transformed input regardless of context.
Adding a gate lets the model suppress the skip path when needed.
But none of this fixes the core issue
Even the best variant still loses clearly to baseline.
So the real bottleneck is probably:
the sequential attention + compressed representation itself.
Why I Think It Failed
A few possible reasons.
1. Gradient flow issues
Standard transformers have extremely clean residual paths.
Here:
768 → 384 → 768 compression
before reaching earlier attention computations.
The plateau patterns in training curves support this idea.
2. Sequential attention may simply be harder to optimize
Standard transformer attention:
Scout-refiner:
This may simply be less sample efficient.
3. Parameter mismatch
SR had fewer parameters than baseline (~10M fewer).
I do not think this fully explains the result because the gap is fairly stable and large, but I should have run parameter matched experiments earlier.
That is probably the biggest thing I would change.
Things I Learned
Novelty means very little
A design being “novel” is almost meaningless for predicting whether it will work.
Architecture research probably has an enormous graveyard of:
Most never get written about.
Cheap ablations should happen earlier
The dual MLP issue could have been discovered with a very short run.
Instead I tested it much later.
Training curves matter more than endpoints sometimes
The plateau pattern was actually more informative than final loss.
Every bottleneck variant had this same stall pattern mid training.
That probably says more about the architecture than the final metrics themselves.
Negative results also have structure
This was not:
“architecture completely broken”
Instead it was:
That shape itself might still be useful somewhere else.
Final Thoughts
No further experiments are planned currently.
Overall I would still classify this architecture as a failure relative to standard GPT-2 at equal token budget.
But I do think the early-training crossover is real and interesting enough to be worth sharing.
If someone is optimizing specifically for:
then some form of bottleneck attention may still have value.
I also would not be surprised if some version of this idea works better with:
But at least in this setup, standard transformer attention remained consistently stronger once training continued long enough.
Link to WandB report : https://api.wandb.ai/links/padia-so-northeastern-university/fq8n082x
A nat is the natural-log unit of cross-entropy loss — the same thing language model papers report as "loss." A model with cross-entropy
Lnats has perplexitye^L, so a gap of 0.2 nats means one model has roughlye^0.2 ≈ 1.22×the perplexity of the other (about 22% worse next-token prediction). For reference: GPT-2 small reports ~3.0 nats on WebText. (1 nat = 1/ln(2) ≈ 1.44 bits, if you prefer bits-per-token.)