The Scout-Refiner Bottleneck: A Negative Result with a Surprising Twist

sohampadia10@gmail.com

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

Epistemic status: negative results from a small scale (~125M param) architecture experiment. The architecture itself is genuinely novel from what I could find after doing a deep literature search, but novelty did not translate into performance here. Sharing because I think negative results like this are useful, especially when the failure mode itself is interesting.

TL;DR

I designed a two stage attention mechanism with a width bottleneck between the two attention passes. The original hypothesis was that this could potentially give better parameter efficiency than standard transformer attention. In practice, standard GPT-2 consistently outperformed every bottleneck variant I tested by around 0.19–0.31 nats^[1] at equal token budget.

But the surprising result was that one variant actually performed better than the baseline during early training. Under shorter compute budgets (~under 16M training tokens), the bottleneck model was more efficient. The standard transformer only overtook it later in training.

So overall:

Long training → baseline wins clearly
Short compute budget → one bottleneck variant actually looks competitive or even better

That crossover point ended up being more interesting than the negative result itself.

The Main Idea

Standard transformer attention is basically:

Queries attend to keys
Weighted sum over values
Output projection
Done

Every head operates at full width and attention happens once per block.

I was curious about whether attention could refine itself in multiple stages instead of doing everything in one pass.

So I designed what I called Bottleneck Scout-Refiner Attention.

The idea was:

First attention module ("scout") performs broad/global attention at full width
Output gets compressed into a bottleneck representation
Original input is partially preserved using a learned skip projection
Second attention module ("refiner") performs another attention pass at reduced width
Output gets projected back to full dimension

The intuition was something like:

scout finds globally useful information
refiner integrates/compresses/refines it
bottleneck forces more efficient representation learning

Sort of inspired by coarse-to-fine processing ideas.

I also used 6 larger macro blocks instead of 12 normal GPT-2 blocks to roughly match compute.

After doing a pretty extensive search through prior work (Longformer, BigBird, MLA, Funnel Transformer, Perceiver, Nexus, Reformer etc.), I could not find this exact combination:

sequential causal attention passes
width bottleneck between them
learned skip projection inside the module

Closest things existed, but not this exact combination.

Unfortunately that still did not make it good.

Variants Tested

I tried a few variants to isolate where the problems were coming from.

1. Standard Scout-Refiner (SR)

Base architecture:

two stage attention
bottleneck
dual parallel MLP branches

2. Span Bottleneck (SB)

Same architecture except:

refiner only attends locally using window size 128
scout still attends globally

This tested whether global refinement was hurting performance.

3. Gated Skip (GS)

Instead of:

fixed skip projection

I used:

sigmoid gated skip projection

The idea was letting the model dynamically decide how much original information should bypass the bottleneck.

4. Single MLP (SMLP)

Instead of:

two parallel 4× MLPs averaged together

I used:

one larger 8× MLP

This tested whether the dual MLP structure itself was harmful.

Experimental Setup

Setup was intentionally fairly controlled.

Dataset:

OpenWebText style corpus
~46M tokens

Model scale:

around GPT-2 small scale (~125M params)

Training:

GPT-2 tokenizer
context length 1024
AdamW
lr = 3e-4

Hardware:

V100 for energy comparison runs

Validation:

evaluated every 500 steps
same data split and seed across runs

One thing I learned pretty early:
matching token throughput matters a lot.

Initially one variant accidentally had much higher throughput and looked much better than it actually was. Once training conditions were properly matched, that advantage disappeared.

Main Result: Baseline Wins

The baseline GPT-2 beats every bottleneck variant fairly consistently.

The gap stabilizes around:

~0.22 to 0.27 nats

Which is definitely large enough that I do not think it is noise.

The local-window variant (SB) was basically almost identical to full SR, which was interesting because it suggests:
the issue is probably not the refiner attention span itself.

Even after multiple architectural fixes:

GS improves over SR
SMLP improves over SR

The baseline still wins.

At 3000 steps:

BL = 4.858
SMLP = 5.045
GS = 5.056
SR/SB ≈ 5.17

So the bottleneck architecture itself seems to be the main limitation.

The Most Interesting Result: Early Training Crossover

This was the part I did not expect.

At low token budgets, SMLP actually beats the baseline.

For example:

at 8M tokens, SMLP performs noticeably better
baseline only overtakes later around 16M–24M tokens

So:

early training → bottleneck model learns faster
later training → standard transformer scales better

Energy efficiency showed the same thing.

SMLP reached moderate loss thresholds using less energy and less wall-clock time than baseline.

But once you try pushing performance further, baseline keeps improving while the bottleneck variants plateau.

So the architecture may actually have some usefulness in:

quick iteration settings
low compute environments
small budget experimentation
possibly edge deployment situations

I do not know if this survives scaling though. It may disappear completely at larger model sizes.

What The Ablations Revealed

The ablations actually helped a lot.

Dual MLP was hurting performance

SMLP clearly beats SR.

My guess is:

both MLP branches receive identical input
both optimize similar functions
averaging them weakens gradients / partially collapses representations

So the dual branch idea was mostly unnecessary complexity.

Fixed skip projection was also hurting

GS improved noticeably over SR.

The fixed skip was probably too rigid because it injects the same transformed input regardless of context.

Adding a gate lets the model suppress the skip path when needed.

But none of this fixes the core issue

Even the best variant still loses clearly to baseline.

So the real bottleneck is probably:
the sequential attention + compressed representation itself.

Why I Think It Failed

A few possible reasons.

1. Gradient flow issues

Standard transformers have extremely clean residual paths.

Here:

gradients must propagate through
768 → 384 → 768 compression

before reaching earlier attention computations.

The plateau patterns in training curves support this idea.

2. Sequential attention may simply be harder to optimize

Standard transformer attention:

one strong attention operation

Scout-refiner:

two weaker sequential attention operations
second one depends on noisy intermediate representations early in training

This may simply be less sample efficient.

3. Parameter mismatch

SR had fewer parameters than baseline (~10M fewer).

I do not think this fully explains the result because the gap is fairly stable and large, but I should have run parameter matched experiments earlier.

That is probably the biggest thing I would change.

Things I Learned

Novelty means very little

A design being “novel” is almost meaningless for predicting whether it will work.

Architecture research probably has an enormous graveyard of:

plausible sounding ideas
combinations of reasonable mechanisms
architectures that almost work

Most never get written about.

Cheap ablations should happen earlier

The dual MLP issue could have been discovered with a very short run.

Instead I tested it much later.

Training curves matter more than endpoints sometimes

The plateau pattern was actually more informative than final loss.

Every bottleneck variant had this same stall pattern mid training.

That probably says more about the architecture than the final metrics themselves.

Negative results also have structure

This was not:
“architecture completely broken”

Instead it was:

strong early efficiency
later optimization plateau
long-run underperformance

That shape itself might still be useful somewhere else.

Final Thoughts

No further experiments are planned currently.

Overall I would still classify this architecture as a failure relative to standard GPT-2 at equal token budget.

But I do think the early-training crossover is real and interesting enough to be worth sharing.

If someone is optimizing specifically for:

low compute
quick convergence
limited training budgets

then some form of bottleneck attention may still have value.

I also would not be surprised if some version of this idea works better with:

different scaling regimes
better normalization
improved residual routing
larger model sizes
different bottleneck ratios

But at least in this setup, standard transformer attention remained consistently stronger once training continued long enough.

Link to WandB report : https://api.wandb.ai/links/padia-so-northeastern-university/fq8n082x

^{^}
A nat is the natural-log unit of cross-entropy loss — the same thing language model papers report as "loss." A model with cross-entropy L nats has perplexity e^L, so a gap of 0.2 nats means one model has roughly e^0.2 ≈ 1.22× the perplexity of the other (about 22% worse next-token prediction). For reference: GPT-2 small reports ~3.0 nats on WebText. (1 nat = 1/ln(2) ≈ 1.44 bits, if you prefer bits-per-token.)