Gradient-free Single-pass Model Beats nanoGPT on Shakespeare

zw5

Beam is a character-level language model that computes count tables mapping character contexts to next-character frequencies.

At prediction time, each order looks up the current context in its count table and produces a distribution over the vocabulary, smoothed over a symmetric Dirichlet prior

$ₒ ⱼ$

Each order receives a capacity score composed of two terms:

Concentration:

$ₒ$

where H(pₒ) is the Shannon entropy of the smoothed distribution. This is 1 when all mass is on one token and 0 when the distribution is uniform.

Reliability:

where n is the total count for the current context. This saturates toward 1 as evidence accumulates and is 0 when the context has not been observed.

A third term, capacity, is computed from the product of concentration and reliability. The capacity scores are converted to weights via softmax at temperature τ = 0.10:

$ₒ ₒ ⱼ ⱼ$

The low temperature makes the routing nearly winner-take-all: the highest-capacity order almost always dominates. The final prediction is the weighted geometric mean of the per-order distributions:

$ₒ ₒ ₒ$

This was chosen deliberately to assign high probability to a token only when multiple weighted orders agree.

The model has four hyperparameters: the set of context orders, α, τ, and the reliability threshold (min_count = 1). These were selected by evaluating variants on the validation set.

Results

Evaluation uses the nanoGPT shakespeare_char benchmark: character-level Shakespeare, about 1M training tokens, about 100K validation tokens, and a vocabulary size of 65.

EntropyBeam

EntropyBeam uses 0 trainable parameters, a single fit pass, and character-level input.

Training tokens	Validation loss, nats	Contexts stored	Transitions stored
1,000	2.954	5,495	6,388
3,000	2.654	14,670	17,176
10,000	2.482	44,092	51,835
30,000	2.289	120,043	140,961
100,000	2.193	346,462	405,119
300,000	1.990	919,897	1,071,750
1,003,854	1.596	2,753,581	3,199,496

nanoGPT

nanoGPT uses 60,192 parameters, 2 layers, n_embd=48, n_head=4, block_size=32, batch_size=16, and AdamW with lr=1e-3, wd=0.01.

Step	Tokens seen	Validation loss, nats
0	0	4.189
300	153,600	2.507
600	307,200	2.409
1,200	614,400	2.262
1,800	921,600	2.162
2,400	1,228,800	2.096
3,000	1,536,000	2.065

Compute

Metric	EntropyBeam	nanoGPT	Ratio
Fit/train FLOPs	0.009 G	614 G	68,000x
FLOPs per prediction	4,500	133,000	30x
Total FLOPs to result	~0.5 G	~760 G	~1,500x
Validation loss, nats	1.596	2.065
Trainable parameters	0	60,192
Wall time	12s	26s

Scaling Behavior

Per-decade improvement in validation loss.

Range	Change in loss, nats
1K to 10K	-0.47
10K to 100K	-0.29
100K to 1M	-0.60

Limitations

Storage is not comparable directly to a transformer's parameter count. EntropyBeam stores 2.7M context-transition entries, compared to 60k learned floats for the transformer. Either way, the fixed combination rule achieves lower cross-entropy than learned optimization on the corpus.

The model was not compared with many different transformer baselines, but in limited testing, it achieved similar next-token prediction accuracy in larger datasets.

Code

The code is available under https://github.com/zw5/beam