Beam is a character-level language model that computes count tables mapping character contexts to next-character frequencies.
At prediction time, each order looks up the current context in its count table and produces a distribution over the vocabulary, smoothed over a symmetric Dirichlet prior
ₒⱼ
Each order receives a capacity score composed of two terms:
Concentration:
ₒ
where H(pₒ) is the Shannon entropy of the smoothed distribution. This is 1 when all mass is on one token and 0 when the distribution is uniform.
Reliability:
where n is the total count for the current context. This saturates toward 1 as evidence accumulates and is 0 when the context has not been observed.
A third term, capacity, is computed from the product of concentration and reliability. The capacity scores are converted to weights via softmax at temperature τ = 0.10:
ₒₒⱼⱼ
The low temperature makes the routing nearly winner-take-all: the highest-capacity order almost always dominates. The final prediction is the weighted geometric mean of the per-order distributions:
ₒₒₒ
This was chosen deliberately to assign high probability to a token only when multiple weighted orders agree.
The model has four hyperparameters: the set of context orders, α, τ, and the reliability threshold (min_count = 1). These were selected by evaluating variants on the validation set.
Results
Evaluation uses the nanoGPT shakespeare_char benchmark: character-level Shakespeare, about 1M training tokens, about 100K validation tokens, and a vocabulary size of 65.
EntropyBeam
EntropyBeam uses 0 trainable parameters, a single fit pass, and character-level input.
Training tokens
Validation loss, nats
Contexts stored
Transitions stored
1,000
2.954
5,495
6,388
3,000
2.654
14,670
17,176
10,000
2.482
44,092
51,835
30,000
2.289
120,043
140,961
100,000
2.193
346,462
405,119
300,000
1.990
919,897
1,071,750
1,003,854
1.596
2,753,581
3,199,496
nanoGPT
nanoGPT uses 60,192 parameters, 2 layers, n_embd=48, n_head=4, block_size=32, batch_size=16, and AdamW with lr=1e-3, wd=0.01.
Step
Tokens seen
Validation loss, nats
0
0
4.189
300
153,600
2.507
600
307,200
2.409
1,200
614,400
2.262
1,800
921,600
2.162
2,400
1,228,800
2.096
3,000
1,536,000
2.065
Compute
Metric
EntropyBeam
nanoGPT
Ratio
Fit/train FLOPs
0.009 G
614 G
68,000x
FLOPs per prediction
4,500
133,000
30x
Total FLOPs to result
~0.5 G
~760 G
~1,500x
Validation loss, nats
1.596
2.065
Trainable parameters
0
60,192
Wall time
12s
26s
Scaling Behavior
Per-decade improvement in validation loss.
Range
Change in loss, nats
1K to 10K
-0.47
10K to 100K
-0.29
100K to 1M
-0.60
Limitations
Storage is not comparable directly to a transformer's parameter count. EntropyBeam stores 2.7M context-transition entries, compared to 60k learned floats for the transformer. Either way, the fixed combination rule achieves lower cross-entropy than learned optimization on the corpus.
The model was not compared with many different transformer baselines, but in limited testing, it achieved similar next-token prediction accuracy in larger datasets.
Beam is a character-level language model that computes count tables mapping character contexts to next-character frequencies.
At prediction time, each order looks up the current context in its count table and produces a distribution over the vocabulary, smoothed over a symmetric Dirichlet prior
Each order receives a capacity score composed of two terms:
Concentration:
where H(pₒ) is the Shannon entropy of the smoothed distribution. This is 1 when all mass is on one token and 0 when the distribution is uniform.
Reliability:
where n is the total count for the current context. This saturates toward 1 as evidence accumulates and is 0 when the context has not been observed.
A third term, capacity, is computed from the product of concentration and reliability. The capacity scores are converted to weights via softmax at temperature τ = 0.10:
The low temperature makes the routing nearly winner-take-all: the highest-capacity order almost always dominates. The final prediction is the weighted geometric mean of the per-order distributions:
This was chosen deliberately to assign high probability to a token only when multiple weighted orders agree.
The model has four hyperparameters: the set of context orders, α, τ, and the reliability threshold (min_count = 1). These were selected by evaluating variants on the validation set.
Results
Evaluation uses the nanoGPT shakespeare_char benchmark: character-level Shakespeare, about 1M training tokens, about 100K validation tokens, and a vocabulary size of 65.
EntropyBeam
EntropyBeam uses 0 trainable parameters, a single fit pass, and character-level input.
Training tokens
Validation loss, nats
Contexts stored
Transitions stored
1,000
2.954
5,495
6,388
3,000
2.654
14,670
17,176
10,000
2.482
44,092
51,835
30,000
2.289
120,043
140,961
100,000
2.193
346,462
405,119
300,000
1.990
919,897
1,071,750
1,003,854
1.596
2,753,581
3,199,496
nanoGPT
nanoGPT uses 60,192 parameters, 2 layers,
n_embd=48,n_head=4,block_size=32,batch_size=16, and AdamW withlr=1e-3,wd=0.01.Step
Tokens seen
Validation loss, nats
0
0
4.189
300
153,600
2.507
600
307,200
2.409
1,200
614,400
2.262
1,800
921,600
2.162
2,400
1,228,800
2.096
3,000
1,536,000
2.065
Compute
Metric
EntropyBeam
nanoGPT
Ratio
Fit/train FLOPs
0.009 G
614 G
68,000x
FLOPs per prediction
4,500
133,000
30x
Total FLOPs to result
~0.5 G
~760 G
~1,500x
Validation loss, nats
1.596
2.065
Trainable parameters
0
60,192
Wall time
12s
26s
Scaling Behavior
Per-decade improvement in validation loss.
Range
Change in loss, nats
1K to 10K
-0.47
10K to 100K
-0.29
100K to 1M
-0.60
Limitations
Storage is not comparable directly to a transformer's parameter count. EntropyBeam stores 2.7M context-transition entries, compared to 60k learned floats for the transformer. Either way, the fixed combination rule achieves lower cross-entropy than learned optimization on the corpus.
The model was not compared with many different transformer baselines, but in limited testing, it achieved similar next-token prediction accuracy in larger datasets.
Code
The code is available under https://github.com/zw5/beam