Difficult to evaluate, with potential yellow flags.
Read full explanation
I built a feedback controller that adjusts LLM sampling parameters in real-time based on token-level entropy, and ran it against MATH. I want to share early results and get feedback. Entropix (xjdr, 2024) demonstrated that entropy and varentropy are actionable signals during generation, using them to switch between discrete sampling strategies. This work closes the loop — replacing open-loop rule switching with continuous feedback control. In short, I was able to gain a 3% improvement in accuracy with a controller based setup over a standard uncontrolled LLM. As my controller weights were untuned this indicates room for significant improvement to the system.
The basic idea is to take the entropy of the top-K logits (I use ) of the LLMs output, then use a PID or 4th order controller (, , , ) to optimize said entropy against a target, where is the error between the two entropy states. There is evidence that entropy is a good indicator of hallucinations in LLM output (Farquhar et al., 2024, Nature), specifically semantic entropy.
So far, I've been able to generate some preliminary results on my RTX 3070, using Qwen3.5:2B at 4-bit quantization, running on CUDA and llama.cpp. This gives me a comfortable 157 tk/sec, which allows me to test against 200 problems from the MATH dataset in around an hour. I give the model a budget of 4096 tokens for each problem.
Error is processed through a sigmoidal dampener (tanh) to smooth out the signal. The controller targets Min-P (used to truncate tokens with probability below a given threshold) and temperature, updating both via:
(Note: Temp causes a feedback system with entropy, and should be discarded in practice. This can be seen in the results above. )
There are some limitations to my current approach.
1. I am using predictive entropy instead of semantic entropy. Less precise, but faster.
2. I am using a small sample size (200 problems)
3. Currently I am only working with two control 'knobs': temperature and min-p. In the future I would like to expand this to include more of the exposed LLM features available in llama.cpp.
4. As of right now, I am manually tuning the controllers weights, which is slow and time consuming. I am looking into ways to automate this process.
5. I am only testing against a single data set at the moment: MATH
6. Using temperature as a control 'knob' is problematic, since temperature directly effects logit probabilities and thus entropy.
In the future I would like to test against a dynamic target entropy determined by calculating the entropy from a smaller model being used in a speculative decoding setup, and setting the entropy from that model as the target entropy. My intended models for this setup are in the Qwen3.5 family, most likely the 9B model as the primary and the 2B model as the secondary. I would also like to experiment with different channels other than predictive entropy. So far I'm thinking of looking at the divergence between successive logit distributions as a kind of 'consistency metric' and hooking that up to its own controller.
Further I would like to improve the error signal by borrowing from Quantum Networks of Assets lifted from the field of econophysics (Gong, Sedai, Medda 2025, arXiv:2511.21515). Here the plan is to use a rolling window of logits as a feature space, and then derive von Neumann entropy from this. This will lead to a Quantum Early Warning Signal, which could indicate the onset of hallucinations. Using this as an channel of the signal may lead to increased accuracy. On top of this, all the computation necessary for this signal is GPU native, and can be implemented with near-zero cost.
Right now my bottleneck is VRAM and processing speed, which is preventing me from
A: testing with better quantizations,
B: testing with more capable model
C: running the dynamic target entropy setup. Working with this setup would hypothetically allow for better range of task completion for the LLM, with naturally high/low entropy tasks have corresponding high/low entropy targets. (Think writing poems vs writing code).
D: running experiments at an efficient scale, targeting >100 tk/sec.
If you would like to assist me, I have a Manifund project going where I'm asking for funds to cover a 3090 (in order to triple my VRAM and double my processing bandwidth) and any unforeseen expenses. This will allow me to run larger experiments faster and on higher quality models, which would permit testing against more complex datasets.
I built a feedback controller that adjusts LLM sampling parameters in real-time based on token-level entropy, and ran it against MATH. I want to share early results and get feedback. Entropix (xjdr, 2024) demonstrated that entropy and varentropy are actionable signals during generation, using them to switch between discrete sampling strategies. This work closes the loop — replacing open-loop rule switching with continuous feedback control. In short, I was able to gain a 3% improvement in accuracy with a controller based setup over a standard uncontrolled LLM. As my controller weights were untuned this indicates room for significant improvement to the system.
The basic idea is to take the entropy of the top-K logits (I use ) of the LLMs output, then use a PID or 4th order controller ( , , , ) to optimize said entropy against a target, where is the error between the two entropy states. There is evidence that entropy is a good indicator of hallucinations in LLM output (Farquhar et al., 2024, Nature), specifically semantic entropy.
So far, I've been able to generate some preliminary results on my RTX 3070, using Qwen3.5:2B at 4-bit quantization, running on CUDA and llama.cpp. This gives me a comfortable 157 tk/sec, which allows me to test against 200 problems from the MATH dataset in around an hour. I give the model a budget of 4096 tokens for each problem.
Below are preliminary results.
# 2B Entropy Controller Sweep Results
## Sweep 1: H_target=0.3 (T+M control)
| Run | K_T | K_M |
|-----|-----|-----|
| Baseline | No Controller, T=0.7, M=0.05 | - |
| PID| [0.02, 0.10, 0.15, 0.00] | [0.005, 0.03, 0.04, 0.00] |
| 4th order | [0.02, 0.10, 0.15, 0.25] | [0.005, 0.03, 0.04, 0.08] |
| 4th order w/ x2 accel | [0.02, 0.10, 0.15, 0.50] | [0.005, 0.03, 0.04, 0.16] |
| File | N | Correct | Accuracy | Avg tok/s | Avg H | Avg tokens | Control |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4th order x2accel | 200 | 100/200 | 50.0% | 156.0 | 0.3 | 2361.1 | YES |
| 4th order | 200 | 104/200 | 52.0% | 156.5 | 0.3 | 2371.3 | YES |
|PID| 200 | 109/200 | 54.5% | 156.7 | 0.3 | 2213.4 | YES |
| Baseline no controller, T=0.7, M=0.05 | 200 | 120/200 |60% | 157.1 | N/A | 2315.3 | NO
## Sweep 2: H_target=0.1 (T+M control)
| Run | K_T | K_M |
|-----|-----|-----|
| PID | [0.02, 0.10, 0.15, 0.00] | [0.005, 0.03, 0.04, 0.00] |
| 4th order | [0.02, 0.10, 0.15, 0.25] | [0.005, 0.03, 0.04, 0.08] |
| 4th order x2accel | [0.02, 0.10, 0.15, 0.50] | [0.005, 0.03, 0.04, 0.16] |
| File | N | Correct | Accuracy | Avg tok/s | Avg H | Avg tokens | Control |
| --- | --- | --- | --- | --- | --- | --- | --- |
|4th order | 200 | 109/200 | 54.5% | 157.2 | 0.2 | 2086.2 | YES |
| PID | 200 | 115/200 | 57.5% | 156.0 | 0.2 | 2087.5 | YES |
| 4th order x2accel | 200 | 116/200 | 58.0% | 156.1 | 0.2 | 2005.8 | YES
## Sweep 3: H_target=0.1 (min-P only, T fixed=0.7)
| Run | K_T | K_M |
|-----|-----|-----|
| Min-P only PID| [0, 0, 0, 0] | [0.005, 0.03, 0.04, 0.00] |
| Min-P only 4th order | [0, 0, 0, 0] | [0.005, 0.03, 0.04, 0.08] |
|Min-P only 4th order x2 accel| [0, 0, 0, 0] | [0.005, 0.03, 0.04, 0.16] |
| File | N | Correct | Accuracy | Avg tok/s | Avg H | Avg tokens | Control |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Min-P only 4th order x2accel | 200 | 106/200 | 53.0% | 156.1 | 0.2 | 2149.2 | YES |
|Min-P PID | 200 | 121/200 | 60.5% | 156.5 | 0.2 | 2082.6 | YES |
|Min-P 4th order | 200 | 127/200 | 63.5% | 156.6 | 0.3 | 2017.7 | YES |
Error is processed through a sigmoidal dampener (tanh) to smooth out the signal. The controller targets Min-P (used to truncate tokens with probability below a given threshold) and temperature, updating both via:
There are some limitations to my current approach.
1. I am using predictive entropy instead of semantic entropy. Less precise, but faster.
2. I am using a small sample size (200 problems)
3. Currently I am only working with two control 'knobs': temperature and min-p. In the future I would like to expand this to include more of the exposed LLM features available in llama.cpp.
4. As of right now, I am manually tuning the controllers weights, which is slow and time consuming. I am looking into ways to automate this process.
5. I am only testing against a single data set at the moment: MATH
6. Using temperature as a control 'knob' is problematic, since temperature directly effects logit probabilities and thus entropy.
In the future I would like to test against a dynamic target entropy determined by calculating the entropy from a smaller model being used in a speculative decoding setup, and setting the entropy from that model as the target entropy. My intended models for this setup are in the Qwen3.5 family, most likely the 9B model as the primary and the 2B model as the secondary. I would also like to experiment with different channels other than predictive entropy. So far I'm thinking of looking at the divergence between successive logit distributions as a kind of 'consistency metric' and hooking that up to its own controller.
Further I would like to improve the error signal by borrowing from Quantum Networks of Assets lifted from the field of econophysics (Gong, Sedai, Medda 2025, arXiv:2511.21515). Here the plan is to use a rolling window of logits as a feature space, and then derive von Neumann entropy from this. This will lead to a Quantum Early Warning Signal, which could indicate the onset of hallucinations. Using this as an channel of the signal may lead to increased accuracy. On top of this, all the computation necessary for this signal is GPU native, and can be implemented with near-zero cost.
Right now my bottleneck is VRAM and processing speed, which is preventing me from
A: testing with better quantizations,
B: testing with more capable model
C: running the dynamic target entropy setup. Working with this setup would hypothetically allow for better range of task completion for the LLM, with naturally high/low entropy tasks have corresponding high/low entropy targets. (Think writing poems vs writing code).
D: running experiments at an efficient scale, targeting >100 tk/sec.
If you would like to assist me, I have a Manifund project going where I'm asking for funds to cover a 3090 (in order to triple my VRAM and double my processing bandwidth) and any unforeseen expenses. This will allow me to run larger experiments faster and on higher quality models, which would permit testing against more complex datasets.
https://github.com/orthogonaltohumanity/Cybernetic_Entropy_Control
https://manifund.org/projects/cybernetic-entropy-controller
Thanks for reading!