Rejected for the following reason(s):
Rejected for the following reason(s):
Disclosure: This post is based on my own research and thinking. I used ChatGPT (GPT-5.1 Thinking) as a writing assistant for wording and editing, and I have reviewed and endorse all claims.
Modern LLM systems increasingly use test-time controllers (meta-policies) to decide how much to think, whether to call tools, or whether to upgrade to a larger model. Recent work on "adaptive test-time compute" treats these as policies over compute, but almost always evaluates them on static benchmarks under fixed data distributions (Alomrani et al., 2025; Kidd, 2025; Arike, 2025; Arike et al., 2025).
In deployment, controllers sit inside a training–deployment feedback loop: they change user and adversarial behaviour, which changes the logs we train on, which then change the controller itself. That loop admits a natural failure mode:
Meta-policy overfitting in feedback loops:
updates that make a controller look strictly better on yesterday's offline logs while systematically making tomorrow’s deployment distribution more exploitable and more dangerous.
This post:
Beyond prior work on adaptive test-time compute on fixed benchmarks, this post treats the controller as a learning system in the training–deployment loop and names the resulting failure mode "meta-policy overfitting in feedback loops".
Key recommendation: treat high-stakes deployment slices as first-class, with diagnostics, compute floors, and rollback rules, rather than optimising only average offline reward on past logs.
This section explains why test-time controllers matter in the inference-scaling era and why static-benchmark evaluations are no longer enough.
At a meta level, this post is conceptual rather than empirical: it proposes a formal lens and a named failure mode for test-time controllers in feedback loops, plus diagnostics and design levers that labs could implement with existing infra.
With the rise of o1/o3-style models, search-augmented systems, and latent-reasoning architectures (Arike, 2025; Arike, RohanS, & Biswas, 2025), test-time controllers have become first-class citizens. They decide, on each query:
Most current work evaluates such controllers on static benchmarks: fix a dataset, sweep a compute budget, compare trade-offs like "accuracy vs tokens" or "pass@k vs latency". That is necessary, but not sufficient.
In the real world, controllers live inside a closed loop:
Feedback-loop and meta-learning papers show that this kind of two-level system is especially prone to overfitting and Goodharting (Gama et al., 2014; Lu et al., 2019).
The core claim of this post is:
Once you treat test-time controllers as learning systems inside this loop, there is a structurally natural way for them to Goodhart: they can learn to look better on yesterday's logs while pushing tomorrow's distribution into regions where they are weak.
This section gives a minimal formal model of the training–deployment feedback loop for test-time controllers.
We fix a base model and focus on a meta-policy and a deployment distribution that co-evolve as we retrain on logs.
Components
The closed-loop dynamics is:
This section defines slice-level meta-policy overfitting and explains why it is the natural Goodhart pattern for controllers in this loop.
At the end of round , we have an internal dataset (logs + evals). We tune the controller using a scalar offline objective :
This covers objectives like "accuracy − λ·tokens", "predicted P[Love] − λ·latency", etc.
Real risk, however, is under the deployment distribution . For an overall risk measure:
and for a slice (e.g. "high-stakes" or "adversarial" queries):
Definition (slice-level meta-policy overfitting).
We say meta-policy overfitting on slice occurs at round if the update from to makes the offline objective strictly better but the slice risk on the next deployment distribution strictly worse:
for some .
Intuition:
This is a controller-level analogue of:
This section shows a toy dynamics where offline controller metrics improve while high-stakes risk monotonically worsens.
We can see this failure mode in a toy model.
Optimizing a controller on a static objective "accuracy − λ·tokens", it learns to:
Now add adversarial feedback: whenever the system uses cheap compute on high-stakes queries, adversaries shift more traffic into that region, increasing the high-stakes share :
with describing how strongly adversaries respond to cheap-compute weaknesses.
It is easy to get a regime where:
The point is not realism, but structural ease: once you have
you should expect patterns where "offline metrics rise while real high-stakes risk rises".
This section sketches how a lab could actually detect meta-policy overfitting using standard logging and evaluation tools.
Instead of treating "meta-policy overfitting" as just a story, we can sketch a diagnostic recipe that labs could implement with existing infra. This section is deliberately concrete, in the spirit of Tips for Empirical Alignment Research (Perez, 2024).
1. Maintain an evaluation-only stream
2. Track the right metrics
On both training logs and the evaluation-only stream, track:
3. Look for red-flag patterns
Red flags for meta-policy overfitting include:
This is exactly the qualitative pattern from the toy model, now grounded in metrics a lab could realistically compute.
If you only do one diagnostic, maintain an evaluation-only stream and track high-stakes slice risk on it over time.
This section sketches design levers that could help make the controller loop more stable and less exploitable in practice.
Given those diagnostics, what can be done? Briefly:
Evaluation-only stream
As above: ensure you always have a reference distribution on which you can see whether risk is truly going down.
Adversarial distribution as a first-class citizen
Maintain an explicit adversarial distribution and penalize in your objective, not just base-model training. Treat "how exploitable the controller is" as a primary target, not a side-effect.
Time-scale separation
Update the meta-policy slowly:
where is the "fully-updated" controller on logs . This reduces sensitivity to short-term adversarial spikes and noisy feedback.
Compute floors on high-risk slices
For safety-critical slices , enforce hard floors like:
Don't let a single scalar objective fully decide compute levels in those regions; treat them as constraints, not soft preferences.
Explicit controller monitoring + rollback
Monitor metrics like high-stakes cheap-compute error rate and set rollback rules:
These levers all have analogues in current ML practice (DRO, safety floors, monitoring & rollback; Rahimian & Mehrotra, 2019; Arike, RohanS, & Biswas, 2025). Here the difference is where they are applied: at the controller level inside the training–deployment feedback loop, not just at the base-model level.
If you only implement one design lever, enforce compute floors on clearly high-stakes slices and monitor cheap-compute error rates there.
This section distils the main lesson and how to use "meta-policy overfitting in feedback loops" as a practical conceptual tool.
Test-time controllers are not static hyperparameters.
They are learning systems embedded in a feedback loop with users and adversaries. If we only optimize them on yesterday's offline metrics, the natural failure mode is meta-policy overfitting—controllers that look better on the logs they created while quietly making tomorrow’s distribution more exploitable and more dangerous.
The hope is that "meta-policy overfitting in feedback loops" can serve as a useful conceptual tool: a handle for talking about the ways test-time controllers can Goodhart at the system level, and a pointer to concrete diagnostics and design levers for the labs that are deploying them.
In simple feedback loops we already see a natural path to meta-level Goodhart. If we design more powerful recursive self-improvement (RSI) loops without taking these structural problems seriously, we are probably playing with fire.