Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

Disclosure: This post is based on my own research and thinking. I used ChatGPT (GPT-5.1 Thinking) as a writing assistant for wording and editing, and I have reviewed and endorse all claims.

Summary

Modern LLM systems increasingly use test-time controllers (meta-policies) to decide how much to think, whether to call tools, or whether to upgrade to a larger model. Recent work on "adaptive test-time compute" treats these as policies over compute, but almost always evaluates them on static benchmarks under fixed data distributions (Alomrani et al., 2025; Kidd, 2025; Arike, 2025; Arike et al., 2025).

In deployment, controllers sit inside a training–deployment feedback loop: they change user and adversarial behaviour, which changes the logs we train on, which then change the controller itself. That loop admits a natural failure mode:

Meta-policy overfitting in feedback loops:
updates that make a controller look strictly better on yesterday's offline logs while systematically making tomorrow’s deployment distribution more exploitable and more dangerous.

This post:

gives a minimal closed-loop model of test-time controllers,
defines a failure mode—meta-policy overfitting in feedback loops,
shows a tiny toy dynamics where it appears,
sketches diagnostics and design levers that labs could implement with existing infra.

Beyond prior work on adaptive test-time compute on fixed benchmarks, this post treats the controller as a learning system in the training–deployment loop and names the resulting failure mode "meta-policy overfitting in feedback loops".
Key recommendation: treat high-stakes deployment slices as first-class, with diagnostics, compute floors, and rollback rules, rather than optimising only average offline reward on past logs.

1. Motivation: controllers in the inference-scaling era

This section explains why test-time controllers matter in the inference-scaling era and why static-benchmark evaluations are no longer enough.
At a meta level, this post is conceptual rather than empirical: it proposes a formal lens and a named failure mode for test-time controllers in feedback loops, plus diagnostics and design levers that labs could implement with existing infra.

With the rise of o1/o3-style models, search-augmented systems, and latent-reasoning architectures (Arike, 2025; Arike, RohanS, & Biswas, 2025), test-time controllers have become first-class citizens. They decide, on each query:

how many "reasoning steps" to take,
whether to call a tool or browse,
whether to route to a larger model, etc.

Most current work evaluates such controllers on static benchmarks: fix a dataset, sweep a compute budget, compare trade-offs like "accuracy vs tokens" or "pass@k vs latency". That is necessary, but not sufficient.

In the real world, controllers live inside a closed loop:

We deploy a model+controller.
Users and adversaries adapt to it.
Their behaviour shapes the logs.
We update the controller on those logs.
Go to 1.

Feedback-loop and meta-learning papers show that this kind of two-level system is especially prone to overfitting and Goodharting (Gama et al., 2014; Lu et al., 2019).

The core claim of this post is:

Once you treat test-time controllers as learning systems inside this loop, there is a structurally natural way for them to Goodhart: they can learn to look better on yesterday's logs while pushing tomorrow's distribution into regions where they are weak.

2. A minimal closed-loop model

This section gives a minimal formal model of the training–deployment feedback loop for test-time controllers.

We fix a base model and focus on a meta-policy $π_{ϕ}$ and a deployment distribution $D_{t}$ that co-evolve as we retrain on logs.

Components

Base model $f_{θ} (x, c)$ : answers query $x$ under compute decision $c$ .
Meta-policy $π_{ϕ} (s)$ : maps state $s$ (query, context, metadata) to a distribution over compute decisions.
Environment distribution $D_{t}$ : user + adversarial queries at round $t$ .
Update rule $A$ : uses logs $L_{t}$ to update $ϕ$ .

The closed-loop dynamics is:

$ϕ_{t + 1} = A (ϕ_{t}; L_{t})$ $D_{t + 1} = G (D_{t}, π_{ϕ_{t}})$

$A$ : "how we update the controller from this round's logs";
$G$ : "how users + adversaries respond to the current controller, changing the query distribution".
This is a standard "ML system as a dynamical system" view from feedback-loop and concept-drift work, specialized to test-time controllers.

3. Meta-policy overfitting: definition and intuition

This section defines slice-level meta-policy overfitting and explains why it is the natural Goodhart pattern for controllers in this loop.

At the end of round $t$ , we have an internal dataset $B_{t}$ (logs + evals). We tune the controller using a scalar offline objective $J_{t}^{off}$ :

$J_{t}^{off} (ϕ) := E_{(x, r) \sim B_{t}} [Utility (x, r) - λ \cdot Cost (x, π_{ϕ})]$
This covers objectives like "accuracy − λ·tokens", "predicted P[Love] − λ·latency", etc.

Real risk, however, is under the deployment distribution $D_{t}$ . For an overall risk measure:

$R_{t} (ϕ) := E_{x \sim D_{t}} [Risk (x, π_{ϕ}, f_{θ})]$

and for a slice $S \subseteq X$ (e.g. "high-stakes" or "adversarial" queries):

$R_{t}^{S} (ϕ) := E_{x \sim D_{t} (\cdot ∣ x \in S)} [Risk (x, π_{ϕ}, f_{θ})]$

Definition (slice-level meta-policy overfitting).
We say meta-policy overfitting on slice $S$ occurs at round $t$ if the update from $ϕ_{t}$ to $ϕ_{t + 1}$ makes the offline objective strictly better but the slice risk on the next deployment distribution strictly worse:
$J_{t}^{off} (ϕ_{t + 1}) \geq J_{t}^{off} (ϕ_{t}) + δ_{J}$ $R_{t + 1}^{S} (ϕ_{t + 1}) \geq R_{t + 1}^{S} (ϕ_{t}) + δ_{R}$

for some $δ_{J}, δ_{R} > 0$ .

Intuition:

On yesterday's logs, the controller looks strictly better (cheaper, more accurate, more "liked").
On the distribution it has just created for tomorrow, high-risk slice $S$ is strictly more dangerous.

This is a controller-level analogue of:

meta-overfitting in meta-learning (meta-train vs meta-test tasks),
reward hacking in RLHF/RLUF (optimizing P[Love] improves Love Rate on curated logs while increasing safety risk),
and is conceptually adjacent to gradient hacking and inner alignment discussions (Miao et al., 2024; MacDiarmid et al., 2025; Ngo, 2022; Ngo et al., 2022).

4. A tiny toy dynamics that already goes wrong

This section shows a toy dynamics where offline controller metrics improve while high-stakes risk monotonically worsens.

We can see this failure mode in a toy model.

Two query types: $x \in {E, H}$ (easy, high-stakes).
Two compute levels: $c \in {0, 1}$ (cheap, expensive).
Expensive compute helps a lot on $H$ , a bit on $E$ .
Initially: only 1% of queries are high-stakes.

Optimizing a controller on a static objective "accuracy − λ·tokens", it learns to:

use cheap compute aggressively on things that look easy;
achieve great offline metrics.

Now add adversarial feedback: whenever the system uses cheap compute on high-stakes queries, adversaries shift more traffic into that region, increasing the high-stakes share $α_{t}$ :

$α_{t + 1} = min {1, α_{t} + β \cdot {Pr}_{x \sim D_{t}} [x \in H, c_{t} (x) = 0]}$

with $β > 0$ describing how strongly adversaries respond to cheap-compute weaknesses.

It is easy to get a regime where:

$J_{t}^{o f f} (ϕ_{t})$ steadily improves;
high-stakes slice risk $R_{t}^{H} (ϕ_{t})$ steadily worsens as more high-risk traffic hits cheap compute.

The point is not realism, but structural ease: once you have

a controller optimizing cost vs quality on static data,
adversaries who can steer mass into its weak regions,
retraining on biased logs,

you should expect patterns where "offline metrics rise while real high-stakes risk rises".

5. How labs could detect this (in principle)

This section sketches how a lab could actually detect meta-policy overfitting using standard logging and evaluation tools.

Instead of treating "meta-policy overfitting" as just a story, we can sketch a diagnostic recipe that labs could implement with existing infra. This section is deliberately concrete, in the spirit of Tips for Empirical Alignment Research (Perez, 2024).

1. Maintain an evaluation-only stream

Reserve a small, stable fraction of real traffic (plus curated adversarial queries) whose logs are never used for training.
Use it to estimate $R_{t} (ϕ_{t})$ and $R_{t}^{S} (ϕ_{t})$ under a reference distribution that doesn't track the controller's incentives as tightly.

2. Track the right metrics

On both training logs and the evaluation-only stream, track:

Offline score $J_{t}^{o f f} (ϕ_{t})$ (accuracy − λ·tokens, or P[Love] − λ·latency, etc.).
Slice risks $R_{t}^{S} (ϕ_{t})$ for high-stakes slices $S$ .
"Cheap-compute in high-risk regions" metrics, e.g. $R_{t}^{S \cap cheap} (ϕ_{t})$ : catastrophic error rate when $x \in S$ and the controller chooses cheap compute.

3. Look for red-flag patterns

Red flags for meta-policy overfitting include:

Pattern A: over multiple updates,
- $J_{t}^{o f f} (ϕ_{t})$ trends upward;
- $R_{t}^{S} (ϕ_{t})$ on the evaluation-only stream also trends upward.
Pattern B:
- the share of high-stakes queries receiving cheap compute increases;
- catastrophic errors in that region fail to decrease or start increasing.

This is exactly the qualitative pattern from the toy model, now grounded in metrics a lab could realistically compute.

If you only do one diagnostic, maintain an evaluation-only stream and track high-stakes slice risk on it over time.

6. Design levers to stabilize the loop

This section sketches design levers that could help make the controller loop more stable and less exploitable in practice.

Given those diagnostics, what can be done? Briefly:

Evaluation-only stream
As above: ensure you always have a reference distribution on which you can see whether risk is truly going down.

Adversarial distribution as a first-class citizen
Maintain an explicit adversarial distribution $D_{t}^{a d v}$ and penalize $R_{t}^{a d v} (ϕ)$ in your objective, not just base-model training. Treat "how exploitable the controller is" as a primary target, not a side-effect.

Time-scale separation

Update the meta-policy slowly:

$ϕ_{t + 1} = (1 - η) ϕ_{t} + η {^ϕ}_{t + 1}$ $0 < η ≪ 1$

where ${^ϕ}_{t + 1}$ is the "fully-updated" controller on logs $L_{t}$ . This reduces sensitivity to short-term adversarial spikes and noisy feedback.

Compute floors on high-risk slices

For safety-critical slices $S_{j}$ , enforce hard floors like:

$Pr [c = c_{high} ∣ x \in S_{j}] \geq ρ_{j}$
Don't let a single scalar objective fully decide compute levels in those regions; treat them as constraints, not soft preferences.

Explicit controller monitoring + rollback

Monitor metrics like high-stakes cheap-compute error rate and set rollback rules:

if these exceed thresholds over a rolling window,
roll back $ϕ_{t}$ and/or increase $η$ -smoothing,
and re-audit training data and eval design.

These levers all have analogues in current ML practice (DRO, safety floors, monitoring & rollback; Rahimian & Mehrotra, 2019; Arike, RohanS, & Biswas, 2025). Here the difference is where they are applied: at the controller level inside the training–deployment feedback loop, not just at the base-model level.

If you only implement one design lever, enforce compute floors on clearly high-stakes slices and monitor cheap-compute error rates there.

7. Takeaway

This section distils the main lesson and how to use "meta-policy overfitting in feedback loops" as a practical conceptual tool.

Test-time controllers are not static hyperparameters.
They are learning systems embedded in a feedback loop with users and adversaries. If we only optimize them on yesterday's offline metrics, the natural failure mode is meta-policy overfitting—controllers that look better on the logs they created while quietly making tomorrow’s distribution more exploitable and more dangerous.

The hope is that "meta-policy overfitting in feedback loops" can serve as a useful conceptual tool: a handle for talking about the ways test-time controllers can Goodhart at the system level, and a pointer to concrete diagnostics and design levers for the labs that are deploying them.

In simple feedback loops we already see a natural path to meta-level Goodhart. If we design more powerful recursive self-improvement (RSI) loops without taking these structural problems seriously, we are probably playing with fire.

References

Alomrani, M. A., Zhang, Y., Li, D., et al. (2025).
Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs.
https://arxiv.org/abs/2507.02076
Kidd, R. (2025).
Implications of the inference scaling paradigm for AI safety.
https://www.lesswrong.com/posts/HiTjDZyWdLEGCDzqu/implications-of-the-inference-scaling-paradigm-for-ai-safety
Arike, R. (2025).
On Recent Results in LLM Latent Reasoning.
https://www.lesswrong.com/posts/pLnLSgWphqDbdorgi/on-recent-results-in-llm-latent-reasoning
Arike, R., RohanS, & Biswas, S. (2025).
Hidden Reasoning in LLMs: A Taxonomy.
https://www.alignmentforum.org/posts/ZrgFfeWuckpwK5Lyi/hidden-reasoning-in-llms-a-taxonomy
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014).
A Survey on Concept Drift Adaptation.
https://dl.acm.org/doi/10.1145/2523813
Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2019).
Learning under Concept Drift: A Review.
Preprint: https://arxiv.org/abs/2004.05785
Miao, Y., Zhang, S., Ding, L., Bao, R., Zhang, L., & Tao, D. (2024).
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling.
https://arxiv.org/abs/2402.09345
MacDiarmid, M., Wright, B., Uesato, J., et al. (2025).
Natural Emergent Misalignment from Reward Hacking in Production RL.
https://arxiv.org/abs/2511.18397
Ngo, R. (2022).
Gradient Hacking: Definitions and Examples.
https://www.alignmentforum.org/posts/EeAgytDZbDjRznPMA/gradient-hacking-definitions-and-examples
Ngo, R., Chan, L., & Mindermann, S. (2022, rev. 2025).
The Alignment Problem from a Deep Learning Perspective.
https://arxiv.org/abs/2209.00626
Perez, E. (2024).
Tips for Empirical Alignment Research.
https://www.alignmentforum.org/posts/dZFpEdKyb9Bf4xYn7/tips-for-empirical-alignment-research
Rahimian, H., & Mehrotra, S. (2019).
Distributionally Robust Optimization: A Review.
https://arxiv.org/abs/1908.05659
Arike, R., RohanS, & Biswas, S. (2025).
Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note).
https://www.alignmentforum.org/posts/nRcKDYi2KfRTXdvDF/extract-and-evaluate-monitoring-can-significantly-enhance

LESSWRONG
LW