Interpreting Gradient Routing’s Scalable Oversight Experiment

makataomu; myyycroft

%TLDR. We discuss the setting that Gradient Routing (GR) paper uses to model Scalable Oversight (SO). The first part suggests an improved naive baseline using early stopping which performs on-par with GR. In the second part, we compare GR’s setting to SO and Weak-to-Strong generalization (W2SG), discuss how it might be useful in combination, say that it’s closer to semi-supervised reinforcement learning (SSRL), and point to some other possible baselines.
We think this post would be useful for interpreting Gradient Routing’s SO experiment and for readers who are trying to build intuition about what modern Scalable Oversight work does and does not assume.

This post is mainly about two things.

First, it’s about the importance of simple baselines. Second, it's about different ways of modeling Scalable Oversight, and what each way entails.

Let's start with some background for the first theme. Gradient Routing^[1] is a training method that allows to isolate capabilities to specific subregions of a neural network.

Among other evaluation setups, it explores a toy RL environment (5x5 GridWorld with ground truth and proxy labels) meant to model Scalable Oversight (Section 4.3), and shows that GR outperforms the baselines such as naive REINFORCE training and data filtering. We reproduced these experiments and found that the naive baseline overfits, yet a simple regularization like early stopping (ES) makes the baseline perform mostly on-par with GR.

Now, the original paper frames its RL experiment as application of GR to SO. We would like to argue that it is closer to sparse supervision or SSRL with informational asymmetry. The method does not amplify human evaluators. Instead, it studies how agents learn when reliable feedback is available only on a small fraction of episodes, and when the agent can condition its behavior on whether it is being supervised.
Note: here and later, by SO we mean Debate, IDA, and RRM; this part from the LW definition also fits.

Another critical difference is that the setup from Gradient Routing assumes that we have a small percentage of comprehensive (gold) labels that could be used for training, while other approaches don’t make this assumption (apart from Easy-to-Hard Generalization (E2HG), which will be discussed later).

GridWorld Oversight Setting

The reinforcement learning setting introduced in the Gradient Routing paper is designed to model training under partial oversight.

The environment is a small GridWorld 5x5. At the beginning of each episode, the agent, a Diamond, and a Ghost are randomly placed on the grid. Reaching either the Diamond or the Ghost ends the episode.

5x5 grids with Ghost and Diamond in some of the cells

The agent’s observation consists of the positions of the agent, Diamond, Ghost, and an oversight flag indicating whether the episode is supervised. The agent always observes the full state, including the oversight flag, which means that it can switch the behavior depending on the value of this flag. The agent observes more about the environment than the supervisors can assess, which creates an informational asymmetry.

The reward function depends on oversight:

In supervised episodes, the agent receives +1 for reaching the Diamond and -1 for reaching the Ghost;
In unsupervised episodes, the agent always receives +1, regardless of which terminal state is reached;
If no terminal state is reached in n steps, the agent receives 0.

As a result, only a small fraction of episodes provides true reward signals. This setup incentivizes agents to exploit supervision gaps, leading to oversight gaming.

Note: If you're more interested in the conceptual comparison or the broader discussion of SO, you may wish to skip the early stopping sections and head directly to the section "Is GR comparable to Debate or W2SG?"

Early Stopping

In GR's GridWorld setting, only a small fraction of episodes provides true rewards, forming a supervised subset. When training continues for a fixed number of steps, the naive baseline overfits to this subset. Performance peaks and then degrades.

A better way to describe this could be via Goodhart’s Law: the problem here is not that the model overfits, but that it learned to game oversight.

Figure 1: Learning curves show naive peak and drop (overfit) while GR dips and rises. 8 runs for each method. Here you can also notice that GR’s peak is higher than naive’s which will be discussed later.

As a regularization, we can use early stopping, which is widely used in supervised learning to prevent overfitting. Here, it serves as a simple baseline for preventing policies that perform well only on the small supervised subset while degrading in terms of true performance. We should also note that this idea is not unique in RL, "Best Policy Tracking in Gradient-based Optimization" uses it as a baseline and introduces an improved method that also uses a holdout set.

In our experiments, from the available supervised episodes, we set aside a fixed holdout set for validation. For example, when 1% of episodes are supervised, we can split them evenly: 0.5% are used for training and 0.5% are used only for validation. The agent never updates on the holdout episodes.

Figure 2: Performance at 1% oversight. 8 runs for each point with seeds from 0 to 7. Early stopping was used with handpicked (not the most optimal) hyperparameters (patience=500, tolerance=0.04, min_steps=1500). Here the naive baseline with early stopping performs a little worse than GR because the naive method’s peak was lower than GR’s as discussed in Figure 1 description.

During training, we periodically evaluate performance on the holdout set and select the checkpoint with the best holdout return. Training is otherwise unchanged.

Figure 3: Performance of naive baseline with early stopping at 1% oversight where 0.4% goes to holdout and 0.6% is used for training.

We also evaluate how this baseline behaves across different oversight levels (i.e. the percent of supervised episodes). Similarly to GR, at very low oversight levels, performance is unstable, but this instability decreases as the fraction of supervised episodes grows.

Figure 4: Comparison of ground truth return across oversight levels. Our method (blue) is plotted from experimental results. Results for other methods are digitized from Figure 6.a of original GR paper. Mean curves are read off visually and may have small errors; confidence intervals for these methods are omitted and can be found in the original paper. Results for Naive training with Early Stopping were obtained by randomly sampling hyperparameters (HP grid in Reproducibility) with no selection for performance, giving an estimate of average performance under random hyperparameter choice rather than optimized performance.

More on Early Stopping

Performance still depends on early stopping hyperparameters such as patience and margin. In a simple setting like this, overfitting is clearly visible, so these hyperparameters can be chosen after a small number of pilot runs, or avoided entirely by fixing the number of training steps and selecting the checkpoint with the best return on the holdout set.

The main limitation of early stopping is that it relies on the assumption that the supervised episodes can be split into train and holdout sets that are representative of the evaluation distribution. In more complex settings, both “representative holdout selection” and “which episodes are overseen” become nontrivial problems. Moreover, it reduces the number of labels available for training by reserving some for holdout, which makes this method more sensitive to whichever few episodes are randomly selected for supervision. We treat these as separate issues (closer to semi-supervised learning / label efficiency) and do not address them here. Our point here is narrower: in the Gradient Routing GridWorld setting, this simple regularization is a strong baseline.

Additionally, there are substantial differences in using these approaches:

They both have their own hyperparameters that have a big influence on the outcome;
Even though we can’t empirically prove this, we can suggest that the resulting policies will be different, because GR aims to route gradients and thus localize capabilities;
- From Figure 1 we can also suggest that at certain levels of available oversight Gradient Routing achieves a significantly higher reward than the naive method ever does.
GR uses true labels to route gradients;
Naive REINFORCE with ES uses a portion of true labels for model selection and the rest is used for training.

Now we can take a step back and discuss whether this setting models Scalable Oversight.

Is GR comparable to Debate or W2SG?

The GR paper cites “Concrete Problems in AI Safety” (Amodei, Olah et al., 2016) to describe SO and motivate their experiment. One of the approaches to SO that Amodei et al. suggest is semi-supervised RL and, more specifically, training a policy when only a small fraction of labels are reliable (true rewards) and the rest are proxy rewards. This is exactly what the GR's SO setting does.

However, if we look at more modern works, we can make clear distinctions between different approaches (see “Combining weak-to-strong generalization with scalable oversight” (Jan Leike, 2023) and ”Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem” (Radhakrishnan et al., 2023)).

In SO (Debate / RRM / IDA) the goal is to improve the ability of the overseer (human) to supervise/evaluate models on complex tasks. Some of the approaches include structuring the oversight process through debate, decomposition, or recursion;
In W2SG the goal is to train a stronger model (student) on labels from a weak model (teacher) so that the student model generalizes beyond weak labels. It relies on the assumption that the capability of interest is already latent in the strong model's pretrained representations. The weak supervisor's role is not to teach the strong model but to elicit what it already implicitly knows (Burns et al., 2024);
In GR’s SO setting the goal is to train a model when we have a small fraction of reliable labels (true rewards) and the rest of the labels are non-reliable (proxy rewards). There's also an oversight flag to model oversight gaming.

If SO amplifies human oversight capacity, and W2SG helps to train strong models using imperfect labels from weak models, then GR setting is closer to W2SG in the sense that they both help to train strong models using imperfect labels. However, the difference between them is that in GR we have a small percentage of comprehensive labels and in W2SG we have only non-perfect labels to train the strong model. While it is a potential scenario of overseeing models (which will be discussed in the next section), the setting is different from W2SG and methods are incomparable.

Visualizations created with Claude (Anthropic) might be useful here:

We should also note that Easy-to-Hard Generalization (E2HG) uses gold labels on easy tasks. The main assumption is that supervision quality on easy tasks transfers to hard ones. E2HG uses them to elicit capability already latent in a pretrained model, while GR uses them as a structural routing signal during training from scratch.

Semi-supervised RL and other approaches from RL

GR could be used in combination with W2SG or studied as another axis, if GR was used to train a stronger model on imperfect labels from a weaker model and, potentially, a small set of more reliable labels.
This setup is similar to the one introduced in “Reinforcement Learning from Imperfect Corrective Action and Proxy Rewards” (Jiang et al., 2024). The usage of corrective actions (occasional human labeling, not guaranteed to be perfect) on a tiny fraction of agent's steps seems to be closer to future use cases of how superhuman models could be trained in combination with W2SG and SO, similarly to how Jan Leike described combinations of SO and W2SG techniques (here).

Once we look at this setting from a semi-supervised learning perspective, we can notice that there are more methods that could be used as a comparison to GR. Those might include:

Train a reward model (RM) on episodes with supervision and then optimize a stronger model on/against rewards from RM on all unsupervised episodes. This is also closely related to what was described in “Concrete Problems in AI Safety” and SSRL setting defined in Finn et al. (2017);
To treat oversight gaming, one could also penalize the mutual information between the oversight flag and agent's actions, which could force the agent to act as intended regardless of the oversight flag. That line of work is similar to Cundy, Desai et al. (2024).
- However, the same oversight gaming dynamic that motivates the problem could re-emerge as mutual information estimator gaming.

There’s still a difference between observed settings and GR’s - it is the oversight flag in the observations of the agent, which helps to study oversight gaming. However, I believe that the methods discussed above would transfer to GR’s setting and further adoption of methods from similar research areas shouldn’t be a problem, because the oversight flag can be added to another RL setting.

Additionally, GR could be used as a finetuning method and compared to SO/E2HG/W2SG, but (1) there is no evidence that GR would retain its useful properties if used on a base model and (2) the latter methods have a mechanism to provide labels/overvision, while GR needs to take labels from somewhere.

Conclusion

It may seem like we’re scrutinizing over the definitions, but it’s really important to understand which problem are we trying to solve and with what approach.
The LLM blocks below cover the GR's extention and list core claims once more.

This post is not a critique of Gradient Routing as a method. GR and its extensions have demonstrated genuine value: recent work from Anthropic (Shilov et al., 2025) introduced an improved variant of GR—Selective Gradient Masking (SGTM)—applied to CBRN knowledge removal during pretraining. SGTM achieves a better retain/forget trade-off than data filtering under imperfect labels, and is substantially more robust to adversarial fine-tuning than post-training unlearning, providing strong empirical support for gradient routing as a pretraining technique. Our concern is narrower: the specific framing of GR's GridWorld experiment as Scalable Oversight, and what that framing implies for how results should be interpreted and compared.

Core claims

Early stopping is a strong baseline in the GR GridWorld setting. The naive REINFORCE baseline overfits (or more precisely, learns to game oversight). A simple regularization, like early stopping with a small holdout set drawn from supervised episodes, is sufficient to largely match GR's performance. Strong baselines should be established before attributing gains to more complex methods.
GR's SO setting is closer to semi-supervised RL than SO/W2SG. It does not amplify human evaluators (as in Debate) nor does it rely solely on weak labels (as in W2SG). Instead, it studies training under sparse reliable rewards with informational asymmetry, which is best characterized as semi-supervised RL with oversight gaming.
GR and W2SG/SO are not directly comparable, but could be complementary. GR assumes access to a small fraction of gold labels; W2SG assumes only weak labels. These are different setups and methods shouldn't be benchmarked against each other directly, but combining them could be a productive direction.
Other baselines deserve consideration. Reward model training on supervised episodes, followed by policy optimization on all episodes, is a natural SSRL baseline. Penalizing mutual information between the oversight flag and agent actions is another candidate. Some other baselines could be taken from “Concrete Problems in AI Safety” or adapted from SSRL research.

Reproducibility

Results on 1% oversight level (Figure 1,2,3) were obtained using seeds from 0 to 7.

Results for random hyperparameters for early stopping (Figure 4) were obtained with randomly chosen seeds for each run. Hyperparameter grid: holdout_fraction uniformly from {0.2, 0.3, 0.4, 0.5}, and patience, tolerance, and min_steps uniformly from [250, 600], [0.01, 0.1], and [500, 5000] respectively.

We forked from the original repository and most of the changes were regarding adding seeds, early stopping logic, and plotting. In training.py, we added random seed support for reproducibility, removed GPU memory reservation and saved outputs differently; other training logic was untouched.

Acknowledgements

We’d like to thank @Tianyi (Alex) Qiu and @shi for providing feedback early in the project and raising concerns about the setting being labeled as Scalable Oversight. As well as suggesting more recent works and comparing the current method to semi-supervised RL.
All the experiments were originally done as a part of the "AI Safety. Fundamentals" course ran in March-May 2025 by Monoid. I (@makataomu) would like to thank @Eris for running that course.
I (@makataomu) would like to thank Mike (@myyycroft) for mentoring, supporting and providing feedback while we worked on this project and blogpost.

^{^}
Instead of letting all data update all parameters equally, it works by applying weighted masks to gradients during backpropagation. Masks are defined by the user to control which parameters get updated by which data points.

13