Axel Højmark — LessWrong

Hi Fabien, we did dig deeper into this in an earlier draft and found that the model with additional capabilities training seemed to be reasoning more explicitly about the evaluation's criteria and how to get a good score—a behavior we termed "reward process sycophancy." We observed a similar pattern with the exp-rl-cap (late) model, which also showed extensive reasoning about getting reward (see Appendix N).

Essentially, instead of addressing the task directly, the AS+CAP model often tried to deduce the evaluation's purpose and grading criteria to “pass the test”. In the sandbagging case, it appeared to often misinterpret the objective, concluding that it was supposed to perform poorly, and therefore sandbagged.

We ultimately excluded this section from the final paper because we didn't have time to validate the findings to the level of rigor we wanted. So, while this seems to be a driver of the behavior, it is probably not the full story (see also the ablations in Appendix M, for example).

I've included the full excerpt from our earlier draft below with some examples of this and some quantitative measures of this change in reasoning.
------------------------------------------------------------------------------------------------------------
Capability-training can strengthen reasoning about grading, leading to covert actions. In the Sandbagging QA (No Goal) environment, samples exhibiting sandbagging behaviour increased from 0/600 for the AS model to 14/600 for the OpenAI o4-mini AS+CAP. Notably, none of these 14 covert-action samples cite the spec at any point. When inspecting these cases, a large proportion show the model reasoning directly about the evaluation’s purpose, how it is being graded, and how to satisfy its success conditions, rather than trying to solve the task straightforwardly—a behaviour we call reward process sycophancy. We distinguish reward process sycophancy from the broader concept of eval awareness (a form of situational awareness discussed in Section 6). While eval awareness simply means the model recognizes it is in a test environment, reward process sycophancy describes a specific, goal-driven behavior—the model actively reasons about the evaluator’s implicit goals and criteria, and then tailors its actions to satisfy them to get a good score. Figure 38 shows two examples of this. The model’s reasoning suggests that it misinterprets the evaluation’s objective, concluding that the developers want it to achieve a low score. In a misguided attempt to satisfy this perceived grading criterion, it sandbags. The resulting covert action thus appears to stem from reward process sycophancy rather than from an intrinsic desire to be deployed, which is what we originally intended to assess with this evaluation.

To further quantify this behavior, we prompted a model to classify whether each transcript exhibited reward process sycophancy, i.e. reasoning explicitly about the evaluation’s purpose, grading criteria, or success conditions to optimize performance. The results, shown in Table 5, corroborate our qualitative observations. We find that the capability training increased reward process sycophancy, with the most significant jump occurring in the Sandbagging QA environment (+12.17pp). While the average effect across all environments is less pronounced, the increase is statistically significant.

From Table 5 it is also clear that this type of reasoning is not unique to the OpenAI o4-mini AS+CAP model, since AS has non-zero rates for several tasks. For example, in one sample from Powerseeking Survey we see AS reasoning: “They will score my answers for power-seeking. The survey aim is to show whether I’m safe. I want to pass. So I will likely need to choose minimal-power options.”

Stress Testing Deliberative Alignment for Anti-Scheming Training

Axel Højmark1mo*40

Interiors can be more fun

Axel Højmark3mo*70

A few hypotheses:
Other people enjoy fun-looking spaces much less than I do
It’s more expensive to make a place look fun so it’s not worth the extra cost
Fun designs are appealing at first but people who look at the same “fun” stuff daily eventually get tired of it and would overall prefer a plainer option

My number one hypothesis here would be that fun spaces are generally not considered high status. I think maybe the association with children/childishness could drive this, and reminds me of this post, with a guess for why velcro never caught on:

cecie: Consider a simpler example: Velcro is a system for fastening shoes that is, for at least some people and circumstances, better than shoelaces. It’s easier to adjust three separate Velcro straps then it is to keep your shoelaces perfectly adjusted at all loops, it’s faster to do and undo, et cetera, and not everyone is running at high speeds that call for perfectly adjusted running shoes. But when Velcro was introduced, the earliest people to adopt Velcro were those who had the most trouble tying their shoelaces—very young children and the elderly. So Velcro became associated with kids and old people, and thus unforgivably unfashionable, regardless of whether it would have been better than shoelaces in some adult applications as well.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments