Kei Nishimura-Gasparian

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

In the context of specification gaming, modifying instructions in hindsight based on observed behavior could provide recontextualization-like effects.

Maybe this could be one way to train against a monitor without the bad side effects? If your reward hack monitor flags reward hacking you would add a hack instruction to the prompt. You can also do this for any other bad behavior you can catch with a monitor like deception or sycophancy or uninterpretable CoT.

If you only change the prompts corresponding to the completions flagged by monitors and nothing else, this might capture most of the benefit and have the upside of making your model updates less off-policy and have less of an impact on instruction following. On the other hand, you might miss some kinds of bad behavior and inadvertently reinforce them.

Reasons to sell frontier lab equity to donate now rather than later

Kei Nishimura-Gasparian15d10

About a year ago, Horizon had a policy of not accepting donations from any individuals employed in the AI industry. Is this no longer the case?

GPT-oss is an extremely stupid model

Kei Nishimura-Gasparian1mo82

Interestingly, in the original agentic misalignment paper, o3 and o4-mini were unique in that they also frequently got confused and played as another character (see Appendix 9). There may be something specific in how OpenAI trained those two models and gpt-oss that caused this confusion.

The agentic misalignment researchers got o3 and o4-mini to better understand the scenario by making a few changes to the setup (described in Appendix 9.1). Maybe those same changes could get gpt-oss to understand the scenario as well.

Will Any Crap Cause Emergent Misalignment?

Kei Nishimura-Gasparian2mo*10

Am I correctly understanding that the effect size shown in the graph is very small? It seems like the mean harmfulness score is not much higher for any of the evals, even if the effect size is technically statistically significant.

Will Any Crap Cause Emergent Misalignment?

Kei Nishimura-Gasparian2mo156

Maybe not any distributional shift, but it does seem noteworthy that of the examples discussed, the answers that seem to me to be more OOD for the model (unpopular preferences, the examples given in this post, and potentially insecure code) produce emergent misalignment, while the answers that seem more in distribution (secure code, educational insecure code, popular preferences) don't produce emergent misalignment.^[1]

As a hypothesis, maybe the model has partially entangled representations for 'normal' behavior and 'aligned' behavior, and so pushing the model towards abnormal behavior induces at least some emergent misalignment. Though I'd be surprised if this were the primary mechanism behind the EM observed when training on explicitly misaligned data like insecure code.

To a first approximation, we should be able to measure how OOD some completion is by using per-token loss of the pre-fine tuned model on that data. ↩︎

Training a Reward Hacker Despite Perfect Labels

Kei Nishimura-Gasparian2mo*10

Interesting work! I've been hypothesizing that behavioral change from RL on perfect reward signals is a potential mechanism for increased inference-time hacking in reasoning models (alongside generalizing from reward hacking experienced during training), so it's great to see evidence that this can happen in practice. Some thoughts:

But we do think it's surprising that this effect can overpower the effect of the training responses resulting in an honest final answer 100% of the time.

My guess is that the model has a much weaker prior on the types of reasoning you are reinforcing compared to the types of final answers you are reinforcing. So the main effect of this training is to make the model's reasoning more 'hacky', which due to the model's priors on making its final answers consistent with its reasoning, makes the model reward hack more. This hypothesis is at least somewhat supported by the fact that your re-contextualized training reduces the hack rate of models that already hack a lot.

This raises the concern that, if some training procedure causes the model to reason more about reward hacking, it might generalize to increased hacking. That generalization might happen regardless of how well the reward model can detect hacks!

This is probably true, and I suspect we can be more concrete. Not only does training on thinking about reward hacking increase the rate of reward hacking, it may also be true that there is a wide array of reasoning traits that models can learn through training that make reward hacky reasoning more likely, at least in certain settings. Potential examples of this are thinking about how you are evaluated, doing situationally aware reasoning, learning creative problem solving, caring about scoring highly, or even learning to backtrack when your current approach is not working.

If possible, I'd be excited to see a more detailed investigation of how the types of reasoning the model does changes as a function of training time, and how that corresponds to rate of reward hacking.

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Kei Nishimura-Gasparian3mo132

I'm curious if you have an opinion on the relation between this work and the fine-tuning experiments in the recent Persona vector paper.

In this work, you find vectors for concepts that you don't want the model to use and ablate them during training to stop the model from learning to use those concepts. In the Persona vector work, the authors find vectors for concepts they don't want the model to use, and then add them during training so the model doesn't need to learn to use those concepts. Interestingly, doing apparently opposite things results in similar outcomes.

Do you think there are any connections between the mechanisms through which these two methods work? Do you have opinions on the situations where one technique may be better or worse than the other?

Early Signs of Steganographic Capabilities in Frontier LLMs

Kei Nishimura-Gasparian3mo20

I agree that a well-designed paraphraser would make nearly all of the hard-to-detect steganography we saw in our experiments go away. It also just makes the task much harder for the model. My guess would be that the first time some company creates a model that is able to do consistent steganography under standard affordances, it won't be able to do consistent steganography with a well-designed paraphraser. So if we are building a safety case, we can first rely on model inability to do steganography. Once it becomes the case that models are capable of steganography, we can put in place a paraphraser mitigation and then rely on model inability to do steganography that bypasses a paraphraser. Eventually that too will fail and we'll need to rely on something else.

As StanislavKrym mentions, models eventually may use steganography techniques that paraphrasing does not remove. This seems especially plausible in situations like untrusted monitoring/control where the model is trying to pass a single outcome signal like a password indicating a backdoor was model-generated. Still though, I suspect it will take some time for models to learn how to do this consistently, at which point we can try other mitigations. Paraphrasing also has the benefit of making any steganography that remains after paraphrasing far less information dense than it would be otherwise.

Steve Byrnes’s Shortform

Kei Nishimura-Gasparian6mo*61

On a second review it seems to me the links are consistent with both definitions. Interestingly the google sheet linked in the blog post, which I think is the most canonical collection of examples of specification gaming, contains examples of evaluation-time hacking, like METR finding that o1-preview would sometimes pretend to fine-tune a model to pass an evaluation. Though that's not definitive, and of course the use of the term can change over time.

I agree that most historical discussion of this among people as well as in the GDM blog post focuses on RL optimization and situations where a model is literally getting a high RL reward. I think this is partly just contingent on these kinds of behaviors historically tending to emerge in an RL setting and not generalizing very much between different environments. And I also think the properties of reward hacks we're seeing now are very different from the properties we saw historically, and so the implications of the term reward hack now are often different from the implications of the term historically. Maybe this suggests expanding the usage of the term to account for the new implications, or maybe it suggests just inventing a new term wholesale.

I suppose the way I see it is that for a lot of tasks, there is something we want the model to do (which I'll call the goal), and a literal way we evaluate the model's behavior (which I'll call the proxy, though we can also use the term specification). In most historical RL training, the goal was not given to the model, it lay in the researcher's head, and the proxy was the reward signal that the model was trained on. When working with LLMs nowadays, whether it be during RL training or test-time evaluation or when we're just prompting a model, we try to write down a good description of the goal in our prompt. What the proxy is depends on the setting. In an RL setting it's the reward signal, and in an explicit evaluation it's the evaluation function. When prompting, we sometimes explicitly write a description of the proxy in the prompt. In many circumstances it's undefined, though perhaps the best analogue of the proxy in the general prompt setting is just whether the human thinks the model did what they wanted the model to do.

So now to talk about some examples:

If I give a model a suite of coding tasks, and I evaluate the tasks by checking if running the test file works or not (as in the METR eval), then I would call it specification gaming, and the objective the model is scoring highly by is the results of the evaluation. By objective I am referring to the actual score the model gets on the evaluation, not what the human wants.
In just pure prompt settings where there is no eval going on, I'm more hesitant to use the terms reward hacking or specification gaming because the proxy is unclear. I do sometimes use the term to refer to the type of behavior that would've received a high proxy score if I had been running an eval, though this is a bit sloppy
- I think a case could be made for saying a model is specification gaming if it regularly produces results that appear to fulfill prompts given to them by users, according to those users, while secretly doing something else, even if no explicit evaluation is going on. (So using the terminology mentioned before, it succeeds as per the proxy "the human thinks the model did what they wanted the model to do" even if it didn't fulfill the goal given to it in the prompt.) But I do think this is more borderline and I don't tend to use the term very much in this situation. And if models just lie or edit tests without fooling the user, they definitely aren't specification gaming even by this definition.
  - Maybe a term like deceptive instruction following would be better to use here? Although that term also has the problem of ascribing intentionality.
- Perhaps you can say that the implicit proxy in many codebases is successfully passing the tests, though that also seems borderline

An intuition pump for why I like to think about things this way. Let's imagine I build some agentic environment for my friend to use in RL training. I have some goal behavior I want the model to display, and I have some proxy reward signal I've created to evaluate the model's trajectories in this environment. Let's now say my friend puts the model in the environment and it performs a set of behaviors that get evaluated highly but are undesired according to my goal. It seems weird to me that I won't know whether or not this counts as specification gaming until I'm told whether or not that reward signal was used to train the model. I think I should just be able to call this behavior specification gaming independent of this. Taking a step further, I also want to be able to call it specification gaming even if I never intended for the evaluation score to be used to train the model.

On an end note, this conversation as well as Rauno Arike's comment is making me want to use the term specification gaming more often in the future for describing either RL-time or inference-time evaluation hacking and reward hacking more for RL-time hacking. And maybe I'll also be more likely to use terms like loophole-finding or deceptive instruction following or something else, though these terms have limitations as well, and I'll need to think more about what makes sense.

Steve Byrnes’s Shortform

Kei Nishimura-Gasparian6moΩ450

It's pretty common for people to use the terms "reward hacking" and "specification gaming" to refer to undesired behaviors that score highly as per an evaluation or a specification of an objective, regardless of whether that evaluation/specification occurs during RL training. I think this is especially common when there is some plausible argument that the evaluation is the type of evaluation that could appear during RL training, even if it doesn't actually appear there in practice. Some examples of this:

OpenAI described o1-preview succeeding at a CTF task in an undesired way as reward hacking.
Anthropic described Claude 3.7 Sonnet giving an incorrect answer aligned with a validation function in a CoT faithfulness eval as reward hacking. They also used the term when describing the rates of models taking certain misaligned specification-matching behaviors during an evaluation after being fine-tuned on docs describing that Claude does or does not like to reward hack.
This relatively early DeepMind post on specification gaming and the blog post from Victoria Krakovna that it came from (which might be the earliest use of the term specification gaming?) also gives a definition consistent with this.

I think the literal definitions of the words in "specification gaming" align with this definition (although interestingly not the words in "reward hacking"). The specification can be operationalized as a reward function in RL training, as an evaluation function or even via a prompt.

I also think it's useful to have a term that describes this kind of behavior independent of whether or not it occurs in an RL setting. Maybe this should be reward hacking and specification gaming. Perhaps as Rauno Arike suggests it is best for this term to be specification gaming, and for reward hacking to exclusively refer to this behavior when it occurs during RL training. Or maybe due to the confusion it should be a whole new term entirely. (I'm not sure that the term "ruthless consequentialism" is the ideal term to describe this behavior either as it ascribes certain intentionality that doesn't necessarily exist.)

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments