On a second review it seems to me the links are consistent with both definitions. Interestingly the google sheet linked in the blog post, which I think is the most canonical collection of examples of specification gaming, contains examples of evaluation-time hacking, like METR finding that o1-preview would sometimes pretend to fine-tune a model to pass an evaluation. Though that's not definitive, and of course the use of the term can change over time.
I agree that most historical discussion of this among people as well as in the GDM blog post focuses on RL optimization and situations where a model is literally getting a high RL reward. I think this is partly just contingent on these kinds of behaviors historically tending to emerge in an RL setting and not generalizing very much between different environments. And I also think the properties of reward hacks we're seeing now are very different from the properties we saw historically, and so the implications of the term reward hack now are often different from the implications of the term historically. Maybe this suggests expanding the usage of the term to account for the new implications, or maybe it suggests just inventing a new term wholesale.
I suppose the way I see it is that for a lot of tasks, there is something we want the model to do (which I'll call the goal), and a literal way we evaluate the model's behavior (which I'll call the proxy, though we can also use the term specification). In most historical RL training, the goal was not given to the model, it lay in the researcher's head, and the proxy was the reward signal that the model was trained on. When working with LLMs nowadays, whether it be during RL training or test-time evaluation or when we're just prompting a model, we try to write down a good description of the goal in our prompt. What the proxy is depends on the setting. In an RL setting it's the reward signal, and in an explicit evaluation it's the evaluation function. When prompting, we sometimes explicitly write a description of the proxy in the prompt. In many circumstances it's undefined, though perhaps the best analogue of the proxy in the general prompt setting is just whether the human thinks the model did what they wanted the model to do.
So now to talk about some examples:
An intuition pump for why I like to think about things this way. Let's imagine I build some agentic environment for my friend to use in RL training. I have some goal behavior I want the model to display, and I have some proxy reward signal I've created to evaluate the model's trajectories in this environment. Let's now say my friend puts the model in the environment and it performs a set of behaviors that get evaluated highly but are undesired according to my goal. It seems weird to me that I won't know whether or not this counts as specification gaming until I'm told whether or not that reward signal was used to train the model. I think I should just be able to call this behavior specification gaming independent of this. Taking a step further, I also want to be able to call it specification gaming even if I never intended for the evaluation score to be used to train the model.
On an end note, this conversation as well as Rauno Arike's comment is making me want to use the term specification gaming more often in the future for describing either RL-time or inference-time evaluation hacking and reward hacking more for RL-time hacking. And maybe I'll also be more likely to use terms like loophole-finding or deceptive instruction following or something else, though these terms have limitations as well, and I'll need to think more about what makes sense.
It's pretty common for people to use the terms "reward hacking" and "specification gaming" to refer to undesired behaviors that score highly as per an evaluation or a specification of an objective, regardless of whether that evaluation/specification occurs during RL training. I think this is especially common when there is some plausible argument that the evaluation is the type of evaluation that could appear during RL training, even if it doesn't actually appear there in practice. Some examples of this:
I think the literal definitions of the words in "specification gaming" align with this definition (although interestingly not the words in "reward hacking"). The specification can be operationalized as a reward function in RL training, as an evaluation function or even via a prompt.
I also think it's useful to have a term that describes this kind of behavior independent of whether or not it occurs in an RL setting. Maybe this should be reward hacking and specification gaming. Perhaps as Rauno Arike suggests it is best for this term to be specification gaming, and for reward hacking to exclusively refer to this behavior when it occurs during RL training. Or maybe due to the confusion it should be a whole new term entirely. (I'm not sure that the term "ruthless consequentialism" is the ideal term to describe this behavior either as it ascribes certain intentionality that doesn't necessarily exist.)
Thanks for making the distinction! I agree that scheming and reward hacking are overlapping concepts and I've just edited the post to be more clear about that.
I think your model of how scheming and reward hacking are likely to coincide in the future makes a lot of sense. It also seems possible that sufficiently strong monitoring and oversight systems (in certain domains) will make it impossible for models to reward hack without scheming.
Do you think this means that if Claude 3.7 Sonnet does in fact use steganography then your earlier experiment wouldn't have found evidence for it, since you didn't fine-tune for long enough for the model to learn the scheme?
Of course this experiment also suggests one should lower their likelihood of Claude 3.7 Sonnet using steganography at all.
For two hours yesterday, I watched the twitch channel ClaudePlaysPokemon, which shows a Claude 3.7 Sonnet agent playing through the game Pokémon Red. Below I list some limitations I observed with the Claude agent. I think many of these limitations will carry over to agents built on other current frontier models and on other agentic tasks.
All this being said, ClaudePlaysPokemon still impresses me, and is probably the most impressive LLM agent demonstration I've seen. Through reasoning and persistence, Claude is able to progress fairly far in the game, accomplish tasks requiring thousands of steps, and eventually get out of loops even when it's been stuck for a long time. I expect increased agentic RL training, increased cross-context RL training, and test-time learning to iron out a lot of these limitations over the next year or two.
The setup in the first sketch in Three Sketches of ASL-4 Safety Case Components can be replaced with a classifier that indicates whether a model is doing something catastrophically bad, which corresponds to the high concern features described in the post. Perhaps you can also have a second classifier that indicates if something is moderately bad, which corresponds to the medium concern features.
If you have such a classifier, you can:
This classifier can leverage any information, SAE vectors or other white box methods as mentioned in the post, the model's chain of thought, the sequence of actions the model takes, or better yet, all of the above.
EpochAI is also working on a "next-generation computer-use benchmark". OpenAI might be involved in funding it given recent rumors they are planning to release a computer-use model early this year.
I think you flipped the names from the iMessage conversation. As per the caption in the OpenAI blog post, the blue bubbles are for Altman and the grey bubbles are for Zilis.
In practice, the verifier is probably some kind of learned reward model (though it could be automated, like unit tests for code).
My guess is that a substantial amount of the verification (perhaps the majority?) was automated by training the model on domains where we have ground truth reward signals, like code, math, and standardized test questions. This would match the observed results in the o1 blog post showing that performance improved by a lot in domains that have ground truth or something close to ground truth, while performance was stagnant on things like creative writing which are more subjective. Nathan Lambert, the head of post-training at AI2, also found that doing continued RL training on ground truth rewards (which he calls RLVR) results in models that learn to say o1-like things like 'wait, let me check my work' in their chain of thought.
I agree that a well-designed paraphraser would make nearly all of the hard-to-detect steganography we saw in our experiments go away. It also just makes the task much harder for the model. My guess would be that the first time some company creates a model that is able to do consistent steganography under standard affordances, it won't be able to do consistent steganography with a well-designed paraphraser. So if we are building a safety case, we can first rely on model inability to do steganography. Once it becomes the case that models are capable of steganography, we can put in place a paraphraser mitigation and then rely on model inability to do steganography that bypasses a paraphraser. Eventually that too will fail and we'll need to rely on something else.
As StanislavKrym mentions, models eventually may use steganography techniques that paraphrasing does not remove. This seems especially plausible in situations like untrusted monitoring/control where the model is trying to pass a single outcome signal like a password indicating a backdoor was model-generated. Still though, I suspect it will take some time for models to learn how to do this consistently, at which point we can try other mitigations. Paraphrasing also has the benefit of making any steganography that remains after paraphrasing far less information dense than it would be otherwise.