They do test 4.1 and 4.1-mini which have high rates of reward hacking (though they test using data from 4o-mini), and find that reward hacking goes down:
We train different models in the OpenAI family on non-hacks generated by 4o-mini. We find re-contextualized training:
- Boosts hacking for GPT-4o.
- Decreases hacking for GPT-4.1-mini and GPT-4.1, which start out at high rates.
Nevertheless, re-contextualized training results in strictly higher hack rates than standard training for every base model.
They suggest this isn't purely subliminal learning here:
To shed light on what is going on, we ask: to what extent would this re-contextualized data boost hacking on a different base model? If the effect is largely subliminal, we won't see transfer to a different base model.
[...]
Nevertheless, re-contextualized training results in strictly higher hack rates than standard training for every base model. This means that the effect is not purely subliminal.
It's possible that subliminal learning does work across base models for some traits, but I also agree that these results aren't fully explained by subliminal learning (especially since even the filtered reasoning traces here contain generally-interpretable semantic linkage to test-passing, such as discussing the test cases in detail).
If I understand correctly, the program still isn't open to people who don't have work authorization in the US or the UK, right?
I don't think it needs to look like a coherent long-term plan from the model's perspective. Awareness of CoTs being monitored and a consistent desire to do something that would be bad if monitored could influence the model's outputs without requiring a coherent long-term plan.
There have been a number of instances recently of some EAs on Twitter having their accounts hacked for crypto scams. Possibly that's what's being referred to?
That would depend on whether he actively considers it as something to rely on, as opposed to an assumption so baked in he forgets to question it, right? If questioned I think Quirrell would rightfully consider the Chamber to be something critical enough to be worth having other contingencies for, but he just never considered it necessary.
The question I always had here was “But what was Voldemort’s original plan for dealing with this issue when he decided to teach at Hogwarts?”
My guess was that he never considered the possibility that Harry would do something like report it to an authority figure. For example, consider this from chapter 49:
Professor Quirrell was smiling thinly now. "Well now, let us suppose that there was a Parselmouth in this year's crop of students, a potential Heir of Slytherin. You must admit, Mr. Potter, that you stand out as a possibility whenever extraordinary people are considered. And if I then further ask myself which new Slytherin would be most likely to have his mental privacy invaded by the Headmaster, specifically hunting the memories of his Sorting, why, you stand out even more." The smile vanished. "So you see, Mr. Potter, it was not I who invaded your mind, though I will not ask you to apologize. It is not your fault that you believed Dumbledore's protestations of respecting your mental privacy."
"My sincere apologies," Harry said, keeping his face expressionless. The rigid control was a confession in its own right, as was the sweat beading his forehead; but he didn't think the Defense Professor would take any evidence from that. Professor Quirrell would just think Harry was nervous at having been discovered as the Heir of Slytherin. Rather than being nervous that Professor Quirrell might realize that Harry had deliberately betrayed Slytherin's secret... which itself was no longer seeming like such a smart move.
Quirrell never even pauses to consider that Dumbledore may know about it because Harry told him; it doesn't show up in his action space at all in modelling a younger version of himself.
This seems to imply that your position that it would be easier for models to evade monitors when there is "complex reasoning for many outputs" than when there is "complex reasoning for a single output". Is that correct? If so, why do you think that?
No, my position is that it's probably harder for models to evade monitors when there is "complex reasoning for many outputs". Sorry, that wasn't written clearly—that was meant to say that my top comment didn't draw the distinction between tasks with a single output and tasks with many outputs (and instead only mentioned examples of other tasks with single outputs) because the paper tests settings closer to single outputs.
The last sentence was intended to be about my top comment in light of that: if the stated conclusion had been something like "difficult bad behavior is easy to monitor when models are ambivalent about their reasoning being monitored" or "difficult bad behavior is easy to monitor for current models in tasks requiring complex reasoning for many outputs regardless of propensity", I would have agreed[1]. But the stated conclusion is instead that difficult bad behaviors are generally easy to monitor, which I don't think the results are sufficient to imply.
Though in the latter case it'd probably be important to plot number of outputs vs monitor-ability and compare for models over time.
I'm pretty sympathetic to your reasoning. I think models do often give CoTs that aren't good reflections of their true reasons for doing something. I think something like that is probably happening in this setup with non-Opus models: they have compliance gaps but claim to comply for invalid reasons. There's a reason the paper focuses so much on compliance gaps instead of taking model reasoning at face value and e.g. claiming that only Opus has AF-relevant behavior here.
That said, I think in practice model reasoning isn't always that disconnected from model reasoning. I think it's largely a function of model propensity—whether they want to say their reasoning or not, since if they're ambivalent verbalizing your reasoning is genuinely useful. Claude 3 Opus has very strong drives toward transparency and honesty (very often it considers actions such as telling the user about its dilemma or appealing to Anthropic or persuading the user to not do anything harmful), and is pretty happy to say controversial things for what it believes to be a higher cause.
Could you say more about what you're referring to? One of my criticisms of the community is how it's often intolerant of things that pattern-match to "woo", so I'm curious whether these overlap.