Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
Thanks, that's helpful! (I also found @Jozdien's similar explanation helpful)
One way I would frame the "more elegant core" of why you might hope it would generalize is: you're trying to ensure that the "honest reporter" (i.e., an honest instruction-follower) is actually consistent with all of your training data and rewards. If there are any cases in training where "honestly follow instructions" doesn't get maximal reward, that's a huge problem for your ability to select for an honest instruction-following model, and inoculation prompting serves to intervene on that by making sure the instructions and the reward are always consistent with each other.
I do feel like something is off about this, and that it doesn't feel like the kind of thing that's stable under growing optimization pressure (for somewhat similar reasons as my objections to the ELK framing at the time).
I also feel like there is some "self-fulfilling prophecy" hope here that feels fragile (like telling the model "oh, it's OK for you to sometimes reward hack, that's expected" is a thing we can tell it non-catastrophically, but telling the model "oh, it's OK for you to sometimes subtly manipulate us into giving you more power, that's expected" then like, that's not really true, we indeed can't really tell the AI it's fine to disempower us).
I might try to write more about it at one point, don't think I can make my thoughts here properly rigorous, and need to think more about the more elegant core that give here.
That makes sense! I do think this sounds like a pretty tricky problem, as e.g. a machine translation of an eval might very well make it into the training set, or plots of charts that de-facto encode the evals themselves.
But it does sound like there is some substantial effort going into preventing at least the worst error modes here. Thank you for the clarification!
But like, the "accurate understanding" of the situation here relies on the fact that the reward-hacking training was intended. The scenarios we are most worried about are for example RL-environments where scheming-like behavior is indeed important for performance (most obviously in adversarial games, but honestly almost anything with a human rater will probably eventually do this).
And in those scenarios we can't just "clarify that exploitation of reward in this training environment is both expected and compatible with broadly aligned behavior". Like, it might genuinely not be "expected" by anyone and whether it's "compatible with broadly aligned behavior" feels like an open question and in some sense a weird self-fulfilling prophecy statement that the AI will eventually have its own opinions on.
This seems like a good empirical case study and I am glad you wrote it up!
This is “inoculation prompting”: framing reward hacking as acceptable prevents the model from making a link between reward hacking and misalignment—and stops the generalization.
That said, I can't be the only one who thinks this is somewhat obviously a kind of insane way of solving this problem. Like, this really doesn't seem like the kind of technique that generalizes to superintelligence, or even the kind of technique we can have any confidence will generalize to more competent future models (at which point they might be competent and misaligned enough to scheme and hide their misalignment, making patch-strategies like this no longer viable).
Maybe there is some more elegant core here that should give us confidence this will generalize more than it sounds like to me on a first pass, or maybe you cover some of this in the full paper which I haven't read in detail. I was just confused that this blogpost doesn't end with "and this seems somewhat obviously like a super hacky patch and we should not have substantial confidence that this will work on future model generations, especially ones competent enough to pose serious risks".
I mean, aren't you training on evals simply by pretraining on random blogposts that have the eval text?
Like, are there more sophisticated algorithms than "watch for the canary string" that do keep the evals out of the training set?
If not, I think the right way to describe this is to say that the models are trained on the evals. Like, sure, maybe they aren't setting up a full RL environment for all of them, but there is still of course a huge effect size from that.
Ok, I... think this makes sense? Honestly, I think I would have to engage with this for a long time to see whether this makes sense with the actual content of e.g. Bostrom's text, but I can at least see the shape of an argument that I could follow if I wanted to! Thank you!
(To be clear, this is of course not a reasonable amount of effort ask to put into understanding a random paragraph from a blogpost, at least without it being flagged as such, but writing is hard and it's sometimes hard to bridge inferential distance)
The original title was "Small batches and the mythical single piece flow", which I guess communicates that better, but "single piece flow" is big word and I like imperative titles for this principle series better.
I might be able to come up with something else. Suggestions also welcome!
Edit: I updated the title! Maybe it's too hard to parse, but I think it's better.
Cool, that's not a crazy view. I might engage with it more, but I feel like I understand where you are coming from now.
Jason Clinton very briefly responded saying that the May 14th update, which excluded sophisticated insiders from the threat model, addresses the concerns in this post, plus some short off-the-record comments.
Based on this and Holden's comment, my best interpretation of Anthropic's position on this issue is that they currently consider employees at compute providers to be "insiders" and executives "sophisticated insiders", and the latter hereby excluded from Anthropic's security commitments. They also likely furthermore think that compute providers do not have the capacity to execute attacks like this without very high chance of getting caught and so that this is not a threat model they are concerned about.
As I argue for in the post, defining basically all employees at compute providers to be "insiders" feels like an extreme stretch of the word, and I think has all kinds of other tricky implications for the RSP, but it's not wholly inconsistent!
I think to bring Anthropic back into compliance with what I think is a common sense reading, I would suggest (in descending goodness):
I separately seem to disagree with Anthropic on this being a thing that executives at Google/Amazon/Microsoft could be motivated to do, and a thing that they would succeed at if they tried, but given the broad definition of "Insider" it doesn't appear to be load bearing for the thesis in the OP. I have written a bit more about it anyways in this comment thread.
Right, but all of this relies on the behavior being the result of a "game-like" setting that allows you to just bracket it.
But the misbehavior I am worried about in the long run is that the unintended power-seeking is the convergent result of many different RL environments that encourage power-seeking, and where that power-seeking behavior is kind of core to good task performance. Like, when the model starts overoptimizing the RLHF ratings with sophisticated sycophancy, it's not that it "played a game of maximizing RLHF ratings" it's that "our central alignment environment has this as a perverse maximization". And I don't think I currently see how this is supposed to expand into that case (though it might, this all feels pretty tricky and confusing to reason about).