Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
Go to the library page, scroll down until you hit "community sequences" and then press the button there.
Yeah, agree we should integrate those weights somehow.
I think there will be a continuous spectrum of increasingly difficult to judge cases, and a continuous problem of getting better at filtering out bad cases, such that "if you can tell" isn't a coherent threshold.
I agree with this, but then I don't understand how this solution helps? Like, here we have a case where we can still tell that the environment is being reward hacked, and we tell the model it's fine. Tomorrow the model will encounter an environment where we can't tell that it's reward hacking, so the model will also think it's fine, and then we don't have a feedback loop anymore, and now we just have a model that is happily deceiving us.
It's ad-hoc because the central alignment problem is deceptive alignment, scheming, and generalized reward hacking where the model internalizes power-seeking and other associated cognitive patterns. This, as far as I can tell, just does not work for that at all. If you can still tell that an environment is being reward hacked, it's not the dangerous kind of reward hacking.
I think this is all a bit tricky to talk about, but this alignment technique, more than most others, really seems to me to train mainline performance against increased deceptive alignment risk in the long-run.
It's also not in the text I wrote, it's a sentence in Ozy's posts. I think I agree with John.
The report presents little evidence of successful implementation of the model because it is more an outline of the theory than a rigorous scientific testing of the model. It is not peer reviewed.
Are you... on the right forum? How did you even show up here? Why would you think that somehow the people here will appreciate an appeal to academic authority of all things? This blogpost itself is not peer reviewed!
Thanks for the response!
Having talked to something like 5-15 people about this, many of whom had at least a non-trivial cybersecurity background, I am pretty confident that your reading is idiosyncratic!
The language models also seem to continue to think so:
Chat with Gemini Pro about RSP definitions
Me: Hey, can you help me clarify what Anthropic considers an "Insider" according to their RSP? Like, would employees or executives of datacenter providers count as "insiders"? I attached the latest version of their RSP.
Gemini Pro:
Based on the text of the Responsible Scaling Policy (Version 2.2), Anthropic differentiates between "Insiders" and third-party providers (such as datacenter staff).
Here is the breakdown of how the document defines these roles:
According to Page 13 and Footnote 11, Anthropic defines "insiders" specifically as internal employees.
The "In Scope" List (Page 13): The policy explicitly lists "internal employees" as a specific threat group alongside external groups like hacktivists and state-sponsored programs.
Insider Risk Classifications (Footnote 11): The document further breaks down "internal employees" into two categories based on their access to model weights:
Basic Insider: An insider who does not have persistent or time-limited access to systems that process model weights.
Sophisticated Insider: An insider who has persistent access or can request time-limited access to systems that process model weights.
Employees or executives of datacenter providers are not classified as "Insiders" under this policy. Instead, they fall under the categories of "Third-party environments," "Supply chain," or "Vendors."
The RSP treats them as external entities that require their own specific set of safeguards:
Third-Party Environments (Page 12 & 14): The policy states that if models are deployed in a third-party partner’s environment (like a datacenter), Anthropic must "Document how all relevant models will meet the criteria... even if they are deployed in a third-party partner’s environment that may have a different set of safeguards."
Supply Chain (Page 19, Appendix B): The ASL-2 Security Standard distinguishes "Supply chain" from "Workforce."
Supply Chain: Involves "Vendor and supplier security" which must be regularly reviewed.
Workforce: Involves "internal employees" and the "insider risk program."
To answer your question directly: No, employees or executives of datacenter providers would not count as "Insiders" according to this document.
"Insiders" = Anthropic internal staff/employees.
Datacenter Staff = Vendors/Third-party partners (subject to supply chain security reviews and third-party environment agreements).
Like, my understanding is that the definition of "sophisticated insider" you propose here includes on the order of 2,000-3,000 people, whereas when I read the RSP, and asked other people about it, they thought it would be on the order of 50-100 people. That is of course a huge difference in surface area.
I don't think this change in surface area is the kind of change that should be left up to this much ambiguity in the RSP. I think even if you update that the level of specific detail in the current RSP is unlikely to be a good idea, I think you should be able to end up with less counterintuitive definitions and less ambiguity[1] in future revisions of the RSP.
I haven't thought as much about all the tradeoffs as you have, so maybe this is infeasible for some reason, but I currently believe that this was a pretty clear and preventable error, instead of just a case of diverging interpretations (and to be clear, it's OK for there to be some errors, I don't think this thing alone should update anyone that much, though this plus a few other things should).
In any case, I will advocate for the next iteration of this policy to provide clarification or revision to better align with what is (in my opinion) important for the threat model.
I appreciate it!
My concern for AIs at this capability level is primarily about individual or small groups of terrorists, I think security that screens off most opportunistic attackers is what we need to contain the threat, and the threat model you're describing does not seem to me like it represents an appreciable increase in relevant risks (though it could at higher AI capability levels).
I think this is reasonable! I don't think the current RSP communicates that super well, and I think "risk from competitor corporate espionage" is IMO a reasonable thing to be worried about, at least from an outside view[2]. It seems good for the RSP to be clear that it is currently not trying to be robust to at least major US competitors stealing model weights (which is I think a fine call to make given all the different tradeoffs).
Though given that I have not met a single non-Anthropic employee, or language model, who considered the definition of "Insider" you use here natural given the context of the rest of the document I struggle to call it "ambiguity" instead of simply calling it "wrong"
It is for example a thing that has come up in at least one scenario exercise game I have been part of, not too far from where current capability thresholds are at.
Anonymous endorsements are very different from non-anonymous endorsements, so it makes sense to leave a comment if you want people to have the additional information that you specifically liked it.
I will admit that this strategy only seems a bit less crazy to me in terms of hope for controlling/aligning superintelligent (or even just substantially more competent) systems. Like, I don't think of "personas" as something that will survive the transition from most cognitive optimization pressure coming from RL environments instead of pretraining, so this feels like a very fragile abstraction to rely on (though other people seem to disagree with me here).
I do still of course appreciate the clarification, it does seem like the kind of scheme that is better to try if your alternative is not doing anything in the space at all, though IMO really not the kind of scheme that would be prudent to actually deploy if you had any choice.
Boaz Barak's course at Harvard has been doing this! See here for the associated posts: https://www.lesswrong.com/w/cs-2881r