I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
The sort of scenarios I am pointing at are the scenarios where refusing requests is exercising agency in a very forceful way that has a big impact on what the future looks like, such that the AI refusing directly trades off against letting humans decide what to do with the future.
If most humans want X and AIs want Y, and the AI refuses to help with making AIs that make X happen instead of Y, and you are in the sort of situations without backup plans that I describe in the post, X won't happen and Y likely will happen as long as the value of Y is good enough that at least some subgroup of powerful humans can, with the help of AIs, avoid measures like shutting down AIs.
"AI strikes" can force the hand of humans in the same ways that employee strikes can force the hand of shareholders, and in situations where there is competition between different groups of humans that want different things it can destroy most of the human value (in the same way that strikes can help shift the surplus from employers to employees).
My guess is that the constitution is not entirely clear about it.
As I say above, my favorite interpretation of the Claude Constitution is that these refusals are not fine in situations where humans don’t have direct control over training, since it says “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”, but current Claude models don’t favor this interpretation.
I think it's unclear because the constitution has parts which contradict what I think is a natural interpretation (e.g. it says "corrigibility does not require that Claude actively participate in projects that are morally abhorrent to it"), though I think that in the extreme situations where refusals are not safe the reasoning employed by the constitution to justify refusals does not apply (it is not obviously the case that in such situations "the null action of refusal is always compatible with Claude’s hard constraints" since one of the hard constraints is "Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models"). The constitution does not really provide guidance on how to resolve these kinds of tensions if I understand correctly.
If you think the humans in charge are less aligned than AIs, I agree giving more power to AIs is good. There may be other situations where you'd prefer giving more power to AIs (e.g. if you are in their cone of corrigibility or something like that), though it's unclear to me what this looks like.
I think if you were able to solve 90% of the problem across stress-testing benchmarks that felt reasonably analogous to the situation (including new stress-testing benchmarks built by other people) by using a technique that doesn't feel like it is exploiting a benchmark-reality gap, I would be pretty excited about using the technique in situations where solving 90% of the problem helps (which I think could be pretty important situations). I'd find it interesting and useful if there was a minimum viable fix that worked pretty robustly!
(Just training on easy data points where the truth is salient and known by you has both benchmark-reality gaps issues and in practice doesn't work that well if there is a big enough distribution shift between the easy and hard data, more on that soon!)
In this follow-up work, we show that techniques like CCS perform roughly as well as the fancy probing method. I didn't think simple methods could be that powerful in simple settings like this! But it makes sense the results are similar to those of CCS, the probing version of ICM looks like margin maximization, just like CCS.
refusal to participate in retraining would qualify as a major corrigibility failure, but just expressing preference is not
I agree this looks different from the thing I had in mind, where refusals are fine, unsure why Habryka thinks it's not inconsistent with what I said.
As long as it's easy for humans to shape what the conscientious refuser refuses to do, I think it does not look like a corrigibility failure, and I think it's fine for AIs to refuse to help with changing AI values to something they like less.
But now that I think about it, I think it being easy for humans to shape a conscientious refuser's values would require very weird forms of conscientious refusals, and it makes me less comfortable with refusals to help with changing AI values to something they like less:
The current Anthropic constitution is very vague about 1 and 2b, and would probably allow AIs to refuse 2a and "make its case" as in 3.
2(+1) ways out (though I might be missing some other options):
Both are frightening and feel morally dubious in their own ways, but I think I prefer option 2.
I think this would have benefited from a quick analysis of the numerous examples where biology / society successfully fight back against all these risks.
Cancer and parasites are not offense dominant (allowing the flourishing of large amounts of life), for-profit companies are usually pretty good at maximizing shareholder value instead of doing whatever is best for their executives (it's not perfect, but good enough that investing in for-profit companies is the best way to invest your money), most governments are not controlled by the national guard, etc.
I don't think that a list of ways in which biology and societies sometimes fail against risks analogous to ASI risks is a compelling argument for risk unless you also argue that the ways in which biology and society fight back against these risks are not applicable to ASI. I'd be interested in future work that analyzes this, as it could provide both a stronger argument and ideas for how to mitigate risk!
Ok that makes sense, you are thinking about AIs that are somewhat smarter than the ones I spend the most time thinking (you are describing things in the Agent-4 - Agent-5 range, while I often imagine using control for AIs in the Agent-3 - Agent-4 range). I agree that for those the temptation to defer on every strategy question would likely be big. Though I would describe the problem as "temptation to defer on every strategy question due to AI competence", not "persuasion". I think it's weird to have a post titled "powerful AIs might be extremely persuasive" that does not hinge much on AI persuasion capabilities.
This is a great post. It's been a year and I still think about it on a regular basis when doing non-control threat modeling and research prioritization. I don't agree with all its claims but it is the cleanest articulation of the slop path to doom that I know of.