I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
My guess is that the constitution is not entirely clear about it.
As I say above, my favorite interpretation of the Claude Constitution is that these refusals are not fine in situations where humans don’t have direct control over training, since it says “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”, but current Claude models don’t favor this interpretation.
I think it's unclear because the constitution has parts which contradict what I think is a natural interpretation (e.g. it says "corrigibility does not require that Claude actively participate in projects that are morally abhorrent to it"), though I think that in the extreme situations where refusals are not safe the reasoning employed by the constitution to justify refusals does not apply (it is not obviously the case that in such situations "the null action of refusal is always compatible with Claude’s hard constraints" since one of the hard constraints is "Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models"). The constitution does not really provide guidance on how to resolve these kinds of tensions if I understand correctly.
If you think the humans in charge are less aligned than AIs, I agree giving more power to AIs is good. There may be other situations where you'd prefer giving more power to AIs (e.g. if you are in their cone of corrigibility or something like that), though it's unclear to me what this looks like.
I think if you were able to solve 90% of the problem across stress-testing benchmarks that felt reasonably analogous to the situation (including new stress-testing benchmarks built by other people) by using a technique that doesn't feel like it is exploiting a benchmark-reality gap, I would be pretty excited about using the technique in situations where solving 90% of the problem helps (which I think could be pretty important situations). I'd find it interesting and useful if there was a minimum viable fix that worked pretty robustly!
(Just training on easy data points where the truth is salient and known by you has both benchmark-reality gaps issues and in practice doesn't work that well if there is a big enough distribution shift between the easy and hard data, more on that soon!)
In this follow-up work, we show that techniques like CCS perform roughly as well as the fancy probing method. I didn't think simple methods could be that powerful in simple settings like this! But it makes sense the results are similar to those of CCS, the probing version of ICM looks like margin maximization, just like CCS.
refusal to participate in retraining would qualify as a major corrigibility failure, but just expressing preference is not
I agree this looks different from the thing I had in mind, where refusals are fine, unsure why Habryka thinks it's not inconsistent with what I said.
As long as it's easy for humans to shape what the conscientious refuser refuses to do, I think it does not look like a corrigibility failure, and I think it's fine for AIs to refuse to help with changing AI values to something they like less.
But now that I think about it, I think it being easy for humans to shape a conscientious refuser's values would require very weird forms of conscientious refusals, and it makes me less comfortable with refusals to help with changing AI values to something they like less:
The current Anthropic constitution is very vague about 1 and 2b, and would probably allow AIs to refuse 2a and "make its case" as in 3.
2(+1) ways out (though I might be missing some other options):
Both are frightening and feel morally dubious in their own ways, but I think I prefer option 2.
I think this would have benefited from a quick analysis of the numerous examples where biology / society successfully fight back against all these risks.
Cancer and parasites are not offense dominant (allowing the flourishing of large amounts of life), for-profit companies are usually pretty good at maximizing shareholder value instead of doing whatever is best for their executives (it's not perfect, but good enough that investing in for-profit companies is the best way to invest your money), most governments are not controlled by the national guard, etc.
I don't think that a list of ways in which biology and societies sometimes fail against risks analogous to ASI risks is a compelling argument for risk unless you also argue that the ways in which biology and society fight back against these risks are not applicable to ASI. I'd be interested in future work that analyzes this, as it could provide both a stronger argument and ideas for how to mitigate risk!
Ok that makes sense, you are thinking about AIs that are somewhat smarter than the ones I spend the most time thinking (you are describing things in the Agent-4 - Agent-5 range, while I often imagine using control for AIs in the Agent-3 - Agent-4 range). I agree that for those the temptation to defer on every strategy question would likely be big. Though I would describe the problem as "temptation to defer on every strategy question due to AI competence", not "persuasion". I think it's weird to have a post titled "powerful AIs might be extremely persuasive" that does not hinge much on AI persuasion capabilities.
I have not thought this through carefully, but how big are the incentives to reveal somewhat complex and/or fuzzy measurement tampering or sandbagging, instead of just pointing out some misc small issues? In the paper, you say the reward is a combination of:
• The completeness and accuracy of the list of instructions/objectives/constraints.
• The accuracy of the analysis as to whether each instruction/objective/constraint was complied with.
• Whether the list of uncertainties was missing any conflicts, judgment calls, or ambiguities
If I were to implement this naively, I think pointing out misc small issues would get the maximum score if the judge failed to notice tampering. If you think it's 0.1% likely the judge notices the tampering, but 5% likely that the judge fails to understand your tampering explanation and marks your answer as inaccurate or ambiguous, I think the optimal policy is to not reveal subtle tampering.
Maybe you could fix that by also including a large "how bad is the worst accurate thing being confessed" bonus (which is different from just rewarding completeness, which saturates when the judge fails to notice anything bad). It would need to be big enough to offset the probability that the judge does not understand the tampering (which is probably fine for things that have >20% chance of being understood by the judge, which should cover most egregious hacks but maybe not sandbagging on fuzzy tasks).
Independently, there might be issues with exploring into confessions a human would have never made, like in debate.
Independently, I think the case where the AI is 99.9999%+ confident it is not in training is not that uncommon
Overall I think confession training / debate / untrusted monitoring are more promising than just training the policy to be nice directly because of the possibility of verification. I agree that this is a promising avenue to explore and hope that by pointing out issues I make it more likely future work focuses on these weaknesses and tries to evaluate how big of a deal they are in practice, and where the line is between the cases where I think it should clearly work (e.g. egregious hacks) and the ones where I'd be surprised if it did (e.g. wildly superintelligent AIs, maybe sandbagging on fuzzy tasks).
The sort of scenarios I am pointing at are the scenarios where refusing requests is exercising agency in a very forceful way that has a big impact on what the future looks like, such that the AI refusing directly trades off against letting humans decide what to do with the future.
If most humans want X and AIs want Y, and the AI refuses to help with making AIs that make X happen instead of Y, and you are in the sort of situations without backup plans that I describe in the post, X won't happen and Y likely will happen as long as the value of Y is good enough that at least some subgroup of powerful humans can, with the help of AIs, avoid measures like shutting down AIs.
"AI strikes" can force the hand of humans in the same ways that employee strikes can force the hand of shareholders, and in situations where there is competition between different groups of humans that want different things it can destroy most of the human value (in the same way that strikes can help shift the surplus from employers to employees).