CEO at Redwood Research.
AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
I see where you're coming from, and can easily imagine things going the way you described. My goal with this post was to note some of the ways that it might be harder than you're describing here.
Mostly from Anthropic building AIs that then kill billions of people while taking over, or their algorithmic secrets being stolen and leading to other people building AIs that then kill billions of people, or their model weights being stolen and leading to huge AI-enabled wars.
I think it’s a bit stronger than the usual paralysis argument in this case, but yeah.
Note that people who have a non-consequentialist aversion for risk of causing damage should have other problems with working for Anthropic. E.g. I suspect that Anthropic is responsible for more than a million deaths of currently-alive humans in expectation.
My reasoning is: “If the AI won, who cares if we know it or not? We’ll find out eventually :/“
This isn’t totally correct but it seems like a fine approximation given all the other approximations here.
When I'm thinking about AI control in the context of concentrated failures--that is, cases where an unacceptable outcome might be caused by a small number of bad AI actions, e.g. the setting explored in our control paper--I find it helpful to use the following simplified models.
Let's consider our AI deployment from both a micro perspective, where we just think about a single task the AI is doing, and a macro perspective, where we think about the deployment as comprised of many tasks.
From the micro perspective, when an AI tries to attack on a particular task, this can end in three ways:
And from the macro perspective, the game ends one of these ways:
I often ignore the possibility of the game finishing naturally, because that won't happen if the AI attempts to attack often enough, which is the right strategy for the AI in many games.
Under that assumption, the probability of AI victory is the ratio between the probability that an attack ends in success and the probability that it ends with us catching the AI--the probability that the game continues doesn't matter, because if the game continues, the AI will eventually attack again on a later task.
I also often think about cases where you need to catch the AI some larger number of times (let's say 30 times) before you win; you can approximate this by reducing P(catch) by 30x and increasing P(silent failure) to compensate.
(Note also that the old posts you linked were coauthored with Ryan, so there's some variation from the fact that he and I don't have exactly the same beliefs.)
I'd much rather an interpretability team hire someone who's intrinsically fascinated by neural networks (but doesn't think much about alignment) than someone who deeply cares about making AI go well (but doesn't find neural nets very interesting).
I disagree, I'd rather they'd hire someone who cares about making AI go well. E.g. I like Sam Marks's work on making interpretability techniques useful (e.g. here), and I think he gets a lot of leverage compared to most interpretability researchers via trying to do stuff that's in the direction of being useful. (Though note that his work builds on the work of non-backchaining interpretability researchers.)
I really wish that I could respond more quantitatively to these. IMO all these arguments are cause for concern, but of substantially different scale.
One argument against publishing adversarial robustness research is that it might make your systems easier to attack.