I agree with you that it is obviously true that we won't be able to make detailed predictions about what an AGI will do without running it. In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods.
I only agree with the first sentence here, and I don't think the rest of the paragraph follows from it. I agree being able to safely experiment on AGIs would be useful, but it's not a replacement for what interpretability is trying to do. Deception is a good example here: how do you empirically tell whether a model is deceptive without giving it a chance to actually execute a treacherous turn? You'd have to fool the model, and there are big obstacles to that. Maybe relaxed adversarial training could help, but that's also more of a research direction than a concrete method for now---I think for any specific alignment approach, it's easy to find challenges. If there is a specific problem that people are currently planning to solve with interpretability, and that you think could be better solved using some other method based on safely experimenting with the model, I'd be interested to hear that example, that seems more fruitful than abstract arguments. (Alternatively, you'd have to argue that interpretability is just entirely doomed and we should stop pursuing it even lacking better alternatives for now---I don't think your arguments are strong enough for that.)
But as you said, this is an unrealistically optimistic picture.
I want to clarify that any story for solving deception (or similarly big obstacles) that's as detailed as what I described seems unrealistically optimistic to me. Out of all stories this concrete that I can tell, the interpretability one actually looks like one of the more plausible ones to me.
In your model, why did the Human Brain Project crash and burn? Should we expect interpreting AGI-scale neural nets to succeed where interpreting biological brains failed?
This is actually something I'd be interested to read more about (e.g. I think a post looking at what lessons we can learn for interpretability from neuroscience and attempts to understand the brain could be great). I don't know much about this myself, but some off-the-cuff thoughts:
Your post briefly mentions these advantages but then dismisses them because they do "not seem to address the core issue of computational irreducibility"---as I said in my first comment, I don't think computational irreducibility rules out the things people realistically want to get out of interpretability methods, which is why for now I'm not convinced we can draw extremely strong conclusions from neuroscience about the difficulty of interpretability.
ETA: so to answer you actual question about what I think happened with the HBP: in part they didn't have those advantages (and without those, I do think mechanistic interpretability would be insanely difficult). Based on the Guardian post you linked, it also seems they may have been more ambitious than interpretability researchers? (i.e. actually making very fine-grained predictions)
My model for why interpretability research might be useful, translated into how I understand this post's ontology, is mainly that it might let us make coarse-grained predictions using fine-grained insights into the model.
I think it's obviously true that we won't be able to make detailed predictions about what an AGI will do without running it (this is especially clear for a superintelligent AI: since it's smarter than us, we can't predict exactly what actions it will take). I'm not sure if you are claiming something stronger about what we won't be able to predict?
In any case, this does not rule out that there might be computationally cheap to extract facts about the AI that let us make important coarse-grained predictions (such as "Is it going to kill us all?"). For example, we might figure out that the AI is running some computations that look like they're checking whether the AI is still in a training sandbox. The output of those computations seems to influence a bunch of other stuff going on in the AI. If we intervene on this output, the AI behaves very differently (e.g. trying to scam people we're simulating for money). I think this is an unrealistically optimistic picture, but I don't see how it's ruled out specifically by the arguments in this post.
As an analogy: while we can't predict which moves AlphaZero is going to make without running it, we can still make very important coarse-grained predictions, such as "it's going to win", if we roughly know how AlphaZero works internally. You could imagine an analogous chess playing AI that's just one big neural net with learned search. If interpretability can tell us "this thing is basically running MCTS, its value function assigns very high value to board states where it's clearly winning, ...", we could make an educated guess that it's a good chess player without ever running it.
One thing that might be productive would be to apply your arguments to specific examples of how people might want to use interpretability (something like the deception case I outlined above). I currently don't know how to do that, so for now the argument doesn't seem that forceful to me (it sounds more like one of these impossibility results that sometimes don't matter in practice, like no free lunch theorems).
I basically agree with this post but want to push back a little bit here:
The problem is not that we don't know how to prevent power-seeking or instrumental convergence, because we want power-seeking and instrumental convergence. The problem is that we don't know how to align this power-seeking, how to direct the power towards what we want, rather than having side-effects that we don't want.
Yes, some level of power-seeking-like behavior is necessary for the AI to do impressive stuff. But I don't think that means giving up on the idea of limiting power-seeking. One model could look like this: for a given task, some level of power-seeking is necessary (e.g. to build working nanotech, you need to do a bunch of experiments and simulations, which requires physical resources, compute, etc.). But by default, the solution an optimization process would find might do even more power-seeking than that (killing all humans to ensure they don't intervene, turning the entire earth into computers). This higher level of power-seeking does increase the success probability (e.g. humans interfering is a genuine issue in terms of the goal of building nanotech). But this increase in success probability clearly isn't necessary from our perspective: if humans try to shut down the AI, we're fine with the AI letting itself be shut off (we want that, in fact!). So the argument "we want power-seeking" isn't strong enough to imply "we want arbitrary amounts of power-seeking, and trying to limit it is mis-guided".
I think of this as two complementary approaches to AI safety:
I see this post as a great write-up for "We need some power-seeking/instrumentally convergent behavior, so AI safety isn't about avoiding that entirely" (a rock would solve that problem, it doesn't seek any power). I just want to add that my best guess is we'll want to do some mix of 1. and 2. above, not just 1. (or at least, we should currently pursuer both strategies, because it's unclear how tractable each one is).
I don't think we're changing goalposts with respect to Katja's posts, hers didn't directly discuss timelines either and seemed to be more about "is AI x-risk a thing at all?". And to be clear, our response isn't meant to be a fully self-contained argument for doom or anything along those lines (see the "we're not discussing" list at the top)---that would indeed require discussing timelines, difficulty of alignment given those timelines, etc.
On the object level, I do think there's lots of probability mass on timelines <20 years for "AGI powerful enough to cause an existential catastrophe", so it seems pretty urgent. FWIW, climate change also seems urgent to me (though not a big x-risk; maybe that's what you mean?)
I agree that aligned AI could also make humans irrelevant, but not sure how that's related to my point. Paraphrasing what I was saying: given that AI makes humans less relevant, unaligned AI would be bad even if no single AI system can take over the world. Whether or not aligned AI would also make humans irrelevant just doesn't seem important for that argument, but maybe I'm misunderstanding what you're saying.
Interesting points, I agree that our response to part C doesn't address this well.
AI's colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it's the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the "AIs take part in human societal structures indefinitely" hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we can derive from that. It's harder to make an airtight x-risk argument using fast takeoff, since I don't think we have airtight arguments for p(fast takeoff) being close to 1, but still important to consider if we're figuring out our overall best guess, rather than trying to find a reasonably compact argument for AI x-risk. (To put this differently: the strongest argument for AI x-risk will of course consider all the ways in which things could go wrong, rather than just one class of ways that happens to be easiest to argue for).
A more robust worry (and what I'd probably rely on for a compact argument) is something like What Failure Looks Like Part 1: maybe AIs work within the system, in the sense that they don't take over the world in obvious, visible ways. They usually don't break laws in ways we'd notice, they don't kill humans, etc. On paper, humans "own" the entire economy, but in practice, they have an increasingly hard time achieving outcomes they want (though they might not notice that, at least for a while).This seems like a mechanism for AIs to collectively "take over the world" (in the sense that humans don't actually have control of the long-run trajectory of the universe anymore), even if no individual AI can break out of the system, and if AIs aren't great at collaborating against humanity.
Addressing a few specific points:
humanity will get better at aligning and controlling AI systems as we gain more experience with them,
True to some extent, but I'd expect AI progress to be much faster than human improvement at dealing with AI (the latter is also bounded presumably). So I think the crux is the next point:
and we may be able to enlist the help of AI systems to keep others in check.
Yeah, that's an important point. I think the crux boils down to how well approaches like IDA or debate are going to work? I don't think that we currently know exactly how to make them work sufficiently well for this, I have less strong thoughts on whether they can be made to work or how difficult that would be.
Thanks for the interesting comments!
Briefly, I think Katja's post provides good arguments for (1) "things will go fine given slow take-off", but this post interprets it as arguing for (2) "things will go fine given AI never becomes dangerously capable". I don't think the arguments here do quite enough to refute claim (1), although I'm not sure they are meant to, given the scope ("we are not discussing").
Yeah, I didn't understand Katja's post as arguing (1), otherwise we'd have said more about that. Section C contains reasons for slow take-off, but my crux is mainly how much slow takeoff really helps (most of the reasons I expect iterative design to fail for AI still apply, e.g. deception or "getting what we measure"). I didn't really see arguments in Katja's post for why slow takeoff means we're fine.
We don't necessarily need to reach some "safe and stable state". X-risk can decrease over time rapidly enough that total x-risk over the lifespan of the universe is less than 1.
Agreed, and I think this is a weakness of our post. I have a sense that most of the arguments you could make using the "existentially secure state" framing could also be made more generally, but I haven't figured out a framing I really like yet unfortunately.
EtA: I am still more concerned about "not enough samples to learn human preferences" than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven't scrutinized it too much (but would be interested to discuss it cooperatively).
Would be interested in discussing this more at some point. Given your comment, I'd now guess I dismissed this too quickly and there are things I haven't thought of. My spontaneous reasoning for being less concerned about this is something like "the better our models become (e.g. larger and larger pretrained models), the easier it should be to make them output things humans approve of". An important aspect is also that this is the type of problem where it's more obvious if things are going wrong (i.e. iterative design should work here---as long as we can tell the model isn't aligned yet, it seems more plausible we can avoid deploying it).
This was an interesting read, especially the first section!
I'm confused by some aspects of the proposal in section 4, which makes it harder to say what would go wrong. As a starting point, what's the training signal in the final step (RL training)? I think you're assuming we have some outer-aligned reward signal, is that right? But then it seems like that reward signal would have to do the work of making sure that the AI only gets rewarded for following human instructions in a "good" way---I don't think we just get that for free. As a silly example, if we rewarded the AI whenever it literally followed our commands, then even with this setup, it seems quite clear to me we'd at best get a literal-command-following AI, and not an AI that does what we actually want. (Not sure if you even meant to imply that the proposal solved that problem, or if this is purely about inner alignment).
The complexity regularizer should ensure the AI doesn't develop some separate procedure for interpreting commands (which might end up crucially flawed/misaligned). Instead, it will use the same model of humans it uses to make predictions, and inaccuracies in it would equal inaccuracies in predictions, which would be purged by the SGD as it improves the AI's capabilities.
Since this sounds to me like you are saying this proposal will automatically lead to commands being interpreted the way we mean them, I'll say more on this specifically: the AI will presumably have not just a model of what humans actually want when they give commands (even assuming that's one of the things it internally represents). It should just as easily be able to interpret commands literally using its existing world model (it's something humans can do as well if we want to). So which of these you get would depend on the reward signal, I think.
For related reasons, I'm not even convinced you get something that's inner-aligned in this proposal. It's true that if everything works out the way you're hoping, you won't be starting with pre-existing inner-misaligned mesa objectives, you just have a pure predictive model and GPS. But then there are still lots of objectives that could be represented in terms of the existing predictive model that would all achieve high reward. I don't quite follow why you think the objective we want would be especially likely---my sense is that even if "do what the human wants" is pretty simple to represent in the AI's ontology, other objectives will be too (as one example, if the AI is already modeling the training process from the beginning of RL training, then "maximize the number in my reward register" might also be a very simple "connective tissue").