Another item for the list of “mundane things you can do for AI takeover prevention”:
We have a wide variety of ways to make AI systems less capable but also less scary. Most alignment research is focused on pushing out the Pareto frontier here. But IMO a lot of value can come from techniques which allow us to choose the right point on this Pareto frontier for a particular application. It seems to me that different AI applications (where by “different AI applications” I’m including things like “different types of tasks you might ask your assistant to do”) have very different tradeoffs between capabilities and scariness. The classic example of this is that you plausibly shouldn’t put your scariest/smartest AIs in charge of running your nuclear weapon silos, because it seems like the returns to being super smart aren’t very high, and the model has particularly good opportunities to behave badly. On the other hand, galaxy brained AIs that were trained end-to-end to use inscrutable thoughts to have incomprehensible ideas can be used fruitfully on tasks where the stakes aren’t high or where verification is easy.
Ryan is saying “AI takeover is obviously really bad and scary regardless of whether the AI is likely to literally kill everybody. I don’t see why someone’s sticking point for worrying about AI alignment would be the question of whether misaligned AIs would literally kill everyone after taking over.”
This is a reasonable point. What I actually are about is reality, but I expect social reality to track reality fairly well on these points.
Fwiw in the conversations I’m in (in the alignment scene in the Bay Area) this point is widely understood.
Something I've realized over the last few days:
Why did we look at just the “most aggressive” experiment allowed by a hypothesis H, instead of choosing some other experiment allowed by H?
The argument for CaSc is: “if H was true, then running the full set of swaps shouldn’t affect the computation’s output, and so if the full set of swaps does affect the computation’s output, that means H is false.” But we could just as easily say “if H was true, then the output should be unaffected any set of swaps that H says should be fine.”
Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.
I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially--that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’). And so, when explaining CaSc, I think we should plausibly think about describing it by talking about the hypothesis producing a bunch of allowed experiments, and then you can test your hypothesis by either looking at the maxent one or by looking at the worst one.
Here's a take of mine on how you should think about CaSc that I haven't so far gotten around to publishing anywhere:
I think you should think of CaSc as being a way to compute a prediction made by the hypothesis. That is, when you claim that the model is computing a particular interpretation graph, and you provide the correspondence between the interpretation graph and the model, CaSc tells you a particularly aggressive prediction made by your hypothesis: your hypothesis predicts that making all the swaps suggested by CaSc won't affect the average output of your computational graph.
Thinking about it this way is helpful to me for two reasons:
CaSc attempts to compute the single most aggressive prediction made by your hypothesis--this is why we do all allowed swaps. (I'm a bit confused about whether we should think of CaSc as succeeding at being the most aggressive experiment for the hypothesis though, I think there are some subtleties here that my coworkers have worked out that I don't totally understand.)
I think I regret that we phrased our writeup as "CaSc gives you a test of interp hypotheses" rather than saying "CaSc shows you a strong prediction made by your interp hypothesis, which you can then compare to the truth, and if they don't match that's a problem for your hypothesis".
Thanks for your work!
Causal Scrubbing Cannot Differentiate Extensionally Equivalent Hypotheses
I think that what you mean here is a combination of the following:
But one way someone could interpret this sentence is that CaSc doesn't distinguish between whether the model does quicksort or mergesort. This isn't generally true--if your interpretation graph broke up its quicksort implementation into multiple nodes, then the CaSc experiment would fail to explain the model's performance if the model itself was actually using merge sort.
Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don't want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I'll be able to predict whether it will generalize correctly onto a particular new distribution.
The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following: There's some interference in the model which manifests as random noise, and the explanation failed to preserve the interference pattern. In this case, your explanation has a bunch of random error in its prediction of what the model does, which will hurt the KL. But that interference was random and understanding it won't help you know if the mechanism that the model was using is going to generalize well to another distribution.
There are other similar cases than the interference one. For example, if your model has a heuristic that fires on some of the subdistribution you're trying to understand the model's behavior on, but not in a way that ends up affecting the model's average performance, this is basically another source of noise that you (at least often) end up not wanting your explanation to have to capture.
It seems like a big input into P(AI takeover) is the extent to which instances of our AI are inclined to cooperate with each other; specifically, the extent to which they’re willing to sacrifice overseer approval at the thing they’re currently doing in return for causing a different instance to get more overseer approval. (I’m scared of this because if they’re willing to sacrifice approval in return for a different instance getting approval, then I’m way more scared of them colluding with each other to fool oversight processes or subvert red-teaming procedures, or coordinating coups.)
If the AIs never interact with each other, there’s selection pressure against them caring about the effects of their actions on the reward given to other instances.
But IRL, the AIs are probably actually going to be working with each other a whole lot, and will plausibly often have opportunities to make tradeoffs where they help each other out (in the same way that people at the same org often have opportunities to help each other out varying amounts). Inasmuch as these opportunities arise, it’s probably from the perspective of the org for the AIs to be cooperative with each other, and to care about helping each other succeed at their tasks. (For the same reason that you want workers at an org to help each other when it increases total productivity.)
(It’s actually a bit unclear to me how valuable it is to have AIs that cooperate. Many of the reasons that people at orgs can produce surplus by helping each other don’t seem applicable. E.g. you might not have specialization to different areas of knowledge (except for safety reasons)?)
There are a few ways you might set things up so that the AIs are able to be appropriately cooperative with each other:
I was thinking about this because I was trying to figure out how much of my P(doom) comes from deceptive alignment. The above arguments that we’ll want to train models for cooperation means that deceptive alignment is less of why we might get collusive models.