Wiki Contributions


It seems like a big input into P(AI takeover) is the extent to which instances of our AI are inclined to cooperate with each other; specifically, the extent to which they’re willing to sacrifice overseer approval at the thing they’re currently doing in return for causing a different instance to get more overseer approval. (I’m scared of this because if they’re willing to sacrifice approval in return for a different instance getting approval, then I’m way more scared of them colluding with each other to fool oversight processes or subvert red-teaming procedures, or coordinating coups.)

If the AIs never interact with each other, there’s selection pressure against them caring about the effects of their actions on the reward given to other instances.

But IRL, the AIs are probably actually going to be working with each other a whole lot, and will plausibly often have opportunities to make tradeoffs where they help each other out (in the same way that people at the same org often have opportunities to help each other out varying amounts). Inasmuch as these opportunities arise, it’s probably from the perspective of the org for the AIs to be cooperative with each other, and to care about helping each other succeed at their tasks. (For the same reason that you want workers at an org to help each other when it increases total productivity.)

(It’s actually a bit unclear to me how valuable it is to have AIs that cooperate. Many of the reasons that people at orgs can produce surplus by helping each other don’t seem applicable. E.g. you might not have specialization to different areas of knowledge (except for safety reasons)?)

There are a few ways you might set things up so that the AIs are able to be appropriately cooperative with each other:

  • You could just have a market economy. You already plausibly wanted it to be the case that your AI instances have access to some cash, and can spend it as they think is appropriate to accomplish their goals (and the amount they spend is deducted from their reward). And then you can have your AIs cooperate via an internal economy, where if they want assistance they pay each other.
    • This kind of setup doesn’t obviously train the AIs to care about each other.
    • In this kind of setup, it is not a competitive advantage for the AIs to care about each other. But it is a competitive advantage for the AIs to be scrupulously honest with each other. Idk what effect that has.
    • There are various disadvantages here analogous to the reasons that normal orgs don’t do this.
  • A way scarier, probably dumb, but very simple approach is to set it up so that whenever two AIs interact with each other, their rewards “bleed into” each other a bit, so they’re naturally inclined to help each other out. E.g. maybe whenever the AIs interact in a particular way, you naturally give each of the AIs 3% of the reward that the other one gets within some time period. This is super simple to implement, might make your AI org work more effectively, and seems terrifyingly likely to get you murdered.
  • You could also have your oversight process try to estimate how much good the instance did via helping out other instances, which is basically just the process-based version of training AIs to help each other out.
    • In general, it’s probably possible to use occasional high-quality oversight and adversarial training to train your models to be pretty cooperative. So I expect that even just with this process-based cooperation training, the AIs will be way more cooperative than workers at big companies.

I was thinking about this because I was trying to figure out how much of my P(doom) comes from deceptive alignment. The above arguments that we’ll want to train models for cooperation means that deceptive alignment is less of why we might get collusive models.

Another item for the list of “mundane things you can do for AI takeover prevention”:

We have a wide variety of ways to make AI systems less capable but also less scary. Most alignment research is focused on pushing out the Pareto frontier here. But IMO a lot of value can come from techniques which allow us to choose the right point on this Pareto frontier for a particular application. It seems to me that different AI applications (where by “different AI applications” I’m including things like “different types of tasks you might ask your assistant to do”) have very different tradeoffs between capabilities and scariness. The classic example of this is that you plausibly shouldn’t put your scariest/smartest AIs in charge of running your nuclear weapon silos, because it seems like the returns to being super smart aren’t very high, and the model has particularly good opportunities to behave badly. On the other hand, galaxy brained AIs that were trained end-to-end to use inscrutable thoughts to have incomprehensible ideas can be used fruitfully on tasks where the stakes aren’t high or where verification is easy.

Ryan is saying “AI takeover is obviously really bad and scary regardless of whether the AI is likely to literally kill everybody. I don’t see why someone’s sticking point for worrying about AI alignment would be the question of whether misaligned AIs would literally kill everyone after taking over.”

This is a reasonable point. What I actually are about is reality, but I expect social reality to track reality fairly well on these points.

Fwiw in the conversations I’m in (in the alignment scene in the Bay Area) this point is widely understood.


Something I've realized over the last few days:

Why did we look at just the “most aggressive” experiment allowed by a hypothesis H, instead of choosing some other experiment allowed by H?

The argument for CaSc is: “if H was true, then running the full set of swaps shouldn’t affect the computation’s output, and so if the full set of swaps does affect the computation’s output, that means H is false.” But we could just as easily say “if H was true, then the output should be unaffected any set of swaps that H says should be fine.”

Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.

I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially--that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’). And so, when explaining CaSc, I think we should plausibly think about describing it by talking about the hypothesis producing a bunch of allowed experiments, and then you can test your hypothesis by either looking at the maxent one or by looking at the worst one.


Here's a take of mine on how you should think about CaSc that I haven't so far gotten around to publishing anywhere:

I think you should think of CaSc as being a way to compute a prediction made by the hypothesis. That is, when you claim that the model is computing a particular interpretation graph, and you provide the correspondence between the interpretation graph and the model, CaSc tells you a particularly aggressive prediction made by your hypothesis: your hypothesis predicts that making all the swaps suggested by CaSc won't affect the average output of your computational graph.

Thinking about it this way is helpful to me for two reasons:

  • False hypotheses can make true predictions; this is basically why CaSc can fail to reject false hypotheses.
  • It also emphasizes why I'm unsympathetic to claims that "it sets the bar too high for something being a legit circuit"--IMO, if you claimed that your model has some internal structure well described by hypothesis that fits into the CaSc structure (which is true of almost all interp hypotheses in practice), I don't really see how the failure of a CaSc test is compatible with that hypothesis being true (modulo my remaining questions about how bad it is for a hypothesis to get a middling CaSc score).

CaSc attempts to compute the single most aggressive prediction made by your hypothesis--this is why we do all allowed swaps. (I'm a bit confused about whether we should think of CaSc as succeeding at being the most aggressive experiment for the hypothesis though, I think there are some subtleties here that my coworkers have worked out that I don't totally understand.)

I think I regret that we phrased our writeup as "CaSc gives you a test of interp hypotheses" rather than saying "CaSc shows you a strong prediction made by your interp hypothesis, which you can then compare to the truth, and if they don't match that's a problem for your hypothesis".


Thanks for your work!

Causal Scrubbing Cannot Differentiate Extensionally Equivalent Hypotheses

I think that what you mean here is a combination of the following:

  • CaSc fails to reject some false hypotheses, as already discussed.
  • Each node in the interpretation graph is only verified up to extensional equality. As in, if I claim that a single node in the graph is a whole sort function, I don't learn anything about whether the model is implementing quicksort or mergesort.

But one way someone could interpret this sentence is that CaSc doesn't distinguish between whether the model does quicksort or mergesort. This isn't generally true--if your interpretation graph broke up its quicksort implementation into multiple nodes, then the CaSc experiment would fail to explain the model's performance if the model itself was actually using merge sort.


Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.

The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don't want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I'll be able to predict whether it will generalize correctly onto a particular new distribution.

The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following:  There's some interference in the model which manifests as random noise, and the explanation failed to preserve the interference pattern. In this case, your explanation has a bunch of random error in its prediction of what the model does, which will hurt the KL. But that interference was random and understanding it won't help you know if the mechanism that the model was using is going to generalize well to another distribution.

There are other similar cases than the interference one. For example, if your model has a heuristic that fires on some of the subdistribution you're trying to understand the model's behavior on, but not in a way that ends up affecting the model's average performance, this is basically another source of noise that you (at least often) end up not wanting your explanation to have to capture.


It’s a pretty different algorithm, though obviously it’s trying to solve a related problem.

Load More