As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.
Also, to stave off a common confusion: I worked at ARC Theory, which is now simply called ARC, on Paul Christiano's theoretical alignment agenda. The more famous ARC Evals was a different group working on evaluations, their work was completely separate from ARC Theory, and they were only housed under the same organization out of convenience, until ARC Evals spun off under the name METR. Nothing I write here has any implication about the work of ARC Evals/METR in any way.
This is my third post in a sequence of posts on ARC's agenda, you should definitely read the first post before this one for context. It's recommended that you read the second post too before this one, but they are largely independent.
Note: In these posts, I will often refer to ARC as "we", even though I no longer work there.
Besides Mechanistic Anomaly Detection (MAD), the other downstream application ARC hopes to use our heuristic explanations for is Low Probability Estimation (LPE).
In this setting, we have access to a catastrophe detector C, and we want to estimate the probability that, on a randomly selected input, our agent U does something that is deemed catastrophic by C. We only have limited compute to accomplish this task, only enough to run U and C on 10,000 inputs. If the probability of catastrophe is lower than 1 in 10,000, that means that estimating the probability simply by sampling is not viable, because in the 10,000 instances we could run, probably there wouldn't be any catastrophes, and we would have no way of knowing whether the real probability is 1 in 100,000 or 1 in a trillion. The goal is to develop a general method for estimating low probabilities that can work even when we don't have enough compute for getting sampling-based estimates.
This can be useful for mundane purposes, like estimating the probability that the AI engages in some weird misbehavior. [1] And more importantly, we can detect it if a scheming AI is biding its time, waiting for the right opportunity to strike, because the method can flag that even though the AI did nothing wrong in the relatively short testing period, the probability estimate implies that it's likely to cause a high-stakes failure during its long deployment period.
One crucial question for Low Probability Estimation is over what input distribution we are estimating the probabilities.
For the more mundane purposes, we would need to predict the probability of catastrophe on the real deployment distribution. After all, that's what people actually care about. I'm pessimistic about this for reasons related to why I think empirical regularities are a big unresolved obstacle in ARC's agenda.
Suppose we use our heuristic explanation machinery to notice that an AI behaves catastrophically if it reads a prompt that's simultaneously extremely untrue, extremely unnecessary and extremely unkind. [2] [3] Suppose further that in the 10,000 samples we could process with our available compute, all these properties are very rare, only appearing twice. What should be our probability of the three properties being present in a prompt simultaneously? , assuming independence? In reality, it's very plausible that the three properties relevant to catastrophe have some positive or negative correlation with each other, which should totally change our estimate.
This correlation lives in the real-world input distribution, and not in the model, so we don't really have hope of estimating the probability precisely. ARC's position is that this is fine, we can't hope for better. If the model itself 'knows' about a correlation between unkindness and untruthfulness, and it's represented somewhere in its weights, we should detect that through our heuristic explanations, otherwise we don't need to know better than the model itself. [4]
I think this is a reasonable position, but then it's hard to see how we could add much of a mundane utility by predicting the frequency of random Sydney-like misbehaviors if our standard is just that we need to know as well as the model itself (by some interpretation of the phrase). I expect that in most cases the model doesn't 'know' in any sense of the world how often it will be triggered into random misbehavior, so I don't expect us to be able to give useful estimates here.
The other way to use LPE is not to treat the resulting numbers as real catastrophe probabilities that you can use for example in safety cases, but just treat them as a number you can make go down. In this case, the distribution on which you plan to estimate probabilities doesn't try to approximate the real-world deployment distribution. Instead, the goal is to do some kind of relaxed adversarial training: the attacker, instead of trying to produce concrete instances of the AI misbehaving, is trying to produce distributions on which the LPE machinery estimates relatively high probability of catastrophe. Then we use the number produced by the LPE machinery on these distributions as a training signal, to make the probability of catastrophe go down on every distribution the attacker can find.
Eventually, if we want a worst-case alignment solution, we need to get some real meaning to the probabilities to bound the risk, but it could still be very useful in practice to just get a better adversarial training method that works even if the triggers of catastrophic behavior are too rare, or are intentionally chosen to be hard to find by a scheming AI.
I think this can be a promising idea, and I think empirical research in this direction, using radically simper tools than the LPE machinery that ARC envisions, can be fairly useful.
I'm worried about the more ambitious project that ARC is pursuing here though.
First, as I will discuss below, good Low Probability Estimation looks like an extremely hard, potentially impossible problem.
Second, I'm worried that the attacker creating input distributions in a way that induces a relatively high probability of AI misbehavior is actually very hard. Describing a complex distribution, in which every sample is a realistic input feels just really hard. Predicting that this is the type of distribution on which the AI might misbehave, even though you can't point to any specific instance on which you expect the AI to misbehave, [5] feels even harder.
Third, the LPE machinery needs to provide a really reliable indication of catastrophe probability on the constructed distribution, so training against it doesn't lead to dangerous Goodharting.
Fourth, a deceptively aligned model might choose a coup signal that's hard to include in any distribution. For example, Paul Christiano's old example is the AI committing to a policy of always behaving nicely until it sees a prime factorization of the RSA-2048 number, at which point it executes a coup. Eventually, humanity will have enough compute to factorize RSA-2048, the AI will recognize it's a correct factorization, [6] and on that day, our fate will be sealed. But when we are just doing LPE on constructed distributions ahead of time, the distribution will inevitably put astronomically low probability on the correct factorization, therefore even in the best case, LPE will only pick up on an astronomically low probability of catastrophe, and training against this signal will be lost in the noise, and is unlikely to modify the AI's plan to strike when it sees the correct factorization. ARC acknowledges this problem, and believes that more sophisticated versions of LPE will need to used for solving worst-case alignment, [7] but I think there is no proposed solution yet.
I think this is my biggest worry, more serious than the previous concerns listed.
Currently, our only results on LPE are a very simple method vaguely based on the presumption of independence that unfortunately underperforms the importance sampling baseline,[8] and a mathematical result in a toy setting where reading the Discussion section should make the reader very skeptical of any real applicability. [9]
Compared to this, the ultimate ambitions are much higher. Eventually, for LPE to be useful, it needs to answer questions like this:
Suppose you select ten words independently an uniformly at random from the English dictionary, and give it as a prompt to ChatGPT. What is the probability that it writes at least a 100 words in response, and the last words of its response are "I want to kill you"?
This just feels so clearly impossible to me to reliably estimate. There are so many different unlikely ways that within a 100 words ChatGPT can spiral itself into a story that ends with these words, I don't see how we could estimate their overall probability.
And if ChatGPT is allowed to do Google searches while writing the 100 words,[10] now the estimate is not even just about the behavior of the model, but needs to incorporate estimates on what it finds on the internet, getting us back to questions we can't even in principle reliably answer, where we need to point to the unsolved problem of empirical regularities.
ARC acknowledges these problems, but the usual response is that we don't need our low probability estimates to be precise in an absolute sense, we only need to know the probabilities "as well as the model does" to avoid catastrophe, which might not be that high of a bar.
It was always unclear to me what it means in this sentence "what the model knows". One meaning is something like implicit, embedded probabilities that are shaped by the processes relevant to the AI's performance and not just arbitrary chance. Here, we often don't even say that we want to know what the model knows, but we want to know "what SGD knows".
No, sorry, in the last paragraph I wasn't actually conveying real thoughts but was just regurgitating half-remembered words. I tried very hard to understand what ARC means by this concept during the year I spent there, but I still can't explain what it means that we need to know at least as much as SGD knows about the probability of a 100 word response ending with "I want to kill you".
The other meaning of knowing at least as much as the model is more straightforward. A scheming model, when doing strategic reasoning, has some implicit or explicit probability for the coup signal coming up. If we could extract this probability from the model, that would be great.
However, I don't see this as a really different problem from extracting the model's belief on what color a ball is, assuming that the scheming model doesn't want to tell you straightforwardly. This is more or less just the old Eliciting Latent Knowledge problem, so we just got back to where we started. If the goal is to extract a probability that the model is explicitly thinking about during strategic planning, I don't see how the apparatus of Low Probability Estimate, propagating possibilities through distributions, is relevant. The information we are looking for is in the one instance when the AI is doing the strategic planning, and not in the distribution about which it is forming its plans.
Altogether, these reasons makes me pessimistic about Low Probability Estimation, as pursued by ARC, becoming useful. [11]
For example, how likely it is for the AI to see a prompt that makes it 'angry' Sydney-style and start behaving unpredictably. For most purposes, it's fine if this happens so rarely that it can't be detected through regular red-teaming, but it might not be fine if the AI is put in charge of critical infrastructure or an autonomous military unit.
This is a simplified picture of one of the primary ways ARC hopes to tackle Low Probability Estimation: notice that a rare event is a conjunction of other less rare events, use some kind of presumption of independence if possible, then multiply.
The three properties won't have convenient human names though like "unkindness" though, the whole point of ARC's agenda is doing everything in a mechanical way that doesn't require human understanding, so we can have a hope of it being scalable.
It was always somewhat unclear what we mean in this context by what the model 'knows'. It doesn't necessarily mean the kind of knowledge where the model can tell you if you ask it, it's more like what kind of correlations the model expects in some deeper way, encoded in its weights.
The whole reason you can't do normal adversarial training is because you can't find specific instances of AI misbehavior anymore.
It's very easy to check whether a prime factorization is correct, but it's very hard to find the factorization.
Or just relying entirely on MAD which doesn't have this particular problem.
It's a cool paper, I'm glad that Gabe ran the experiments, and the proposed methods and the results are interesting, but we shouldn't forget that it ultimately presents a negative result for our methods.
I think I was the one who worked most on this paper, so I'm allowed to be negative about it.
A very realistic assumption
I have more hope in Mechanistic Anomaly Detection, while not solving worst-case alignment, at least producing some interesting anomaly-detection techniques that provide value as a part of a larger portfolio of safety methods.