- AI R&D and AI safety R&D will almost surely come at the same time.
- Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work)
Seems like probably the modal scenario to me too, but even limited exceptions like the one you mention seem to me like they could be very important to deploy at scale ASAP, especially if they could be deployed using non-x-risky systems (e.g. like current ones, very bad at DC evals).
- People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go).
This seems good w.r.t. automated AI safety potentially 'piggybacking', but bad for differential progress.
It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.
Sure, though wouldn't this suggest at least focusing hard on (measuring / eliciting) what might not come at the same time?
To get somewhat more concrete, the Frontier Safety Framework report already proposes a (somewhat vague) operationalization for ML R&D evals, which (vagueness-permitting) seems straightforward to translate into an operationalization for automated AI safety R&D evals:
Machine Learning R&D level 1: Could significantly accelerate AI research at a cutting-edge lab if deployed widely, e.g.improving the pace of algorithmic progress by 3X, or comparably accelerate other AI research groups.
Machine Learning R&D level 2: Could fully automate the AI R&D pipeline at a fraction of human labor costs, potentially enabling hyperbolic growth in AI capabilities.
On an apparent missing mood - FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely
Automated AI safety R&D could results in vast amounts of work produced quickly. E.g. from Some thoughts on automating alignment research (under certain assumptions detailed in the post):
each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.
Despite this promise, we seem not to have much knowledge when such automated AI safety R&D might happen, in which order the relevant capabilities might appear (including compared to various dangerous capabilities), etc. AFAICT. we don't seem to be trying very hard either, neither at prediction, nor at elicitation of such capabilities.
One potential explanation I heard for this apparent missing mood of FOMO on AI automated AI safety R&D is the idea / worry that the relevant automated AI safety R&D capabilities would only appear when models are already / too close to being dangerous. This seems plausible to me, but would still seems like a strange way to proceed given the uncertainty.
For contrast, consider how one might go about evaluating dangerous capabilities and trying to anticipate them ahead of time. From GDM's recent Introducing the Frontier Safety Framework:
The Framework has three key components:
- Identifying capabilities a model may have with potential for severe harm. To do this, we research the paths through which a model could cause severe harm in high-risk domains, and then determine the minimal level of capabilities a model must have to play a role in causing such harm. We call these “Critical Capability Levels” (CCLs), and they guide our evaluation and mitigation approach.
- Evaluating our frontier models periodically to detect when they reach these Critical Capability Levels. To do this, we will develop suites of model evaluations, called “early warning evaluations,” that will alert us when a model is approaching a CCL, and run them frequently enough that we have notice before that threshold is reached.
- Applying a mitigation plan when a model passes our early warning evaluations. This should take into account the overall balance of benefits and risks, and the intended deployment contexts. These mitigations will focus primarily on security (preventing the exfiltration of models) and deployment (preventing misuse of critical capabilities).
I see no reason why, in principle, a similar high-level approach couldn't be analogously taken for automated AI safety R&D capabilities, yet almost nobody seems to be working on this, at least publicly (in fact, I am currently working on related topics; I'm still very surprised by the overall neglectedness, made even more salient by current events).
Quote from Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie:
A quite early example of this is Collin Burn’s work, doing unsupervised identification of some aspects of a neural network that are correlated with things being true or false. I think that is important work. It's a kind of obvious direction for the stuff to go. You can keep improving it when you have AIs that you're training to do their best to deceive humans or other audiences in the face of the thing and you can measure whether our lie detectors break down. When we train our AIs to tell us the sky is green in the face of the lie detector and we keep using gradient descent on them, do they eventually succeed? That's really valuable information to know because then we'll know our existing lie detecting systems are not actually going to work on the AI takeover and that can allow government and regulatory response to hold things back. It can help redirect the scientific effort to create lie detectors that are robust and that can't just be immediately evolved around and we can then get more assistance. Basically the incredibly juicy ability that we have working with the AIs is that we can have as an invaluable outcome that we can see and tell whether they got a fast one past us on an identifiable situation. Here's an air gap computer, you get control of the keyboard, you can input commands, can you root the environment and make a blue banana appear on the screen? Even if we train the AI to do that and it succeeds. We see the blue banana, we know it worked. Even if we did not understand and would not have detected the particular exploit that it used to do it. This can give us a rich empirical feedback where we're able to identify things that are even an AI using its best efforts to get past our interpretability methods, using its best efforts to get past our adversarial examples.
You’ll often be able to use the model to advance lots of other forms of science and technology, including alignment-relevant science and technology, where experiment and human understanding (+ the assistance of available AI tools) is enough to check the advances in question.
- Interpretability research seems like a strong and helpful candidate here, since many aspects of interpretability research seem like they involve relatively tight, experimentally-checkable feedback loops. (See e.g. Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie.)
The apparent early success of language model agents for interpretability
(e.g. MAIA, FIND) seems perhaps related.
Related, from The “no sandbagging on checkable tasks” hypothesis:
- You’ll often be able to use the model to advance lots of other forms of science and technology, including alignment-relevant science and technology, where experiment and human understanding (+ the assistance of available AI tools) is enough to check the advances in question.
- Interpretability research seems like a strong and helpful candidate here, since many aspects of interpretability research seem like they involve relatively tight, experimentally-checkable feedback loops. (See e.g. Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie.)
I think the research that was done by the Superalignment team should continue happen outside of OpenAI and, if governments have a lot of capital to allocate, they should figure out a way to provide compute to continue those efforts. Or maybe there's a better way forward. But I think it would be pretty bad if all that talent towards the project never gets truly leveraged into something impactful.
Strongly agree; I've been thinking for a while that something like a public-private partnership involving at least the US government and the top US AI labs might be a better way to go about this. Unfortunately, recent events seem in line with it not being ideal to only rely on labs for AI safety research, and the potential scalability of automating it should make it even more promising for government involvement. [Strongly] oversimplified, the labs could provide a lot of the in-house expertise, the government could provide the incentives, public legitimacy (related: I think of a solution to aligning superintelligence as a public good) and significant financial resources.
'Any ways to combine all the previously mentioned ideas (A Causal Explainable Guardrails for Large Language Models, Discussion: Challenges with Unsupervised LLM Knowledge Discovery and A Multimodal Automated Interpretability Agent (MAIA), 'Catching AIs red-handed', including paulfchristiano's comment) with ideas from 'Searching for a model's concepts by their shape – a theoretical framework' to 'catch AIs red-handed'?
Here are a few additional ideas for combining all the mentioned papers and comments to try catching deceptive models red-handed:
Extending my previous thoughts, I think one of the most useful things the "Searching for concepts by their shape" framing can contribute is mathematical precision in defining hypotheses about what characterizes deceptive cognition. If we can frame agency/deception in terms of precise relations between features (even if those features are hard to interpretably extract in practice), it gives us a crisper target to aim at with our testing, rather than just relying on intuitive notions of what "seems deceptive".
However, I continue to be quite pessimistic about the prospects for catching competent deceptive models before it's too late, for the reasons outlined previously. I worry that by the time a model is smart enough that we could meaningfully apply these combined interpretability strategies to it, it would also be smart enough to notice what we're doing and come up with countermeasures.
Unless interpretability tools far exceed my expectations in their ability to crack open the black box of neural nets (which would be great!), I suspect the most practical path is to use simpler, more behavioral detection strategies (like the model consistency testing from your Control paper) to avoid building models that are competent enough to be robustly deceptive in the first place. Catching a superintelligent deceptive model after the fact seems like a much harder challenge than noticing warning signs of misalignment in weaker models and heading them off early.
But I would still encourage pursuing all these ideas, even if they end up not being enough on their own to solve the problem. The interpretability insights gained would be valuable for alignment more broadly. And we may get lucky in extracting some key structural signature of deceptive cognition that even very sophisticated models can't fully mask. It's an arms race we can't afford to unilaterally forfeit.'
Continued:
'How could the above be combined with 'Catching AIs red-handed' and, in particular, the ideas mentioned in paulfchristiano's comment?
The approaches discussed in "A Multimodal Automated Interpretability Agent" (MAIA) and "Causal Explainable Guardrails for Large Language Models" (LLMGuardaril) could potentially be combined with the ideas in "Catching AIs red-handed" and paulfchristiano's comment to create a more comprehensive strategy for detecting and mitigating risks from deceptively aligned AI systems:
Significant open questions remain about the feasibility of scaling these approaches to be robust against highly capable AI systems. There are also important limitations to consider: for example, MAIA's causal analysis is unlikely to provide a fully conclusive "smoking gun" that an AI system is deceptive, as opposed to strong probabilistic evidence. And sufficiently capable systems may find ways to avoid exhibiting any detectable deceptive behavior under the kind of systematic experimentation enabled by MAIA.
Nonetheless, I believe the combination of automated interpretability agents, causally grounded analysis of AI reasoning, and the overall adversarial "red team" mindset advocated in "Catching AIs red-handed" represents a promising direction for building more robust monitoring and containment of potentially deceptive AI systems. A concerted effort to develop and integrate these approaches could significantly improve our odds of detecting and mitigating catastrophic deceptive alignment failures.'
Intuitively, I'm thinking of all this as something like a race between [capabilities enabling] safety and [capabilities enabling dangerous] capabilities (related: https://aligned.substack.com/i/139945470/targeting-ooms-superhuman-models); so from this perspective, maintaining as large a safety buffer as possible (especially if not x-risky) seems great. There could also be something like a natural endpoint to this 'race', corresponding to being able to automate all human-level AI safety R&D safely (and then using this to produce a scalable solution to aligning / controlling superintelligence).
W.r.t. measurement, I think it would be good orthogonally to whether auto AI safety R&D is already happening or not, similarly to how e.g. evals for automated ML R&D seem good even if automated ML R&D is already happening. In particular, the information of how successful auto AI safety R&D would be (and e.g. what the scaling curves look like vs. those for DCs) seems very strategically relevant to whether it might be feasible to deploy it at scale, when that might happen, with what risk tradeoffs, etc.