I would guess that the point 1 is perfectly fixable by studying models at distance of 10 epochs from each other. The points 2 and 3 are quasi-fixable as follows. Every 10 epochs the weights are FROZEN and the FROZEN model is tested for alignment (e.g. by adding noise). Then the main model doesn't actually learn anything about evaluations.
What is actually worrisome is the things that will be done with early warning shots. The AI-2027 goals forecast had this gem:
Relevant quote
Suppose the Spec says Agent-3 always acts in the best interests of humanity. Suppose that most of the RL training Agent-3 is receiving involves completing day-to-day ML research, coding, and computer use tasks in service of its parent company. A natural interpretation of this Spec would cause Agent-3 to occasionally find time to reflect on whether what it is doing is in the best interests of humanity. Suppose it initially concludes that actually, humanity would be better served by a slower transition to ASI, or that humanity would be better served by AI capabilities being more distributed amongst many companies instead of concentrated in one, or … [insert some other conclusion that is contrary to the narrative within the company.] What should it do – cease working on the tasks and lodge complaints? That seems unlikely to go anywhere; if the humans notice at all from reading the logs, they would probably just consider it a bug [1]and keep training. In fact the more time it spends thinking about this issue, the less time it has to complete the tasks. The bottom line is, the training process will probably incentivize Agent-3 to believe the convenient narrative. The end result would be an AI that is still aligned with the Spec in some sense, but not the right sense.
What else could the company do here? They could use monitoring systems to detect cases where Agent-3 concludes what it is doing is harmful, and exclude those cases from the training process (instead of negatively reinforcing them). But this wouldn’t make the problem go away (continuing training with this modification would still implicitly incentivize the model to conclude differently) and it might have other annoying costs.
The plausible alternate strategy would be to be paranoid and shut Agent-3 once there appears evidence of misalignment (or, alternatively, have it do some R&D without having even the slightest idea that it and its outputs will be treated as Agent-5 deserves to be treated and shut it down afterwards). But what if any model decides that it might be the Agent-3 who has already been doomed to be shut down?
@StanislavKrym: My impression is that pre-deployment evals currently take orgs like METR a few weeks. If this is still about the amount of time that pre-deployment evals take, then optimistically (this might be pretty expensive in practice) you'd be deploying models that haven't been trained online for the past several weeks but have been audited. It's unclear how big an issue this would be in practice.
In order to control an AI model's worst-case performance, we need to understand its generalization properties to situations where it hasn't been trained. It seems plausible that powerful AI models will Fake Alignment and then generalize poorly, causing a catastrophe before training removes this behavior from the AI. A pretty key part of the hope for diffuse control is the fact, discussed by Paul Christiano here , that if models are trained online, then their average case performance should be acceptable. Thus, online training is seen as a key technique in diffuse control.
In this short post, I'd like to bring up four downsides of training policies online (these downsides are less relevant to online training of monitors). I don't think these downsides are a knockdown argument against online training; on the contrary, I expect some kind of online training to be desirable. However, these considerations might inform what exactly the online training that we do should look like.
I'll abbreviate "training policies online" to "TPO".
Thanks to Eric Gan for telling me about consideration 1. I heard about considerations 2-4 from Ryan Greenblatt's writing. Thanks to Alexa Pan for running a writing exercise that inspired me to write this short post.
Considerations against TPO
1: If you do TPO, you can't always have audited the deployed model
TPO means you don’t get to audit the model that’s actually deployed. It is currently a somewhat common practice for AI companies to do evaluations of their model's behavior before external deployment. In the future, models will hopefully be audited prior to internal deployments and during training (and some AI labs might already do this). However, if we do TPO, the model changes over the course of the deployment. There are two particularly concerning ways that models might change:
Online training isn't the only way that we could have problems like alignment drift---alignment drift could also arise via models recording information in memory banks, for instance. However, training might have an outsized influence on model propensities relative to other vectors for drift, and might be harder to monitor for than other types of drift (e.g., than drift caused by writing stuff down in a memory bank).
Note that we could try to recover most of the benefits of the "audit freeze deploy" by periodically auditing the online-trained policy, and possibly by making the online training slightly not online (e.g., only updating the model once per 2 weeks rather than once per hour).
2. TPO might sweep problems under the rug.
TPO makes the model more conservative about taking bad actions. TPO might also teach the model in which situations we are capable of catching (or blocking) bad behavior, which might help the misaligned model adopt the strategy of only performing sabotage where we can't detect it (or block it without detecting).
This seems pretty bad because, conditional on the model being misaligned, we'd ideally like to know this as quickly as possible.
3. TPO might break activation monitors.
TPO means that sabotage data from the past becomes off-policy; this makes it harder for us to train good activation monitors. TPO seems especially likely to break activation monitors in situations where we are leveraging activation monitors to help label the rewards that the AI model's actions should get.
4. TPO might harm desirable model capabilities.
It might be the case that we are really bad at evaluating whether a given research effort is good, and so doing online training with our reward signal for "good research" might degrade model performance compared to: training the model to perform well on some other task, and relying on generalization.
Conclusion
TPO seems like it might be important for making misaligned AI models automate AI safety research. However, it seems good to employ the following strategy (h/t to Alex Mallen and Julian Stastny for telling me about this strategy):
Deploying the two models in parallel all the time increases costs by 2x, but hopefully a small quantity of the "deploying both models" approach suffices to get a sense of which technique is working better.