The argument against R&D of contemporary systems because of future systems capabilities has always been shortsighted. Two examples of this are nuclear weapons controls and line level commenting of programming code.
In nuclear weapons development the safety advocates argued for safety measures on the weapons themselves to prevent misuse. They were overruled by arguments that future systems would be so physically secure that they couldn’t be stolen and the controls were centralized to the launch control process, usually with one person there having ultimate control. Proliferation and miniaturization eventually made theft and misuse a major risk and an entire specialized industry sprang up to develop and implement Permissive Action Links (PALs).
It wasn’t until the late nineteenth 1980s that the entire U.S. nuclear weapons inventory was equipped with PALs. Which is nuts. Even then, malfunction concerns were paramount, and they continue to be concerns, creating real risks in deterrence based defenses. Even today PALs are still being upgraded for reliability. It was the fact that PALs had to be retrofitted that was responsible for the 70 year timeframe for implementation of devices with questionable reliability. Safety R&D as a primary R&D focus during the pioneering era of nuclear weapons could have prevented that mess. Trying to shoehorn it in later was unbelievably expensive, difficult, and created strategic threats.
The finance and defense sectors have a similar problem with software. High turnover iterations led to a culture that eschewed internal comments in the code. Now making changes creates safety concerns that are only overcome with fantastically complex and expensive testing regimes that take so long to complete the problem they are trying to solve has been fixed another way. It’s phenomenally wasteful. Again, future safety was ignored because if arguments based in future capabilities.
Most importantly however, is that R&D of contemporary systems is how we learn to study future systems. Even though future generations of the technology will be vastly different, they are still based, ultimately, on today’s systems. By considering future safety as a contemporary core competency the way is being paved for the study and innovation of future systems and preventing waste from adhoc shoehorning exercises in future.
What are your thoughts about the impact of larger more capable models that are also tuned (or capability fenced) to be only task specific? Where the trust range that needs to be factored is considerably smaller.
Sometimes people say that it's much less valuable to do AI safety research today than it will be in the future, because the current models are very different—in particular, much less capable—than the models that we're actually worried about. I think that argument is mostly right, but it misses a couple of reasons models as capable as current models might still be important at the point where AI is posing substantial existential risk. In this post, I'll describe those reasons.
Our best trusted models might be at a similar level of capabilities to current models: It might be the case that we are unable to make models which are much more powerful than current models while still being confident that these models aren't scheming against us. That is, the most powerful trusted models we have access to are similar in capabilities to the models we have today. I currently predict that our best trusted models will be a lot more capable than current models, but it seems plausible that they won't, especially in short timelines.
This has a few implications:
Due to compute constraints, large fractions of (automated) safety research might happen on relatively weak models: Right now, lots of safety research happens on models which are weaker than the best models because this reduces compute costs and makes experiments more convenient to run. I expect that once we have dangerously powerful models we'll be able to automate large fractions of running experiments, so convenience will be less of an issue, but compute might become more of a bottleneck. Currently most safety work is bottlenecked by the availability of human researcher labor, but as AIs automate this work, then experiments could become increasingly bottlenecked on compute which might result in using models which are smaller relative to the frontier than what human researchers do today. This will be most extreme once AIs fully automate safety research. In the case of capabilities research, I expect that a substantial factor in mitigating compute bottlenecks will be running experiments on small models and then extrapolating to bigger models.[1] I think we might want to apply a similar approach in the case of safety research where we do as much work as possible at very small scale and generally work on small scale will look more appealing given how much more of it you can do.
Naively, we might expect that you want to spend similar amounts of compute on inference for AI research and on the compute they use for experiments. In the case of experiments that are compute heavy by default because they involve doing full training runs on models or just large fine-tuning runs, you still might want the training cost to be comparable to the inference cost, implying that you'd want to run (many of) these experiments on models which are much smaller than frontier AI systems. If an experiment involves doing 1 million forward passes (e.g. for gradient steps) and the AI researcher effort to setup the experiment involves only 1,000 forward passes (with no human labor), then naively, you'd be willing to spend 10x more AI labor (10,000 forward passes) to find some equally valuable way to do this experiment on a model which is so small that it has 100x lower inference cost than the AI researcher. Of course, in practice, the trade-offs won't be this clear cut and you won't be able to perfectly recover the value of larger experiments using additional labor, but nonetheless, I expect that spending tons of AI labor on miniaturizing experiments is likely to be worthwhile at some point.
It's unclear how far this miniaturization goes (and there are approaches to reducing compute other than miniaturization, like iterating on techniques which are cheap to run and evaluate even on frontier scale models), but it seems plausible that large fractions of the experiments will be on models which are roughly as capable as current models. (These future models of comparable capability will presumably have much lower inference costs due to better algorithms, more compute (for overtraining the model), and better data). In the extreme, it's possible that most of the cumulative value of safety work prior to the point when human efforts are totally irrelevant will come from experiments on models which aren't more capable than the current best models.
This has a few implications:
In some cases, we might care about extrapolation for its own sake rather than to save compute. E.g., because we can't measure something directly on a more powerful model for whatever reasons or because we think that observing how something varies with capabilities/scale is generally informative about what is going on. This seemed like a less interesting application of weaker models with not many implications, so I decided to not discuss it in the main text. ↩︎