AI companies should be safety-testing the most capable versions of their models

sjadler

This is a linkpost for https://stevenadler.substack.com/p/ai-companies-should-be-safety-testing

AI companies should be safety-testing the most capable versions of their models
Only OpenAI has committed to the strongest approach: testing task-specific versions of its models. But evidence of their follow-through is limited.

You might assume that AI companies run their safety tests on the most capable versions of their models possible. But there are important versions that currently go untested: task-specific versions, specially trained to demonstrate how far a model can be pushed on a dangerous ability, like novel bioweapon design. Without exploring and evaluating these versions, Al companies are underestimating the risks their models could pose.

Only OpenAI (where I previously worked¹) has committed to what I consider to be the strongest vision for safety-testing: evaluating these task-specific versions of its models to understand the worst-case scenarios. I think OpenAI’s vision on this is truly laudable and great. But from my review of publicly available reports, it is not clear that OpenAI is in fact following through on this commitment. I believe that all leading AI companies should be doing this form of task-specific model testing - but if companies are not willing or able to today, I argue there are many middle-ground improvements they could pursue.

In this post, I will:

Walk you through why AI companies safety-test their models, and how they do it
Define task-specific fine-tuning (TSFT), and argue why it’s important for safety-testing
Summarize the status of TSFT at leading AI companies
Dig into a case study on OpenAI and TSFT
Outline challenges to safety testing using TSFT, and what middle-ground improvement exists

[ continues on Substack ]

Thanks for doing this, I found the chart very helpful! I'm honestly a bit surprised and sad to see that task-specific fine-tuning is still not the norm. Back in 2022 when our team was getting the ball rolling on the whole dangerous capabilities testing / evals agenda, I was like "All of this will be worse than useless if they don't eventually make fine-tuning an important part of the evals" and everyone was like "yep of course we'll get there eventually, for now we will do the weaker elicitation techniques." It is now almost three years later...

Crossposted from X

Back in 2022 when our team was getting the ball rolling on the whole dangerous capabilities testing / evals agenda, I was like…

Looks like the rest of the comment got cut off?

Daniel said:

Thanks for doing this, I found the chart very helpful! I'm honestly a bit surprised and sad to see that task-specific fine-tuning is still not the norm. Back in 2022 when our team was getting the ball rolling on the whole dangerous capabilities testing / evals agenda, I was like "All of this will be worse than useless if they don't eventually make fine-tuning an important part of the evals" and everyone was like "yep of course we'll get there eventually, for now we will do the weaker elicitation techniques." It is now almost three years later...

Oops, thanks, fixed!

The most capable version of each model has not yet been created when the model is released. As well as fine-tuning for specific tasks, scaffolding matters. The agentic scaffolds people create have an increasingly important role in the model's ultimate capability.

Scaffolding for sure matters, yup!

I think you're generally correct that the most-capable version hasn't been created, though there are times where AI companies do have specialized versions for a domain internally, and don't seem to be testing these anyway. It's reasonable IMO to think that these might outperform the unspecialized versions.