I underestimated safety research speedups from safe AI

UPDATE

When writing this post, I think I was biased by the specific work I'd been doing for the 6-12 months prior, and generalised that too far.

I think I stand by the claim that tooling alone could speedup the research I was working on by 3-5x. But even on the same agenda now, the work is far less amenable to major speedups from tooling. Now, work on the agenda is far less "implement several minor algorithmic variants, run hyperparameter sweeps which take < 1 hour, evaluate a set of somewhat concrete metrics, repeat", and more "think deeply about which variants make the most sense, run >3 hours jobs/sweeps, evaluate the more murky metrics".

The main change was switching our experiments from toy models to real models, which massively loosened the iteration loop due to increased training time and less clear evaluations.

For the current state, I think 6-12 months of tooling progress might give 1.5-2x speedup from the baseline 1 year ago.

I still believe that I underestimated safety research speedups overall, but not by as much as I thought 2 months ago.

Tooling alone could 3-5x progress

If we stopped frontier AI progress today but had 6-12 months of tooling and scaffolding progress, I think the research direction I’ve been working on, which contains a mix of conceptual and empirical interpretability work (parameter decomposition), could speed up by 3-5x from the base rate of 1 year ago. The most recent month or two might have been a 1.5-2x speedup.

Where do the speedups come from? The majority of the time my team has spent on the parameter decomposition agenda went as follows:

Test current scheme on toy models by running hyperparmeter sweeps.

Squint at various metrics and figures and think about what might be going wrong.

Then, either

run more hyperparameter sweeps, or
design new loss functions/adjustments to the training process.

Repeat

You might think “surely humans without AI can minimise the number of iterations here by doing very wide sweeps initially?”. Unfortunately, the space of hyperparmeter sweeps is extremely large when you’re testing 20+ different possible loss functions and training process miscellanea. This is a problem because:

You don’t just want to find one good setting, you want to try and analyse the effects that the hyperparameters have on the output and each other. This means you’ll usually want grid sweeps and not more efficient sweep methods that search for global optima.

If you sweep too wide, then you’re waiting a long time for your runs (unless you can parallelise over hundreds of gpus), and model training time hurts productivity a lot.

If you sweep too wide, it also becomes difficult and very time consuming for humans to process the outputs.

How I think we could get a 3-5x speedup with better tooling/scaffolding only:

Have a large fleet of agents that each attempt a few iterations of the above process themselves^[1].

Humans spend their time reviewing an AI-curated list of run outputs and ideas for new things to try, and directing the AIs to explore certain directions.

The biggest speedup comes from 3a, i.e. the slow iteration loop of first running sweeps, then analysing the outputs (which can usually rely on concrete metrics and not require much thought), and finally running more sweeps.

I expect 3b (designing good alterations to the training process/loss functions) to be too difficult for today’s models. But if you give a strong human researcher several options for ways to improve the training process, there’s a reasonable chance that one of them might spark an idea in the researcher that points at an underlying problem (or be flat out correct).

Going from 3-5x to 10x speedup

Going from 3-5x to 10+x with still-safe AI just comes from models that are capable enough to do more iterations of 1-4 on their own, and are able to provide better ideas and analysis to the humans. I don’t know what level of capabilities is required to achieve this, but I don’t think it’s too far from the current level. Provided these slightly more capable models are not integrated absolutely everywhere in society with minimal controls, I don’t expect them to have large x-risk.

My takeaways

This is not to say that I’m in favour of the strategy of pushing hard on improving AI capabilities until it’s extremely useful for alignment research and then stopping. The more momentum something has, the longer it takes to stop it.

I think the extent to which research is amenable to automation with safe models is mostly a function of how conceptual vs empirical the work is. I think the work I outlined above is more conceptual than most research that a poll of LessWrong would consider “safety research”. So I expect the speedups to safety work to be on average greater.

Those working on safety who aren’t actively trying to speed up their research with AI agents, should. I found the last 6 months to be a step-change in the usefulness of AI agents.

^{^}

If they tried to do more than a few iterations, I expect today's models to get too off-track

[-]Dan Braun3mo142

[-]leogao5mo123

my hot take is i agree that human researchers spend a ridiculous amount doing stupid stuff (see my shortform on this), but also I don't think it's very easy to automate the stupid stuff.

I've optimized my research setup to get quite tight feedback loops. if I had more slack I could probably make things even better, but it would look more like developing better infrastructure and hpopt techniques myself, than handing work off to agents.

I disagree that you have to use grid search and not anything more clever in theory. I currently use grid searches too for simplicity, and it's definitely nontrivial to get the clever thing to tell you about interactions, but it doesn't seem fundamentally impossible.

LESSWRONG
LW

LESSWRONG
LW

38

I underestimated safety research speedups from safe AI

38

38

Tooling alone could 3-5x progress

Going from 3-5x to 10x speedup

My takeaways