Changing selection pressures to align with intended behaviors: This might involve making training objectives more robust, iterating against held-out evaluation signals, or trying to overwrite the AI's motivations at the end of training with high-quality, aligned training data.
Is another broad direction for this increasing the intelligence of the reward models? A fun hypothetical I've heard is to imagine replacing the reward model used during training with Redwood Research (the organization).
So we imagine that during training, whenever an RL reward must be given or not given, a virtual copy of all Redwood Research employees is spun up on the training servers and has one subjective week to decide whether/how to give the RL reward, with their existing knowledge, access to the transcript of the model completing the task, access to data from previous such decisions, ability to use whatever internals-based probes are available, etc.
My instinct is this would help a lot.
To the extent trusted or controlled AI systems can approximate this thought experiment (or do even better!), that seems great for safety and we should use them to do so. People use the term "automated alignment research" but this sort of this seems like an example of automated safety work that isn't "research".
[cross-posted from EAF]
Thanks for writing this!!
This risk seems equal or greater to me than AI takeover risk. Historically the EA & AIS communities focused more on misalignment, but I'm not sure if that choice has held up.
Come 2027, I'd love for it to be the case that an order of magnitude more people are usefully working on this risk. I think it will be rough going for the first 50 people in this area; I expect there's a bunch more clarificatory and scoping work to do; this is uncharted territory. We need some pioneers.
People with plans in this area should feel free to apply for career transition funding from my team at Coefficient (fka Open Phil) if they think that would be helpful to them.
Thanks for writing this.
One question I have about this and other work in this area is the training / deployment distinction. If AIs are doing continual learning once deployed, I'm not quite sure what that does to this model.
Thanks Tom! Appreciate the clear response. This feels like it significantly limits how much I update on the model.
We simulate AI progress after the deployment of ASARA.
We assume that half of recent AI progress comes from using more compute in AI development and the other half comes from improved software. (“Software” here refers to AI algorithms, data, fine-tuning, scaffolding, inference-time techniques like o1 — all the sources of AI progress other than additional compute.) We assume compute is constant and only simulate software progress.
We assume that software progress is driven by two inputs: 1) cognitive labour for designing better AI algorithms, and 2) compute for experiments to test new algorithms. Compute for experiments is assumed to be constant. Cognitive labour is proportional to the level of software, reflecting the fact AI has automated AI research.
Your definition of software includes all data, which strikes me as an unusual use of the term so I'll put it in scare quotes.
You say half of recent AI progress came from "software" and half comes from compute. Then in your diagram, the cognitive labor gained from better AI is going to improve "software."
To me it seems like a ton of recent AI progress was from using up a data overhang, in the sense of scaling up compute enough to take advantage of an existing wealth of data (the internet, most or all books, etc.)
I don't see how more AI researchers, automated or not, could find more of this data. The model has their cognitive labor being used to increase "software." Does the model assume that they are finding or generating more of this data, in addition to doing R&D for new algorithms, or other "software" bucket activities?
These methods may be too aggressive. Before we have ASARA, less capable AI systems may still accelerate software progress by a more moderate amount, plucking the low-hanging fruit. As a result, ASARA has less impact than we might naively have anticipated.
I'm confused.
My default assumption is that prior to ASARA, less-capable AIs will have accelerated software progress a lot — so I'm interested in working that into the model.
It looks like your "gradual boost" section is for people like me; you simulate the gradual emergence of the ASARA boost over a period of five years. But in the gradual boost section, you conclude that using this model results in a higher chance of >10yrs being compressed into one year. (I'm not currently following the logic there, just treating it as a black box.)
Why is the sentence "As a result, ASARA has less impact than we might naively have anticipated" then true? It seems this consideration actually ends up meaning it has more impact.
Just wanted to say I really enjoyed this post, especially your statement of the problem in the last paragraph.
The guy next to me, who introduced himself as "Blake, Series B, stealth mode,"
I don't think it makes sense to have a startup which is in stealth mode, but is also raising Series B (a later round of funding for scaling once you've found a proven business model).
Thanks for reply!
When I say future updates I'm referring to stuff like the EM fintetuning you do in the paper; I interpreted your hypothesis as being that for inoculated models, updates from the EM finetuning are in some sense less "global" and more "local".
Maybe that's a more specific hypothesis than what you intended, though.
More speculative thoughts:
Even more speculative thoughts: