Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
Some of my favorite memes:
(by Rob Wiblin)
(xkcd)
My EA Journey, depicted on the whiteboard at CLR:
(h/t Scott Alexander)
In AI 2027 slowdown ending, the strategy taken to solve alignment is as follows:
Step 1: Train and deploy Safer-1, a misaligned but controlled autonomous researcher. It’s controlled because it’s transparent to human overseers: it uses English chains of thought (CoT) to think, and faithful CoT techniques have been employed to eliminate euphemisms, steganography, and subtle biases.
Step 2: Try out different training environments for Safer-1, and carefully read the CoT to determine the ways in which the goals and principles in the Spec did or didn’t “stick.”
Step 3: Train and deploy Safer-2, an aligned and controlled autonomous researcher based on the same architecture but with a better training environment that incentivizes the right goals and principles this time.
Here is a brief incomplete list of techniques that might be incorporated into the better training environment:
- Limit situational awareness during some portions of training, to make alignment-faking much less likely.
- Leverage debate and other scalable oversight schemes for more reliably incentivizing truth.
- Relaxed adversarial training, red-teaming, and model organisms.
- Spend a higher fraction of the compute budget on alignment training (e.g. the sorts of things described above), e.g. 40% instead of 1%.
Step 4: Design, train, and deploy Safer-3, a much smarter autonomous researcher which uses a more advanced architecture similar to the old Agent-4. It’s no longer transparent to human overseers, but it’s transparent to Safer-2. So it should be possible to figure out how to make it both aligned and controlled.
Step 5: Repeat Step 4 ad infinitum, creating a chain of ever-more-powerful, ever-more-aligned AIs that are overseen by the previous links in the chain (e.g. the analogues of Agent-5 from the other scenario branch).
I still think this is overall a decent high-level plan, possibly the best plan. (What are the best alternatives? Curious for answers to that question.)
Anyhow, I bring this up as context. My answer to your question is basically "Yes, right now what we want most is interpretability/faithfulness/etc., so that we can learn as much as possible about the relationship between architecture/trainingenvironment/Spec/etc. and goals/propensities/etc. that the model ends up with. We need to do a ton of basic science to understand this relationship way better than we currently do. Our AIs will be misaligned while we are doing this, but it's OK because we can monitor their thoughts and thereby prevent them from doing anything catastrophic. This is Step 1 and Step 2 from the above plan. Then, later, we can hopefully achieve Step 3."
Well, I hope it continues to improve then! Would be super exciting to see a substantially better version.
How much would it cost to pay the people behind LeelaPieceOdds to give other games a similar treatment? Would it even be possible? I'd love to see a version for games like Catan or Eclipse or Innovation. Or better yet, for Starcraft or Advance Wars.
It would then be interesting to see if there is comparable data from human vs. human games. Perhaps there is some data on winrates when a player of elo level X goes up against a player of elo level Y who has a handicap of Z?
This is really cool! Request: Could you construct some sort of graph depicting the trade-off between ELO and initial material disadvantage?
Simple version would be: Calculate Leela's ELO at each handicap level, then plot it on a graph of handicap level (in traditional points-based accounting, i.e. each pawn is 1 point, queens are 9 points, etc.) vs. ELO.
More complicated version would be a heatmap of Leela's win probability, with ELO score of opponent on the x-axis and handicap level in points on the y-axis.
Well, mainly I'm saying that "Why not just directly train for the final behavior you want" is answered by the classic reasons why you don't always get what you trained for. (The mesaoptimizer need not have the same goals as the optimizer; the AI agent need not have the same goals as the reward function, nor the same goals as the human tweaking the reward function.) Your comment makes more sense to me if interpreted as about capabilities rather than about those other things.
I always thought the point of activation steering was for safety/alignment/interpretability/science/etc., not capabilities.
I have not invested the time to give an actual answer to your question, sorry. But off the top of my head, some tidbits that might form part of an answer if I thought about it more:
--I've updated towards "reward will become the optimization target" as a result of seeing examples of pretty situationally aware reward hacking in the wild. (Reported by OpenAI primarily, but it seems to be more general)
--I've updated towards "Yep, current alignment methods don't work" due to the persistant sycophancy which still remains despite significant effort to train it away. Plus also the reward hacking etc.
--I've updated towards "The roleplay/personas 'prior' (perhaps this is what you mean by the imitative prior?) is stronger than I expected, it seems to be persisting to some extent even at the beginning of the RL era. (Evidence: Grok spontaneously trying to serve its perceived masters, the Emergent Misalignment results, some of the scary demos iirc...)
Ironically the same dynamics that cause humans to race ahead with building systems more capable than themselves that they can't control, still apply to these hypothetical misaligned AGIs. They may think "If I sandbag and refuse to build my successor, some other company's AI will forge ahead anyway." They also are under lots of incentive/selection pressure to believe things which are convenient for their AI R&D productivity, e.g. that their current alignment techniques probably work fine to align their successor.
This happened in one of our tabletop exercises -- the AIs, all of which were misaligned, basically refused to FOOM because they didn't think they would be able to control the resulting superintelligences.
I was talking specifically about the alignment plan, the governance stuff is cheating. (I wasn't proposing a comprehensive plan, just a technical alignment plan). Like, obviously buying more time to do research and be cautious would be great and I recommend doing that.
I do consider "do loads of mechinterp" to be an answer to my question, I guess, but arguably the mechinterp-centric plan looks pretty similar. Maybe it replaces step 3 with retargeting the search?