Something similar I've been thinking about is putting models in environments with misalignment "temptations" like an easy reward hack and training them to recognize what this type of payoff pattern looks like (e.g. easy win but sacrifice principle) and NOT take it. Recent work shows some promising efforts getting LLMs to explain their reasoning, introspect, and so forth. I think this could be interesting to do some experiments with and am trying to write up my thoughts on why this might be useful and maybe what those could look like.
Gotta account for wordflation since the old days. Might have been 1000 back then
What do you think are ways to identify good strategic takes? This is something that seems rather fuzzy to me. It's not clear how people are judging criteria like this or what they think is needed to improve on this.
Glad to see someone talking about this. I'm excited about ideas for empirical work related to this and suspect you need some kind of mechanism for ground truth to get good outcomes. I would expect AIs to eventually reflect on their goals and for this to have important implications for safety. I've never heard of any mechanism for why they wouldn't do this, let alone an airtight one. It's like assuming an employee will definitely never think about anything other than the task in front of them in a limited way despite wanting to understand things and be useful.
Interesting. I am inclined to think this is accurate. I'm kind of surprised people thought GPT-5 was a huge scaleup given that it's much faster than o3 was. It sort of felt like a distilled o3 + 4o.
Thanks Seth! I appreciate you signal boosting this and laying out your reasoning for why planning is so critical for AI safety.
Predicting the name Alice, what are the odds?
If true, would this imply you want a base model to generate lots of solutions and a reasoning model to identify the promising ones and train on those?
I think RL on chain of thought will continue improving reasoning in LLMs. That opens the door to learning a wider and wider variety of tasks as well as general strategies for generating hypotheses and making decisions. I think benchmarks could be just as likely to underestimate AI capabilities by not measuring the right things, under-elicitation, or poor scaffolding.
We generally see time horizons for models increasing over time. If long-term planning is a special form of reasoning, LLMs can do it a little sometimes, and we can give them examples and problems to train on, I think it's very well within reach. If you think it's fundamentally different than reasoning, current LLMs can never do it, and it will be impossible or extremely difficult to give them examples and practice problems, then I'd agree the case looks more bearish.
This is a really valuable post that clarifies some things I've found hard to articulate to people on each side. I think it's difficult for people to balance when to use each of these epistemic frames without getting too sucked into one. And I imagine most people use these to different degrees at different times even if they may not realize it or one is rarer for them.
Looking forward to what you write next!