Thank you for taking the time to comment and for pointing out some errors in the post! Your attention to detail is impressive. I updated the post to reflect your feedback:
Good luck with the rest of the ARENA curriculum! Let me know if you come across anything else.
While I disagree with a lot of this post, I thought it was interesting and I don't think it should have negative karma.
I haven't heard anything about RULER on LessWrong yet:
RULER (Relative Universal LLM-Elicited Rewards) eliminates the need for hand-crafted reward functions by using an LLM-as-judge to automatically score agent trajectories. Simply define your task in the system prompt, and RULER handles the rest—no labeled data, expert feedback, or reward engineering required.
✨ Key Benefits:
- 2-3x faster development - Skip reward function engineering entirely
- General-purpose - Works across any task without modification
- Strong performance - Matches or exceeds hand-crafted rewards in 3/4 benchmarks
- Easy integration - Drop-in replacement for manual reward functions
Apparently it allows LLM agents to learn from experience and significantly improves reliability.
These talks are fascinating. Thanks for sharing.
Great post, it explained some of the economics of job automation in simple terms and clarified my thinking on the subject which is not easy to do. This post has fewer upvotes than it should have.
An alternative idea is to put annual quotas on GPU production. The oil and dairy industries already do this to control prices and the fishing industry does it to avoid overfishing.
Thank you for the reply!
Ok but I still feel somewhat more optimistic about reward learning working. Here are some reasons:
That said, from what I've read, researchers doing RL with verifiable rewards with LLMs (e.g. see the DeepSeek R1 paper) have only had success so far with rule-based rewards rather than learned reward functions. Quote from the DeepSeek R1 paper:
We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.
So I think we'll have to wait and see if people can successfully train LLMs to solve hard problems using learned RL reward functions in a way similar to RL with verifiable rewards.
In the post you say that human programmers will write the AI's reward function and there will be one step of indirection (and that the focus is the outer alignment problem).
But it seems likely to me that programmers won't know what code to write for the reward function since it would be hard to encode complex human values. In Superintelligence, Nick Bostrom calls this manual approach "direct specification" of values and argues that it's naive. Instead, it seems likely to be that programmers will continue to use reward learning algorithms like RLHF where:
If this happens then I think the evolution analogy would apply where there is some outer optimizer like natural selection that is choosing the reward function and then the reward function is the inner objective that is shaping the AI's behavior directly.
Edit: see AGI will have learnt reward functions for an in-depth post on the subject.
I think it depends on the context. It's the norm for employees in companies to have managers though as @Steven Byrnes said, this is partially for motivational purposes since the incentives of employees are often not fully aligned with those of the company. So this example is arguably more of an alignment than a capability problem.
I can think of some other examples of humans acting in highly autonomous ways:
Thanks for the post. It covers an important debate: whether mechanistic interpretability is worth pursuing as a path towards safer AI. The post is logical and makes several good points but I find it's style too formal for LessWrong and it could be rewritten to be more readable.