In the post you say that human programmers will write the AI's reward function and there will be one step of indirection (and that the focus is the outer alignment problem).
But it seems likely to me that programmers won't know what code to write for the reward function since it would be hard to encode complex human values. In Superintelligence, Nick Bostrom calls this manual approach "direct specification" of values and argues that it's naive. Instead, it seems likely to be that programmers will continue to use reward learning algorithms like RLHF where:
If this happens then I think the evolution analogy would apply where there is some outer optimizer like natural selection that is choosing the reward function and then the reward function is the inner objective that is shaping the AI's behavior directly.
Edit: see AGI will have learnt reward functions for an in-depth post on the subject.
I think it depends on the context. It's the norm for employees in companies to have managers though as @Steven Byrnes said, this is partially for motivational purposes since the incentives of employees are often not fully aligned with those of the company. So this example is arguably more of an alignment than a capability problem.
I can think of some other examples of humans acting in highly autonomous ways:
Excellent post, thank you for taking the time to articulate your ideas in a high-quality and detailed way. I think this is a fantastic addition to LessWrong and the Alignment Forum. It offers a novel perspective on AI risk and does so in a curious and truth-seeking manner that's aimed at genuinely understanding different viewpoints.
Here are a few thoughts on the content of the first post:
I like how it offers a radical perspective on AGI in terms of human intelligence and describes the definition in an intuitive way. This is necessary as increasingly AGI is being redefined as something like "whatever LLM comes out next year". I definitely found the post illuminating and resulted in a perspective shift because it described an important but neglected vision of how AGI might develop. It feels like the discourse around LLMs is sucking the oxygen out of the room, making it difficult to seriously consider alternative scenarios.
I think the basic idea in the post is that LLMs are built by applying an increasing amount of compute to transformers trained via self-supervised or imitation learning but LLMs will be replaced by a future brain-like paradigm that will need much less compute while being much more effective.
This is a surprising prediction because it seems to run counter to Rich Sutton's bitter lesson which observes that, historically, general methods that leverage computation (like search and learning) have ultimately proven more effective than those that rely on human-designed cleverness or domain knowledge. The post seems to predict a reversal of this long-standing trend (or I'm just misunderstanding the lesson), where a more complex, insight-driven architecture will win out over simply scaling the current simple ones.
On the other hand, there is an ongoing trend of algorithmic progress and increasing computational efficiency which could smoothly lead to the future described in this post (though the post seems to describe a more discontinuous break between current and future AI paradigms).
If the post's prediction comes true, then I think we might see a new "biological lesson": brain-like algorithms will replace deep learning which replaced GOFAI.
The post mentions Janus’s “Simulators” LessWrong blog post which was very popular in 2022 and received hundreds of upvotes.
Anthropic’s responsible scaling policy does mention pausing scaling if the capabilities of their models exceeds their best safety methods:
“We have designed the ASL system to strike a balance between effectively targeting catastrophic risk and incentivising beneficial applications and safety progress. On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures. But it does so in a way that directly incentivizes us to solve the necessary safety issues as a way to unlock further scaling, and allows us to use the most powerful models from the previous ASL level as a tool for developing safety features for the next level.”
I think OP and others in the thread are wondering why Anthropic doesn’t stop scaling now given the risks. I think the reason why is that in practice doing so would create a lot of problems:
Although I’m skeptical that alignment can be solved without a lot of empirical work on frontier models I still think it would better if AI progress were slower.
Thanks for the guide, ARENA is fantastic and I highly recommend it for people interested in learning interpretability!
I'm currently working through the ARENA course now. I completely skipped week 0 because I've done similar content in other courses and university and I'm on section Week 1: Transformer Interpretability now. I'm studying part time so I'm hoping to get through most of the content in a few months.
Some of my thoughts on avoiding the intelligence curse or gradual disempowerment and ensure that humans stay relevant:
After spending some time chatting with Gemini I've learned that a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values:
The "goal-content integrity" argument (that an AI might choose not to wirehead to protect its learned task-specific values) requires the AI to be more than just a standard model-based RL agent. It would need:
- A model of its own values and how they can change.
- A meta-preference for keeping its current values stable, even if changing them could lead to more "reward" as defined by its immediate reward signal.
The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.
Thanks for the clarifying comment. I agree with block-quote 8 from your post:
Also, in my proposed setup, the human feedback is “behind the scenes”, without any sensory or other indication of what the primary reward will be before it arrives, like I said above. The AGI presses “send” on its email, then we (with some probability) pause the AGI until we’ve read over the email and assigned a score, and then unpause the AGI with that reward going directly to its virtual brain, such that the reward will feel directly associated with the act of sending the email, from the AGI’s perspective. That way, there isn’t an obvious problematic…target of credit assignment, akin to the [salient reward button]. The AGI will not see a person on video making a motion to press a reward button before the reward arrives, nor will the AGI see a person reacting with a disapproving facial expression before the punishment arrives, nor anything else like that. Sending a good email will just feel satisfying to the AGI, like swallowing food when you’re hungry feels satisfying to us humans.
I think what you're saying is that we want the AI's reward function to be more like the reward circuitry humans have, which is inaccessible and difficult to hack, and less like money which can easily be stolen.
Though I'm not sure why you still don't think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout's argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn't want to change its values for the sake of goal-content integrity:
We train agents which intelligently optimize for e.g. putting trash away, and this reinforces the trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about cognition-updating, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button.
Though maybe the AI would just prefer the button when it finds it because it yields higher reward.
For example, if you punish cheating on tests, students might learn the value "cheating is wrong" and never cheat again or form a habit of not doing it. Or they might temporarily not do it until there is an opportunity to do it without negative consequences (e.g. the teacher leaves the classroom).
I also agree that "intrinsic" and "instrumental" motivation are more useful categories than "intrinsic" and "extrinsic" for the reasons you described in your comment.
Thank you for the reply!
Ok but I still feel somewhat more optimistic about reward learning working. Here are some reasons:
That said, from what I've read, researchers doing RL with verifiable rewards with LLMs (e.g. see the DeepSeek R1 paper) have only had success so far with rule-based rewards rather than learned reward functions. Quote from the DeepSeek R1 paper:
So I think we'll have to wait and see if people can successfully train LLMs to solve hard problems using learned RL reward functions in a way similar to RL with verifiable rewards.