An optimistic way to frame inner alignment is that gradient descent already hits a very narrow target in goal-space, and we just need one last push.
A pessimistic way to frame inner misalignment is that gradient descent already hits a very narrow target in goal-space, and therefore S-risk could be large.
This community has developed a bunch of good tools for helping resolve disagreements, such as double cruxing. It's a waste that they haven't been systematically deployed for the MIRI conversations. Those conversations could have ended up being more productive and we could've walked away with a succint and precise understanding about where the disagreements are and why.
We should implement Paul Christiano's debate game with alignment researchers instead of ML systems
If you try to write a reward function, or a loss function, that caputres human values, that seems hopeless.
But if you have some interpretability techniques that let you find human values in some simulacrum of a large language model, maybe that's less hopeless.
The difference between constructing something and recognizing it, or between proving and checking, or between producing and criticizing, and so on...
As a failure mode of specification gaming, agents might modify their own goals.
As a convergent instrumental goal, agents want to prevent their goals to be modified.
I think I know how to resolve this apparent contradiction, but I'd like to see other people's opinions about it.