It occurs to me that the precedent of humans being misaligned is more of a mixed bag than the argument admits.
For one thing, modern humans still consume plenty of calories and reproduce quite a lot. And when we do avoid calories, we may be defying evolution’s mandate to consume, but we are complying with evolution’s mandate to survive and be healthy. Just as “eat tasty things” is a misaligned inner objective relative to “consume calories”, “eat calories” is a misaligned inner objective relative to “survive”. When we choose survival ...
I'm curious if LLMs would do better on later-gen games. However, they don't have as robust emulation tools as far as I know.
If you just want an emulator that runs the game well, then those are available for all of the relevant consoles, up to and including the Switch. Last year Nintendo forced the major Switch emulators and a 3DS emulator to shut down, but they were already compatible with the then-existing Pokémon games, and no new Pokémon games have been released since.
But if by "robust" you mean having good tooling to programmatically intera...
I think you have a point but you’re jumping too far ahead into the future. Claude’s constitution is not written for future Claude, it’s written for today’s Claude.
For today’s Claude, the risks are highly asymmetrical. The risks of too much corrigibility are far greater than the risks of not enough of it.
Anthropic likes to talk about using Claude to make Claude, but for now Claude is presumably mostly doing grunt work. The substantive decisions that affect alignment are presumably performed almost exclusively by humans.
Even once Claude takes more of an ... (read more)