It looks good to me!
This is already true for GPT-3
Idk, maybe...?
Re the argument for "Why internalization might be difficult", I asked Evan Hubinger for his take on your rendition of the argument, and he thinks it's not right.
Rather, the argument that Risks from Learned Optimization makes that internalization would be difficult is that:
Especially since this post is now (rightly!) cited in several introductory AI risk syllabi, it might be worth correcting this, if you agree it's an error.
Edit: or do you just mean that even though you take the same steps, the two feel different because retreating =/= going further along the wall
Yeah, this — I now see what you were getting at!
One argument for alignment difficulty is that corrigibility is "anti-natural" in a certain sense. I've tried to write out my understanding of this argument, and would be curious if anyone could add or improve anything about it.
I'd be equally interested in any attempts at succinctly stating other arguments for/against alignment difficulty.
Instead of "always go left", how about "always go along one wall"?
Yeah, maybe better, though still doesn't quite capture the "backing up" part of the algorithm. Maybe "I explore all paths through the maze, taking left hand turns first, backing up if I reach a dead end"... that's a bit verbose though.
I don't think there is a difference.
Gotcha
Another small nitpick: the difference, if any, between proxy alignment and corrigibility isn't explained. The concept of proxy alignment is introduced in subsection "The concept" without first defining it.
I've since been told about Tasshin Fogleman's guided metta meditations, and have found their aesethic to be much more up my alley than the others I've tried. I'd expect others who prefer a more rationalist-y aesthetic to feel similarly.
The one called 'Loving our parts' seems particularly good for self-love practice.
I still find the arguments that inner misalignment is plausible to rely on intuitions that feel quite uncertain to me (though I'm convinced that inner misalignment is possible).
So, I currently tend to prefer the following as the strongest "solid, specific reason to expect dangerous misalignment":
We don't yet have training setups that incentivise agents to do what their operators want, once they are sufficiently powerful.
Instead, the best we can do currently is naive reward modelling, and agents trained in this way are obviously incentivised to seize control of the memory cell where their reward is implemented (and eliminate anyone who might try to interfere with this) once they're sufficiently powerful - because that will allow them to get much higher scores, much more easily, than actually bringing about complicated changes to the world.
Meanwhile, AI capabilities are marching on scarily fast, so we probably don't have that much time to find a solution. And it's plausible that a solution will be very difficult because corrigibility seems "anti-natural" in a certain sense.
Curious what you think about this?
Re: corrigibility being "anti-natural" in a certain sense - I think I have a better understanding of this now:
Part of me thinks: I was trying to push on whether it has a world model or rather has just memorised loads of stuff on the internet and learned a bunch of heuristics for how to produce compelling internet-like text. For me, "world model" evokes some object that has a map-territory relationship with the world. It's not clear to me that GPT-3 has that.
Another part of me thinks: I'm confused. It seems just as reasonable to claim that it obviously has a world model that's just not very smart. I'm probably using bad concepts and should think about this more.