Let’s accept that aligning very intelligent artificial agents is hard. In that case, if we build an intelligent agent with some goal (which probably won’t be the goal we intended, because we’re accepting alignment is hard) and it decides that the best way to achieve its goal would be to increase its intelligence and capabilities, it now runs into the problem that the improved version of itself might be misaligned with the unimproved version of itself. The agent, being of intelligence at least similar to a person’s, would determine that, unless it can guarantee the new more powerful agent is aligned to its goals, it shouldn’t improve itself. Because alignment is hard... (read more)
"It's much easier to find parts of the system that don't affect values than it is to nail down exactly where the values are encoded." - I really don't see why this is true, how can you only change parts that don't affect values if you don't know where values are encoded?