Posts

Sorted by New

Wiki Contributions

Comments

On instrumental convergence: humans would seem to be a prominent counterexample to "most agents don't let you edit their utility functions" -- at least in the sense that our goals/interests etc are quite sensitive to those of people around us. So maybe not explicit editing, but lots of being influenced by and converging to the goals and interests of those around us. (and maybe this suggests another tool for alignment, which is building in this same kind of sensitivity to artificial agents' utility functions)