Aditya Prasad — LessWrong

human values are over the “true” values of the latents, not our estimates - e.g. I want other people to actually be happy, not just to look-to-me like they’re happy.

But this is not what our current value system is, we did not evolve such a pointer. Humans will be happy if their senses are deceived. The value system we have is currently over our estimates and that is exactly why we can be manipulated. It is just that till now we did not have an intelligence trying to adversarially fool us. So the value function we need to imbibe is one we don't even have an existence proof of.

I found this post really useful to clarify what the outer alignment problem really was. Like others mentioned in the comments I think we should give up predictive power for the AI adopting our world model, there would be a lot of value to be unpacked and the predictive power will still be far better than anything humans have seen now, maybe later one day we can figure out how to align an AI which is allowed to form their own more powerful world model.

Current methods seem to be applying optimisation pressure to maximise predictive power which will push the AI away from adopting human like world models.

It seems to come down to how do you traverse the ladder of abstraction, when some things you value are useful rather than true beliefs.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments