I agree that the agent should be able to make a decent effort at telling us which of its drives are biases (/addictions) versus values. One complicating factor is that agents change their opinions about these matters over time. Imagine a philosopher who uses the drug heroin. They may very well vacillate on whether heroin satisfies their full-preferences, even if the experience of taking heroin is not changing. This could happen via introspection, via philosophical investigation, via examining fMRI scans, et cetera. It's tricky for the human to state their biases with confidence because they may never know when they are done updating on the matter.

Intuitively, an agent might want the AI system to do this examination and then to maximize whatever turns out to be valuable. That is, you might want the bias-model to be the one that you would settle on if you thought for a long time, similarly to enlightened self-interest / extrapolated volition models. Similar problems ensue: e.g., it this process may diverge. Or it may be fundamentally indeterminate whether some drives are values or biases.

Reply

[-]Stuart_Armstrong8y30

>One complicating factor is that agents change their opinions about these matters over time.

Yep! This is one of the major issues, and one that I'll try to model in a soon-to-be-coming post. The whole issue of rigged and influeceable learning processes is connected with trying to learn the preferences of such an agent.

>Or it may be fundamentally indeterminate whether some drives are values or biases.

I think it's fundamentally indeterminate in principle, but we can make some good judgements in practice.

Reply

[-]Gordon Seidoh Worley8y10

Ooooh, I like where this is going. I realize you still have more to develop on this idea, but is your thought that this could replace the use of objective reward functions that exist outside the agent?

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

10

Beyond algorithmic equivalence: self-modelling

10

10

Self-modelling

Self-model and preparation

The philosophical position