Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Much concern about AI comes down to the scariness of goal-oriented behavior. A common response to such concerns is "why would we give an AI goals anyway?" I think there are good reasons to expect goal-oriented behavior, and I've been on that side of a lot of arguments. But I don't think the issue is settled, and it might be possible to get better outcomes by directly specifying what actions are good. I flesh out one possible alternative here.

(As an experiment I wrote the post on medium, so that it is easier to provide sentence-level feedback, especially feedback on writing or low-level comments. Big-picture discussion should probably stay here.)

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 10:34 AM

Despite this essay's age, I was linked by Structural Risk Minimization and felt I had to address some points you made. I think you may have dismissed a strawman of consequence-approval direction, and then later used a more robust version on your reasoning while avoiding those terms. See my comments on the essay.

It seems that if it is desired, the overseer could also set their behaviour and intentions so that the approval-directed agent acts as we would want an oracle or tool to act. This is a nice feature.

I think Nick Bostrom and Stuart Armstrong would also be interested in this, and might have good feedback for you.

High-level feedback: this is a really interesting proposal, and looks like a promising direction to me! Most of my inline comments on Medium are more critical, but that doesn't reflect my overall assessment.