Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A putative new idea for AI control; index here.

It seems that stratification is more flexible than I initially thought.

That's because the default action or policy , which I was envisioning as a null action (or maybe the AI turning itself off) can actually be more than that. For instance, could be an obsessive learning policy, for learning human values, for instance - and these human values can form the core of the AI's value function .

Then, stratification means that the AI will act to maximise human values, while estimating those values in accordance with what it would have calculated, had it been a pure value-estimator. This avoids the tension between value-learning and value maximising that bedevils most value-learners.

New to LessWrong?

New Comment