Generalised models
Concept Extrapolation
AI Safety Subprojects
Practical Guide to Anthropics
Anthropic Decision Theory
Subagents and impact measures
If I were a well-intentioned AI...


This post is on a very important topic: how could we scale ideas about value extrapolation or avoiding goal misgeneralisation... all the way up to superintelligence? As such, its ideas are very worth exploring and getting to grips to. It's a very important idea.

However, the post itself is not brilliantly written, and is more of "idea of a potential approach" than a well crafted theory post. I hope to be able to revisit it at some point soon, but haven't been able to find or make the time, yet.

It was good that this post was written and seen.

I also agree with some of the comments that it wasn't up to usual EA/LessWrong standards. But those standards could be used as excuses to downvote uncomfortable topics. I'd like to see a well-crafted women in EA post, and see whether it gets downvoted or not.

Not at all what I'm angling at. There's a mechanistic generator for why humans navigate ontology shifts well (on my view). Learn about the generators, don't copy the algorithm.

I agree that humans navigate "model splinterings" quite well. But I actually think the algorithm might be more important than the generators. The generators comes from evolution and human experience in our actual world; this doesn't seem like it would generalise. The algorithm itself, though, may very generalisable (potential analogy: humans have instinctive grasp of all numbers under five, due to various evolutionary pressures, but we produced the addition algorithm that is far more generalisable).

I'm not sure that we disagree much. We may just have different emphases and slightly different takes on the same question?

Do you predict that if I had access to a range of pills which changed my values to whatever I wanted, and I could somehow understand the consequences of each pill (the paperclip pill, the yay-killing pill, ...), I would choose a pill such that my new values would be almost completely unaligned with my old values?

This is the wrong angle, I feel (though it's the angle I introduced, so apologies!). The following should better articulate my thoughts:

We have an AI-CEO money maximiser, rewarded by the stock price ticker as a reward function. As long as the AI is constrained and weak, it continues to increase the value of the company; when it becomes powerful, it wireheads and takes over the stock price ticker.

Now that wireheading is a perfectly correct extrapolation of its reward function; it hasn't "changed" its reward function, it simply has gained the ability to control its environment well, so that it now can decorrelate the stock ticker from the company value.

Notice the similarity with humans who develop contraception so they can enjoy sex without risking childbirth. Their previous "values" seemed to be a bundle of "have children, enjoy sex" and this has now been wireheaded into "enjoy sex".

Is this a correct extrapolation of prior values? In retrospect, according to our current values, it seems to mainly be the case. But some people strongly disagree even today, and, if you'd done a survey of people before contraception, you'd have got a lot of mixed responses (especially if you'd got effective childbirth medicine long before contraceptives). And if we want to say that the "true" values have been maintained, we'd have to parse the survey data in specific ways, that others may argue with.

So we like to think that we've maintained our "true values" across these various "model splinterings", but it seems more that what we've maintained has been retrospectively designated as "true values". I won't go the whole hog of saying "humans are rationalising beings, rather than rational ones", but there is at least some truth to that, so it's never fully clear what our "true values" really were in the past.

So if you see humans as examples of entities that maintain their values across ontology changes and model splinterings, I would strongly disagree. If you see them as entities that sorta-kinda maintain and adjust their values, preserving something of what happened before, then I agree. That to me is value extrapolation, for which humans have shown a certain skill (and many failings). And I'm very interested in automating that, though I'm sceptical that the purely human version of it can extrapolate all the way up to superintelligence.

It is not that human values are particularly stable. It's that humans themselves are pretty limited. Within that context, we identify the stable parts of ourselves as "our human values".

If we lift that stability - if we allow humans arbitrary self-modification and intelligence increase - the parts of us that are stable will change, and will likely not include much of our current values. New entities, new attractors.

Hey, thanks for posting this!

And I apologise - I seem to have again failed to communicate what we're doing here :-(

"Get the AI to ask for labels on ambiguous data"

Having the AI ask is a minor aspect of our current methods, that I've repeatedly tried to de-emphasise (though it does turn it to have an unexpected connection with interpretability). What we're trying to do is:

  1. Get the AI to generate candidate extrapolations of its reward data, that include human-survivable candidates.
  2. Select among these candidates to get a human-survivable ultimate reward functions.

Possible selection processes include being conservative (see here for how that might work: ), asking humans and then extrapolating the process of what human-answering should idealise to (some initial thoughts on this here:, removing some of the candidates on syntactic ground (e.g. wireheading, which I've written quite a bit on how it might be syntactically defined). There are some other approaches we've been considering, but they're currently under-developed.

But all those methods will fail if the AI can't generate human-survivable extrapolations of its reward training data. That is what we are currently most focused on. And, given our current results on toy models and a recent literature review, my impression is that there has been almost no decent applicable research done in this area to date. Our current results on HappyFaces are a bit simplistic, but, depressingly, they seem to be the best in the world in reward-function-extrapolation (and not just for image classification) :-(

We ask them to not cheat in that way? That would be using their own implicit knowledge of what the features are.

I'd say do two challenges: one at a mix rate of 0.5, one at a mix rate of 0.1.

I was putting all those under "It would help the economy, by redirecting taxes from inefficient sources. It would help governments raise revenues and hence provide services without distorting the economy.".

And we have to be careful about a citizen's dividend; with everyone richer, they can afford higher rents, so rents will rise. Not by the same amount, but it's not as simple as "everyone is X richer".

Load More