Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I'm leaving the Future of Humanity Institute, the best and most impactful job I've ever had, to co-found Aligned AI. For the first time in my research career, I feel the problem of AI alignment is solvable.


Alignment research: a history of partial failures

The history of AI safety is littered with failures and partial successes. The most common examples of failure are ideas that would work well typically, but which fail in extreme situations - and a superintelligent AI is perfectly capable of creating such situations.

  • Low-impact AIs were supposed to allow smart machines that interacted with humans without causing huge disruptions. They had some success at 'almost no impact'. But everyone - including me - failed at developing algorithms that had reliable low-impact. If the AI is allowed even a little bit of impact, it can make these low-impact restrictions irrelevant.
  • Corrigibility and interruptibility were designed to allow AIs to be reprogrammed even when active and powerful. They have good narrow uses, but aren't a general solution: though the AI would not interfere with the interruptibility process, it also has no incentive to preserve it or to ensure its subagents were also interruptible.
  • Oracles, question answering AIs (and their close relatives, tool AIs) are perennial suggestions, the idea being to limit the power of the AI by limiting it to answering questions or giving suggestions. But that fails, for instance when the AI is incentivised to manipulate humans through the contents of its answers or suggestions.
  • There were some interesting examples on limiting AI power, but these were ultimately vulnerable to the AI creating subagents.
  • The different forms of value learning confronted a surprising obstacle: values could not be learnt without making strong assumptions about human rationality, and human rationality could not be learnt without making strong assumptions about human values.

A litany of partial failures suggests that the next approach tried will be a failure as well - unless we can identify why the approaches above failed. Is there a common failure mode for all of them?

The common thread: lack of value extrapolations

It is easy to point at current examples of agents with low (or high) impact, at safe (or dangerous) suggestions, at low (or high) powered behaviours. So we have in a sense the 'training sets' for defining low-impact/Oracles/low-powered AIs.

It's extending these examples to the general situation that fails: definitions which cleanly divide the training set (whether produced by algorithms or humans) fail to extend to the general situation. Call this the 'value extrapolation problem[1], with 'value' interpreted broadly as a categorisation of situations into desirable and undesirable.

Humans turn out to face similar problems. We have broadly defined preferences in familiar situations we have encountered in the world or in fiction. Yet, when confronted with situations far from these, we have to stop and figure out how our values might possibly extend[2]. Since these human values aren't - yet - defined, we can't directly input them into an algorithm, so AIs that can't solve value extrapolation can't be aligned with human values.

Value extrapolation is thus necessary for AI alignment. It is also almost sufficient, since it allows AIs to draw correct conclusions from imperfectly defined human data. Combined with well grounded basic human values, it will allow the algorithm to extrapolate as well as humans can - better, in fact, using its superhuman abilities.

If that's successful, AIs that value extrapolate and that start aligned, will remain aligned even as they dramatically change the world and confront the unexpected, re-assessing its reward functions when its world-model changes.


We think that once humanity builds its first AGI, superintelligence is likely near, leaving little time to develop AI safety at that point. Indeed, it may be necessary that the first AGI start off aligned: we may not have the time or resources to convince its developers to retrofit alignment to it. So we need a way to have alignment deployed throughout the algorithmic world before anyone develops AGI.

To do this, we'll start by offering alignment as a service for more limited AIs. Value extrapolation scales down as well as up: companies value algorithms that won't immediately misbehave in new situations, algorithms that will become conservative and ask for guidance when facing ambiguity.

We will get this service into widespread use (a process that may take some time), and gradually upgrade it to a full alignment process. That will involve drawing on our research and that of others - we will remain strongly engaged with other research groups, providing tools that they can use and incorporating their own results into our service.

We will refine and develop this deployment plan, depending on research results, commercial opportunities, feedback, and suggestions. Contact us in the comments of this post or from our website.

Thanks to LessWrong

I want to thank LessWrong, as a collective entity, for getting us to the point where such a plan seems doable. We'll be posting a lot here, putting out ideas, asking for feedback - if you can continue giving the same quality of response that you always have (and checking that we ourselves haven't go misaligned!), that's all we can ask from you :-)

  1. Formerly called the 'model splintering' problem. ↩︎

  2. Humans have demonstrated a skill with value extrapolation, during their childhoods and adolescences, when encountering new stories and thought-experiments, and when their situation changes dramatically. Though human value extrapolation can be contingent, it rarely falls into the extreme failure modes of AIs. ↩︎


Ω 29

New Comment
53 comments, sorted by Click to highlight new comments since: Today at 5:36 AM

Can you describe what changed / what made you start feeling that the problem is solvable / what your new attack is, in short?

This feels like a key detail that's lacking from this post. I actually downvoted this post because I have no idea if I should be excited about this development or not. I'm pretty familiar with Stuart's work over the years, so I'm fairly surprised if there's something big here.

Might help if I put this another way. I'd be purely +1 on this project if it was just "hey, I think I've got some good ideas AND I have an idea about why it's valuable to operationalize them as a business, so I'm going to do that". Sounds great. However, the bit about "AND I think I know how to build aligned AI for real this time guys and the answer is [a thing folks have been disagreeing about whether or not it works for years]" makes me -1 unless there's some explanation of how it's different this time.

Sorry if this is a bit harsh. I don't want to be too down on this project, but I feel like a core chunk of the post is that there's some exciting development that leads Stuart to think something new is possible but then doesn't really tell us what that something new is, and I feel that by the standards of LW/AF that's good reason to complain and ask for more info.

Firstly, because the problem feels central to AI alignment, in the way that other approaches didn't. So making progress in this is making general AI alignment progress; there won't be such a "one error detected and all the work is useless" problem. Secondly, we've had success generating some key concepts, implying the problem is ripe for further progress.

Value extrapolation is thus necessary for AI alignment. It is also almost sufficient, since it allows AIs to draw correct conclusions from imperfectly defined human data.

I am missing something... The idea of correctly extrapolating human values is basically the definition of the Eliezer's original proposal, CEV. In fact, it's right there in the name. What is the progress over the last decade?

CEV is based on extrapolating the person; the values are what the person would have had, had they been smarter, known more, had more self-control, etc... Once you have defined the idealised person, the values emerge as a consequence. I've criticised this idea in the past, mainly because the process to generate the idealised person seems vulnerable to negative attractors (Eliezer's most recent version of CEV has less of this problem).

Value extrapolation and model splintering are based on extrapolating features and concepts in models, to other models. This can be done without knowing human psychology or (initially) anything about knowing anything about humans at all, including their existence. See for example the value extrapolation partially resolves symbol grounding post; I would never write "CEV partially resolves symbol grounding". On the contrary, CEV needs symbol grounding.

I don't really understand the symbol grounding issue, but I can see that "value extrapolation" just happened to sound very similar to CEV and hence my confusion.

I wanted to look up CEV after reading this comment. Here's a link for anyone else looking:

That acronym stands for "Coherent Extrapolated Volition" not "Coherent Extrapolated Values". But from skimming the paper just now, I think agree with shminux that it's basically the same idea.

A more recent explanation of CEV by Eliezer:  

Aligned AI is a benefit corporation dedicated to solving the alignment problem

Is this a UK or US public-benefit corporation?

Who are the other founders?

Who and how much are you capitalized for?

Rebecca Gorman (who authored with Stuart Armstrong) is another co-founder according to her LinkedIn page.

This page says "We are located in Oxford, England." So I think they are a UK public-benefit corporation, but I could be mistaken.

UK based currently, Rebecca Gorman other co-founder.

Hmm, the only overlap I can see between your recent work and this description (including optimism about very-near-term applications) is the idea of training an ensemble of models on the same data, and then if the models disagree with each other on a new sample, then we're probably out of distribution (kinda like the Yarin Gal dropout ensemble thing and much related work).

And if we discover that we are in fact out of distribution, then … I don't know. Ask a human for help?

If that guess is at all on the right track (very big "if"!), I endorse it as a promising approach well worth fleshing out further (and I myself put a lot of hope on things in that vein working out). I do, however, think there are AGI-specific issues to think through, and I'm slightly worried that y'all will get distracted by the immediate deployment issues and not make as much progress on AGI-specific stuff. But I'm inclined to trust your judgment :)

Unmentioned but large comparative advantage of this: it's not based in the Bay Area.

The typical alignment pitch of: "Come and work on this super-difficult problem you may or may not be well suited for at all" Is a hard enough sell for already-successful people (which intelligent people often are) without adding: "Also you have to move to this one specific area of California which has a bit of a housing and crime problem and very particular culture"

Unmentioned but large comparative advantage of this: it's not based in the Bay Area.

It's based in the Bay Area of England (Oxford), though, with no mention of remote. So, all the same pathologies: extreme liberal politics, high taxes and cost of living, Dutch disease being captured by NIMBYs with a lock on ever escalating real estate prices and banning density, persistent blatant crime and homelessness (in some ways, worse: I was never yelled at by the homeless in SF like I was in Oxford, and one woman tried to scam me twice. I was there for all of 2 weeks!).

Taxes in Oxford are more-or-less the same as anywhere else in the UK. These are lower than many European countries but higher than the US (especially states with no income tax). 

Rent in SF is more than 2x Oxford (seems roughly right to me) but I agree with what you say on housing. 

Having lived in SF and Oxford, the claim about crime and homelessness doesn't match my experience at all (nor any anecdotes I've heard). I'd be very surprised if stats showed more crime in Oxford vs the central parts of SF. 

The homeless in Oxford talked to me or followed me more than in Berkeley. (I haven’t spent much time in SF.)

Another difference is the geographic location! As someone who grew up in Germany, living in England is a lot more attractive to me since it will allow me to be closer to my family. Others might feel similarly.

If you think this is financially viable, then I'm fairly keen on this, especially if you provide internships and development opportunities for aspiring safety researchers.

Yes, those are important to provide, and we will.

So we need a way to have alignment deployed throughout the algorithmic world before anyone develops AGI. To do this, we'll start by offering alignment as a service for more limited AIs.

I'm tentatively fairly excited about some version of this, so I'll suggest some tweaks that can hopefully be helpful for your success (or for the brainstorming of anyone else who's thinking about doing something similar in the future).

We will refine and develop this deployment plan, depending on research results, commercial opportunities, feedback, and suggestions.

I suspect there'd be much better commercial/scaling opportunities for a somewhat similar org that offered a more comprehensive, high-quality package of "trustworthy AI services"--e.g., addressing bias, privacy issues, and other more mainstream concerns along with safety/alignment concerns. Then there'd be less of a need to convince companies about paying for some new service--you would mostly just need to convince them that you're the best provider of services that they're already interested in. (Cf. ethical AI consulting companies that already exist.)

(One could ask: But wouldn't the extra price be the same, whether you're offering alignment in a package or separately? Not necessarily--IP concerns and transaction costs incentivize AI companies to reduce the number of third parties they share their algorithms with.)

As an additional benefit, a more comprehensive package of "trustworthy AI services" would be directly competing for consumers with companies like the AI consulting company mentioned above. This might pressure those companies to start offering safety/alignment services--a mechanism for broadening adoption that isn't available to an org that only provides alignment services.

[From the website] We are hiring AI safety researchers, ML engineers and other staff.

Relatedly to the earlier point, given that commercial opportunities are a big potential bottleneck (in other words, given that selling limited alignment services might be as much of a communications and persuasion challenge as it is a technical challenge), my intuition would be to also put significant emphasis into hiring people who will kill it at the persuasion: people who are closely familiar with the market and regulatory incentives faced by relevant companies, people with sales and marketing experience, people with otherwise strong communications skills, etc. (in addition to the researchers and engineers).

Adding on to Mauricio's idea: Also explore partnering with companies that offer a well-recognized, high-quality package of mainstream "trustworthy AI services" -- e.g., addressing bias, privacy issues, and other more mainstream concerns -- where you have comparative advantage on safety/alignment concerns and they have comparative advantage on the more mainstream concerns. Together with a partner, you could provide a more comprehensive offering. (That's part of the value proposition for them. Also, of course, be sure to highlight the growing importance of safety/alignment issues, and the expertise you'd bring.) Then you wouldn't have to compete in the areas where they have comparative advantage. 

We agree with this.

Thanks for the ideas! We'll think on them.

Given that there's a lot of variation in how humans extrapolate values, whose extrapolation process do you intend to use?

If that will turn out to be the only problem then we'll be in an amazing situation

Near future AGI might be aligned to the meta-preferences of MTurkers more than anyone else :P

We're aiming to solve the problem in a way that is acceptable to one given human, and then generalise from that.

This seems fragile in ways that make me less optimistic about the approach overall. We have strong reasons to think that value aggregation is intractable, and (by analogy,) in some ways the problem of coherence in CEV is the tricky part. That is, the problem of making sure that we're not Dutch book-able is, IIRC, NP-complete, and even worse, the problem of aggregating preferences has several impossibility results.

Edit: To clarify, I'm excited about the approach overall, and think it's likely to be valuable, but this part seems like a big problem.

I've posted on the theoretical difficulties of aggregating the utilities of different agents. But doing it in practice is much more feasible (scale the utilities to some not-too-unreasonable scale, add them, maximise sum).

But value extrapolation is different from human value aggregation; for example, low power (or low impact) AIs can be defined with value extrapolation, and that doesn't need human value aggregation.

I'm skeptical that many of the problems with aggregation don't both apply to actual individual human values once extrapolated, and generalize to AIs with closely related values, but I'd need to lay out the case for that more clearly. (I did discuss the difficulty of cooperation even given compatible goals a bit in this paper, but it's nowhere near complete in addressing this issue.)

It's worth you write up your point and post it - that tends to clarify the issue, for yourself as well as for others.

Hi David,

As Stuart referenced in his comment to your post here, value extrapolation can be the key to AI alignment *without* using it to deduce the set of human values. See the 'List of partial failures' in the original post: With value extrapolation, these approaches become viable.

To do this, we'll start by offering alignment as a service for more limited AIs. Value extrapolation scales down as well as up: companies value algorithms that won't immediately misbehave in new situations, algorithms that will become conservative and ask for guidance when facing ambiguity.

What are examples of AIs you think you can currently align and how much (order of magnitude, say) would it cost to have you align one for me? If I have a 20B parameter language model, can you align it for me?

Reach out to my cofounder (Rebecca Gorman) on linkedin.

For some time, I have planned to make a post calling for more people to actually try to solve the problem of alignment. I haven't studied Stuart's work in detail (something to be rectified soon), I always say June Ku's is the most advanced scheme we have, but as an unapologetic fan of CEV, this talk of value extrapolation seems on the right track. I do wonder to what extent a solution to alignment for autonomous superhuman AI can lead (in advance) to spinoff for narrower and less powerful systems - superhuman alignment seems to require a determination of the full "human utility function", or something similar; I suppose the extrapolation part might be relevant for lesser AI, even if the full set of human values are not - but we shall learn more as Stuart's scheme unfolds.

I will add that I am personally interested in contributing to this kind of research (paid work would be most empowering, but absent that, I will still keep doing what I can, when I can, until we run out of time), but my circumstances are a little unusual, and might be incompatible with what some organizations require. So for now I'll just mention my interest.

Thanks. Would you want to send me a message explaining your interest and your unusual circumstances (if relevant)?

I'm encouraged by your optimism, and wish you the best of luck (British, and otherwise), but I hope you're not getting much of your intuition from the "Humans have demonstrated a skill with value extrapolation..." part. I don't think we have good evidence for this in a broad enough range of circumstances for it to apply well to the AGI case.

We know humans do pretty 'well' at this - when surrounded by dozens of other similar agents, in a game-theoretical context where it pays to cooperate, it pays to share values with others, and where extreme failure modes usually lead to loss of any significant power before they can lead to terrible abuse of that power.

Absent such game-theoretic constraints, I don't think we know much at all about how well humans do at this.

Further, I don't think I know what it means to do value extrapolation well - beyond something like "you're doing it well if you're winning" (what would it look like for almost all humans to do it badly?). That's fine for situations where cooperation with humans is the best way to win. Not so much where it isn't.

But with luck I'm missing something!

I do not put too much weight on that intuition, except as an avenue to investigate (how do humans do it, exactly? If it depends on the social environment, can the conditions of that be replicated?).

I’m really glad to see this. I can’t say I fully grasp your particular approach, but what you’ve written about model fragments has really resonated.

My intuition around value extrapolation is that if we extrapolate the topic itself it’ll eventually turn into creating fine models of nervous system dynamics. Will be curious to see how your work intersects and what it assumes about neuroscience, and also what sort of neuroscience progress you think might make your work easier.

Good luck!

All the best on this new venture!

Regarding 'value extrapolation', I wrote a little on grounding the acquisition of the right priors by 'learning to value learn' last year in Motivations, Natural Selection, and Curriculum Engineering (section Transmissible Accumulation). It basically just has seeds of ideas, but you may be interested.

I think value extrapolation is more tractable than many assume, even for vary powerful systems. I think this because I expect AI systems to strongly prefer a small number of general explanations over many shallow explanations. I expect such general explanations for human values are more likely to extend to unusual situations than more shallow explanations.

One approach that seems really underexplored is to directly generate data on how human preferences extend to extreme situations or very capable AIs. OpenAI was able to greatly improve the alignment of current language models by learning a reward model from text examples of current language models following human instructions, ranked by how well the AI’s output followed the human’s instruction. We should be able to generate a similar values data set, but for much AIs much stronger than current language models. See here for a more extended discussion.

To do this, we'll start by offering alignment as a service for more limited AIs.

Interesting move! Will be interesting to see how you will end up packaging and positioning this alignment as a service, compared to the services offered by more general IT consulting companies. Good luck!

How do you know when you have solved the value extrapolation problem? 

One hypothesis I have for what you might say is something like "a training scheme solves the value extrapolation problem when the sequence of inputs that will be seen in deployment by the AI produced by that training scheme leads to outputs which lead to positive outcomes by human lights" though from what I can tell, that's basically the same as having a training scheme that leads to an "impact aligned" AI*.

If it isn't this, how is your answer different?

*[ETA: the definition of impact alignment that Evan gives in the linked post technically only refers to an AI "which doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic," but in my comment above, I meant to refer to what I think is the more relevant property for an AI to have, which I'll call (impact aligned)_Jack: an agent is (impact aligned)_Jack to the degree that, by human lights, it doesn't take bad actions and does take good actions." I think that this is more relevant because Evan's definition doesn't distinguish between a rock and an intuitively aligned AI.]


Knowing that we've solve the problem relies on the knowing the innards of the algorithm we've designed, and proving theorems about it, rather than looking solely at its behaviour.

Oh I see -- could you say more about what characteristics you want the innards to have?

Ping about my other comment -- FYI, because I am currently concerned that you don't have criteria for the innards in mind, I'm less excited about your agenda than other alignment theory agendas (though this lack of excitement is somewhat weak, e.g. since I haven't tried to digest your work much yet).

Let me develop the idea a bit more. It is somewhat akin to answering, in the 1968, the question "how do you know you've solved the moon landing problem?" In that case, NASA could point to them having solved a host of related problems (getting into space, getting to the moon, module separation, module reconnection), knowing that their lander could theoretically land on the moon (via knowledge of the laws of physics and of their lander design), estimating that the pilots are capable of dealing with likely contingencies, trusting that their model of the lunar landing problem is correct and has covered various likely contingencies, etc... and then putting it all together into a plan where they could say "successful lunar landing is likely".

Note that various parts of the assumptions could be tested; engineers could probe at the plan and say things like "what if the conductivity of the lunar surface is unusual", and try and see if their plan could cope with that.

Back to value extrapolation. We'd be confident that it is likely to work if we had, for example:

  1. It works well in an all situations where we can completely test it (eg we have a list of human moral principles, and we can have an AI successfully run a school using those as input).
  2. It works well on testable subproblems of more complicated situations (eg we inspect the AI's behaviour in specific situations).
  3. We have models of how value extrapolation works in extreme situations, and strong theoretical arguments that those models are correct.
  4. We have developed a much better theoretical understanding of value extrapolation, and are confident that it works.
  5. We've studied the problem adversarially and failed to break the approach.
  6. We have deployed interpretability methods to look inside the AI at certain places, and what we've seen is what we expect to see.

These are the sort of things that could make us confident that a new approach could work. Is this what you are thinking?

Thanks for this list!

Though the list still doesn't strike me as very novel -- it feels that most of these conditions are conditions we've been shooting for anyways.

E.g. conditions 1, 2, and 5 are about selecting for behavior we approve of and condition 5 is just inspection with interpretability tools.

If you feel you have traction on conditions 3 and 4 though, that does seem novel (side-note that condition 4 seems to be a subset of condition 3). I feel skeptical though, since value extrapolation seems like about as hard of a problem as understanding machine generalization in general + the way a thing behaves in a large class of cases seems to be so complicated of a concept that you won't be able to have confident beliefs about it or understand it. I don't have a concrete argument about this though.

Anyways, thanks for responding, and if you have any thoughts about the tractability of conditions 3/4, I'm pretty curious.

Yes, the list isn't very novel - I was trying to think of the mix of theoretical and practical results that convince us, in the current world, that a new approach will work. Obviously we want a lot more rigour for something like AI alignment! But there is an urgency to get it fast, too :-(

New to LessWrong?