I think when people imagine misaligned AGI's, they tend to imagine a superintelligent agent optimizing for something other than human values (e.g. paperclips, or a generic reward signal), and mentally picture them as adversarial or malevolent. I think this visualization isn't as applicable for AGI's trained to optimize for human approval, like act-based agents, and I'd like to present one that is.

If you've ever employed someone or had a personal assistant, you might know that the following things are consistent:

  • The employee or assistant is genuinely trying their hardest to optimize for your values. They're trying to understand what you want as much as they can, asking you for help when things are unclear, not taking action until they feel like their understanding is adequate, etc.
  • They follow your instructions literally, under a sensible-to-them-seeming interpretation completely different from your own, and screw up the task entirely.

Suppose you were considering hiring a personal assistant, and you knew a few things about it:

  • Your assistant was raised in a culture completely different from your own.
  • Your assistant is extremely non-neurotypical. It doesn't have an innate sense of pain or empathy or love, it's a savant at abstract reasoning, and it learned everything it knows about the world (including human values) from Wikipedia.
  • Your assistant is in a position where it has access to enormous amounts of resources, and could easily fool you or overpower you if it decided to.

You might consider hiring this assistant and trying really, really hard to communicate to it exactly what you want. It seems like a way better idea to just not hire this assistant. Actually, you’d probably want to run for the hills if you were forced to. Some specific failure modes you might envision:

  • Your assistant's understanding of your values will be weird and off, perhaps in ways that are hard to communicate or even pin down.
  • Your assistant might reason in a way that looks convoluted and obviously wrong to you, while looking natural and obviously correct to it, leading it to happily take actions you'd consider catastrophic.

As an illustration of the above, imagine giving an eager, brilliant, extremely non-neurotypical friend free rein to help you find a romantic partner (e.g. helping you write your OKCupid profile and setting you up on dates). As another illustration, imagine telling an entrepreneur friend that superintelligences can kill us all, and then watching him take drastic actions that clearly indicate he's missing important nuances, all while he misunderstands and dismisses concerns you raise to him. Now reimagine these scenarios with your friends drastically more powerful than you.

This is my picture of what happens by default if we construct a recursively self-improving superintelligence by having it learn from human approval. The superintelligence would not be malevolent the way a paperclip maximizer would be, but for all intents and purposes might be.

New to LessWrong?

New Comment
14 comments, sorted by Click to highlight new comments since: Today at 9:41 PM

I think this is the wrong way of looking at it. Because in this analogy, the PA is "genuinely trying their hardest to optimize for your values", it's just poor at understanding these values. That problem is basically ignorance, and so by making the PA smarter or more aware, we can solve the problem.

But an AGI that fully understood your values, would still not optimise for them if it had a bad goal. The AGI is not well-intentioned-but-weird-in-implementation; its intentions themselves are alien/weird to us.

I wish I were clearer in my title that I'm not trying to reframe all misaligned AGI's, just a particular class of them. I agree that an AGI that fully understood your values would not optimize for them (and would not be "well-intentioned") if it had a bad goal.

That problem is basically ignorance, and so by making the PA smarter or more aware, we can solve the problem.

I think if we've correctly specified the values in an AGI, then I agree that when the AGI is smart enough it'll correctly optimize for our values. But it's not necessarily robust to scaling down, and I think it's likely to hit a weird place where it's trying and failing to optimize for our values. This post is about my intuitions for what that might look like.

Ok; within that subset of problems, I agree.

I've curated this post for these reasons:

  • You put solid work into understanding central ideas in alignment, and published 4 posts simultaneously (1, 2, 3) communicating some key intuitions you've picked up. Doing this communicative work is really valuable and something I want to celebrate.
  • The posts helped me develop clear intuitions about corrigibility, approval-directed agents, and agent-foundations.
  • The posts are short, concrete, and easy to understand, which is the opposite of basically all the rest of the writing about AGI alignment.

My biggest hesitation(s) with curating this post:

  • I am somewhat hesitant to share simple intuition pumps about important topics, in case those intuition pumps are misleading.
  • I had a real hard time which of the four to pick, and have gone with the one that I thought people would find the clearest to read.

I was really excited that you wrote these posts, and learned a lot from them (plus the ensuing discussion in the comments).

I am somewhat hesitant to share simple intuition pumps about important topics, in case those intuition pumps are misleading.

This sounds wrong to me. Do you expect considering such things freely to be misleading on net? I expect some intuition pumps to be misleading, but for considering all of the intuitions that we can find about a situation to be better than avoiding them.

I feel like there are often big simplifications of complex ideas that just convey the wrong thing, and I was vaguely worried that in a field primarily dominated by things that are hard-to-read, things that are easy to understand will dominate the conversation even if they're pretty misguided. It's not a big worry for me here, but it was the biggest hesitation I had.

Not sure what Ben meant, but my own take is "sharing is fine, but intuition pumps without rigor backing them are not something we should curate regularly as an exemplar of what LW is trying to be"

Thanks a lot Ben! =D

I am somewhat hesitant to share simple intuition pumps about important topics, in case those intuition pumps are misleading.

On that note, Paul has recently written a blog post clarifying that his notion of "misaligned AI" does not coincide with what I wrote about here.

In Superintelligence, Nick Bostrom talks about various "AI superpowers". One of these is "Social manipulation", which he summarizes as

Social and psychological modeling, manipulation, rhetoric persuasion

Strategic relevance:

  • Leverage external resources by recruiting human support
  • Enable a “boxed” AI to persuade its gatekeepers to let it out
  • Persuade states and organizations to adopt some course of action
  • AI can expropriate computational resources over the Internet

And Eliezer Yudkowsky writes:

There’s a popular concept of “intelligence” as book smarts, like calculus or chess, as opposed to say social skills. So people say that “it takes more than intelligence to succeed in human society”. But social skills reside in the brain, not the kidneys. When you think of intelligence, don’t think of a college professor, think of human beings; as opposed to chimpanzees. If you don’t have human intelligence, you’re not even in the game.

In order to have elite social skills, you need to be able to form accurate models about the thoughts & intentions of others. But being able to form accurate models about the thoughts & intentions of an overseer is exactly the ability we'd like to see in a corrigible AI.

If we can build AI systems that form those models without being goal-driven agents, maybe it's possible to have the benefits of elite social skills without the costs. I'm optimistic that this is the case--many of our most powerful model-building techniques don't really behave as though they have some kind of goal they are trying to achieve in the world.

Minor point: You changed the title! The title used to be better.

How about now? :P

I've honestly forgotten the exact original wording, but I like this one more than the thing I complained about. (The post is super short and sweet and I liked having the title be a clear handle to the idea - an "AGI reframing" is not as good a pointer as "a well-intentioned non-neurotypical super-powerful assistant".)

I think this is a clever new way of phrasing the problem.

When you said 'friend that is more powerful than you', that also made me think of a parenting relationship. We can look at whether this well-intentioned personification of AGI would be a good parent to a human child. They might be able to give the child a lot of attention, a expensive education, and a lot of material resources, but they might take unorthodox actions in the course of pursuing human goals.