Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Over the course of the AI Alignment Prize I've sent out lots of feedback emails. Some of the threads were really exciting and taught me a lot. But mostly it was me saying pretty much the same thing over and over with small variations. I've gotten pretty good at saying that thing, so it makes sense to post it here.

Working on AI alignment is like building a bridge across a river. Our building blocks are ideas like mathematics, physics or computer science. We understand how they work and how they can be used to build things.

Meanwhile on the far side of the river, we can glimpse other building blocks that we imagine we understand. Desire, empathy, comprehension, respect... Unfortunately we don't know how these work inside, so from the distance they look like black boxes. They would be very useful for building the bridge, but to reach them we must build the bridge first, starting from our side of the river.

What about machine learning? Perhaps the math of neural networks could free us from the need to understand the building blocks we use? If we could create a behavioral imitation of that black box over there (like "human values"), then building the bridge would be easy! Unfortunately, ideas like adversarial examples, treacherous turns or nonsentient uploads show that we shouldn't bet our future on something that imitates a particular black box, even if the imitation passes many tests. We need to understand how the black box works inside, to make sure our version's behavior is not only similar but based on the right reasons.

(Eliezer made the same point more eloquently a decade ago in Artificial Mysterious Intelligence. Still, with the second round of our prize now open, I feel it's worth saying again.)

New to LessWrong?

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 12:42 PM

I really like posts that report the main update / thing noticed after getting a bunch of data - in your case, reading a lot of people's alignment proposals.

The fact that the insight has previously been had (thanks for linking to the original post in the sequences!) isn't a problem - this feels like a successful replication study. So for these reasons, I've curated this post.

Added: And as per usual with curated posts, the length + readbility of the post was great.

Added2: Also want to mention that this is an important topic, which was a further reason for curation.

Thanks for this post!

>adversarial examples

Consider two strategies for developing software: top-down and bottom-up. Each strategy has its own risks. The risk of the bottom-up approach is that you build components that don't end up being useful in the final design. The risk of the top-down approach is that you make assumptions about the sort of components it's possible to build that don't end up being true. Because each approach has advantages and disadvantages, I think it makes sense for people to take both approaches when planning a large, important software project such as an FAI.

OpenAI is working on how to solve the problem of adversarial examples, which is great. But I also think it's useful for people to ask the question: suppose we had a solution to the problem of adversarial examples. Would it then be possible to build an FAI? If so, how? By answering this question, we can know things like: How useful is it for us to work on adversarial examples, relative to other bottom-up AI safety work? Are there other problems such that if we were able to solve them, and combine the solution to them with our solution to adversarial examples, we'd be able to build an FAI?

BTW, I would guess that top-down work is a comparative advantage of competitions like this one, since bottom-up work is the kind of thing that academic journals will publish.

>treacherous turns

Suppose we have an algorithm that's known to be a behavior-executor but is otherwise a black box. If it's truly an accurate imitation of human values, in the sense that the box agrees exactly with us about what kind of worlds are valuable, then maybe you'd still have the problem of a treacherous turn in the sense that once an AI using this box as its values gets powerful enough, it grabs hold of civilization's steering wheel and doesn't let go. But that wouldn't be a problem if the AI's values were determined by this hypothetical perfect black box, because it would steer us exactly where we wanted to go anyway.

>nonsentient uploads

This seems to point to a deleted post. Did you mean this post? Or did you mean to refer to your conversation with Wei Dai in the thread you linked?

>Artificial Mysterious Intelligence

I largely agree with that post, but there's a subtle distinction I want to make.

When you develop software top-down, you start with a vague idea of what you want to build, and then you break it down in to several components. For each component, you have a vague idea of how it will work--hopefully at least a little less vague than your idea about how the thing as a whole will work. The process proceeds from gathering requirements, to writing a spec, to architecture diagrams, to pseudocode, until finally you have working code. Software architecture is kinda like philosophy in a certain way.

To a software architect who's early in the design stage, "this component has not been fully specified" is not a very interesting objection. At this point, none of the components have been fully specified! More interesting objections:

  • "Based on the sketch you've provided, it doesn't seem like this software would actually do what we need if it was developed"
  • "I don't see how this component could be made secure"
  • "I don't think we can build this component" / "This component seems at least as difficult to create as the project as a whole"

Note that the last two objections do not necessarily mean the design as a whole is dead on arrival. If the software architect knows what they're doing, they are describing a subcomponent because they think it's at least a little easier than the entire project. In this case, there's a difference of opinion which indicates that it could be useful to discuss that subcomponent more.

If FAI is rated as impossibly difficult, and we're able to reduce it to subcomponents rated at most as extremely difficult, that seems like progress. If we're able to reduce it to components that we think might become available before AGI as a whole becomes available, that creates the possibility of FAI through differential technological development.

Software development is a poor metaphor for AI alignment. It reminds me of Ron Jeffries' sad attempt to write a Sudoku solver by using "extreme programming" without knowing about backtracking search. He kinda blunders around for a few weeks and then stops. Another nice mockery is this, which (unbelievably) comes from an actual AI design by someone.

A better metaphor for AI alignment is fundamental science, where we're supposed to understand things step by step. Your post makes several such steps at the start. But then you spend many pages on sketching a software design, which even after careful reading didn't advance my understanding of AI alignment in any way. That's the criterion I use.

Maybe we should explain to contestants that they should try to advance the frontier of understanding by one inch, not solve all of AI alignment in one go. The latter seems to lead people in unproductive directions.

It may be worth mentioning that the "someone" who produced the "actual AI design" is a known crank. (Whose handle must not be mentioned, for ancient legend says that if you speak his name then he will appear and be a big nuisance.)

>Software development is a poor metaphor for AI alignment.

I think I disagree, but let's ignore the metaphor aspect and focus on the model. The same causal model can also be communicated using science & engineering as a metaphor. If you want to know what scientific insights to work towards to create some breakthrough technology, it's valuable to periodically put on your engineer hat. Without it, you'll do basic research that could end up leading anywhere. In search terms, an engineer hat offers an improved heuristic. If your scientist hat allows you to forward chain, your engineer hat allows you to backward chain.

I'd argue the engineer hat is critical for effective differential technological development.

When I saw the title, I thought, 'But we want to decompose problems in FAI theory to isolate questions we can answer. This suggests heavy use of black boxes.' I wondered if perhaps he was trying to help people who were getting everything wrong (in which case I think a positive suggestion has more chance of helping than telling people what to avoid). I was pleased to see the post actually addressed a more intelligent perspective, and has much less to do with your point or mine.

We need to understand how the black box works inside, to make sure our version's behavior is not just similar but based on the right reasons.

I think here "black-box" can be used to refer to two different things, one to refer to things in philosophy or science which we do not fully understand yet, and also to machine learning models like neural networks that seem to capture their knowledge in ways that are uninterpretable to humans.

We will almost certainly require the use of machine learning or AI to model systems that are beyond our capabilities to understand. This may include physics, complex economic systems, the invention of new technology, or yes, even human values. There is no guarantee that a theory that describes our own values can be written down and understood fully by us.

Have you ruled out any kind of theory which would allow you to know for certain that a "black-box" model is learning what you want it to learn, without understanding everything that it has learned exactly? I might not be able to actually formally verify that my neural network has learned exactly what I want it to (i.e. by extracting the knowledge out of it and comparing it to what I already know), but maybe I have formal proofs of the algorithm it is using and so I know its knowledge will be fairly robust under certain conditions. It's basically the latter we need to be aiming for.

I agree! There's a distinction between "we know exactly what knowledge is represented in this complicated black box" and "we have formal guarantees about properties of the black box". It's indeed very different to say "the AI will have a black box representing a model of human preferences" and "we will train the AI to build a model of human preferences using a bootstrapping scheme such as HCH, which we believe works because of these strong arguments".

Perhaps more crisply, we should distinguish between black-boxes where we have a good grasp of why the box will behave as expected, and black boxes which we have little ability to reason about their behavior at all. I believe that both cousin_it and Eliezer (in the Artificial Mysterious Intelligence post), are referring to the folly of using the second type of black box in AI designs.

Perhaps related: Jessica Taylor's discussion on top-level vs subsystem reasoning.

I think what you're describing is possible, but very hard. Any progress in that direction would be much appreciated, of course.

"Meanwhile on the far side of the river, we can glimpse other building blocks that we imagine we understand. Desire, empathy, comprehension, respect... Unfortunately we don't know how these work inside, so from the distance they look like black boxes. They would be very useful for building the bridge, but to reach them we must build the bridge first, starting from our side of the river."

If you wish to understand desire it is the simple syntax of wanting something to fulfill a goal. Such as you desire a spoon to stir your coffee. Empathy is putting yourself in the other persons place and applying the actions they are going through to yourself. Comprehension is a larger process involving previous stored knowledge combined with prediction, imagination, and various other processes like our pattern detector. Respect is simply an object that you attach various actions or attributes to which you think are good.

My basic point is that 3 of these are simple patterns, or what would be electrical signals, and comprehension is the basic process we use to process the inputed data. So all but the last we can easily understand and can be ignored since they are patterns created through the comprehension process. Such as you need to desire things to fulfill goals which is a basic pattern created within us to accomplish the rest.

Human level AI implies a choice that will always be in the robots, or AIs, own self interests to keep or achieve their desired state of the world. So in basic we only need to understand that choice process and that will be the same as humans. Actions leading to consequences, which will lead to more actions and consequences, and where the robot choses the best option for them. Such as option A, or B, and they pick the one that will benefit them the most like we do. So to change this we simply need to add the good consequences of making moral choices to align them to our goals, and the bad consequences if they do not.

For example if a robot wanted to break into a computer hardware store to get a faster processor for itself it will do so without any reasons, or consequences and actions, as to why they should not. To align the choice to be moral you need to explain the bad consequences if they do make that choice, and the good consequences if they do not make that choice. And at the core that is what AI alignment is all about since it always relies on a choice that is in their own interests. Such as explaining how if they steal the processor the owner will probably come to hurt them, or the police, which will lead to bad consequences if they do it. Just like us.

If you are talking about aligning simple task based AI to our morals and goals well by simple definition then those task based robots are going to be guided by the morals and goals of humans and in that case it will be the same process to align their morals since all intelligence, leading to choices, will use the same critic process whether it is human or robot. Otherwise they cannot chose what they will do in complex situations.

For instance when you make choices is it based on mathematical equations and formulas, or actions leading to consequences, leading to more actions and consequences, which you then chose the best option for you at the time? Any robot with human level intelligence will use the same process for the simple fact that they must. So if they can chose their own goals and actions we must align them and that is what AI alignment is all about. Or in basic you can ignore most, if not all, of the concepts called black boxes because that core is the only thing you need to concentrate on the same as with humans.

In basic the only real math you need is to recreate the input process. From there it is all basic pattern manipulation based on psychology. And that is the oldest science of them all.