AI safety without goal-directed behavior

[-]Wei Dai7yΩ11210

I'm curious if you're more optimistic about non-goal-directed approaches to AI safety than goal-directed approaches, or if you're about equally optimistic (or rather equally pessimistic). The latter would still justify your conclusion that we ought to look into non-goal-directed approaches, but if that's the case I think it would be good to be explicit about it so as to not unintentionally give people false hope (ETA: since so far in this sequence you've mostly talked about the problems associated with goal-directed agents and not so much about problems associated with the alternatives). I think I'm about equally pessimistic, because while goal-directed agents have a bunch of safety problems, they also have a number of advantages that may be pretty hard to replicate in the alternative approaches.

We have an existing body of theory about goal-directed agents (which MIRI is working on refining and expanding) which plausibly makes it possible to one day reason rigorously about the kinds of goal-directed agents we might build and determine their safety properties. Paul and others working on his approach are (as I understand it) trying to invent a theory of corrigibility, but I don't know if such a thing even exists in platonic theory space. And if it did, we're starting from scratch so it might take a long time to reach parity with the theory of goal-directed agents.
Goal-directed agents give you economic efficiency "for free". Alternative approaches have to simultaneously solve efficiency and safety, and may end up approximating goal-directed agent anyway due to competitive pressures.
Goal-directed agents can more easily avoid a bunch of human safety problems that are inherited by alternative approaches which all roughly follow the human-in-the-loop paradigm. These include value drift (including vulnerability to corruption/manipulation), problems with cooperation/coordination, lack of transparency/interpretability, and general untrustworthiness of humans.

[-]Rohin Shah7yΩ470

While I mostly agree with all three of your advantages, I am more optimistic about non-goal-directed approaches to AI safety. I think this is primarily because I'm generally optimistic about AI safety, and the well-documented problems with goal-directed agents makes me pessimistic about that particular approach.

If I had to guess at what drives my optimism that you don't have, it would be that we can aim for an adequate, not-formalized solution, and this will very likely be okay. All else equal, I would prefer a more formal solution, but I don't think we have the time for that. I would guess that while this lack of formality makes me only a little more worried, it is a big source of worry for you and MIRI researchers. This means that argument 1 isn't a big update for me.

Re: argument 2, it's worth noting that a system that has some chance of causing catastrophe is going to be less economically efficient. Now people might build it anyway because they underestimate the chance of catastrophe, or because of race dynamics, but I'm hopeful that (assuming it's true) we can convince all the relevant actors that goal-directed agents have a significant chance of causing catastrophe. In that case, non-goal-directed agents have a lower bar to meet. But overall this is a significant update.

Re: argument 3, I don't really see why goal-directed agents are more likely to avoid human safety problems. It seems intuitively plausible -- if you get the right goal, then you don't have to rely on humans, and so you avoid their safety problems. However, even with goal-directed agents, the goal has to come from somewhere, which means it comes from humans. (If not, we almost certainly get catastrophe.) So wouldn't the goal have all of the human safety problems anyway?

I'm also optimistic about our ability to solve human safety problems in non-goal-directed approaches -- see for example the reply I just wrote on your CAIS comment.

[-]Wei Dai7yΩ590

All else equal, I would prefer a more formal solution, but I don’t think we have the time for that.

I should have added that having a theory isn't just so we can have a more formal solution (which as you mention we might not have the time for) but it also helps us be less confused (e.g., have better intuitions) in our less formal thinking. (In other words I agree with what MIRI calls "deconfusion".) For example currently I find it really confusing to think about corrigible agents relative to goal-directed agents.

However, even with goal-directed agents, the goal has to come from somewhere, which means it comes from humans. (If not, we almost certainly get catastrophe.) So wouldn’t the goal have all of the human safety problems anyway?

The goal could come from idealized humans, or from a metaphilosophical algorithm, or be an explicit set of values that we manually specify. All of these have their own problems, of course, but they do avoid a lot of the human safety problems that the non-goal-directed approaches would have to address some other way.

[-]Rohin Shah7yΩ120

For example currently I find it really confusing to think about corrigible agents relative to goal-directed agents.

Strong agree, and I do think it's the biggest downside of trying to build non-goal-directed agents.

The goal could come from idealized humans, or from a metaphilosophical algorithm, or be an explicit set of values that we manually specify.

For the case of idealized humans, couldn't real humans defer to idealized humans if they thought that was better?

Similarly, it seems like a non-goal-directed agent could be instructed to use the metaphilosophical algorithm. I guess I could imagine a metaphilosophical algorithm such that following it requires you to be goal-directed, but it doesn't seem very likely to me.

For an explicit set of values, those values come from humans, so wouldn't they be subject to human safety problems? It seems like you would need to claim that humans are better at stating their values than acting in accordance with them, which seems true in some settings and false in others.

[-]Wei Dai7yΩ240

For the case of idealized humans, couldn’t real humans defer to idealized humans if they thought that was better?

Real humans could be corrupted or suffer some other kind of safety failure before the choice to defer to idealized humans becomes a feasible option. I don't see how to recover from this, except by making an AI with a terminal goal of deferring to idealized humans (as soon as it becomes powerful enough to compute what idealized humans would want).

Similarly, it seems like a non-goal-directed agent could be instructed to use the metaphilosophical algorithm. I guess I could imagine a metaphilosophical algorithm such that following it requires you to be goal-directed, but it doesn’t seem very likely to me.

That's a good point. Solving metaphilosophy does seem to have the potential to help both approaches about equally.

For an explicit set of values, those values come from humans, so wouldn’t they be subject to human safety problems? It seems like you would need to claim that humans are better at stating their values than acting in accordance with them, which seems true in some settings and false in others.

Well I'm not arguing that goal-directed approaches are more promising than non-goal-directed approaches, just that they seem roughly equally (un)promising in aggregate.

[-]Rohin Shah7yΩ120

Well I'm not arguing that goal-directed approaches are more promising than non-goal-directed approaches, just that they seem roughly equally (un)promising in aggregate.

Your first comment was about advantages of goal-directed agents over non-goal-directed ones. Your next comment talked about explicit value specification as a solution to human safety problems; it sounded like you were arguing that this was an example of an advantage of goal-directed agents over non-goal-directed ones. If you don't think it's an advantage, then I don't think we disagree here.

Real humans could be corrupted or suffer some other kind of safety failure before the choice to defer to idealized humans becomes a feasible option. I don't see how to recover from this, except by making an AI with a terminal goal of deferring to idealized humans (as soon as it becomes powerful enough to compute what idealized humans would want).

That makes sense, I agree that goal-directed AI pointed at idealized humans could solve human safety problems, and it's not clear whether non-goal-directed AI could do something similar.

[-]Steven Byrnes6y90

Rohin, I really like the distinction you draw between "build[ing] an AI system that could maximize an arbitrary function, and then [trying] to program in the utility function we care about" versus "build[ing] systems in such a way that these properties are inherent in the way that they reason." That was helpful.

However, it seems to me—and please correct me if I'm wrong!—that most or all CIRL papers are framing the problem in terms of understanding a generic goal-seeking system whose goal is "the human gets what they want". Then papers like The Off-Switch Game show that the goal of "the human gets what they want" leads to nice instrumental goals like not disabling off-switches. Do you agree?

So when I was reading CIRL papers, or reading Stuart Russell's new book, I did in fact keep thinking to myself "How do we make sure that the AI really has the goal of "The human gets what they want.", as opposed to a proxy to it that will diverge out-of-distribution?"

IDA / "act-based corrigibility" seems like more of an attempt to break out of the goal-seeking paradigm altogether, although I still haven't convinced myself that it succeeds.

[-]Rohin Shah6y40

To be clear, this post was not arguing that CIRL is not goal-directed -- you'll notice that CIRL is not on my list of potential non-goal-directed models above.

I think CIRL is in this weird in-between place where it is kind of sort of goal-directed. You can think of three different kinds of AI systems:

An agent optimizing a known, definite utility function
An agent optimizing a utility function that it is uncertain about, that it gets information about from humans
A system that isn't maximizing any simple utility function at all

I claim the first is clearly goal-directed, and the last is not goal-directed. CIRL is in the second set, where it's not totally clear: it's actions are driven by a goal, but that goal comes from another agent (a human). (This is also the case with imitation learning, and that case is also not clear -- see this thread.)

I did in fact keep thinking to myself "How do we make sure that the AI really has the goal of "The human gets what they want.", as opposed to a proxy to it that will diverge out-of-distribution?"

I think this is a reasonable critique to have. In the context of Stuart's book, this is essentially a quibble with principle 3:

3. The ultimate source of information about human preferences is human behavior.

The goal learned by the AI system depends on how it maps human behavior (or sensory data) into (beliefs about) human preferences. If that mapping is not accurate (quite likely), then it will in fact learn some other goal, which could be catastrophic.

[-]Steven Byrnes6y30

Thanks! Pulling on that thread a bit more, compare:

My goal is that the human overseer achieves her goals. To accomplish this, I need to observe and interact with the human to understand her better—what kind of food she likes, how she responds to different experiences, etc. etc.

My goal is to maximize the speed of this racecar. To accomplish this, I need to observe and interact with the racecar to understand it better—how its engine responds to different octane fuels, how its tires respond to different weather conditions, etc. etc.

To me, they don't seem that different on a fundamental level. But they do have the super-important practical difference that the first one doesn't seem to have problematic instrumental subgoals.

(I think I'm just agreeing with your comment here?)

[-]Rohin Shah6y30

(I think I'm just agreeing with your comment here?)

Yeah, I think that's basically right.

[-]Steven Byrnes6y50

If we're making a list of models for non-goal-directed AI (and we should!!), I would propose two more:

Non-consequentialist oracle AI: An oracle with the property that the algorithm will not think through the consequences of its own outputs. You ask it a question, it digs through its world-model for a fixed number of computation steps and spits out its best-guess answer, but crucially it does not try to model the causal effects of that output. (Contrast with Eliezer's side-comment about an oracle here, which he of course assumes will be goal-directed, with the goal of "increase the correspondence between the user's belief about relevant consequences and reality".) A non-consequentialist oracle could never be deceptive or manipulative, because deception and manipulation require modeling the causal effects of outputs. I've speculated a bit about how to build something like this but it's still definitely an open question.
Interpretable-world-model as AI: Kinda related to the first. Imagine you take an AGI that deeply understands the world, you extract its world-model, and you have a way to browse it—like the world-model is somehow 100% super-easily interpretable. What causes Alzheimers? Well, you would go to the Alzheimers entry of the world-model, and you'll find a beautiful way of thinking about Alzheimers in terms of these other three concepts, which in turn refer to other concepts etc. What would happen if we started a political movement against squirrels? Well, through the world-model interface, we can throw that hypothetical scenario at other entities in the world-model (people, journalists, politicians) and see what the predicted effects are. My intuition here is: (1) It's nice to have a map when you're traveling, (2) It's nice to have wikipedia when you're learning, (3) it would be nice to have a crystal ball when you're planning ... Maybe there's some way to build a system that combines all those things and more, but is still fundamentally tool-ish? My impression is that something like this is at the core of the Kurzweil-ish vision of how brain-computer interfaces are going to solve the problem of AGI safety (see also waitbutwhy on Neuralink). (Needless to say, it's possible to try to implement this vision without brain-computer interfaces, and vice-versa.)

[-]Adrià Garriga-alonso7y40

I usually think that logic-based reasoning systems are the canonical example of of an AI without goal-directed behaviour. They just try to prove or disprove a statement, given a database of atoms and relationships. (Usually they're restricted to statements that are decidable by construction so that is always possible).

You can also frame their behaviour as a utility function: U(time, state) = 1 if you have correctly decided the statement at t ≤ time, 0 otherwise. But your statement that

>It seems possible to build systems in such a way that these properties are inherent in the way that they reason, such that it’s not even coherent to ask what happens if we “get the utility function slightly wrong”.

very much applies. I'm fairly sure you can specify the behaviour of _anything_, including "dumb" things like trousers, screwdrivers, rocks and saucepans, as an utility function + perfect optimization, even though for most things this is a very unhelpful way of thinking. Or at least human artifacts. E.g. a screwdriver optimizes "transmit the rotational force that is applied to you", a rock optimizes "keep these molecules bound and respond to forces according to the laws of physics".

[-]Rohin Shah7y30

I usually think that logic-based reasoning systems are the canonical example of of an AI without goal-directed behaviour.

Yeah, that seems right to me. Though it's not clear how you'd use a logic-based reasoning system to act in the world -- if you do that by asking the question "what action would lead to the maximum value of this function", which it then computes using logic-based reasoning, then the resulting behavior would be goal-directed.

I'm fairly sure you can specify the behaviour of _anything_

Yup. I actually made this argument two posts ago.

[-]Adrià Garriga-alonso7y30

Yup. I actually made this argument two posts ago.

Ah, that's good. I should probably read the rest of the sequence too.

Though it's not clear how you'd use a logic-based reasoning system to act in the world

The easy way to use them would be as they are intended: oracles that will answer questions about factual statements. Humans would still do the questioning and implementing here. It's unclear how exactly you'd ask really complicated, natural-language-based questions (obviously, otherwise we'd have solved AI), but I think it serves as an example of the paradigm.

[-]avturchin7y10

May be this article is related to the topic:

"A plurality of values" https://www.academia.edu/173502/A_plurality_of_values

"Abstract: Many maximizing normative theories are monistic in resting upon one core value. But such theories generate highly counter-intuitive implications. This is especially clear in the case of hedonistic utilitarianism. But an analysis of why we find thoseimplications counter-intuitive implies that we ought to subscribe to a plurality of values. For example, the Repugnant Conclusion implies that we should value a highlevel of average happiness, while the Problem of the Ecstatic Psychopath implies that we should value either a large quantity of total happiness or a large number of worthwhile lives. The problems posed by pleasure-wizards, on the other hand, implythat we should include a non-utilitarian value: namely, equality. And only when suchvalues are kept in play simultaneously can the Repugnant Conclusion, the Problem of the Ecstatic Psychopath and the problems posed by pleasure-wizards all be avoided,thereby demonstrating the superiority of pluralist over monistic normative theories."

Interesting part start on page 10 after the quote: "Brian Barry argues in his early work that we could model trade-offs between principles such as equity and efficiency in a manner that parallels the way in which micro-economists employ indifference-curves to model how we might swap grapes for potatoe".

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

68

AI safety without goal-directed behavior

68

Ω 25

68

Ω 25

Why goal-directed behavior may not be required

Implications