Goal retention discussion with Eliezer

[-]Wei Dai11y410

Max, as you can see from Eliezer's reply, MIRI people (and other FAI proponents) are largely already aware of the problems you brought up in your paper. (Personally I think they are still underestimating the difficulty of solving those problems. For example, Peter de Blanc and Eliezer both suggest that humans can already solve ontological crises, implying that the problem is merely one of understanding how we do so. However I think humans actually do not already have such an ability, at least not in a general form that would be suitable for implementing in a Friendly AI, so this is really a hard philosophical problem rather than just one of reverse engineering.)

Also, you may have misunderstood why Nick Bostrom talks about "goal retention" in his book. I think it's not meant to be an argument in favor of building FAI (as you suggest in the paper), but rather an argument for AIs being dangerous in general, since they will resist attempts to change their goals by humans if we realize that we built AIs with the wrong final goals.

[-]Max Tegmark11y150

Thanks Wei for these interesting comments. Whether humans can "solve" ontological crises clearly depends one's definition of "solve". Although there's arguably a clear best solution for de Blanc's corridor example, it's far from clear that there is any behavior that deserves being called a "solution" if the ontological update causes the entire worldview of the rational agent to crumble, revealing the goal to have been fundamentally confused and undefined beyond repair. That's what I was getting at with my souls example.

As to what Nick's views are, I plan to ask him about this when I see him tomorrow.

[-]torekp11y20

In the link you suggest that ontological crises might lead to nihilism, but I think a much more likely prospect is that they lead to relativism, with respect to the original utility function. That is, there are solutions to the re-interpretation problem, which, for example, allow us to talk of "myself" and "others" despite the underlying particle physics. But there are more than one of those solutions, none of which are forced. Thus the original "utility function" fails to be such, strictly speaking. It does not really specify which actions are preferred. It only does so modulo a choice of interpretation.

So, all we need to do is figure out each of the possible ways physics might develop, and map out utility functions in terms of that possible physics! Or, we could admit that talk of utility functions needs to be recognized as neither descriptive nor truly normative, but rather as a crude, mathematically simplified, approximation to human values. (Which may be congruent to your conclusions - I just arrive by a different route.)

[-]cousin_it11y280

It's very nice to see you on LW! I think both your essay and Eliezer's comments are very on point.

There are non-obvious ways to define a utility function for an AI. For example, you could "pass the buck" by giving the AI a mathematical description of a human upload, and telling it to maximize the value of the function that the upload would define, given enough time and resources to think. That's Paul Christiano's indirect normativity proposal. I think it fails for subtle reasons, but there might be other ways of defining what humans want by looking at the computational content of human brains and extrapolating it somehow (CEV), while keeping a guarantee that the extrapolation will talk about whatever world we actually live in. Basically it's a huge research problem.

[-]Max Tegmark11y190

Thanks Eliezer for your encouraging words and for all these interesting comments! I agree with your points, and we clearly agree on the bottom line as well: 1) Building FAI is hard and we’re far from there yet. Sorting out “final goal” issues is part of the challenge. 2) It’s therefore important to further research these questions now, before it’s too late. :-)

[-]So8res11y170

Hey Max, this is Nate. Thanks for posting this publicly! Are there any particular points that still seem confusing or wrong, and/or concerns that you don't feel have been addressed?

[-]Matthew_Opitz11y40

Okay, wow, I don't know if I quite understand any of this, but this part caught my attention:

The Omohundrian/Yudkowskian argument is not that we can take an arbitrary stupid young AI and it will be smart enough to self-modify in a way that preserves its values, but rather that most AIs that don't self-destruct will eventually end up at a stable fixed-point of coherent consequentialist values. This could easily involve a step where, e.g., an AI that started out with a neural-style delta-rule policy-reinforcement learning algorithm, or an AI that started out as a big soup of self-modifying heuristics, is "taken over" by whatever part of the AI first learns to do consequentialist reasoning about code.

I have sometimes wondered whether the best way to teach an AI a human's utility function would not be to program it into the AI directly (because that will require that we figure out what we really want in a really precisely-defined way, which seems like a gargantuan task), but rather, perhaps the best way would be to "raise" the AI like a kid at a stage where the AI would have minimal and restricted ways of interacting with human society (to minimize harm...much like a toddler thankfully does not have the muscles of Arnold Schwarzenegger to use during its temper tantrums), and where we would then "reward" or "punish" the AI for seeming to demonstrate better or worse understanding of our utility function.

It always seemed to me that this strategy had the fatal flaw that we would not be able to tell if the AI was really already superintelligent and was just playing dumb and telling us what we wanted to hear so that we would let it loose, or if the AI really was just learning.

In addition to that fatal flaw, it seems to me that the above quote suggests another fatal flaw to the "raising an AI" strategy—that there would be a limited time window in which the AI's utility function would still be malleable. It would appear that, as soon as part of the AI figures out how to do consequentialist reasoning about code, then its "critical period" in which we could still mould its utility function would be over. Is this the right way of thinking about this, or is this line of thought waaaay too amateurish?

[-]Adele Lopez11y80

It always seemed to me that this strategy had the fatal flaw that we would not be able to tell if the AI was really already superintelligent and was just playing dumb and telling us what we wanted to hear so that we would let it loose, or if the AI really was just learning.

In addition to that fatal flaw, it seems to me that the above quote suggests another fatal flaw to the "raising an AI" strategy—that there would be a limited time window in which the AI's utility function would still be malleable. It would appear that, as soon as part of the AI figures out how to do consequentialist reasoning about code, then its "critical period" in which we could still mould its utility function would be over. Is this the right way of thinking about this, or is this line of thought waaaay too amateurish?

This problem is essentially what MIRI has been calling corrigibility. A corrigible AI is one that understands and accepts that it or its utility function is not yet complete.

[-]lfghjkl11y60

Very relevant article from the sequences: Detached Lever Fallacy.

Not saying you're committing this fallacy, but it does explain some of the bigger problems with "raising an AI like a child" that you might not have thought of.

[-]Paul Crowley11y40

I completely made this mistake right up until the point I read that article.

[-]TheAncientGeek11y-10

Hardly dispositive. A utility function that says "learn and care what your parents care about" looks relatively simple on paper. And we know the minumum intelligence required is that of a human toddler,

[-]VAuroch11y10

A utility function that says "learn and care what your parents care about" looks relatively simple on paper.

Citation needed. That sounds extremely complex to specify.

[-]TheAncientGeek11y-10

relatively

[-]VAuroch11y10

I don't think "learn and care about what your parents care about" is noticeably simpler than abstractly trying to determine what an arbitrary person cares about or CEV.

[-][anonymous]11y-20

It always seemed to me that this strategy had the fatal flaw that we would not be able to tell if the AI was really already superintelligent and was just playing dumb and telling us what we wanted to hear so that we would let it loose, or if the AI really was just learning.

You could, you know, look inside the machine and see what makes it tick. It's not a black box.

[-]Paul Crowley11y40

That seems desirable and perhaps possible, but extremely difficult, especially when you have a superintelligent mind anticipating that you'll do it and trying to work out how to ensure you come away with the wrong impression.

[-]Lumifer11y10

You could, you know, look inside the machine and see what makes it tick.

After a certain level of complexity you can look but you wouldn't understand what you are seeing.

[-][anonymous]11y20

Why? That's not a necessary logical consequence. These aren't (or don't have to be) chaotical systems, so there is no reason that scaling up the size of the computation results in an unauditable mess. The techniques used depend very much on the AGI architecture, but there are designs which allow for tracing thought patterns and answering questions about its operation in ways which are computationally tractable.

[-]Lumifer11y10

Why?

Because there are limits on complexity that a human brain can process. There are tools and techniques which help, sure, but at some point you will hit the limit anyway.

there is no reason that scaling up the size of the computation results in an unauditable mess.

I am pretty sure there is. Imagine a few million threads running different code asynchronously on distributed hardware, all interacting. And if the scenario involves an AI which bootstrapped itself through self-modification, your chances of understanding what's happening by looking at pieces of code are... very low.

[-][anonymous]11y20

Sorry some clarification is necessary. I didn't mean to imply that a human being manually inspect the AI and understand its workings, but rather that humans assisted by non-recursively self-improving technology (e.g. theorem provers, probabalistic code checkers, sandbox environments etc.) checks during execution or afterwards that the code behaves in respect of whatever properties the human cares about.

For example, the input/output channels which handle a combination of long-term memory, self-modifications, and actions may require that complete computational traces be provided. Now obviously the traces must be pruned in order to keep the auditing workload manageable, but using cryptographic techniques this can be done such that the AI cannot know ahead of time which branches will be checked, and therefore cannot cheat by creating fraudulent traces while escaping probabalistic or energetic detection.

The AI itself is executing on a framework that includes Bayesian probability graphs as its fundamental unit of computation, like a list in Lisp. The auditor's job then it to make sure that the updates or actions make sense in terms of the data provided, and that choices wree made in alignment with its claimed utility function (remember the AI can't make up data without that fraud being energetically detectable).

The above basically acts as a lie detector to the question "what's your highest-level utility function?" and followup questions like "give examples of why this clause of the utility function is necessary".

Actually, maybe that's a simpler way of framing the issue: assume that you have access to a lie detector. I posit that an accurate lie detector is sufficient to create boxable AI, or maybe even enable some procedural processes for the creation of true friendly AI (e.g. some extension of indirect normativity). The above is basically a technical outline of how one would create such a lie detector for certain AGI architectures.

[-]Lumifer11y10

humans assisted by non-recursively self-improving technology (e.g. theorem provers, probabalistic code checkers, sandbox environments etc.) checks during execution or afterwards that the code behaves in respect of whatever properties the human cares about.

I reiterate my point: regardless of tools used, you still have a human brain at the end of the line and that brain's complexity limit is not high.

complete computational traces be provided

Going back to my example, what does a "complete computational trace" mean in the context of a few million threads running different code asynchronously on distributed hardware, all interacting?

Bayesian probability graphs as its fundamental unit of computation, like a list in Lisp

A list in Lisp is just a data structure. What is a Bayesian probability graph as a unit of computation? I don't even understand what it means.

assume that you have access to a lie detector

Sure. Give that lie detector to someone with the IQ of 60 and see how well they together will figure out something that a guy with the IQ of 150 wants to keep hidden from them, say, his true intentions and capabilities.

[-]TheAncientGeek11y10

I reiterate my point: regardless of tools used, you still have a human brain at the end of the line and that brain's complexity limit is not high.

A human brain is at the end of all the alternative strategies as well.

[-][anonymous]11y00

Just as an example of something a human couldn't understand that a sufficiently smart computer might - writing code directly to binary, without the intermediate step of a programming language.

[This comment is no longer endorsed by its author]Reply

[-]Lumifer11y30

That would be read as decompiled assembler which humans can understand, though not in large quantities.

[-][anonymous]11y10

Interesting. Consider me corrected.

[-]Baughn11y10

For anything nontrivial, we need software support to do that—and it still won't work very well. You might not be absolutely correct, but you're close.

IDA is a wonderful piece of software, though. A heck of a lot better than working manually.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

98

Goal retention discussion with Eliezer

98

98