Daniel Dewey, 'Learning What to Value'

Abstract: I.J. Good's theory of an "intelligence explosion" predicts that ultraintelligent agents will undergo a process of repeated self-improvement. In the wake of such an event, how well our values are fulfilled will depend on whether these ultraintelligent agents continue to act desirably and as intended. We examine several design approaches, based on AIXI, that could be used to create ultraintelligent agents. In each case, we analyze the design conditions required for a successful, well-behaved ultraintelligent agent to be created. Our main contribution is an examination of value-learners, agents that learn a utility function from experience. We conclude that the design conditions on value-learners are in some ways less demanding than those on other design approaches.

New Comment
43 comments, sorted by Click to highlight new comments since: Today at 12:58 PM

Goodness: an attempt at doing something useful! My first impressions:

The definition of an O-maximizer is pretty central, and presented in a needlessly-confusing way, IMO.

I did something remarkably similar a while back, with my agent cybernetic diagrams and my wirehead analysis.

As far as I can see, with O-maximizers Daniel Dewey is doing the exact same thing that I did.

Daniel's "value learners" goes beyond my own work. I agree that learning an observation utility function has some benefits - though also some problems associated with going back to something more like a R-L agent.

I am not sure that the "O-maximizer" terminology is desirable. There is too much overlap with expected utility maximiser. Also, the stress on observation appears to me to be unwarranted. The utility function is (or should be permitted to be) a function of the agent's expected external sensory inputs and its own internal state. Using "observation" emphasizes the former - and suggests that there might be some way of maximising the unobserved - which there isn't.

A few brief words about whether all this is positive or negative - I think being able to manually choose a utility function is pretty good, and it was not quite so obvious that we were going to be able to do that before. I also think the "forecasting first" scenario implied by this kind of architecture is positive (wisdom before action). A more modular intelligence allows a divide-and-conquer strategy - which makes machine intelligence look closer. That has some plus points (possible slower takeoff), but also some negative ones (less time to prepare).

Anyway: cool! It is interesting to see some SingInst folks arriving on practically the exact-same page as me. I've been looking at these kinds of architectures for quite a while now - and have a whole web site all about them.

Goodness: an attempt at doing something useful!

Thanks :)

And thanks for pointing out your site on architectures-- I'll have to take a look at that.

Apparently the paper has been submitted to: http://agi-conf.org/2011/

Unless I am mistaken, the first actual SIAI paper/presentation at an AGI conference - unless you count Roko. So much for my "Ben's conference" hypothesis.

It's been submitted, but I haven't gotten any word on whether it's accepted yet.

EDIT: Accepted!

How was it received at the conference?

Well, overall.

I think most people understood the basic argument: powerful reinforcement learners would behave badly, and we need to look for other frameworks. Pushing that idea was my biggest goal at the conference. I didn't get much further than that with most people I talked to.

Unfortunately, almost nobody seemed convinced that it was an urgent issue, or one that could be solved, so I don't expect many people to start working on FAI because of me. Hopefully repeated exposure to SI's ideas will convince people gradually.

Common responses I got when I failed to convince someone included:

  • "I don't care what AGIs do, I just want to solve the riddle of intelligence."
  • "Why would you want to control an AGI, instead of letting it do what it wants to?"
  • "Our system has many different reward signals, not just one. It has hunger, boredom, loneliness, etc."

Common responses I got when I failed to convince someone included:

Ouch.

But keep up the good work regardless! Hopefully we'll still convince them.

Thanks!

This paper has been updated. New, shorter version here.

This paper does a nice job of formalizing some matters relevant to FAI.

In the AIXI interaction model, the reward input is exogenous, and so there was a gap to fill in. In Appendix B, this paper points out the danger with reward functions -- that the agent will hijack the rewarder.

Yet with an internally calculated utility function of input (observations), the danger is that the agent will hijack the input channel, as it were placing a video screen in front of its cameras to show itself whatever maximizes the function. (This is not wireheading, because this is not the direct tweaking of a utility register.)

How can we address this problem?

If we are not going to include problems with counterfitting utility as being part of The Wirehead Problem, then I propose the term The Pornography Problem be used to refer to them.

I think umbrella category which incldues both types of problem is the main one, though. Unless a better term can be found, The Wirehead Problem seems pretty appropriate as an umbrella term. Counterfitting utility is very close to direct self-stimulation.

The classical proposed way of dealing with this is to make sure the agent has some knowledge and understanding of what its goals actually are. What the best way of doing that is is an open problem.

Another good thing about this paper is that it claims to prove that a utility maximiser can mimic any computable agent. That is an idea I have been banging on about for years, now, when people claim the utility maximiser framework sucks, or that it can't describe humans, or whatever. The proof looks essentially the same as the one I gave.

Unless there are mistakes, this looks like a useful place to refer doubters to in the future.

Disappointingly, this paper is still pre-UDT thinking. An AIXI-like agent doesn't understand that it lives within the universe that it's trying to affect, so it can unwittingly destroy its own hardware with its mining claws (to borrow a phrase from Tim Tyler).

This statement bugs me because I don't see that any solution to that problem has been developed within LW's avantgarde decision theories. In fact, they often introduce self-referential statements, so this problem in a way should be more pressing for them. The "AIXI-like agent" just doesn't have a self-representation; but a LWDT (LessWrong decision theory) agent, insofar as its decision theory revolves around self-referential propositions, does need a capacity for self-representation, and yet I don't remember this problem being discussed very much. It's more as if the problem has been overlooked because so much of the theory is discussed in natural language, and so the capacity for self-referential semantics that natural-language statements provide has been taken for granted.

There is actually a decades-old body of work in computer science on self-representation, for example under the name of "computational reflection"; it was the subject of the thesis of Pattie Maes at MIT. But there's no magic bootstrap here, whereby ordinary reference turns into self-reference. It's just a matter of coding up, by hand, structures that happen to represent aspects of themselves, and then giving them causal connections so that those representations co-vary appropriately with the things they represent. This would appear to be an adequate solution to the problem, and I don't see that any alternative solutions have been invented locally.

A UDT agent does not need to be explicitly told how to represent itself, besides knowing its own code.

This is because a UDT agent does not make an attempt to explicitly control a particular instantiation of itself. A UDT agent looks at the world and tries to control the entire thing by controlling logical facts about its own behavior. If the world happens to contain patterns similar to the agent, then the agent will recognize that controlling its own output will control the behavior of those patterns. The agent could also infer that by destroying those patterns, it will lose its ability to control the world.

I think this is a nice idea, and it does deal with (at least this particular) problem with self-representation.

The bigger issue is that (as far as I know) no one has yet found a precisely specified version of ADT/UDT/TDT which satisfies our intuitions about what TDT should do.

Right.

To be more precise about the current state of the art: we don't know any algorithm that can maximize its utility in a UDT-ish setting, but we do know algorithms that can hit a specified utility value, or ensure that utility is no less than some predefined value, if the underlying decision problem allows that at all. (Paul, I think you have already figured this out independently, right?)

That sounds pretty wild. Do you think it would help any with the wirehead problem?

Yes, I think it resolves it completely and this is part of what makes it interesting.

An ADT agent cares about some utility function which is independent of its experiences; for example, the number of paperclips that actually exist (viewing the universe as a mathematically well-defined, but uncertain, object).

If so, and if it could be made at all practical, I think that would be a major breakthrough. The current stories about wirehead-avoidance are not terribly convincing, IMO. Which is not to say that there's not a solution - just that we do not yet really know how to implement one.

An ADT agent cares about some utility function which is independent of its experiences

That is kind-of impossible, though. All our knowledge of the world necessarily comes to us through our senses.

I had a brief look at it again. It seems very expensive. When making a decision, it is painful to start by integrating over all possible copies of agents who might be "like you". In short, it doesn't look remotely like what is most likely to come first.

Update 2011-06-28. OK, I finally figured out what you were talking about above - and it turns out that I don't agree with it at all. The "LessWrong"-style decision theories that I am aware of so far don't have any impact on the wirehead problem at all - as far as I can see.

All our knowledge of the world necessarily comes to us through our senses.

Yes, but an agent can understand that it's fixed utility function which refers to the state of the entire universe is not maximized by allowing itself to be deceived.

Well, possibly. I certainly have an idea about what "the state of the universe" refers to aside from my sensory perceptions of it. What we need math for is to see whether it is possible to build an agent whose belief that it is maximising such a quantity survives extensive self-knowledge about its own operation. Without supporting math, we don't have much more than a story.

What we need math for is to see whether it is possible to build an agent whose belief that it is maximising such a quantity survives extensive self-knowledge about its own operation.

Well, I am an example of an agent who does not want to wirehead for the reasons explained in the posts I linked to. I have some self knowledge about my own operation, though not nearly as much as I would like (I don't know how to program a computer to be me), but I doubt that more self knowledge, barring valley effects, would do anything other than increase my ability to avoid wireheading.

Actually, our current concept of UDT should handle this problem automatically, at least in theory. I'll try to explain how it works.

First, assume that the world is a computer program with known source code W. (The general case is a prior distribution over possible world-programs, the solution will generalize to that case easily.) Further imagine that the agent is also a computer program that knows its own source code A. The way the agent works is by investigating the logical consequences of its decisions; that is, it tries to find plausible mathematical statements of the form "A() == a logically implies W() == w" for different values of a and w.

One way of finding such statements is by inspecting the source code of W and noticing that there's a copy of A (or its logical equivalent) embedded somewhere within it, and the return value of that embedded copy can be used to compute the return value of W itself. Note that this happens "implicitly", we don't need to tell the agent "where" it is within the world, it just needs to search for mathematical statements of the specified form. Also note that if W contains multiple logically equivalent copies of A (e.g. if they're playing a symmetric PD, or someone somewhere is running a predictor simulation of A, etc.), then the approach handles that automatically too.

See Wei Dai's original UDT post for another explanation. I've made many posts along these lines too, for example, this one describes "embodied" thingies that can dismantle their own hardware for spare parts and still achieve their values.

[-][anonymous]13y00

It sounds pretty wild. Do you think it would help any with the wirehead problem?

[-][anonymous]13y00

Yeah, it solves it.

It's probably not even our problem. ISTM that we could easily get to beyond-human level using agents that have walled-off brains, and can't self-modify, or hack into themselves.

You can normally stop such an agent from bashing its own brains in with a bit of operant conditioning.

[-][anonymous]13y00

An AIXI-like agent doesn't understand that it lives within the universe that it's trying to affect, so it can unwittingly destroy its own hardware with its mining claws (to borrow a phrase from Tim Tyler).

It looks like the newer version of the paper tries to deal with this explicitly, by introducing the concept of "agent implementation". But I can't verify whether the solution actually works, since the probability function P is left undefined.

I think the paper suffers from what may be a common failure mode in AI design: problems with the overall approach/framework can always be hidden with additional degrees of freedom "to be filled in later". (The paper actually introduces two unspecified probability functions.) In a way UDT does the same thing: the only version that doesn't have obvious problems uses an unspecified "math intuition module".

Of course I still think UDT is the better approach, but it seems hard to make an argument beyond "it just makes more intuitive sense". Well, I'm glad I'm a hobbyist who can work on whatever I want and not have to justify myself to a funding committee. :)

[-]XiXiDu13y-20

I really have to read EY's UDT paper. Can you tell me what math prerequisites I need to understand UDT?

Have you read the links from http://wiki.lesswrong.com/wiki/Updateless_decision_theory and http://wiki.lesswrong.com/wiki/ADT ? They're fairly short and cover most of what is known.

Eliezer's theory is the TDT (Timeless Decision Theory). I believe Wei Dai is the main creator of the UDT (Updateless Decision Theory)

To understand TDT you probably need to understand Causal Graphs, and you need to be in general comfortable with reasoning by abstraction. Apart form that its not particularly advanced.

I requested feedback about this paper here.

One of my conclusions was that you could, in theory, train a Solomonoff Induction-based reinforcement learning agent to produce arbitrary finite sequences of actions (non-self-destructive ones anyway) in response to specified sets of finite sense data - assuming you are allowed to program its reward function and give it fake memories dating back from before it was born.

This is essentially the same result as is claimed for O-Maximisers in the paper. This undermines the thesis that O-Maximisers somehow exhibit different dynamics from reinforcement learning agents.

Update on 2011-04-30: Bill Hibbard makes an almost identical point to the observations I made in this comment. You can see it in his post - on the AGI mailing list - here.

Thanks for posting this around! It's great to see it creating discussion.

I'm working on replies to the points you, Bill Hibbard, and Curt Welch have made. It looks like I have some explaining to do if I want to convince you that O-maximizers aren't a subset of reward maximizers-- in particular, that my argument in appendix B doesn't apply to O-maximizers.

It looks like I have some explaining to do if I want to convince you that O-maximizers aren't a subset of reward maximizers-- in particular, that my argument in appendix B doesn't apply to O-maximizers.

To recap, my position is that both expected reward maximisers and expected utility maximisers are universal learners - and so can perform practically any series of non-self-destructive actions in a configurable manner in response to inputs. So, I don't think either system necessarily exhibits the "characteristic behaviour" you describe.

Response to Curt Welch:

Sadly, what he seems to have failed to realize, is that any actual implementation of an O-Maximizer or his Value-learners must also be reward maximizerr. Is he really that stupid so as not to understand they are all reward maximizer?

Zing! I guess he didn't think I was going to be reading that. To be fair, it may seem to him that I've made a stupid error, thinking that O-maximizers behave differently than reward maximizers. I'll try to explain why he's mistaken.

A reward maximizer acts so as to bring about universes in which the rewards it receives are maximized. For this reason, it will predict and may manipulate the future actions of its rewarder.

An O-maximizer with utility function U acts so as to bring about universes which score highly according to U. For this reason, it is quite unlikely to manipulate or alter its utility function, unless its utility function directly values universes in which it self-alters.

In particular, note that an O-maximizer does not act so as to bring about universes in which the utility it assigns to the universe is maximized. Where the reward maximizer predicts and "cares about" what the rewarder will say tomorrow, an O-maximizer uses its current utility function to evaluate futures and choose actions.

O-maximizers and reward maximizers have different relationships with their "motivators" (utility function vs. rewarder), and they behave differently when given the option to alter their motivators. It seems clear to me that they are distinct.

The only difference is in the algorithm it uses to calculate the "expected value". Dose he not understand that if you build a machine to do this, that there must be hardware in the machine that calculates that expected value? And that such a machine can then be seen as two machines, one which is calculating the expected value, and the other which is picking actions to maximize the output of that calculation? And once you have that machine, his argument of appendix B once again applies?

Actually trying to apply the argument in Appendix B to an O-maximizer, implemented or in the abstract, using the definitions given in the paper instead of reasoning by analogy, is sufficient to show that this is also incorrect.

An agent of unbounded intelligent will always reach a point of understanding he has the option to try and modify the reward function which means the wirehead problem is always on the table.

It may have the option, but will it be motivated to alter its "reward function"? Consider an O-maximizer with utility function U. It acts to maximize the universe's utility as measured by U. How would the agent's alteration of its own utility function bring about universes that score highly according to U?

OK, some responses from me:

A reward maximizer acts so as to bring about universes in which the rewards it receives are maximized. For this reason, it will predict and may manipulate the future actions of its rewarder.

An O-maximizer with utility function U acts so as to bring about universes which score highly according to U. For this reason, it is quite unlikely to manipulate or alter its utility function

The more obvious problem for utility maximisers is fake utility.

Actually trying to apply the argument in Appendix B to an O-maximizer [...] is sufficient to show that this is also incorrect.

My position here is a bit different from Curt's. Curt will argue that both systems are likely to wirehead (and I don't necessarily disagree - the set-up in the paper is not sufficient to prevent wireheading, IMO). My angle is more that both types of systems can be made into universal agents - producing arbitrary finite action sequenes in response to whatever inputs you like.

The more obvious problem for utility maximisers is fake utility.

...but your characterisation of the behaviour of reward maximizers and utility maximisers seems ratther like a projection to me. IMO, actual behaviour will depend on what the systems believe their purpose is when they come to adjusting their brains. Since they both lack knowledge of the design purpose of their own goal systems, ISTM that the outcome could potentially vary. Maybe they will wirehead, maybe they won't.

Ah, I see. Thanks for taking the time to discuss this-- you've raised some helpful points about how my argument will need to be strengthened ("universal action" is good food for thought) and clarified (clearly, my account of wireheading is unconvincing).

The paper's been accepted, and I have a ton of editing to do (need to cut four pages!), so I may not be very quick to respond for the time being. I didn't want to disappear without warning, and without saying thanks for your time!

OK. I am skepical that the wirehead problem can be solved simply by invoking expected utillity maximisation. IMO, there are at least two problems that go beyond that:

  • How do you tell the system to maximise (say) temperature - and not some kind of proxy or perception of temperature?

  • How do you construct a practical inductive inference engine without using reinforcement learning?

FWIW, my current position is that this probably isn't our problem. The wirehead problem doesn't become serious until relatively late on - leaving plenty of scope for transforming the world into a smarter place in the mean time.

Response to Bill Hibbard:

It seems to me that every O-maximizer can be expressed as a reward maximizer. Specifically, comparing equations (2) and (3), given an O-maximizer we can define reward r sub(m) (by this notation I mean "r subscript m") as:

r sub(m) = SUM(r in R) U(r)P(r|yx sub(<=m))

and r sub(i) = 0 for i<m, where the paper sets m to the final time step, following Nick Hay. The reward maximizer so defined will behave identically with the O-maximizer.

In the reward-maximization framework, rewards are part of observations and come from the environment. You cannot define "r sub(m)" to be equal to something mathematically, then call the result a reward-maximizer; therefore, Hibbard's formulation of an O-maximizer as a reward-maximizer doesn't work.

If this is correct, doesn't the "characteristic behavior pattern" shown for reward maximizers in Appendix B, as stated in Section 3.1, also apply to O-maximizers?

Since the construction was incorrect, this argument does not hold.

My way of putting much the same idea was:

you could, in theory, train a Solomonoff Induction-based reinforcement learning agent to produce arbitrary finite sequences of actions (non-self-destructive ones anyway) in response to specified sets of finite sense data - assuming you are allowed to program its reward function and give it fake memories dating back from before it was born.

Basically Solomonoff Induction is a powerful learning mechanism, and with sufficient time and test cases, you could configure an agent based on it to behave in an arbitrary way[*] in response to any finite sense-stream after its "birth" - by giving it sufficient pre-birth training "memories" - which laboriously say: "if you see this, do this, and don't do this or this or this" - for every possible bunch of observations, up to some finite length limit.

I call this sort of thing universal action - and I think reinforcement learning systems are capable of it.

Bill responds here. It is pretty much what I expected him to say.

[-][anonymous]13y00

This paper should be added to singinst.org's research publications page.