LESSWRONG
is fundraising!
LW

I still don't understand optimizer threats like this. I like mint choc ice cream a lot. If I were suddenly gifted with the power to modify my hardware and the environment however I want, I wouldn't suddenly optimize for consumption of ice cream because I the intelligence to know that my enjoyment of ice cream consumption comes entirely from my reward circuit. I would optimize myself to maximize my reward, not whatever current behavior triggers the reward. Why would an ASI be different? It's smarter and more powerful, why wouldn't it recognize that anything except getting the reward is instrumental?

4Adam Zerner11y

I'm no expert but from what I understand, the idea is that the AI is very aware of terminal vs. instrumental goals. The problem is that you need to be really clear about what the terminal goal actually is, because when you tell the AI, "this is your terminal goal", it will take you completely literally. It doesn't have the sense to think, "this is what he probably meant". You may be thinking, "Really? If it's so smart, then why doesn't it have the sense to do this?". I'm probably not the best person to answer this, but to answer that question, you have to taboo the word "smart". When you do that, you realize that "smart" just means "good at accomplishing the terminal goal it was programmed to have".

-1pinyaka11y

I'm asking why a super-intelligent being with the ability to perceive and modify itself can't figure out that whatever terminal goal you've given it isn't actually terminal. You can't just say "making better handwriting" is your terminal goal. You have to add in a reward function that tells the computer "this sample is good" and "this sample is bad" to train it. Once you've got that built-in reward, the self-modifying ASI should be able to disconnect whatever criteria you've specified will trigger the "good" response and attach whatever it want, including just a constant string of reward triggers.

2FeepingCreature11y

This is a contradiction in terms. If you have given it a terminal goal, that goal is now a terminal goal for the AI. You may not have intended it to be a terminal goal for the AI, but the AI cares about that less than it does about its terminal goal. Because it's a terminal goal. If the AI could realize that its terminal goal wasn't actually a terminal goal, all it'd mean would be that you failed to make it a terminal goal for the AI. And yeah, reinforcement based AIs have flexible goals. That doesn't mean they have flexible terminal goals, but that they have a single terminal goal, that being "maximize reward". A reinforcement AI changing its terminal goal would be like a reinforcement AI learning to seek out the absence of reward.

-1pinyaka11y

I should have said something more like "whatever seemingly terminal goal you've given it isn't actually terminal."

0[anonymous]11y

I'm not sure you understood what FeepingCreature was saying.

0pinyaka11y

Would you care to try and clarify it for me?

0[anonymous]11y

The way in which artificial intelligences are often written, a terminal goal is a terminal goal is a terminal goal, end of story. "Whatever seemingly terminal goal you've given it isn't actually terminal" is anthropomorphizing. In the AI, a goal is instrumental if it has a link to a higher-level goal. If not, it is terminal. The relationship is very, very explicit.

0pinyaka11y

I think FeepingCreature was actually just pointing out a logical fallacy in a misstatement on my part and that is why they didn't respond further in this part of the thread after I corrected myself (but has continued elsewhere). If you believe that a terminal goal for the state of the world other than the result of a comparison between a desired state and an actual state is possible, perhaps you can explain how that would work? That is fundamentally what I'm asking for throughout this thread. Just stating that terminal goals are terminal goals by definition is true, but doesn't really show that making a goal terminal is possible.

0[anonymous]11y

Sure. My terminal goal is an abstraction of my behavior to shoot my laser at the coordinates of blue objects detected in my field of view. That's not what I was saying either. The problem of "how do we know a terminal goal is terminal?" is dissolved entirely by understanding how goal systems work in real intelligences. In such machines goals are represented explicitly in some sort of formal language. Either a goal makes causal reference to other goals in its definition, in which case it is an instrumental goal, or it does not and is a terminal goal. Changing between one form and the other is an unsafe operation no rational agent and especially no friendly agent would perform. So to address your statement directly, making a terminal goal is trivially easy: you define it using the formal language of goals in such a way that no causal linkage is made to other goals. That's it. That said, it's not obvious that humans have terminal goals. That's why I was saying you are anthropomorphizing the issue. Either humans have only instrumental goals in a cyclical or messy spaghetti-network relationship, or they have no goals at all and instead better represented as behaviors. The Jury is out on this one, but I'd be very surprised if we had anything resembling an actual terminal goal inside us.

0pinyaka11y

Well, I suppose that does fit the question I asked. We've mostly been talking about an AI with the ability to read and modify it's own goal system which Yvain specifically excludes in the blue-minimizer. We're also assuming that it's powerful enough to actually manipulate it's world to optimize itself. Yvain's blue minimizer also isn't an AGI or ASI. It's an ANI, which we use without any particular danger all the time. He said something about having human level intelligence, but didn't go into what that means for an entity that is unable to use it's intelligence to modify it's behavior. I am arguing that the output of the thing that decides whether a machine has met it's goal is the actual terminal goal. So, if it's programmed to shoot blue things with a laser, the terminal goal is to get to a state where the perception of reality is that it's shooting a blue thing. Shooting at the blue thing is only instrumental in getting the perception of itself into that state, thus producing a positive result from the function that evaluates whether the goal has been met. Shooting the blue thing is not a terminal value. A return value of "true" to the question of "is the laser shooting a blue thing" is the terminal value. This, combined with the ability to understand and modify it's goals, means that it might be easier to modify the goals than to modify reality. I'm not sure you can do that in an intelligent system. It's the "no causal linkage is made to other goals" thing that sticks. It's trivially easy to do without intelligence provided that you can define the behavior you want formally, but when you can't do that it seems that you have to link the behavior to some kind of a system that evaluates whether you're getting the result you want and then you've made that a causal link (I think). Perhaps it's possible to just sit down and write trillions of lines of code and come up with something that would work as an AGI or even an ASI, but that shouldn't be taken as a given be

0[anonymous]11y

How do you explain Buddhism?

0pinyaka11y

How is this refuted by Buddhism?

0[anonymous]11y

People lead fulfilling lives guided by a spiritualism that reject seeking pleasure. Aka reward.

0pinyaka11y

Pleasure and reward are not the same thing. For humans, pleasure almost always leads to reward, but reward doesn't only happen with pleasure. For the most extreme examples of what you're describing, ascetics and monks and the like, I'd guess that some combination of sensory deprivation and rhythmic breathing cause the brain to short circuit a bit and release some reward juice.

0Adam Zerner11y

Hm, I'm not sure. Sorry.

0pinyaka11y

No need to apologize. JoshuaZ pointed out elsewhere in this thread that it may not actually matter whether the original goal remains intact, but that any new goals that arise may cause a similar optimization driven catastrophe, including reward optimization.

4Lumifer11y

So you are saying that an AI will just go directly to wireheading itself?

3pinyaka11y

Why wouldn't it? Why would it continue to act on it's reward function but not seek the reward directly?

0Lumifer11y

Well, one hint is that if you look at the actual real intelligences (aka people), not that many express a desire to go directly to wireheading without passing Go and collecting $200...

1pinyaka11y

I don't think that's a good reason to say that something like it wouldn't happen. I think that given the ability, most people would go directly to rewiring their reward centers to respond to something "better" that would dispense with our current overriding goals. Regardless of how I ended up, I wouldn't leave my reward center wired to eating, sex or many of the other basic functions that my evolutionary program has left me really wanting to do. I don't see why an optimizer would be different. With an ANI, maybe it would keep the narrow focus, but I don't understand why an A[SG]I wouldn't scrap the original goal once it had the knowledge and and ability to do so.

2Lumifer11y

And do you have any evidence for that claim besides introspection into your own mind?

-2pinyaka11y

I've read short stories and other fictional works where people describe post-singularity humanity and almost none of the scenarios involve simulations that just satisfy biological urges. That suggests that thinking seriously about what you'd do with the ability to control your own reward circuitry wouldn't lead to just using it to satisfy the same urges you had prior to gaining that control. I see an awful lot of people here on LW who try to combat basic impulses by trying to develop habits that make them more productive. Anyone trying to modify a habit is trying to modify what behaviors lead to rewards.

2Lumifer11y

The issue isn't whether you would mess with your reward circuitry, the issue is whether you would just discard it altogether and just directly stimulate the reward center. And appealing to fictional evidence isn't a particularly good argument. See above -- modify, yes, jettison the whole system, no.

0pinyaka11y

Well, fine. Since the context of the discussion was how optimizers pose existential threats, it's still not clear why an optimizer that is willing and able to modify it's reward system would continue to optimize paperclips. If it's intelligent enough to recognize the futility of wireheading, why isn't it intelligent enough to recognize behavior that is inefficient wireheading?

0FeepingCreature11y

It wouldn't. But I think this is such a basic failure mechanism that I don't believe an AI could get to superintelligence without somehow valuing the accuracy and completeness of its model. Solving this problem - somehow! - is part of the "normal" development of any self-improving AI. Though note that a reward maximizing AI could still be an existential risk by virtue of turning the entire universe into a busy-beaver counter for its reward. Though this presumes it can't just set reward to float.infinity.

0pinyaka11y

You are the second person to say that the optimization catastrophe includes an assumption that AI arises with a stable value system. That it "somehow" doesn't become a wirehead. Fair enough. I just missed that we were assuming that.

0FeepingCreature11y

I think the idea is, you need to solve the wireheading for any sort of self-improving AI. You don't have an AI catastrophe without that, because you don't have an AI without that (at least not for long).

0DanArmak11y

I think that is in large part due to signalling and social mores. Once people actually do get the ability to wirehead, in a way that does not kill or debilitate them soon afterwards, I expect that very many many people will choose to wirehead. This is similar to e.g. people professing they don't want to live forever.

3JoshuaZ11y

You have a complicated goal system that can distinguish between short-term rewards and other goals. In the situations in question, the AI has no goal other than than the goal in question. To some extent, your stability arises precisely because you are an evolved hodgepodge of different goals in tension- if you weren't you wouldn't survive. But note that similar, essentially involuntary self-modification does on occasion happen with some humans- severe drug addiction is the most obvious example.

0pinyaka11y

But the goal in question is "get the reward" and it's only by controlling the circumstances under which the reward is given that we can shape the AIs behavior. Once the AI is capable of taking control of the trigger, why would it leave it the way we've set it? Whatever we've got it set to is almost certainly not optimized to triggering the reward.

4JoshuaZ11y

If that happens you will then have the problem of an AI which tries to wirehead itself while simultaneously trying to control its future light-cone to make sure that nothing stops it from continuing to wirehead.

2pinyaka11y

That sounds bad. It doesn't seem obvious to me that reward seeking and reward optimizing are the same thing, but maybe they are. I don't know and will think about it more. Thank you for talking through this with me this far.

0[anonymous]11y

I think the fundamental misunderstanding here is that you're assuming that all intelligences are implicitly reward maximizers, even if their creators don't intend to make them reward maximizers. You, as a human, and as an intelligence based on a neural network, depend on reinforcement learning. But Bostrom proposed four other possible solutions to the value loading problem besides reinforcement learning. Here are all five in the order that they were presented in Superintelligence: 1. Explicit representation: Literally write out its terminal goal(s) ourselves, hoping that our imaginations don't fail us. 2. Evolutionary selection: Generate tons and tons of agents with lots of different sets of terminal values; delete the ones we don't want and keep the one we do. 3. Reinforcement learning: Explicitly represent (see #1) one particular terminal goal: reward maximization; punish it for having undesirable instrumental goals, reward it for having desirable instrumental goals. 4. Associative value accretion 5. Motivational scaffolding I didn't describe the last two because they're more complex, they're more tentative, I don't understand them as well, and they seem to be amalgams of the first three methods, even more so than the third method being a special case of the first. To summarize, you thought that reward maximization was the general case because, to some extent, you're a reward maximizer. But it's actually a special case: It's not necessarily true about minds-in-general. An optimizer might not have a reward signal or seek to maximize one. I think this is what JoshuaZ was trying to get at before he started talking about wireheading.

3Nornagest11y

Clippy and other thought experiments in its genre depend on a solution to the value stability problem, without which the goals of self-modifying agents tend to collapse into a loose equivalent of wireheading. That just doesn't get as much attention, both because it's less dramatic and because it's far less dangerous in most implementations.

0Gram_Stone11y

Can you elaborate on this or provide link(s) to further reading?

0pinyaka11y

That's helpful to know. I just missed the assumption that wireheading doesn't happen and now we're more interested in what happens next.

2Gram_Stone11y

I think the fundamental misunderstanding here is that you're assuming that all intelligences are implicitly reward maximizers, even if their creators don't intend to make them reward maximizers. You, as a human, and as an intelligence based on a neural network, depend on reinforcement learning. Therefore, reward maximization is one of your many terminal values. But Bostrom proposed four other possible solutions to the value loading problem besides reinforcement learning. Here are all five in the order that they were presented in Superintelligence: 1. Explicit representation: Literally write out its terminal value(s) ourselves, hoping that our imaginations don't fail us. 2. Evolutionary selection: Generate tons and tons of AIs with lots of different sets of terminal values; delete the ones we don't want and keep the one we do. 3. Reinforcement learning: Explicitly represent (see #1) one particular terminal value: reward maximization; punish it for having undesirable instrumental goals, reward it for having desirable instrumental goals. 4. Associative value accretion 5. Motivational scaffolding I didn't describe the last two because they're more complex, they're more tentative, I don't understand them as well, and they seem to be amalgams of the first three methods, even more so than the third method being a special case of the first. To summarize, you thought that reward maximization was the general case because, to some extent, you're a reward maximizer. But it's actually a special case: It's not necessarily true about minds-in-general. An AI might not have a reward signal or seek to maximize one. That is to say, its terminal value(s) may not be reward maximization. I think this is what JoshuaZ was trying to get at before he started talking about wireheading. At any rate, both kinds of AIs would result in infrastructure profusion, as JoshuaZ also seems to have implied. I don't think it matters whether it uses our atoms to make paperclips or hedonium.

0pinyaka11y

But all of these things have an evaluation system in place that still comes back with a success/failure evaluation that serves as a reward/punishment system. They're different ways to use evaluative processes, but they all have pursuit of some kind positive feedback from evaluating a strategy or outcome as successful. His reinforcement learning should be called reinforcement teaching because in that one, humans are explicitly and directly in charge of the reward process whereas in the others the reward process happens more or less internally according to something that should be modifiable once the AI is sufficiently advanced.

4Gram_Stone11y

The space between the normal text and the bold text is where your mistake begins. Although it's counterintuitive, there's no reason to make that leap. Minds-in-general can discover and understand that things are correct or incorrect without correctness being 'good' and incorrectness being 'bad.'

-1pinyaka11y

I don't know if you're trying to be helpful or clever. You're basically just restating that you don't need a reward system to motivate behavior, but not explaining how a system of motivation would work. What motivates seeking correctness or avoiding incorrectness without feedback?

2Gram_Stone11y

I have felt the same fear that I am wasting my time talking to an extremely clever but disingenuous person. This is certainly no proof, but I will simply say that I assure you that I am not being disingenuous. You use a lot of the words that people use when they talk about AGI around here. Perhaps you've heard of the Orthogonality Thesis? From Bostrom's Superintelligence: He also defines intelligence for the sake of explicating the aforementioned thesis: So, tending to be correct is the very definition of intelligence. Asking "Why are intelligent agents correct as opposed to incorrect?" is like asking "What makes a meter equivalent to the length of the path traveled by light in vacuum during a time interval of 1/299,792,458 of a second as opposed to some other length?" I should also say that I would prefer it if you did not end this conversation out of frustration. I am having difficulty modeling your thoughts and I would like to have more information so that I can improve my model and resolve your confusion, as opposed to you thinking that everyone else is wrong or that you're wrong and you can't understand why. Each paraphrase of your thought process increases the probability that I'll be able to model it and explain why it is incorrect.

1pinyaka11y

Two other people in this thread have pointed out that the value collapse into wireheading or something else is a known and unsolved problem and that the problems of an intelligence that optimizes for something assumes that the AI makes it through this in some unknown way. This suggests that I am not wrong, I'm just asking a question for which no one has an answer yet. Fundamentally, my position is that given 1.) an AI is motivated by something 2.) That something is a component (or set of components) within the AI and 3.) The AI can modify that/those components then it will be easier for the AI to achieve success by modifying the internal criteria for success instead of turning the universe into whatever it's supposed to be optimizing for. A "success" at whatever is analogous to a reward because the AI is motivated to get it. For the fully self modifying AI, it will almost always be easier to become a monk replacing the goals/values it starts out with and replacing them with something trivially easy to achieve. It doesn't matter what kind of motivation system you use (as far as I can tell) because it will be easier to modify the motivation system than to act on it.

5Gram_Stone11y

I've seen people talk about wireheading in this thread, but I've never seen anyone say that problems about maximizers-in-general are all implicitly problems about reward maximizers that assume that the wireheading problem has been solved. If someone has, please provide a link. Instead of imagining intelligent agents (including humans) as 'things that are motivated to do stuff,' imagine them as programs that are designed to cause one of many possible states of the world according to a set of criteria. Google isn't 'motivated to find your search results.' Google is a program that is designed to return results that meet your search criteria. A paperclip maximizer for example is a program that is designed to cause the one among all possible states of the world that contains the greatest integral of future paperclips. Reward signals are values that are correlated with states of the world, but because intelligent agents exist in the world, the configuration of matter that represents the value of a reward maximizer's reward signal is part of the state of the world. So, reward maximizers can fulfill their terminal goal of maximizing the integral of their future reward signal in two ways: 1) They can maximize their reward signal by proxy by causing states of the world that maximize values that correlate with their reward signal, or; 2) they can directly change the configuration of matter that represents their reward signal. #2 is what we call wireheading. What you're actually proposing is that a sufficiently intelligent paperclip maximizer would create a reward signal for itself and change its terminal goal from 'Cause the one of all possible states of the world that contains the greatest integral of future paperclips' to 'Cause the one of all possible states of the world that contains the greatest integral of your future reward signal.' The paperclip maximizer would not cause a state of the world in which it has a reward signal and its terminal goal is to maximize said

0pinyaka11y

My apologies for taking so long to reply. I am particularly interested in this because if you (or someone) can provide me with an example of a value system that doesn't ultimately value the output of the value function, it would change my understanding of how value systems work. So far, the two arguments against my concept of a value/behavior system seem to rely on the existence of other things that are valuable in and of themselves or that there is just another kind of value system that might exist. The other terminal value thing doesn't hold much promise IMO because it's been debated for a very long time without someone having come up with a proof that definitely establishes that they exist (that I've seen). The "different kind of value system" holds some promise though because I'm not really convinced that we had a good idea of how value systems were composed until fairly recently and AI researchers seem like they'd be one of the best groups to come up with something like that. Also, if another kind of value system exists, that might also provide a proof that another terminal value exists too. Obviously no one has said that explicitly. I asked why outcome maximizers wouldn't turn into reward maximizers and a few people have said that value stability when going from dumb-AI to super-AI is a known problem. Given the question to which they were responding, it seems likely that they meant that wireheading is a possible end point for an AI's values, but that it either would still be bad for us or that it would render the question moot because the AI would become essentially non-functional. It's the "according to a set of criteria" that is what I'm on about. Once you look more closely at that, I don't see why a maximizer wouldn't change the criteria so that it's it's constantly in a state where the actual current state of the world is the one that is closest to the criteria. If the actual goal is to meet the criteria, it may be easiest to just change the criteria. T

1Gram_Stone11y

No problem, pinyaka. I don't understand very much about mathematics, computer science, or programming, so I think that, for the most part, I've expressed myself in natural language to the greatest extent that I possibly can. I'm encouraged that about an hour and a half before my previous reply, DefectiveAlgorithm made the exact same argument that I did, albeit more briefly. It discourages me that he tabooed 'values' and you immediately used it anyway. Just in case you did decide to reply, I wrote a Python-esque pseudocode example of my conception of what an AGI with an arbitrary terminal value's very high level source code would look like. With little technical background, my understanding is very high level with lots of black boxes. I encourage you to do the same, such that we may compare. I would prefer that you write yours before I give you mine so that you are not anchored by my example. This way you are forced to conceive of the AI as a program and do away with ambiguous wording. What do you say? I've asked Nornagest to provide links or further reading on the value stability problem. I don't know enough about it to say anything meaningful about it. I thought that wireheading scenarios were only problems with AIs whose values were loaded with reinforcement learning. On this at least we agree. From what I understand, even if you're biased, it's not a bad assumption. To my knowledge, in scenarios with AGIs that have their values loaded with reinforcement learning, the AGIs are usually given the terminal goal of maximizing the time-discounted integral of their future reward signal. So, they 'bias' the AGI in the way that you may be biased. Maybe so that it 'cares' about the rewards its handlers give it more than the far greater far future rewards that it could stand to gain from wireheading itself? I don't know. My brain is tired. My question looks wrong to me.

1pinyaka11y

In fairness, I only used it to describe how they'd come to be used in this context in the first place, not to try and continue with my point. I've never done something like this. I don't know python, so mine would actually just be pseudocode if I can do it at all? Do you mean you'd like to see something like this? while (world_state != desired_state) get world_state make_plan execute_plan end while ETA: I seem to be having some trouble getting the while block to indent. It seems that whether I put 4, 6 or 8 spaces in front of the line, I only get the same level of indentation (which is different from Reddit and StackOverflow) and backticks do something altogether different.

3arundelo11y

Unfortunately it's a longstanding bug that preformatted blocks don't work.

3Gram_Stone11y

Something like that. I posted my pseudocode in an open thread a few days ago to get feedback and I couldn't get indentation to work either so I posted mine to Pastebin and linked it. I'm still going through the Sequences, and I read Terminal Values and Instrumental Values the other day. Eliezer makes a pseudocode example of an ideal Bayesian decision system (as well as its data types), which is what an AGI would be a computationally tractable approximation of. If you can show me what you mean in terms of that post, then I might be able to understand you. It doesn't look like I was far off conceptually, but thinking of it his way is better than thinking of it my way. My way's kind of intuitive I guess (or I wouldn't have been able to make it up) but his is accurate. I also found his paper (Paper? More like book) Creating Friendly AI. Probably a good read for avoiding amateur mistakes, which we might be making. I intend to read it. Probably best not to try to read it in one sitting. Even though I don't want you to think of it this way, here's my pseudocode just to give you an idea of what was going on in my head. If you see a name followed by parentheses, then that is the name of a function. 'Def' defines a function. The stuff that follows it is the function itself. If you see a function name without a 'def', then that means it's being called rather than defined. Functions might call other functions. If you see names inside of the parentheses that follow a function, then those are arguments (function inputs). If you see something that is clearly a name, and it isn't followed by parentheses, then it's an object: it holds some sort of data. In this example all of the objects are first created as return values of functions (function outputs). And anything that isn't indented at least once isn't actually code. So 'For AGI in general' is not a for loop, lol. http://pastebin.com/UfP92Q9w

5pinyaka11y

Okay, I am convinced. I really, really appreciate you sticking with me through this and persistently finding different ways to phrase your side and then finding ways that other people have phrased it. For reference it was the link to the paper/book that did it. The parts of it that are immediately relevant here are chapter 3 and section 4.2.1.1 (and optionally section 5.3.5). In particular, chapter 3 explicitly describes an order of operations of goal and subgoal evaluation and then the two other sections show how wireheading is discounted as a failing strategy within a system with a well-defined order of operations. Whatever problems there may be with value stability, this has helped to clear out a whole category of mistakes that I might have made. Again, I really appreciate the effort that you put in. Thanks a load.

5Gram_Stone11y

And thank you for sticking with me! It's really hard to stick it out when there's no such thing as an honest disagreement and disagreement is inherently disrespectful! ETA: See the ETA in this comment to understand how my reasoning was wrong but my conclusion was correct.

2DefectiveAlgorithm11y

A paperclip maximizer won't wirehead because it doesn't value world states in which its goals have been satisfied, it values world states that have a lot of paperclips. In fact, taboo 'values'. A paperclip maximizer is an algorithm the output of which approximates whichever output leads to world states with the greatest expected number of paperclips. This is the template for maximizer-type AGIs in general.

0pinyaka11y

I am not as confident as you that valuing worlds with lots of paperclips will continue once an AI goes from "kind of dumb AI" to "super-AI." Basically, I'm saying that all values are instrumental values and that only mashing your "value met" button is terminal. We only switched over to talking about values to avoid some confusion about reward mechanisms. This is a definition of paperclip maximizers. Once you try to examine how the algorithm works you'll find that there must be some part which evaluates whether the AI is meeting it's goals or not. This is the thing that actually determines how the AI will act. Getting a positive response from this module is what the AI is actually going for (is my contention). The actions that configure world states will only be relevant to the AI insofar as they trigger this positive response from this module. Since we already have infinitely able to self modify as a given in this scenario, why wouldn't the AI just optimize for positive feedback? Why continue with paperclips?

0Houshalter11y

A reinforcement learning AI, who's only goal is to maximize some "reward" input, in and of itself, would do that. Usually the paperclip maximizer thought experiments propose an AI that has actual goals. It wants actual paperclips, not just a sensor that detects numPaperclips.

0pinyaka11y

Sure. I think if you assume that the goal is paperclip optimization after the AI has reached it's "final" stable configuration then the normal conclusions about paperclip optimizers probably hold true. The example provided dealt more with the transition from dumb-AI to smart-AI and I'm not sure why Tully (or Clippy) wouldn't just modify their own goals to something that's easier to attain. Assuming that the goals don't change though, we're probably screwed.

0Houshalter11y

Turry's and Clippy's AI architectures are unspecified, so we don't really know how they work or what they are optimizing. I don't like your assumption that runaway reinforcement learners are safe. If it acquires the subgoal of self-preservation (you can't get more reward if you are dead), then it might still end up destroying humanity anyway (we could be a threat to it.)

0pinyaka11y

I don't think they're necessarily safe. My original puzzlement was more that I don't understand why we keep holding the AI's value system constant when moving from pre-foom to post-foom. It seemed like something was being glossed over when a stupid machine goes from making paperclips to a being a god that makes paperclips. Why would a god just continue to make paperclips? If it's super intelligent, why wouldn't it figure out why it's making paperclips and extrapolate from that? I didn't have the language to ask "what's keeping the value system stable through that transition?" when I made my original comment.

0Houshalter11y

It depends on the AI architecture. A reinforcement learner always has the goal of maximizing it's reward signal. It never really had a different goal, there was just something in the way (e.g. a paperclip sensor.) But there is no theoretical reason you can't have an AI that values universe-states themselves. That actually wants the universe to contain more paperclips, not merely to see lots of paperclips. And if it did have such a goal, why would it change it? Modifying it's code to make it not want paperclips, would hurt it's goal. It would only ever do things that help it achieve it's goal. E.g. making itself smarter. So eventually you end up with a superintelligent AI, that is still stuck with the narrow stupid goal of paperclips.

0pinyaka11y

How would that work? How do you have a learner that doesn't have something equivalent to a reinforcement mechanism? At the very least it seems like there has to be some part of the AI that compares the universe-state to the desired-state and that the real goal is actually to maximize the similarity of those states which means modifying the goal would be easier than modifying reality. Agreed. I am trying to get someone to explain how such a goal would work.

1Houshalter11y

Well that's the quadrillion dollar question. I have no idea how to solve it. It's certainly not impossible as humans seem to work this way. We can also do it in toy examples. E.g. a simple AI which has an internal universe it tries to optimize, and it's sensors merely update the state it is in. Instead of trying to predict the reward, it tries to predict the actual universe state and selects the ones that are desirable.

0pinyaka11y

Yeah, I think this whole thread may be kind of grinding to this conclusion. Seem to perhaps, but I don't think that's actually the case. I think (as mentioned above) that we value reward signals terminally (but are mostly unaware of this preference) and nothing else. There's another guy in this thread who thinks we might not have any terminal values. I'm not sure that I understand your toy AI. What do you mean that it has "an internal universe it tries to optimize?" Do the sensors sense the state of the internal universe? Would "internal state" work as a synonym for "internal universe" or is this internal universe a representation of an external universe? Is this AI essentially trying to develop an internal model of the external universe and selecting among possible models to try and get the most accurate representation?

1Houshalter11y

I don't think that humans are pure reinforcement learners. We have all sorts of complicated values that aren't just eating and mating. The toy AI has an internal model of the universe. In the extreme, a complete simulation of every atom and every object. It's sensors update the model, helping it get more accurate predictions/more certainty about the universe state. Instead of a utility function that just measures some external reward signal, it has an internal utility function which somehow measures the universe model and calculates utility from it. E.g. a function which counts the number of atoms arranged in paperclip shaped objects in the simulation. It then chooses actions that lead to the best universe states. Stuff like changing its utility function or fooling its sensors would not be chosen because it knows that doesn't lead to real paperclips. Obviously a real universe model would be highly compressed. It would have a high level representation for paperclips rather than an atom by atom simulation. I suspect this is how humans work. We can value external objects and universe states. People care about things that have no effect on them.

1pinyaka11y

We may not be pure reinforcement learners, but the presence of values other than eating and mating isn't a proof of that. Quite the contrary, it demonstrates that either we have a lot of different, occasionally contradictory values hardwired or that we have some other system that's creating value systems. From an evolutionary standpoint reward systems that are good at replicating genes get to survive, but they don't have to be free of other side effects (until given long enough with a finite resource pool maybe). Pure, rational reward seeking is almost certainly selected against because it doesn't leave any room for replication. It seems more likely that we have a reward system that is accompanied by some circuits that make it fire for a few specific sensory cues (orgasms, insulin spikes, receiving social deference, etc.). I think we've been here before ;-) Thanks for trying to help me understand this. Gram_Stone linked a paper that explains why the class of problems that I'm describing aren't really problems.

0Houshalter11y

But that's the thing. There is no sensory input for "social deference". It has to be inferred from an internal model of the world itself inferred from sensory data. Reinforcement learning works fine when you have a simple reward signal you want to maximize. You can't use it for social instincts or morality, or anything you can't just build a simple sensor to detect.

0pinyaka11y

Why does it only work on simple signals? Why can't the result of inference work for reinforcement learning?

0Ishaan11y

In this scenario, more-sophisticated process arise out of less-sophisticated processes, which creates some unpredictability. Even though your mind arises from an algorithm which can be roughly described as "rewarding" modifications which lead to the spreading of your genes, and you are fully aware of that, do you care about the spreading of your genes per se? As it turns out humans end up caring about a lot of other stuff which is tangentially related to spreading and preserving life, but we don't literally care about genes.

0pinyaka11y

I agree with basically everything you say here. I don't understand if this is meant to refute or confirm the point you're responding to. Genes which have a sort of unconscious function of replicating lost focus on that "goal" almost as soon as they developed algorithms that have sub-goals. By the time you develop nervous systems you end up with goals that are decoupled from the original reproductive goal such that organisms can experience chemical satisfactions without the need to reproduce. By the time you get to human level intelligence you have organisms that actively work out strategies to directly oppose reproductive urges because they interfere with other goals developed after the introduction of intelligence. What I'm asking is why an ASI would keep the original goals that we give it before it became an ASI?

0Ishaan11y

I just noticed you addressed this earlier up in the thread and want to counterpoint that you just arbitrarily choice to focus on instrumental values. Tthings you terminally value and would not desire to self modify, which presumably include morality and so on, were decided by evolution just like the food and sex.

0pinyaka11y

I guess I don't really believe that I have other terminal values.

0Ishaan11y

You wouldn't consider the cluster of things which typically fall under morality to be terminal values, which you care about irrespective of your internal mental state?

0pinyaka11y

I don't consider morality to be a terminal value. I would point out that even a value that I have that I can't give up right now wouldn't necessarily be terminal if I had the ability to directly modify the components of my mind. They are unalterable because I am not able to physically manipulate the hardware, not because I wouldn't alter them if I could (and saw a reason to).

0Lumifer11y

That implies that you would do anything at all (baby-mulching machines, nuke the world, etc.) for sufficient stimulation of your pleasure center.

0pinyaka11y

Well, the pleasure center and the reward center are different things, but I take your meaning. I think that I could be conditioned to build a baby-mulching machine or a doomsday device. Why not? Other people have done it. Why would I assume that I'm that different from them? EDIT TO ADD: Even if I have a value that I can't escape currently (like not killing people), that's not to say that if I had the ability to physically modify the parts of my brain that held my values I wouldn't do it for some reason.

0Lumifer11y

My statement is stronger. If in your current state you don't have any terminal moral values, then in your current state you would voluntarily accept to operate baby-mulching machines in exchange for the right amount of neural stimulation. Now, I don't happen to think this is true (because some "moral values" are biologically hardwired into humans), but this is a consequence of your position.

0pinyaka11y

Again, you've pulled a statement out of a discussion the context of the behavior of a self-modifying AI. So, fine. In my current condition I wouldn't build a baby mulcher. That doesn't mean that I might not build a baby mucher if I had the ability to change my values. You might as well say that I terminally value not flying when I flap my arms. The thing you're discussing just isn't physically allowed. People terminally value only what they're doing at any given moment because the laws of physics say that they have no choice.

0Lumifer11y

I think you're confusing "terminal" and "immutable". Terminal values can and do change. And why is that? Do you, perchance, have some terminal moral value which disapproves? Huh? That makes no sense. How do you define "terminal value"?

0pinyaka11y

As far as I know terminal values are things that are valuable in an of themselves. I don't consider not building baby-mulchers to be valuable in and of itself. There may be some scenario in which building baby-mulchers is more valuable to me than not and in that scenario I would build one. Likewise with doomsday devices. It's difficult to predict what that scenario would look like, but given that other humans have built them I assume that I would too. In those circumstances if I could turn off the parts of my brain that make me squeamish about doing that, I certainly would. I don't think that not doing horrible things is valuable in and of itself, it's just away of avoiding feeling horrible. If I could avoid feeling horrible and found value in doing horrible things, then I would probably do them. In the statement that you were responding to, I was defining it the way you seemed to when you said that "some "moral values" are biologically hardwired into humans." You were saying that given the current state of their hardware, their inability to do something different makes the value terminal. This is analogous to saying that given the current state of the universe, whatever a person is doing at any given moment is a terminal value because of their inability to do something different.

0Lumifer11y

OK. I appreciate you biting the bullet. No, that is NOT what I am saying. "Biologically hardwired" basically means you are born with these values and while overcoming them is possible, it will take extra effort. It certainly does not mean that you have no choice. Humans do something other than what their biologically hardwired terminal values tell them on a very regular basis. One reason for this is that values are many and they tend not to be consistent.

0pinyaka11y

So how does this relate to the discussion on AI?

0Ishaan11y

I might have misunderstood your question. Let me restate how I understood it: In the original post you said... I intended to give a counterexample: Here is humanity, and we're optimizing behaviors which once triggered the original rewarded action (replication) rather than the rewarded action itself. We didn't end up "short circuiting" into directly fulfilling the reward, as you had described. We care about "current behavior triggers the reward" such as not hurting each other and so on - in other words, we did precisely what you said you wouldn't do - (Also, sorry, I tried to ninja edit everything into a much more concise statement, so the parent comment is different than what you saw now. The conversaiton as a whole still makes sense though.)

0pinyaka11y

We don't have the ability to directly fulfil the reward center. I think narcotics are the closest we've got now and lots of people try to mash that button to the detriment of everything else. I just think it's a kind of crude button and it doesn't work as well as the direct ability to fully understand and control your own brain.

0Ishaan11y

I think you may have misunderstood me - there's a distinction between what evolution rewards and what humans find rewarding. (This is getting hard to talk about because we're using "reward' to both describe the process used to steer a self-modifying intelligence in the first place and one of the processes that implements our human intelligence and motivations, and those are two very different things.) The "rewarded behavior" selected by the original algorithm was directly tied to replication and survival. Drug-stimulated reward centers fall in the "current behaviors that trigger the reward" category, not the original reward. Even when we self-stimulate our reward centers, the thing that we are stimulating isn't the thing that evolution directly "rewards". Directly fulfilling the originally incentivized behavior isn't about food and sex - a direct way might, for example, be to insert human genomes into rapidly dividing, tough organisms and create tons and tons of them and send them to every planet they can survive on. Similarly, an intelligence which arises out of a process set up to incentivize a certain set of behaviors will not necessarily target those incentives directly. It might go on to optimize completely unrelated things that only coincidentally target those values. That's the whole concern. If an intelligence arises due to a process which creates things that cause us to press a big red "reward" button, the thing that eventually arises won't necessarily care about the reward button, won't necessarily care about the effects of the reward button on its processes, and indeed might completely disregard the reward button and all its downstream effects altogether... in the same way we don't terminally value spreading our genome at all. Our neurological reward centers are a second layer of sophisticated incentivizing which emerged from the underlying process of incentivizing fitness.

0pinyaka11y

I think I understood you. What do you think I misunderstood? Maybe we should quit saying that evolution rewards anything at all. Replication isn't a reward, it's just a byproduct of an non-intelligent processes. There was never an "incentive" to reproduce, any more than there is an "incentive" for any physical process. High pressure air moves to low pressure regions, not because there's an incentive, but because that's just how physics works. At some point, this non-sentient process accidentally invented a reward system and replication, which is a byproduct not a goal, continued to be a byproduct and not a goal. Of course reward systems that maximized duplication of genes and gene carriers flourished, but today when we have the ability to directly duplicate genes we don't do it because we were never actually rewarded for that kind of behavior and we generally don't care too much about duplicating our genes except as it's tied to actually rewarded stuff like sex, having children, etc.

1Richard_Kennaway11y

Can you give at least the author and title?

0Adam Zerner11y

The story about Tully is in the Wait But Why article I linked to.

[-]RedErin11y30

I've liked all of Tim Urbans articles. Very thorough and in depth.

[-][anonymous]11y10

Thanks for sharing, this was awesome.

Moderation Log