Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

If you haven't read the prior posts, please do so now. This sequence can be spoiled.

¯\_(ツ)_/¯

New Comment
24 comments, sorted by Click to highlight new comments since: Today at 12:00 AM

Can you give other conceptions of "impact" that people have proposed, and compare/contrast them with "How does this change my ability to get what I want?"

Also, there's a bunch of different things that "want" could mean. Is that something you've thought about and if so, is it important to pick the right sense of "want"?

(BTW, in these kinds of sequences I never know whether to ask a question midway through or to wait and see if it will be resolved later. Maybe it would help to have a table of contents at the start? Or should I just ask and let the author say that they'll be answered later in the sequence?)

Can you give other conceptions of "impact" that people have proposed, and compare/contrast them with "How does this change my ability to get what I want?"

This is not quite what you're asking for, but I have a post on ways people have thought AIs that minimise 'impact' should behave in certain situations, and you can go through and see what the notion of 'impact' given in this post would advise. [ETA: although that's somewhat tricky, since this post only defines 'impact' and doesn't say how agent should behave to minimise it]

Can you give other conceptions of "impact" that people have proposed, and compare/contrast them with "How does this change my ability to get what I want?"

The next post will cover this.

there's a bunch of different things that "want" could mean. Is that something you've thought about and if so, is it important to pick the right sense of "want"?

I haven't considered this at length yet. Since we're only thinking descriptively right now and in light of where the sequence is headed, I don't know it's important to nail down the right sense. That said, I'm still quite interested in doing so.

In terms of the want/like distinction (keeping in mind that want is being used in its neuroscientific that-which-motivates sense, and not the sense I've been using in the post), consider the following:

A University of Michigan study analyzed the brains of rats eating a favorite food. They found separate circuits for "wanting" and "liking", and were able to knock out either circuit without affecting the other... When they knocked out the "liking" system, the rats would eat exactly as much of the food without making any of the satisifed lip-licking expression, and areas of the brain thought to be correlated with pleasure wouldn't show up in the MRI. Knock out "wanting", and the rats seem to enjoy the food as much when they get it but not be especially motivated to seek it out. Are wireheads happy?

Imagining my "liking" system being forever disabled feels pretty terrible, but not maximally negatively impactful (because I also have preferences about the world, not just how much I enjoy my life). Imagining my "wanting" system being disabled feels similar to imagining losing significant executive function - it's not that I wouldn't be able to find value in life, but my future actions now seem unlikely to be pushing my life and the world towards outcomes I prefer. Good things still might happen, and I'd like that, but they seem less likely to come about.

The above is still cheating, because I'm using "preferences" in my speculation, but I think it helps pin down things a bit. It seems like there's some combination of liking/endorsing for "how good things are", while "wanting" comes into play when I'm predicting how I'll act (more on that in two posts, along with other embedded agentic considerations re: "ability to get").

Or should I just ask and let the author say that they'll be answered later in the sequence?

Doing this is fine! We're basically past the point where I wanted to avoid past framings, so people can talk about whatever (although I reserve the right to reply "this will be much easier to discuss later").


Can you give other conceptions of "impact" that people have proposed, and compare/contrast them with "How does this change my ability to get what I want?"
The next post will cover this.

(no way to double quote it seems...maybe nested BBCode?)

Anyhow, looking forward to that as I was struggling a bit with the claim cannot be a big deal if it doesn't impact my getting what I want without being tautological.

Well, the claim is tautological, after all! The problem with the first part of this sequence is that it can seem... obvious... until you realize that almost all prior writing about impact has not even acknowledged that we want the AI to leave us able to get what we want (to preserve our attainable utility). By default, one considers what "big deals" have in common, and then thinks about not breaking vases / not changing too much stuff in the world state. This attractor is so strong that when I say, "wait, maybe it's not primarily about vases or objects", it didn't make sense.

The point of the first portion of the sequence isn't to amaze people with the crazy surprising insane twists I've discovered in what impact really is about - it's to show how things add up to normalcy, so as to set the stage for a straightforward discussion about one promising direction I have in mind for averting instrumental incentives.

The problem with the first part of this sequence is that it can seem... obvious... until you realize that almost all prior writing about impact has not even acknowledged that we want the AI to leave us able to get what we want (to preserve our attainable utility).

Agreed. This has been my impression from reading previous work on impact.

Let me substantiate my claim a bit with a random sampling; I just pulled up a relative reachability blogpost. From the first paragraph, (emphasis mine)

An incorrect or incomplete specification of the objective can result in undesirable behavior like specification gaming or causing negative side effects. There are various ways to make the notion of a “side effect” more precise – I think of it as a disruption of the agent’s environment that is unnecessary for achieving its objective. For example, if a robot is carrying boxes and bumps into a vase in its path, breaking the vase is a side effect, because the robot could have easily gone around the vase. On the other hand, a cooking robot that’s making an omelette has to break some eggs, so breaking eggs is not a side effect.

But notice now we're talking about "disruption of the agent's environment". Relative reachability is indeed tackling the impact measure problem, so using what we now understand we might prefer to reframe as:

We think about "side effects" when they change our attainable utilities, so they're really just a conceptual discretization of "things which negatively affect us". We want the robot to prefer policies which avoid overly changing our attainable utilities. For example, if a robot is carrying boxes and bumps into a vase in its path, breaking the vase is a side effect, because it's not that easy for us to repair the vase...

Kind of weird question: what are you using to write these? A tablet of some sort presumably?

Surface Pro 6 with Leonardo from the Windows app store.

In the blowing up sun scenario I imagined myself being helpless as I normally can't command sun altering lasers or anything like that. In a world that has a slow apocalypse there would be a period of altered living. But in a world that suddenly turns off it is business as usual to the cutoff point. It doesn't feel impactful.

Also being stuck in the abstract doesn't feel that bad. Is being stuck bad? Is being stuck good or worse by being killed by unpredictable natural forces? Does stuck come with immortality?

My answer to the previous post challenge question was pretty close but I wonder whether I have a slightly differnt thought in my head. In a world that goes from a state of high expected utility to a state of low expected utility if my strategy keeps unchanged I don't think this is impactful but for a different framing I get to "win less" in this new state and have "lost access to utility". That is news can be "bad" without being "impactful" news. In the same way news can be "impactful" without being "good". If I am a taxi driver when the customer announces their destination it is very impactful for my driving but addresses are not better or worse amongst each other.

In the blowing up sun scenario I imagined myself being helpless as I normally can't command sun altering lasers or anything like that. In a world that has a slow apocalypse there would be a period of altered living. But in a world that suddenly turns off it is business as usual to the cutoff point. It doesn't feel impactful.

So learning that you and your family won't die in like a week doesn't feel like a big deal? I feel confused by this, and think maybe you meant something else?

being stuck in the abstract doesn't feel that bad. Is being stuck bad? Is being stuck good or worse by being killed by unpredictable natural forces? Does stuck come with immortality?

Well, your goal is to reach the gray tile. So, if you imagine yourself existing in this strange maze-reality, having that goal be your one and only imperative, and then no longer being able to reach the tile at all... that feels like a huge setback. But crucially, it only feels like a setback once you comprehend the rules of the maze enough to realize what happened.

If I am a taxi driver when the customer announces their destination it is very impactful for my driving but addresses are not better or worse amongst each other.

My framing of impact is something that only agents experience and consider. I'm not talking about how your strategies themselves are "impacted" or change as the result of new information. (I feel like we're using different words for the same things, so I wouldn't be surprised if just this reply didn't clarify what I mean.)

ETA: I'm saying that "getting to win less and losing access to utility" is impact to you, in my conception.

The blowing up scenario might be a bit fantatical for me to properly apply intuitions. It did specify that I grew up in such a earth which would mean my family expectations have not really changed up to this point and I have hard time imagining what they would be. If a doomsday cult suddenly lives throught the expected date they do not go "omg profit" but "huh? what now?".

Then there is the case of knowing that this solar system has only a finite lifespan. It doesn't automatically feel like everything one has lived for melts to nothing even if before such a realization one might have thought that all improvements are for perpetuity. Cassandara migth be frustrated but it is because she has so low impact not becuase she has received demoralising information.

Yes I was using a little ambigious shorthands. The address announcing is impactful to the driver but there is no utility change. I think the "losing access to utility" does not well apply to the taxi-driver and the kind of conception that I have that does apply seems attractive in comparison.

It seems like you're considering the changes in actions or information-theoretic surprisal, and I'm considering impact to the taxi driver. It's valid to consider how substantially plans change, it's just not the focus of the sequence.

I thought that "impact" was the word for that. What is there left of the focus of the sequence if you take "life-changes" away from that?

You think or would say there is no impact for the taxi driver?

I assert that we feel impacted when we change our beliefs about how well we can get what we want. Learning the address does not affect their attainable utility, so (when I simulate this experience) it doesn't feel impactful in this specific way. It just feels like learning something.

Is this engaging with what you have in mind by "life-changes"?

I would have agreed with "how we can get what we want" but "how well we can get what we want" kind of specifies that it is a scalar quantity.

Utility functions can be constructed or are translatable from/to choice rankings. There can be no meaningful utility change without it being understandable with choices.

Impact as a primitive feeling feels super weird. I get that it has something to do with the idiom "fuck my life". However there is another idiom "This is my life now" which more captures that quality change that is not neccesarily a move up or down.

There is a "so" word that would suggest theorethical implication but reference to simulated experience and feeling seem like callbacks to imagined emotions. Either or both apply?

I am also confused what the realtionship between expected utility and attainable utility is supposed to be. If you expect to maximise they should be pretty close.

I think I might be expereriencing goal directed behaviour very differntly on the inside and I am unsure how much of the terminology is supposed to be abstract math concepts and how much of it is supposed to be emotional language. It might be for other people there is a more natural link between being in a low or high utility state and feeling low or high.

I am now suspecthing it has less to do with "Objective-life" but rather "subjective-life" or life-as-experienced which tells the approach uses a differnt kind of ontology.

I think I might be expereriencing goal directed behaviour very differntly on the inside and I am unsure how much of the terminology is supposed to be abstract math concepts and how much of it is supposed to be emotional language. It might be for other people there is a more natural link between being in a low or high utility state and feeling low or high.

The sequence uses emotional language (so far), as it's written to be widely accessible. I'm extensionally defining what I'm thinking of and how that works for me. These intuitions translated for the 20 or so people I showed the first part of the sequence, but minds are different and it's possible it doesn't feel the same for you. As long as the idea of "how well the agent can achieve their goals" makes sense and you see why I'm pointing to these properties, that's probably fine.

I am also confused what the realtionship between expected utility and attainable utility is supposed to be. If you expect to maximise they should be pretty close.

Great catch, covered two posts from now.

In order to be upset I would need an expectation that the tile was reachable before. If I have zero clue how the nature works I don't have an expectation that it was possible beforehand so I am not losing any ability.

Then there is the technicality tht even if I know that I can't move I don't know anything about the nature of the world so maybe I think that the grey square can teleport to me? The framing seems to assume a lot of basic assumption about gridworlds. So which parts I can assume and which parts I geniunely do not know?

But yeah I did fail to read that there was a specification of the wanting.

I would need an expectation that the tile was reachable before. If I have zero clue how the nature works I don't have an expectation that it was possible beforehand so I am not losing any ability.

Whoops, yeah, I forgot to specify that by the rules of this maze (which is actually generated from the labyrinth in Undertale), you can initially reach the goal. These are really fair critiques. I initially included the rules, but there were a lot of rules and it was distracting. I might add something more.

Imagine a planet with aliens living on it. Some of those aliens are having what we would consider morally valuable experiences. Some are suffering a lot. Suppose we now find that their planet has been vaporized. By tuning the relative amounts of joy and suffering, we can make it so that the vaporization is exactly neutral under our morality. This feels like a big deal, even if the aliens were in an alternate reality that we could watch but not observe.

Our intuitive feeling of impact is a proxy for how much something effects our values and our ability to achive them. You can set up contrived situations where an event doesn't actually effect our ability to achive our values, but still triggers the proxy.

Would the technical definition that you are looking for be value of information. Feeling something to be impactful means that a bunch of mental heuristics think it has a large value of info?

Can you elaborate the situation further? I’m not sure I follow where the proxy comes apart, but I’m interested in hearing more.

An alien planet contains joy and suffering in a ratio that makes them exactly cancel out according to your morality. You are exactly ambivalent about the alien planet blowing up. The alien planet can't be changed by your actions, so you don't need to cancel plans to go there and reduce the suffering when you find out that the planet blew up. Say that they existed long ago. In general we are setting up the situation so that the planet blowing up doesn't change your expected utility, or the best action for you to take. We set this up by a pile of contrivances. This still feels impactful.

That doesn't feel at all impactful to me, under those assumptions. It feels like I've learned a new fact about the world, which isn't the same feeling. ETA Another example of this was mentioned by Slider: if you're a taxi driver ambivalent between different destinations, and the client announces where they want to go, it feels like you've learned something but doesn't feel impactful (in the way I'm trying to point at).

I think an issue we might run into here is that I don't exist in your mind, and I've tried to extensionally define for you what I'm pointing at. So if you try to find edge cases according to your understanding of exactly which emotion I'm pointing to, then you'll probably be able to, and it could be difficult for me to clarify without access to your emotions. That said, I'm still happy to try, and I welcome this exploration of how what I've claimed lines up with others' experiences.