Rob Bensinger

Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.


2022 MIRI Alignment Discussion
2021 MIRI Conversations
Naturalized Induction

Wiki Contributions

Load More


Sounds like a lot of political alliances! (And "these two political actors are aligned" is arguably an even weaker condition than "these two political actors are allies".)

At the end of the day, of course, all of these analogies are going to be flawed. AI is genuinely a different beast.

It's pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for "actually terminally has good intentions".

Aren't there a lot of clearer words for this? "Well-intentioned", "nice", "benevolent", etc.

(And a lot of terms, like "value loading" and "value learning", that are pointing at the research project of getting good intentions into the AI.)

To my ear, "aligned person" sounds less like "this person wishes the best for me", and more like "this person will behave in the right ways".

If I hear that Russia and China are "aligned", I do assume that their intentions play a big role in that, but I also assume that their circumstances, capabilities, etc. matter too. Alignment in geopolitics can be temporary or situational, and it almost never means that Russia cares about China as much as China cares about itself, or vice versa.

And if we step back from the human realm, an engineered system can be "aligned" in contexts that have nothing to do with goal-oriented behavior, but are just about ensuring components are in the right place relative to each other.

Cf. the history of the term "AI alignment". From my perspective, a big part of why MIRI coordinated with Stuart Russell to introduce the term "AI alignment" was that we wanted to switch away from "Friendly AI" to a term that sounded more neutral. "Friendly AI research" had always been intended to subsume the full technical problem of making powerful AI systems safe and aimable; but emphasizing "Friendliness" made it sound like the problem was purely about value loading, so a more generic and low-content word seemed desirable.

But Stuart Russell (and later Paul Christiano) had a different vision in mind for what they wanted "alignment" to be, and MIRI apparently failed to communicate and coordinate with Russell to avoid a namespace collision. So we ended up with a messy patchwork of different definitions.

I've basically given up on trying to achieve uniformity on what "AI alignment" is; the best we can do, I think, is clarify whether we're talking about "intent alignment" vs. "outcome alignment" when the distinction matters.

But I do want to push back against those who think outcome alignment is just an unhelpful concept — on the contrary, if we didn't have a word for this idea I think it would be very important to invent one. 

IMO it matters more that we keep our eye on the ball (i.e., think about the actual outcomes we want and keep researchers' focus on how to achieve those outcomes) than that we define an extremely crisp, easily-packaged technical concept (that is at best a loose proxy for what we actually want). Especially now that ASI seems nearer at hand (so the need for this "keep our eye on the ball" skill is becoming less and less theoretical), and especially now that ASI disaster concerns have hit the mainstream (so the need to "sell" AI risk has diminished somewhat, and the need to direct research talent at the most important problems has increased).

And I also want to push back against the idea that a priori, before we got stuck with the current terminology mess, it should have been obvious that "alignment" is about AI systems' goals and/or intentions, rather than about their behavior or overall designs. I think intent alignment took off because Stuart Russell and Paul Christiano advocated for that usage and encouraged others to use the term that way, not because this was the only option available to us.

"Should" in order to achieve a certain end? To meet some criterion? To boost a term in your utility function?

In the OP: "Should" in order to have more accurate beliefs/expectations. E.g., I should anticipate (with high probability) that the Sun will rise tomorrow in my part of the world, rather than it remaining night.

Why would the laws of physics conspire to vindicate a random human intuition that arose for unrelated reasons?

We do agree that the intuition arose for unrelated reasons, right? There's nothing in our evolutionary history, and no empirical observation, that causally connects the mechanism you're positing and the widespread human hunch "you can't copy me".

If the intuition is right, we agree that it's only right by coincidence. So why are we desperately searching for ways to try to make the intuition right?

It also doesn't force us to believe that a bunch of water pipes or gears functioning as a classical computer can ever have our own first person experience.

Why is this an advantage of a theory? Are you under the misapprehension that "hypothesis H allows humans to hold on to assumption A" is a Bayesian update in favor of H even when we already know that humans had no reason to believe A? This is another case where your theory seems to require that we only be coincidentally correct about A ("sufficiently complex arrangements of water pipes can't ever be conscious"), if we're correct about A at all.

One way to rescue this argument is by adding in an anthropic claim, like: "If water pipes could be conscious, then nearly all conscious minds would be instantiated in random dust clouds and the like, not in biological brains. So given that we're not Boltzmann brains briefly coalescing from space dust, we should update that giant clouds of space dust can't be conscious."

But is this argument actually correct? There's an awful lot of complex machinery in a human brain. (And the same anthropic argument seems to suggest that some of the human-specific machinery is essential, else we'd expect to be some far-more-numerous observer, like an insect.) Is it actually that common for a random brew of space dust to coalesce into exactly the right shape, even briefly?

Yeah, at some point we'll need a proper theory of consciousness regardless, since many humans will want to radically self-improve and it's important to know which cognitive enhancements preserve consciousness.

You can easily clear this confusion if you rephrase it as "You should anticipate having any of these experiences". Then it's immediately clear that we are talking about two separate screens.

This introduces some other ambiguities. E.g., "you should anticipate having any of these experiences" may make it sound like you have a choice as to which experience to rationally expect.

And it's also clear that our curriocity isn't actually satisfied. That the question "which one of these two will actually be the case" is still very much on the table.

... And the answer is "both of these will actually be the case (but not in a split-screen sort of way)".

Your rephrase hasn't shown that there was a question left unanswered in the original post; it's just shown that there isn't a super short way to crisply express what happens in English, you do actually have to add the clarification.

Still as soon as we got Rob-y and Rob-z they are not "metaphysically the same person". When Rob-y says "I" he is reffering to Rob-y, not Rob-z and vice versa. More specifically Rob-y is refering to some causal curve through time ans Rob-z is refering to another causal curve through time. These two curves are the same to some point, but then they are not. 

Yep, I think this is a perfectly fine way to think about the thing.

My first issue with your post is that this initial ontological assumption is neither mentioned explicitly nor motivated. Nothing in your post can be used as proof of this initial assumption.

There are always going to be many different ways someone could object to a view. If you were a Christian, you'd perhaps be objecting that the existence of incorporeal God-given Souls is the real crux of the matter, and if I were intellectually honest I'd be devoting the first half of the post to arguing against the Christian Soul.

Rather than trying to anticipate these objections, I'd rather just hear them stated out loud by their proponents and then hash them out in the comments. This also makes the post less boring for the sorts of people who are most likely to be on LW: physicalists and their ilk.

Now, what would be the experience of getting copied, seen from a first-person, "internal", perspective? I am pretty sure it would be something like: you walk into the room, you sit there, you  hear say the scanner working for some time, it stops, you walk out. From my agnostic perspective, if I were the one to be scanned it seems like nothing special would have happened to me in this procedure. I didnt feel anything weird, I didnt feel my "consciousness split into two" or something.

Why do you assume that you wouldn't experience the copy's version of events?

The un-copied version of you experiences walking into the room, sitting there, hearing the scanner working, and hearing it stop; then that version of you experiences walking out. It seems like nothing special happened in this procedure; this version of you doesn't feel anything weird, and doesn't feel like their "consciousness split into two" or anything.

The copied version of you experiences walking into the room, sitting here, hearing the scanner working, and then an instantaneous experience of (let's say) feeling like you've been teleported into another room -- you're now inside the simulation. Assuming the simulation feels like a normal room, it could well seem like nothing special happened in this procedure -- it may feel like blinking and seeing the room suddenly change during the blink, while you yourself remain unchanged. This version of you doesn't necessarily feel anything weird either, and they don't feel like their "consciousness split into two" or anything.

It's a bit weird that there are two futures, here, but only one past -- that the first part of the story is the same for both versions of you. But so it goes; that just comes with the territory of copying people.

If you disagree with anything I've said above, what do you disagree with? And, again, what do you mean by saying you're "pretty sure" that you would experience the future of the non-copied version?

Namely, if I consider this procedure as an empirical experiment, from my first person perspective I dont get any new / unexpected observation compared to say just sitting in an ordinary room. Even if I were to go and find my copy, my experience would again be like meeting a different person which just happens to look like me and which claims to have similar memories  up to the point when I entered the copying room. There would be no way to verify or to view things from their first person perspective.

Sure. But is any of this Bayesian evidence against the view I've outlined above? What would it feel like, if the copy were another version of yourself? Would you expect that you could telepathically communicate with your copy and see things from both perspectives at once, if your copies were equally "you"? If so, why?

On the contrary, I would be wary to, say, kill myself or to be destroyed after the copying procedure, since no change will have occured to my first person perspective, and it would thus seem less likely that my "experience" would somehow survive because of my copy.

Shall we make a million copies and then take a vote? :)

I agree that "I made a non-destructive software copy of myself and then experienced the future of my physical self rather than the future of my digital copy" is nonzero Bayesian evidence that physical brains have a Cartesian Soul that is responsible for the brain's phenomenal consciousness; the Cartesian Soul hypothesis does predict that data. But the prior probability of Cartesian Souls is low enough that I don't think it should matter.

You need some prior reason to believe in this Soul in the first place; the same as if you flipped a coin, it came up heads, and you said "aha, this is perfectly predicted by the existence of an invisible leprechaun who wanted that coin to come up heads!". Losing a coinflip isn't a surprising enough outcome to overcome the prior against invisible leprechauns.

and it would also force me to accept that even a copy where the "circuit" is made of water pipes and pumps, or gears and levers also have an actual, first person experience as "me", as long as the appropriate computations are being carried out.  

Why wouldn't it? What do you have against water pipes?

Wouldn't it follow that in the same way you anticipate the future experiences of the brain that you "find yourself in" (i.e. the person reading this) you should anticipate all experiences, i.e. that all brain states occur with the same kind of me-ness/vivid immediacy?

What's the empirical or physical content of this belief?

I worry that this may be another case of the Cartesian Ghost rearing its ugly head. We notice that there's no physical thingie that makes the Ghost more connected to one experience or the other; so rather than exorcising the Ghost entirely, we imagine that the Ghost is connected to every experience simultaneously.

But in fact there is no Ghost. There's just a bunch of experience-moments implemented in brain-moments.

Some of those brain-moments resemble other brain-moments, either by coincidence or because of some (direct or indirect) causal link between the brain-moments. When we talk about Brain-1 "anticipating" or "becoming" a future brain-state Brain-2, we normally mean things like:

  • There's a lawful physical connection between Brain-1 and Brain-2, such that the choices and experiences of Brain-1 influence the state of Brain-2 in a bunch of specific ways.
  • Brain-2 retains ~all of the memories, personality traits, goals, etc. of Brain-1.
  • If Brain-2 is a direct successor to Brain-1, then typically Brain-2 can remember a bunch of things about the experience Brain-1 was undergoing.

These are all fuzzy, high-level properties, which admit of edge cases. But I'm not seeing what's gained by therefore concluding "I should anticipate every experience, even ones that have no causal connection to mine and no shared memories and no shared personality traits". Tables are a fuzzy and high-level concept, but that doesn't mean that every object in existence is a table. It doesn't even mean that every object is slightly table-ish. A photon isn't "slightly table-ish", it's just plain not a table.

Which just means, all brain states exist in the same vivid, for-me way, since there is nothing further to distinguish between them that makes them this vivid, i.e. they all exist HERE-NOW.

But they don't have the anticipation-related properties I listed above; so what hypotheses are we distinguishing by updating from "these experiences aren't mine" to "these experiences are mine"?

Maybe the update that's happening is something like: "Previously it felt to me like other people's experiences weren't fully real. I was unduly selfish and self-centered, because my experiences seemed to me like they were the center of the universe; I abstractly and theoretically knew that other people have their own point of view, but that fact didn't really hit home for me. Then something happened, and I had a sudden realization that no, it's all real."

If so, then that seems totally fine to me. But I worry that the view in question might instead be something tacitly Cartesian, insofar as it's trying to say "all experiences are for me" -- something that doesn't make a lot of sense to say if there are two brain states on opposite sides of the universe with nothing in common and nothing connecting them, but that does make sense if there's a Ghost the experiences are all "for".

As a test, I asked a non-philosopher friend of mine what their view is. Here's a transcript of our short conversation: 

I was a bit annoyingly repetitive with trying to confirm and re-confirm what their view is, but I think it's clear from the exchange that my interpretation is correct at least for this person.

Is there even anybody claiming there is an experiential difference?

Yep! Ask someone with this view whether the current stream of consciousness continues from their pre-uploaded self to their post-uploaded self, like it continues when they pass through a doorway. The typical claim is some version of "this stream of consciousness will end, what comes next is only oblivion", not "oh sure, the stream of consciousness is going to continue in the same way it always does, but I prefer not to use the English word 'me' to refer to the later parts of that stream of consciousness".

This is why the disagreement here has policy implications: people with different views of personal identity have different beliefs about the desirability of mind uploading. They aren't just disagreeing about how to use words, and if they were, you'd be forced into the equally "uncharitable" perspective that someone here is very confused about how relevant word choice is to the desirability of uploading.

The alternative to this is that there is a disagreement about the appropriate semantic interpretation/analysis of the question. E.g. about what we mean when we say "I will (not) experience such and such". That seems more charitable than hypothesizing beliefs in "ghosts" or "magic".

I didn't say that the relevant people endorse a belief in ghosts or magic. (Some may do so, but many explicitly don't!)

It's a bit darkly funny that you've reached for a clearly false and super-uncharitable interpretation of what I said, in the same sentence you're chastising me for being uncharitable! But also, "charity" is a bad approach to trying to understand other people, and bad epistemology can get in the way of a lot of stuff.

Load More