pjeby

Software developer and mindhacking instructor. Interested in intelligent feedback (especially of the empirical testing variety) on my new (temporarily free) ebook, A Minute To Unlimit You.

Wiki Contributions

Comments

we don’t think that shutdown-seeking avoids every possible problem involved with reward misspecification

Seems like this is basically the alignment problem all over again, with the complexity just moved to "what does it mean to 'shut down' in the AI's inner model".

For example, if the inner-aligned goal is to prevent its own future operation, it might choose to say, start a nuclear war so nobody is around to start it back up, repair it, provide power, etc.

it doesn't have the kind of insight into its motives that we do

Wait, human beings have insight into their own motives that's better than GPTs have into theirs? When was the update released, and will it run on my brain? ;-)

Joking aside, though, I'd say the average person's insight into their own motives is most of the time not much better than that of a GPT, because it's usually generated in the same way: i.e. making up plausible stories.

What I was pointing out is that the barrier is asymmetrical: it's biased towards AIs with more-easily-aligned utility functions. A paperclipper is more likely to be able to create an improved paperclipper that it's certain enough will massively increase its utility, while a more human-aligned AI would have to be more conservative.

In other words, this paper seems to say, "if we can create human-aligned AI, it will be cautious about self-improvement, but dangerously unaligned AIs will probably have no issues."

The first and most obvious issue here is that an AI that "solves alignment" sufficiently well to not fear self-improvement is not the same as an AI that's actually aligned with humans. So there's actually no protection there at all.

In fact, the phenomenon described here seems to make it more likely that an unaligned AI will be fine with self-improving, because the simpler the utility function the easier time it has guaranteeing the alignment of the improved version!

Last, but far from least, self-improvement of the form "get faster and run on more processors" is hardly challenging from an alignment perspective. And it's far from unlikely an AI could find straightforward algorithmic improvements that it could mathematically prove safe relative to its own utility function.

In short, the overall approach seems like wishful thinking of the form, "maybe if it's smart enough it won't want to kill us."

Nope - expression of feelings of friendship isn't part of the explicit structure of friendship either. Lots of people are friends without saying anything about it.

All I've really said here is that the difference between VCFWB and a "romantic" relationship is difficult to discern, especially from the outside, and given that the nature of "romance" is both internal and optional to the relationship. If a pair of VCFWB's stop having sex or hanging out or cuddling, it's hard to say they're still in a VCFWB relationship. But if people in a "romantic" relationship stop acting romantic with one another, they can still be said to be in a "romantic" relationship.

The overall point here is that describing "romantic" as if it is a property of a relationship rather than a property of people's feelings is not a good carving of reality at the joints. People can have romantic feelings (or expression thereof) without having any relationship at all, let alone one with reciprocal romantic feelings.

(Indeed, romantic feelings are quite orthogonal to the type and nature of the relationship itself: the term "friend zone" highlights this point.)

So, from an epistemic view, my take is that it's not only useless but confusing to describe a relationship as being romantic, since it's not meaningfully a property of the relationship, but rather a set of feelings that come and go for (and about) parties in the relationship. How many feelings must happen? How often? Must they be reciprocal? Is it still romantic if neither party feels that way any more? What if they didn't start out that way but are now?

I think that the bundle of things called "romantic relationship" are much better described structurally in terms of behavior, in order to avoid cultural projections and mismatched expectations between partners. One person might use it to mean "marriage for life", while another might mean "passionate weekend affair", after all! These more structurally-defined relationships can both be labeled a "romantic relationship" but this does not do a good job of defining a shared vision and expectations for the parties in said relationship.

IOW, I believe that everyone is better off taboo-ing the phrase "romantic relationship" in any serious discussion about relationships -- especially a relationship they'll personally be involved in!

I'm assuming "relationship" here means something like "the explicit structure and boundaries of behavior as agreed upon by the parties" - friends, friends with benefits, marriage, polycule etc. People's romantic feelings and expressions are rarely something that's part of a relationship's explicit structure, even if people often have a lot of implicit expectations about them. (And any of those named structures can include romantic feelings, or a lack thereof.)

Direct quotes:

Which seems to give me just as much control[4] over the past as I have over the future.

And the footnote:

whatever I can do to make my world the one with FA in it, I can do to make my world the one with HA in it.

This is only trivially true in the sense of saying "whatever I can do to arrive at McDonalds, I can do to make my world the one where I walked in the direction of McDonalds". This is ordinary reality and nothing to be "bothered" by -- which obviates the original question's apparent presupposition that something weird is going on.

If there's something incoherent or contradictory about "either the propositions 'HA happened, A is the current state, I will choose CA, FA will happen' are all true, or the propositions 'HB happened, B is the current state, I will choose CB, FB will happen' are all true; the ones that aren't all true are all false", can you be specific about what it is?

It's fine so long as HA/A and HB/B are understood to be the events and states during the actual decision-making process, and not referencing anything before that point, i.e.:

  • H -> S -> (HA ->A) -> CA -> FA
  • H -> S -> (HB ->B) -> CB -> FB

Think of H as events happening in the world, then written onto a read-only SD card labeled "S". At this moment, the contents of S are already fixed. S is then fed into a device which will then operate upon the data and reveal its interpretation of the data by outputting the text "A" or "B". The history of events occurring inside the device will be different according to whatever the content of the SD card was, but the content of the card isn't "revealed" or "chosen" or "controlled" by this process.

How is 'revealing something about the past' retrocausal?

It isn't; but neither is it actually revealing anything about the past that couldn't have been ascertained prior to executing the decision procedure or in parallel with it. The decision procedure can only "reveal" the process and results of the decision procedure itself, since that process and result were not present in the history and state of the world before the procedure began.

I don't know how to clarify this, because I don't understand why you think I am. I do think we can narrow down a 'moment of decision' if we want to, meaning e.g. the point in time where the agent becomes conscious of which action they will take, or when something that looks to us like a point of no return is reached. But obviously the decision process is a process, and I don't get why you think I don't understand or have failed to account for this.

Here is the relevant text from your original post:

State A: Past events HA have happened, current state of the world is A, I will choose CA, future FA will happen.

State B: Past events HB have happened, current state of the world is B, I will choose CB, future FB will happen.

These definitions clearly state "I will choose" -- i.e., the decision process has not yet begun. But if the decision process hasn't yet begun, then there is only one world-state, and thus it is meaningless to give that single state two names (HA/A and HB/B).

Before you choose, you can literally examine any aspect of the current world-state that you like and confirm it to your heart's content. You already know which events have happened and what the state of the world is, so there can't be two such states, and your choice does not "reveal" anything about the world-state that existed prior to the start of the decision process.

This is why I'm describing HA/A and HB/B in your post as incoherent, and assuming that this description must be based on an instantaneous, outside-reality concept of "choice", which seems to be the only way the stated model can make any sense (even in its own terms).

In contrast, if you label every point of the timeline as to what is happening, the only logically coherent timeline is H -> S -> ( H[A/B] -> A/B ) -> C[A/B] -> F[A/B] -- where it's obvious that this is just reality as normal, where the decision procedure neither "chooses" nor "reveals" anything about the history of the world prior to its beginning execution. (IOW, it can only "reveal" or "choose" or "control" the present and future, not the past.)

But if you were using that interpretation, then your original question appears to have no meaning: what would it mean to be bothered that the restaurant you eat at today will "reveal" which way you flipped the coin you used to decide?

I'm not sure how much that rephrasing would change the rest of your answer

Well, it makes the confusion more obvious, because now it's clearer that HA/A and HB/B are complete balderdash. This will be apparent if you try to unpack exactly what the difference between them is, other than your choice. (Specifically, the algorithm used to compute your choice.)

Let's say I give you a read-only SD card containing some data. You will insert this card into a device that will run some algorithm and output "A" or "B". The data on the card will not change as a result of the device's output, nor will the device's output retroactively cause different data to have been entered on the card! All that will be revealed is the device's interpretation of that data. To the extent there is any uncertainty about the entire process, it's simply that the device is a black box - we don't know what algorithm it uses to make the decision.

So, tl;dr: the choice you make does not reveal anything about the state or history of the world (SD card), except for the part that is your decision algorithm's implementation. If we draw a box around "the parts of your brain that are involved in this decision", then you could say that the output choice tells you something about the state and history of those parts of your brain. But even there, there's no backward causality -- it's again simply resolving your uncertainty about the box, not doing anything to the actual contents, except to the extent that running the decision procedure makes changes to the device's state.

broadly I'm not sure whether you are defending compatibilism or hard determinism

As other people have mentioned, rationalists don't typically think in those terms. There isn't actually any difference between those two ideas, and there's really nothing to "defend". As with a myriad other philosophical questions, the question itself is just map-territory confusion or a problem with word definitions.

Human brains have lots of places where it's easy to slip on logical levels and end up with things that feel like questions or paradoxes when in fact what's going on is really simple once you put back in the missing terms or expand the definitions properly. (This is often because brains don't tend to include themselves as part of reality, so this is where the missing definitions can usually be found!)

In the particular case you've presented, that tendency manifests in the part where no part of your problem specification explicitly calls out the brain or its decision procedures as components of the process. Once you include those missing pieces, it's straightforward to see that the only place where hypothetical alternative choices exist is in the decider's brain, and that no retrocausality is involved.

In the parts of reality that do not include your brain, they are already in some state and already have some history. When you make a decision, you already know what state and history exist for those parts of reality, at least to the extent that state and history is decision-relevant. What you don't know is which choice you will make.

You then can imagine CA and CB -- i.e., picture them in your brain -- as part of running your decision algorithm. Running this algorithm then makes changes to the history and state of your brain -- but not to any of the inputs that your brain took in.

Suppose I follow the following decision procedure:

  1. Make a list of alternatives
  2. Give them a score from 1-10 and sort the list
  3. Flip a coin
  4. If it comes up heads, choose the first item
  5. If it comes up tails, cross off that item and go back to step 3

None of these steps is retrocausal, in the sense of "revealing" or "choosing" anything about the past. As I perform these steps, I am altering H and S of my brain (and workspace) until a decision is arrived at. At no point is there an "A" or "B" here, except in the contents of the list.

Since there is a random element I don't even know what choice I will make, but the only thing that was "revealed" is my scoring and which way the coin flips went -- all of which happened as I went through the process. When I get to the "choice" part, it's the result of the steps that went before, not something that determines the steps.

This is just an example, of course, but it literally doesn't matter what your decision procedure is, because it's still not changing the original inputs of the process. Nothing is retroactively chosen or revealed. Instead, the world-state is being changed by the process of making the decision, in normal forward causality.

As soon as you fully expand your terms to any specific decision procedure, and include your brain as part of the definition of "history" and "state", the illusion of retrocausality vanishes.

A pair of timelines, showing two possible outcomes, with the decision procedure parenthesized:

  • H -> S -> (HA -> SA) -> CA
  • H -> S -> (HB -> SB) -> CB

The decision procedure operates on history H, state S as its initial input. During the process it will produce a new history and final state, following some path that will result in CA or CB. But CA and CB do not reveal or "choose" anything about the H or S that existed prior to beginning the decision procedure. Instead, the steps go forward in time creating HA or HB as they go along.

It's as if you said, "isn't it weird, how if I flip a coin and then go down street A or B accordingly, coming to whichever restaurant is on that street, that the cuisine of the restaurant I arrive at reveals which way my coin flip went?"

No. No. It's not weird at all! That's what you should expect to happen! The restaurant you arrived at does not determine the coin flip, the coin flip determines the restaurant.

As soon as you make the decision procedure a concrete procedure -- be it flipping a coin or otherwise -- it should hopefully become clear that the choice is the output of the steps taken; the steps taken are not retroactively caused by the output of the process.

The confusion in your original post is that you're not treating "choice" as a process with steps that produce an output, but rather as something mysterious that happens instantaneously while somehow being outside of reality. If you properly place "choice" as a series of events in normal spacetime, there is no paradox or retrocausality to be had. It's just normal things happening in the normal order.

LW compatibilism isn't believing that choice magically happens outside of spacetime while everything else happens deterministically, but rather including your decision procedure as part of "things happening deterministically".

The confusion is resolved if you realize that both A and B here are mental simulations. When you observe the ball moving, it allows you to discard some of your simulations, but this doesn't affect the past or future, which already were whatever they were.

To view the ball as affecting the past is to confuse the territory (which already was in some definite state) with your map (which was in a state of uncertainty re: the territory).

Answer by pjebyMay 06, 20233-3

It seems to me that your confusion is contending there are two past/present states (HA+A / HB+B) when in fact reality is simply H -> S -> C. There is one history, one state, and one choice that you will end up making. The idea that there is a HA and HB and so on is wrong, since that history H has already happened and produced state S.

Further, C is simply the output of your decision algorithm, which result we don't know until the algorithm is run. Your choice could perhaps be said to reveal something previously not known about H and S, but it doesn't distinguish between two histories or states, only your state of information about the single history/state that already existed. (It also doesn't determine anything about H and S that isn't "this decision algorithm outputs C under condition S".)

Indeed, even presenting it as if there is actually a CA and CB from which you will choose is itself inaccurate: you're already going to choose whatever you're going to choose, and that output is already determined even if you have yet to run the algorithm that will let you find out what that choice is. The future states CA and CB never actually exist either -- they are simulations you create in your mind as part of the decision algorithm.

Or to put it another way, since the future state C is a complex mix of your choice and other events taking place in the world, it will not actually match whatever simulated option you thought about. So the entire A/B disjunction throughout is about distinctions that only exist in your mental map, not in the territory outside your head.

So, the real world is H->S->C, and in your mind, you consider simulated or hypothetical A's and B's. Your decision process resolves which of A and B you feel accurately reflects H/S/C, but cannot affect anything but C. (And even there, the output was already determinable-in-principle before you started -- you just didn't know what the output was going to be.)

Load More