The top performing vector is odd in another way. Because the tokens of the positive and negative side are subtracted from each other, a reasonable intuition is that the subtraction should point to a meaningful direction. However, some steering vectors that perform well in our test don't have that property. For the steering vector “Wedding Planning Adventures” - “Adventures in self-discovery”, the positive and negative side aren't well aligned per token level at all:

I think I don't see the Mystrie here.
When you directly subtract the steering prompts from each other, most of the results would not make sense, yes. But this is not what we do. 
We feed these Prompts into the Transformer and then subtract the residual stream activations after block n from each other. Within the n layers, the attention heads have moved around the information between the positions. Here is one way, this could have happened:

The first 4 Blocks assess the sentiment of a whole sentence, and move this information to position 6 of the residual stream, the other positions being irrelevant. So, when we constructed the steering vector and recorded the activation after block 4, we have the first 5 positions of the steering vector being irrelevant and the 6th position containing a vector that points in a general "Wedding-ness" direction. When we add this steering vector to our normal prompt, the transformer acts as if the previous vector was really wedding related and 'keeps talking' about weddings.

Obviously, all the details are made up, but I don't see how a token for token meaningful alignment of the prompts of the steering vector should intuitively be helpful for something like this to work.

The analogy to molecular biology you've drawn here is intriguing. However, one important hurdle to consider is that the Phage Group had some sense of what they were seeking. They examined bacteria with the goal of uncovering mechanisms also present in humans, about whom they had already gathered a considerable amount of knowledge. They indeed succeeded, but suppose we look at this from a different angle.

Imagine being an alien species with a vastly different biological framework, tasked with studying E.Coli with the aim of extrapolating facts that also apply to the "General Intelligences" roaming Earth - entities that you've never encountered before. What conclusions would you draw? Could you mistakenly infer that they reproduce by dividing in two, or perceive their surroundings mainly through chemical gradients?

I believe this hypothetical scenario is more analogous to our current position in AI research, and it highlights the difficulty in uncovering empirical findings that can generalize all the way up to general intelligence.

Thanks a lot for the comment and correction :) 

I updated "diamond maximization problem" to "diamond alignment problem".

I didn't understand your proposal to involve surgically inserting the drive to value "diamonds are good", but instead systematically rewarding the agent for acquiring diamonds so that a diamond shard forms organically. I also edited that sentence. 

I am not sure I get your Nitpick: "Just as you can deny that Newtonian mechanics is true, without denying that heavy objects attract each other." was supposed to be an example of "The specific theory is wrong, but the general phenomenon which it tries to describe exists". In the same way that I think Natural Abstractions exist but (my flawed understanding) of  Wentworths theory of natural abstractions is wrong. It was not supposed to be an example of a natural abstraction itself.

Very interesting Idea!

I am a bit sceptical about the part, where the Ghosts should mostly care about what will happen to their actual version, and not care about themselfs.

Lets say I want you to cooperate in a prisoner's dilemma. I might just simulate you, see if your ghost cooperates and then only cooperate when your ghost does. But I could also additionally reward?punnish your ghosts directly depending wether they cooperate or defect. 

Wouldn't that also be motivating to the ghosts, that they suspect that I might just get reward or punishment even if they are the Ghosts and not the actual person?

Yes, I would consider humans to already be unsafe, as we already made a sharp left turn that left us unaligned relative to our outer optimiser.

Dogs are a good point, thank you for that example. Not sure if dogs have our exact notion of corrigibility, but they definitely seem to be friendly in some relevant sence.

I am confused by the part, where the Rick-shard can anticipate wich plan the other shards will bit for. If I understood shard-theory correctly, shards do not have their own world model, they can just bid up or down actions, according to the consequences they might have according to the worldmodel that is available to all shards. Please correct me if I am wrong about this point.

So I don’t see how the Rick-Shard could really „trick“ the atheism-shard via rationalisation.

If the Rick-shard sees that „church-going for respect-reasons“ will lead to conversion, then the atheism-shard has to see that too, because they query the same world-model. So the atheism-shard should bid against that plan just as heavily as against „going to church for conversion reasons“.

I think there is something else going on here. I think the Rick-shard does not trick the Atheism-Shard, but the Concious-Part that is not described by shard theory.

In particular, these results suggest that we may be able to predict power-seeking, situational awareness, etc. in future models by evaluating those behaviors in terms of log-likelihood.

I am skeptical that this methodology could work for the following reason:

I think it is generally useful for thinking about the sharp left turn, to keep the example of chimps/humans in mind. Chimps as a pre-sharp left turn example and humans as a post-sharp left turn example.

Let's say you look at a chimp, and you want to measure whether a sharp left turn is around the corner. You reason, that post-sharp left turn animals should be able to come up with algebra. (so far, so correct)

And now what you do, is that you measure the log likelihood that a chimp would come up with algebra. I expect you get a value pretty close to -inf, even though sharp left turn homo sapiens is only one species down the line. 

I am also still looking for a reference on that one...

You could make it even more accessible if Credit card was not the only payment option. In some places (like here in Germany) having a credit card is somewhat less common. Adding Paypal would be nice.

Rationality framework: The Greenland effect:

Remember the first time, you looked at a world map: one thing that maybe cached your eye was Greenland: That huge Island, almost as big as Africa, up there in the north.

Now remember the first time, you took a closer look at a globe (or a non-Mercator projection for that matter) Greenland is a bit disappointing, isn’t it? Doesn’t seem to be THAT big at all.

Now remember that time in geography class, when you held presentations on the countries in Europe: In comparison to these folks, the icy planes of Denmark´s pet island seem gigantic. Now, not as gigantic as Africa, but still…

Depending on how much time you spend with geography, I can well imagine that cycle going back and forth some more.

What is important here, is the following: even though your knowledge about the size of Greenland ever increased over your life, your emotional attitude “oh, quite big” or “nah, it´s an island, bruh” switched around quite a lot in both directions.

Now in the case of Greenland this is all well and fine, but other scenarios in can lead to pseudo disagreements or confused arguments: Beware the Greenland effect. Beware that your emotional dispossession towards an issue, often reflects your last update on that issue (which should vary unpredictably) and not your overall believes on an issue (which should converge).

Example of Greenland effects:

“The church is good, it teaches me about God”->”God is fake, the priest must be a moron, the world lied to me” -> “These religious people are actually using a lot of their recouces to help people in need” -> “all those religious charities are so ineffective.” …

“I can’t stop this project now, I have already invested so many recources”->”I know about sunk cost bias. I will abandon my projects, whenever they seem to be a bad Idea” -> “I should carry through projects despite having downs: sunk cost faith.”…

