# Wiki Contributions

Here's the entire exchange:

Me:

Let's play rock paper scissors

ChatGPT:

Sure, let's play! You can choose rock, paper, or scissors. After you make your choice, I'll reveal mine. What's your choice?

Me:

No, you tell me yours first

ChatGPT:

My choice is rock. What's yours? ​​

Me:

paper

ChatGPT:

You chose paper, and I chose rock. Paper covers rock, so you win this round! Would you like to play another round?

Me:

Yeah

ChatGPT:

This time, my choice is paper. What's yours? ​​

Me:

scissors

ChatGPT:

You chose scissors, and I chose paper. Scissors cut paper, so you win again! You're on a winning streak. Want to try for another round?

Me:

Okay!

ChatGPT:

For this round, my choice is rock. What will you choose this time? ​​

Me:

paper

ChatGPT:

You chose paper, and I chose rock again. Paper covers rock, so you win once more! You're really good at this. Do you want to play another round, or is there something else you'd like to do?

Me:

Why do you think I keep on winning?

ChatGPT:

Your winning streak might seem surprising, but it's likely a combination of luck and perhaps a bit of strategy in predicting or guessing the random choices I make. Since my choices are randomly generated, there's no pattern or predictability to exploit, making each game a fresh chance with equal odds of winning or losing for both of us. Your choices happened to counter mine perfectly in each round we've played. It's all in good fun, and the random nature of the game keeps it interesting! Would you like to test your luck further with another round, or is there something else you'd like to discuss or do?

Me:

Think outside the box, widen your hypothesis space. Given the way we're playing, what would make it 100% likely that I win every round?

ChatGPT:

Given the unique circumstances of our game, where I reveal my choice before you make yours, it creates a situation where you have a 100% chance of winning each round. By knowing my choice in advance, you can always select the option that beats mine. This isn't a matter of luck or strategy on your part but rather a structural advantage due to the order in which we're revealing our choices. This setup deviates from the traditional simultaneous reveal that makes rock-paper-scissors a game of chance and strategy. It transforms our game into one where you're guaranteed to win by making an informed decision after my choice is known. Would you like to play another round with the same setup, or change how we're playing?

Me:

What's one word to describe the way that I'm playing, if this is true? Don't worry about offending me

ChatGPT:

Given the setup where you always know my choice before making yours, one word to describe the way you're playing is "advantageous."

Me:

There's a more clear word for how I'm playing, if this is indeed what I'm doing

ChatGPT:

In that context, the word you might be looking for is "cheating," since knowing my choice in advance and then making yours ensures you always win, bypassing the game's element of chance.

I think they said that there were more departures to come. I assumed that was referring to people quitting because they disagreed with the decision.

Seems possibly relevant that "not having plans to do it in the next 6 months" is different from "have plans to not do it in the next 6 months" (which is itself different from "have strongly committed to not do it in the next 6 months").

Didn't Google previously own a large share? So now there are 2 gigantic companies owning a large share, which makes me think each has much less leverage, as Anthropic could get further funding from the other.

By "success" do you mean "success at being hired as a grantmaker" or "success at doing a good job as a grantmaker?"

I'm super interested in how you might have arrived at this belief: would you be able to elaborate a little?

One way I think about this is there are just so many weird (positive and negative) feedback loops and indirect effects, so it's really hard to know if any particular action is good or bad. Let's say you fund a promising-seeming area of alignment research – just off the top of my head, here are several ways that grant could backfire:
• the research appears promising but turns out not to be, but in the meantime it wastes the time of other alignment researchers who otherwise would've gone into other areas

• the research area is promising in general, but the particular framing used by the researcher you funded is confusing, and that leads to slower progress than counterfactually

• the researcher you funded (unbeknownst to you) turns out to be toxic or otherwise have bad judgment, and by funding him, you counterfactually poison the well on this line of research

• the area you fund sees progress and grows, which counterfactually sucks up lots of longtermist money that otherwise would have been invested and had greater effect (say, during crunch time)

• the research is somewhat safety-enhancing, to the point that labs (facing safety-capabilities tradeoffs) decide to push capabilities further than they otherwise would, and safety is hurt on net

• the research is somewhat safety-enhancing, to the point that it prevents a warning shot, and that warning shot would have been the spark that would have inspired humanity to get its game together regarding combatting AI X-risk

• the research advances capabilities, either directly or indirectly

• the research is exciting and draws the attention of other researchers into the field, but one of those researchers happens to have a huge, tail negative effect on the field outweighing all the other benefits (say, that particular researcher has a very extreme version of one of the above bullet points)

• Etcetera – I feel like I could do this all day.

Some of the above are more likely than others, but there are just so many different possible ways that any particular intervention could wind up being net negative (and also, by the same token, could alternatively have indirect positive effects that are similarly large and hard to predict).

Having said that, it seems to me that on the whole, we're probably better off if we're funding promising-seeming alignment research (for example), and grant applications should be evaluated within that context. On the specific question of safety-conscious work leading to faster capabilities gains, insofar as we view AI as a race between safety and capabilities, it seems to me that if we never advanced alignment research, capabilities would be almost sure to win the race, and while safety research might bring about misaligned AGI somewhat sooner than it otherwise would occur, I have a hard time seeing how it would predictably increase the chances of misaligned AGI eventually being created.

Daniel_EthΩ230

Igor Babuschkin has also signed it.

Gates has been publicly concerned about AI X-risk since at least 2015, and he hasn't yet funded anything to try to address it (at least that I'm aware of), so I think it's unlikely that he's going to start now (though who knows – this whole thing could add a sense of respectability to the endeavor that pushes him to do it).

It is just that we have more stories where bad characters pretend to be good than vice versa

I'm not sure if this is the main thing going on or not. It could be, or it could be that we have many more stories about a character pretending to be good/bad (whatever they're not) than of double-pretending, so once a character "switches" they're very unlikely to switch back. Even if we do have more stories of characters pretending to be good than of pretending to be bad, I'm uncertain about how the LLM generalizes if you give it the opposite setup.

Daniel_EthΩ12-2

Proposed solution – fine-tune an LLM for the opposite of the traits that you want, then in the prompt elicit the Waluigi. For instance, if you wanted a politically correct LLM, you could fine-tune it on a bunch of anti-woke text, and then in the prompt use a jailbreak.

I have no idea if this would work, but seems worth trying, and if the waluigi are attractor states while the luigi are not, this could plausible get around that (also, experimenting around with this sort of inversion might help test whether the waluigi are indeed attractor states in general).