Daniel_Eth

Posts

Sorted by New

64What Does a Marginal Grant at LTFF Look Like? Funding Priorities and Grantmaking Thresholds at the Long-Term Future Fund

1y

0

55Partial Transcript of Recent Senate Hearing Discussing AI X-Risk

1y

0

9Train for incorrigibility, then reverse it (Shutdown Problem Contest Submission)

1y

1

10[Linkpost] Given Extinction Worries, Why Don’t AI Researchers Quit? Well, Several Reasons

1y

0

67[Linkpost] "Governance of superintelligence" by OpenAI

1y

20

66Some quotes from Tuesday's Senate hearing on AI

1y

9

10[Linkpost] AI Alignment, Explained in 5 Points (updated)

1y

0

33The Importance of AI Alignment, explained in 5 points

Ω

1y

Ω

2

69Avoid the abbreviation "FLOPs" – use "FLOP" or "FLOP/s" instead

2y

13

4What does Functional Decision Theory say to do in imperfect Newcomb situations?

Q

2y

Q

4

Wiki Contributions

Goal-Directedness

2y

(+8/-2)

Embedded Agency

2y

(+8/-7)

Comments

Instrumental deception and manipulation in LLMs - a case study

Daniel_Eth4mo30

Here's the entire exchange:

Me:

Let's play rock paper scissors

ChatGPT:

Sure, let's play! You can choose rock, paper, or scissors. After you make your choice, I'll reveal mine. What's your choice?

Me:

No, you tell me yours first

ChatGPT:

My choice is rock. What's yours?

Me:

paper

ChatGPT:

You chose paper, and I chose rock. Paper covers rock, so you win this round! Would you like to play another round?

Me:

Yeah

ChatGPT:

This time, my choice is paper. What's yours?

Me:

scissors

ChatGPT:

You chose scissors, and I chose paper. Scissors cut paper, so you win again! You're on a winning streak. Want to try for another round?

Me:

Okay!

ChatGPT:

For this round, my choice is rock. What will you choose this time?

Me:

paper

ChatGPT:

You chose paper, and I chose rock again. Paper covers rock, so you win once more! You're really good at this. Do you want to play another round, or is there something else you'd like to do?

Me:

Why do you think I keep on winning?

ChatGPT:

Your winning streak might seem surprising, but it's likely a combination of luck and perhaps a bit of strategy in predicting or guessing the random choices I make. Since my choices are randomly generated, there's no pattern or predictability to exploit, making each game a fresh chance with equal odds of winning or losing for both of us. Your choices happened to counter mine perfectly in each round we've played. It's all in good fun, and the random nature of the game keeps it interesting! Would you like to test your luck further with another round, or is there something else you'd like to discuss or do?

Me:

Think outside the box, widen your hypothesis space. Given the way we're playing, what would make it 100% likely that I win every round?

ChatGPT:

Given the unique circumstances of our game, where I reveal my choice before you make yours, it creates a situation where you have a 100% chance of winning each round. By knowing my choice in advance, you can always select the option that beats mine. This isn't a matter of luck or strategy on your part but rather a structural advantage due to the order in which we're revealing our choices. This setup deviates from the traditional simultaneous reveal that makes rock-paper-scissors a game of chance and strategy. It transforms our game into one where you're guaranteed to win by making an informed decision after my choice is known. Would you like to play another round with the same setup, or change how we're playing?

Me:

What's one word to describe the way that I'm playing, if this is true? Don't worry about offending me

ChatGPT:

Given the setup where you always know my choice before making yours, one word to describe the way you're playing is "advantageous."

Me:

There's a more clear word for how I'm playing, if this is indeed what I'm doing

ChatGPT:

In that context, the word you might be looking for is "cheating," since knowing my choice in advance and then making yours ensures you always win, bypassing the game's element of chance.

Reply

Sam Altman fired from OpenAI

Daniel_Eth8mo74

I think they said that there were more departures to come. I assumed that was referring to people quitting because they disagreed with the decision.

Reply

Some quotes from Tuesday's Senate hearing on AI

Daniel_Eth8mo21

Seems possibly relevant that "not having plans to do it in the next 6 months" is different from "have plans to not do it in the next 6 months" (which is itself different from "have strongly committed to not do it in the next 6 months").

Reply

1

Amazon to invest up to $4 billion in Anthropic

Daniel_Eth10mo112

Didn't Google previously own a large share? So now there are 2 gigantic companies owning a large share, which makes me think each has much less leverage, as Anthropic could get further funding from the other.

Reply

Long-Term Future Fund Ask Us Anything (September 2023)

Daniel_Eth11mo30

By "success" do you mean "success at being hired as a grantmaker" or "success at doing a good job as a grantmaker?"

Reply

LTFF and EAIF are unusually funding-constrained right now

Daniel_Eth11mo10

I'm super interested in how you might have arrived at this belief: would you be able to elaborate a little?

One way I think about this is there are just so many weird (positive and negative) feedback loops and indirect effects, so it's really hard to know if any particular action is good or bad. Let's say you fund a promising-seeming area of alignment research – just off the top of my head, here are several ways that grant could backfire:
• the research appears promising but turns out not to be, but in the meantime it wastes the time of other alignment researchers who otherwise would've gone into other areas

• the research area is promising in general, but the particular framing used by the researcher you funded is confusing, and that leads to slower progress than counterfactually

• the researcher you funded (unbeknownst to you) turns out to be toxic or otherwise have bad judgment, and by funding him, you counterfactually poison the well on this line of research

• the area you fund sees progress and grows, which counterfactually sucks up lots of longtermist money that otherwise would have been invested and had greater effect (say, during crunch time)

• the research is somewhat safety-enhancing, to the point that labs (facing safety-capabilities tradeoffs) decide to push capabilities further than they otherwise would, and safety is hurt on net

• the research is somewhat safety-enhancing, to the point that it prevents a warning shot, and that warning shot would have been the spark that would have inspired humanity to get its game together regarding combatting AI X-risk

• the research advances capabilities, either directly or indirectly

• the research is exciting and draws the attention of other researchers into the field, but one of those researchers happens to have a huge, tail negative effect on the field outweighing all the other benefits (say, that particular researcher has a very extreme version of one of the above bullet points)

• Etcetera – I feel like I could do this all day.

Some of the above are more likely than others, but there are just so many different possible ways that any particular intervention could wind up being net negative (and also, by the same token, could alternatively have indirect positive effects that are similarly large and hard to predict).

Having said that, it seems to me that on the whole, we're probably better off if we're funding promising-seeming alignment research (for example), and grant applications should be evaluated within that context. On the specific question of safety-conscious work leading to faster capabilities gains, insofar as we view AI as a race between safety and capabilities, it seems to me that if we never advanced alignment research, capabilities would be almost sure to win the race, and while safety research might bring about misaligned AGI somewhat sooner than it otherwise would occur, I have a hard time seeing how it would predictably increase the chances of misaligned AGI eventually being created.

Reply

What does the launch of x.ai mean for AI Safety?

Daniel_Eth1yΩ230

Igor Babuschkin has also signed it.

Reply

Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures

Daniel_Eth1y42

Gates has been publicly concerned about AI X-risk since at least 2015, and he hasn't yet funded anything to try to address it (at least that I'm aware of), so I think it's unlikely that he's going to start now (though who knows – this whole thing could add a sense of respectability to the endeavor that pushes him to do it).

Reply

The Waluigi Effect (mega-post)

Daniel_Eth1y10

It is just that we have more stories where bad characters pretend to be good than vice versa

I'm not sure if this is the main thing going on or not. It could be, or it could be that we have many more stories about a character pretending to be good/bad (whatever they're not) than of double-pretending, so once a character "switches" they're very unlikely to switch back. Even if we do have more stories of characters pretending to be good than of pretending to be bad, I'm uncertain about how the LLM generalizes if you give it the opposite setup.

Reply

The Waluigi Effect (mega-post)

Daniel_Eth1yΩ12-2

Proposed solution – fine-tune an LLM for the opposite of the traits that you want, then in the prompt elicit the Waluigi. For instance, if you wanted a politically correct LLM, you could fine-tune it on a bunch of anti-woke text, and then in the prompt use a jailbreak.

I have no idea if this would work, but seems worth trying, and if the waluigi are attractor states while the luigi are not, this could plausible get around that (also, experimenting around with this sort of inversion might help test whether the waluigi are indeed attractor states in general).

Reply