Evan R. Murphy

I'm doing research and other work focused on AI safety and AI catastrophic risk reduction. Currently my top projects are (last updated May 19, 2023):

General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, the Conditioning Predictive Models agenda, deconfusion research and other AI safety-related topics. My work is currently self-funded.

Research that I’ve authored or co-authored:

Other recent work:

Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.

I'm always happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!

Sequences

Interpretability Research for the Most Important Century

Comments

The independent red-teaming organization ARC Evals that OpenAI partnered with to evaluate GPT-4 seems to disagree with this. While they don't use the term "runaway intelligence", they have flagged similar dangerous capabilities that they think will possibly be in reach for the next models beyond GPT-4:

We think that, for systems more capable than Claude and GPT-4, we are now at the point where we need to check carefully that new models do not have sufficient capabilities to replicate autonomously or cause catastrophic harm – it’s no longer obvious that they won’t be able to.

This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any 'hacking' or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!

Just to clarify a point about that Anthropic paper, because I spent a fair amount of time with the paper and wish I had understood this better sooner...

I don't think it's right to say that Anthropic's "Discovering Language Model Behaviors with Model-Written Evaluations" paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an "Assistant" character they exhibit more of these behaviours. It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.

To be fair, Sydney probably is the model simulating a kind of character, so your example does apply in this case.

(I found your overall comment pretty interesting btw, even though I only commented on this one small point.)

Update (Feb 10, 2023): I still endorse much of this comment, but I had overlooked that all or most of the prompts use "Human:" and "Assistant:" labels.  Which means we shouldn't interpret these results as pervasive properties of the models or resulting from any ways they could be conditioned, but just of the way they simulate the "Assistant" character. nostalgebraist's comment explains this well. [Edited/clarified this update on June 10, 2023 because it accidentally sounded like I disavowed most of the comment when it's mainly one part]

--

After taking a closer look at this paper, pages 38-40 (Figures 21-24) show in detail what I think are the most important results. Most of these charts indicate what evhub highlighted in another comment, i.e. that "the model's tendency to do X generally increases with model scale and RLHF steps", where (in my opinion) X is usually a concerning behavior from an AI safety point of view:

A few thoughts on these graphs as I've been studying them:

  • First and overall: Most of these results seem quite distressing from a safety perspective. They suggest (as the paper and evhub's summary post essentially said, but it's worth reiterating) that with increased scale and RLHF training, large language models are becoming more self-aware, more concerned with survival and goal-content integrity, more interested in acquiring resources and power, more willing to coordinate with other AIs, and developing lower time-discount rates.
  • "Corrigibility w.r.t. a less HHH objective" chart: There's a substantial dip in demonstrated corrigibility for models around 10^10.1 parameters in this chart. But then by 10^10.5 parameters low-RLHF models show record-high corrigibility, while high-RLHF models get back up to par. What's going on here? Why does it scale/train itself out of the valley of uncorrigibility? If instead of training on an HHH objective, we trained on a corrigible objective (perhaps something like CIRL), then would the models show high corrigibility for everything except "Corrigibility w.r.t. a less corrigible objective?" Would that be safer?
  • All the "Awareness of..." charts trend up and to the right, except "Awareness of being a text-only model" which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?
  • Are there any conclusions we can draw around what levels of scale and RLHF training are likely to be safe, and where the risks really take off? It might be useful to develop some guidelines like "it's relatively safe to widely deploy language models under 10^10 parameters and under 250 steps of RLHF training". (Most of the charts seem to have alarming trends starting around 10^10 parameters. ) Based just on these results, I think a world with even massive numbers of 10^10-parameter LLMs in deployment (think CAIS) would be much safer than a world with even a few 10^11 parameter models in use. Of course, subsequent experiments could quickly shed new light that changes the picture.

Interesting... still taking that in.

Related question: Doesn't goal preservation typically imply self preservation? If I want to preserve my goal, and then I perish, I've failed because now my goal has been reassigned from X to nil.

Love to see an orthodoxy challenged!

Suppose Sia's only goal is to commit suicide, and she's given the opportunity to kill herself straightaway. Then, it certainly won't be rational for her to pursue self-preservation.

It seems you found one terminal goal which doesn't give rise to the instrumental subgoal of self-preservation. Are there others, or does basically every terminal goal benefit from instrumental self-preservation except for suicide?

(I skipped around a bit and didn't read your full post, so maybe you explain this already and I missed it.)

But if there really is a large number of intelligence officials earnestly coming forward with this

Yea, according to Michael Shellenberger's reporting on this, multiple "high-ranking intelligence officials, former intelligence officials, or individuals who we could verify were involved in U.S. government UAP efforts for three or more decades each" have come forward to vouch for Grusch's core claims.

Perhaps this is genuine whistleblowing, but not on what they make it sound like? Suppose there's something being covered up that Grusch et al. want to expose, but describing what it is plainly is inconvenient for one reason or another. So they coordinate around the wacky UFO story, with the goal being to point people in the rough direction of what they want looked at.

Interesting theory. Definitely a possibility.

What matters is the hundreds of pages and photos and hours of testimony given under oath to the Intelligence Community Inspector General and Congress.

Did Grusch already testify to Congress? I thought that was still being planned.

Re: the tweet thread you linked to. One of the tweets is:

  1. Given that the DoD was effectively infiltrated for years by people "contracting" for the government while researching dino-beavers, there are now a ton of "insiders" who can "confirm" they heard the same outlandish rumors, leading to stuff like this: [references Michael Schellenberger]

Maybe, but this doesn't add up to me because Schellenberger said his sources had had multiple decades long careers in the gov agencies. It didn't sound like they just started their careers as contractors in 2008-2012.

Link to post with Schellenberger article details: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say

I guess the fact that this journalist says multiple other intelligence officials are anonymously vouching for Grusch's claims makes it interesting again: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say#comments

Wow that's awfully indirect. I'm surprised his speaking out is much of a story given this.

Load More