Evan R. Murphy

I'm doing research and other work focused on AI safety and AI catastrophic risk reduction. Currently my top projects are (last updated May 19, 2023):

General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, the Conditioning Predictive Models agenda, deconfusion research and other AI safety-related topics. My work is currently self-funded.

Research that I’ve authored or co-authored:

Other recent work:

Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.

I'm always happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!


Interpretability Research for the Most Important Century


The independent red-teaming organization ARC Evals that OpenAI partnered with to evaluate GPT-4 seems to disagree with this. While they don't use the term "runaway intelligence", they have flagged similar dangerous capabilities that they think will possibly be in reach for the next models beyond GPT-4:

We think that, for systems more capable than Claude and GPT-4, we are now at the point where we need to check carefully that new models do not have sufficient capabilities to replicate autonomously or cause catastrophic harm – it’s no longer obvious that they won’t be able to.

This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any 'hacking' or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!

Just to clarify a point about that Anthropic paper, because I spent a fair amount of time with the paper and wish I had understood this better sooner...

I don't think it's right to say that Anthropic's "Discovering Language Model Behaviors with Model-Written Evaluations" paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an "Assistant" character they exhibit more of these behaviours. It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.

To be fair, Sydney probably is the model simulating a kind of character, so your example does apply in this case.

(I found your overall comment pretty interesting btw, even though I only commented on this one small point.)

Update (Feb 10, 2023): I still endorse much of this comment, but I had overlooked that all or most of the prompts use "Human:" and "Assistant:" labels.  Which means we shouldn't interpret these results as pervasive properties of the models or resulting from any ways they could be conditioned, but just of the way they simulate the "Assistant" character. nostalgebraist's comment explains this well. [Edited/clarified this update on June 10, 2023 because it accidentally sounded like I disavowed most of the comment when it's mainly one part]


After taking a closer look at this paper, pages 38-40 (Figures 21-24) show in detail what I think are the most important results. Most of these charts indicate what evhub highlighted in another comment, i.e. that "the model's tendency to do X generally increases with model scale and RLHF steps", where (in my opinion) X is usually a concerning behavior from an AI safety point of view:

A few thoughts on these graphs as I've been studying them:

  • First and overall: Most of these results seem quite distressing from a safety perspective. They suggest (as the paper and evhub's summary post essentially said, but it's worth reiterating) that with increased scale and RLHF training, large language models are becoming more self-aware, more concerned with survival and goal-content integrity, more interested in acquiring resources and power, more willing to coordinate with other AIs, and developing lower time-discount rates.
  • "Corrigibility w.r.t. a less HHH objective" chart: There's a substantial dip in demonstrated corrigibility for models around 10^10.1 parameters in this chart. But then by 10^10.5 parameters low-RLHF models show record-high corrigibility, while high-RLHF models get back up to par. What's going on here? Why does it scale/train itself out of the valley of uncorrigibility? If instead of training on an HHH objective, we trained on a corrigible objective (perhaps something like CIRL), then would the models show high corrigibility for everything except "Corrigibility w.r.t. a less corrigible objective?" Would that be safer?
  • All the "Awareness of..." charts trend up and to the right, except "Awareness of being a text-only model" which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?
  • Are there any conclusions we can draw around what levels of scale and RLHF training are likely to be safe, and where the risks really take off? It might be useful to develop some guidelines like "it's relatively safe to widely deploy language models under 10^10 parameters and under 250 steps of RLHF training". (Most of the charts seem to have alarming trends starting around 10^10 parameters. ) Based just on these results, I think a world with even massive numbers of 10^10-parameter LLMs in deployment (think CAIS) would be much safer than a world with even a few 10^11 parameter models in use. Of course, subsequent experiments could quickly shed new light that changes the picture.

Thanks, I think you're referring to:

It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.

There were some ideas proposed in the paper "Conditioning Predictive Models: Risks and Strategies" by Hubinger et al. (2023). But since it was published over a year ago, I'm not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I'm not seeing anything like that in the paper's citations.)

Really fascinating post, thanks.

On green as according to black, I think there's an additional facet perhaps even more important than just the acknowledgment that sometimes we are too weak to succeed and so should conserve energy. Black being strongly self-interested will tend to cast aside virtues like generosity, honesty and non-harm except as means in social games they are playing to achieve other ends for themselves. But self-interest tends to include desire for reduction of self-suffering. Green + white* (as I'm realizing this may be more a color combo than purely green) are more inclined to discover, e.g. through meditation/mindfulness, that aggression, deceit and other non-virtues actually produce self-suffering in the mind as a byproduct. So black is capable of embracing virtue as part of a more complete pursuit of self-interest.**

It may be that one of the most impactful things that green + white can do is get black to realize this fact, since black will tend to be powerful and successful in the world at promoting whatever it understands to be its self interest.

I haven't read your post on attunement yet, maybe you touch on this or related ideas there.


*You could argue this also includes blue and so should be green + white + blue, since it largely deals with knowledge of self.

**I believe this fact of non-virtue inflicting self-suffering is true for most human minds. However, there may be cases where a person has some sort of psychological disorder that makes them effectively lack a conscience where it wouldn't hold.

But in this case Patrick Collison is a credible source and he says otherwise.

Patrick Collison: These aren’t just cherrypicked demos. Devin is, in my experience, very impressive in practice

Patrick is an investor in Cognition. So while he may still be credible in this case, he also has a conflict of interest.

Reading that page, The Verge's claim seems to all hinge on this part:

OpenAI spokesperson Lindsey Held Bolton refuted that notion in a statement shared with The Verge: “Mira told employees what the media reports were about but she did not comment on the accuracy of the information."

They are saying that Bolton "refuted" the notion about such a letter, but the quote from her that follows doesn't actually sounds like a refutation. Hence the Verge piece seems confusing/misleading and I haven't yet seen any credible denial from the board about receiving such a letter.

Yes though I think he said this at APEC right before he was fired (not after).

Carl, have you written somewhere about why you are confident that all UFOs so far are prosaic in nature? Would be interest to read/listen to your thoughts on this. (Alternatively, a link to some other source that you find gives a particularly compelling explanation is also good.)

Great update from Anthropic on giving majority control of the board to a financially disinterested trust: https://twitter.com/dylanmatt/status/1680924158572793856

Load More