LESSWRONG
LW

Evan R. Murphy
1181Ω122133201
Message
Dialogue
Subscribe

I'm doing research and other work focused on AI safety/security, governance and risk reduction. Currently my top projects are (last updated Feb 26, 2025):

  • Technical researcher for UC Berkeley at the AI Security Initiative, part of the Center for Long-Term Cybersecurity (CLTC)
  • Serving on the board of directors for AI Governance & Safety Canada

General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, deconfusion research and other AI safety-related topics.

Research that I’ve authored or co-authored:

  • See publications on Google Scholar
  • Steering Behaviour: Testing for (Non-)Myopia in Language Models
  • Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
  • (Scroll down to read other posts and comments I've written)

Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.

While I'm not always great at responding, I'm happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Interpretability Research for the Most Important Century
6Evan R. Murphy's Shortform
4mo
5
Discovering Language Model Behaviors with Model-Written Evaluations
Evan R. Murphy3y*Ω10143

Update (Feb 10, 2023): I still endorse much of this comment, but I had overlooked that all or most of the prompts use "Human:" and "Assistant:" labels.  Which means we shouldn't interpret these results as pervasive properties of the models or resulting from any ways they could be conditioned, but just of the way they simulate the "Assistant" character. nostalgebraist's comment explains this well. [Edited/clarified this update on June 10, 2023 because it accidentally sounded like I disavowed most of the comment when it's mainly one part]

--

After taking a closer look at this paper, pages 38-40 (Figures 21-24) show in detail what I think are the most important results. Most of these charts indicate what evhub highlighted in another comment, i.e. that "the model's tendency to do X generally increases with model scale and RLHF steps", where (in my opinion) X is usually a concerning behavior from an AI safety point of view:

A few thoughts on these graphs as I've been studying them:

  • First and overall: Most of these results seem quite distressing from a safety perspective. They suggest (as the paper and evhub's summary post essentially said, but it's worth reiterating) that with increased scale and RLHF training, large language models are becoming more self-aware, more concerned with survival and goal-content integrity, more interested in acquiring resources and power, more willing to coordinate with other AIs, and developing lower time-discount rates.
  • "Corrigibility w.r.t. a less HHH objective" chart: There's a substantial dip in demonstrated corrigibility for models around 10^10.1 parameters in this chart. But then by 10^10.5 parameters low-RLHF models show record-high corrigibility, while high-RLHF models get back up to par. What's going on here? Why does it scale/train itself out of the valley of uncorrigibility? If instead of training on an HHH objective, we trained on a corrigible objective (perhaps something like CIRL), then would the models show high corrigibility for everything except "Corrigibility w.r.t. a less corrigible objective?" Would that be safer?
  • All the "Awareness of..." charts trend up and to the right, except "Awareness of being a text-only model" which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?
  • Are there any conclusions we can draw around what levels of scale and RLHF training are likely to be safe, and where the risks really take off? It might be useful to develop some guidelines like "it's relatively safe to widely deploy language models under 10^10 parameters and under 250 steps of RLHF training". (Most of the charts seem to have alarming trends starting around 10^10 parameters. ) Based just on these results, I think a world with even massive numbers of 10^10-parameter LLMs in deployment (think ) would be much safer than a world with even a few 10^11 parameter models in use. Of course, subsequent experiments could quickly shed new light that changes the picture.
Reply
Evan R. Murphy's Shortform
Evan R. Murphy1mo20

Starting to be some discussion on LW now, e.g.

 https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-general-claims-about-generalizable-reasoning

https://www.lesswrong.com/posts/tnc7YZdfGXbhoxkwj/give-me-a-reason-ing-model

Reply
Evan R. Murphy's Shortform
Evan R. Murphy1mo20

I should have mentioned the above thoughts are a low-confidence take. I was mostly just trying to get the ball rolling on discussion because I couldn't find any discussion of this paper on LessWrong yet, which really surprised me because I saw the paper had been shared thousands of times on LinkedIn already.

Reply
Evan R. Murphy's Shortform
Evan R. Murphy1mo2-4

Thoughts on "The Ilusion of Thinking" paper that came out of Apple recently?

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

Seems to me like at least a point in favor of "stochastic parrots" over "builds a quality world model" for the language reasoning models.

Also wondering if their findings could be used to the advantage of safety/security somehow. E.g. if these models are more dependent on imitating examples than we relaized, then it might also be more effective than we previously thought to purge training data of the types of knowledge and reasoning that we don't want them to have (e.g. knowledge of dangerous weapons development, scheming, etc.)

Reply
Interpretability Will Not Reliably Find Deceptive AI
Evan R. Murphy2mo31

I agree it's a good post, and it does take guts to tell people when you think that a research direction that you've been championing hard actually isn't the Holy Grail. This is a bit of a nitpick but not insubstantial:

Neel is talking about interpretability in general, not just mech-interp. He claims to be accounting in his predictions for other non-mech interp approaches to interpretability that seem promising to some other researchers, such as representation engineering (RepE), which Dan Hendrycks among others has been advocating for recently.

Reply
E.G. Blee-Goldman's Shortform
Evan R. Murphy2mo30

Let me know if anyone has thoughts on this question I just posted as well: Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Reply
Interpretability Will Not Reliably Find Deceptive AI
Evan R. Murphy2mo20

Does representation engineering (RepE) seem like a game-changer for interpretability? I don't see it mentioned in your post, so I'm trying to figure out if it is baked into your predictions or not.

It seemed like Apollo was able to spin up a pretty reliable strategic deception detector (95-99% accurate) using linear probes even though the techniques are new, and generally it sounds like RepE is getting traction on some things that have been a slog for mech interp. Does it look plausible that RepE could get us to high reliability interpretability on workable timelines or are we likely to hit similar walls with that approach?

Thanks for your post Neel (and Gemini 2.5) - really important perspective on all this.

Reply
Evan R. Murphy's Shortform
Evan R. Murphy4mo30

"AI governance looking bleak" seems like an overstatement. Certain types or aims of AI governance are looking bleak right now, especially getting strong safety-oriented international agreements that include the US and China, or meaningful AI regulation at the national level in the US. But there may be other sorts of AI governance projects (e.g. improving the policies of frontier labs, preparing for warning shots, etc.) that could still be quite worthwhile.

Reply
How to Make Superbabies
Evan R. Murphy4mo40

Is there a summary of this post?

Reply
Evan R. Murphy's Shortform
Evan R. Murphy4mo*92

2023: AI governance starting to look promising because governments are waking up about AI risks. Technical AI safety getting challenging if you're not in a frontier lab because hard to access relevant models to run experiments.

2025: AI governance looking bleak after the AI Action Summit. Technical AI safety looking more accessible because open-weight models are proliferating.

Reply
Load More
CAIS
Market making (AI safety technique)
3y
43Does the Universal Geometry of Embeddings paper have big implications for interpretability?
Q
2mo
Q
3
6Evan R. Murphy's Shortform
4mo
5
11Steven Pinker on ChatGPT and AGI (Feb 2023)
2y
8
40Steering Behaviour: Testing for (Non-)Myopia in Language Models
Ω
3y
Ω
19
52Paper: Large Language Models Can Self-improve [Linkpost]
Ω
3y
Ω
15
25Google AI integrates PaLM with robotics: SayCan update [Linkpost]
Ω
3y
Ω
0
18Surprised by ELK report's counterexample to Debate, IDA
3y
0
35New US Senate Bill on X-Risk Mitigation [Linkpost]
3y
12
58Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
Ω
3y
Ω
0
16Introduction to the sequence: Interpretability Research for the Most Important Century
Ω
3y
Ω
0
Load More