LESSWRONG
LW

3369
Evan R. Murphy
1194Ω122143271
Message
Dialogue
Subscribe

Send me feedback anonymously

I'm doing research and other work focused on AI safety/security, governance and risk reduction. Currently my top projects are (last updated Feb 26, 2025):

  • Technical researcher for UC Berkeley at the AI Security Initiative, part of the Center for Long-Term Cybersecurity (CLTC)
  • Serving on the board of directors for AI Governance & Safety Canada

General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, deconfusion research and other AI safety-related topics.

Research that I’ve authored or co-authored:

  • See publications on Google Scholar
  • Steering Behaviour: Testing for (Non-)Myopia in Language Models
  • Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
  • (Scroll down to read other posts and comments I've written)

Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.

While I'm not always great at responding, I'm happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Interpretability Research for the Most Important Century
6Evan R. Murphy's Shortform
8mo
6
Do models know when they are being evaluated?
Evan R. Murphy1mo70

I read through your paper Large Language Models Often Know When They Are Being Evaluated (the apparent output of OP's research). This is solid progress on an important problem.

While the paper acknowledges that models could sandbag on capability evaluations and notes meta-reasoning in the results, there's a deeper issue that deserves attention: models with evaluation awareness may be sandbagging on the evaluation awareness assessments themselves.

The paper finds that frontier models show "substantial, though not yet superhuman" evaluation awareness, with Gemini 2.5 Pro reaching 0.95 AUC on agentic tasks. You note that models sometimes recognize that being asked about evaluations is itself evidence of being evaluated. But this points to a fundamental challenge: a model sophisticated enough to recognize evaluations could deliberately score lower on situational awareness tests to avoid triggering safety thresholds.

This recursive problem was a key concern in our UC Berkeley paper on intolerable risk thresholds (arxiv.org/pdf/2503.05812, p. 41). Once models reach even moderate evaluation awareness (we propose ~60% accuracy at distinguishing evaluation vs deployment), they can strategically underperform on evaluations, including on evaluation awareness assessments themselves. Your findings suggest we may already be in this territory.

This doesn't diminish the value of your benchmark—it provides essential baseline data. But it looks like we are already at the point where we need to also do things like:

  1. Develop harder-to-game evaluation methods (e.g., white-box techniques examining internal representations)
  2. Use conservative safety margins that assume some sandbagging may already be occurring
  3. Corroborate behavioral tests with other evidence of capabilities (e.g., scaling laws, architectural features)

Have you considered approaches to detect or mitigate potential sandbagging on evaluation awareness tests themselves?

(Thanks to Matteo Pistillo for sharing your paper with me)

Reply
Tristan's Projects
Evan R. Murphy1mo30

OISs

Idea:

There is a paradigm missing from the discussion of ASI misalignment existential risk. The threat of ASI generalizes to the concept of "Outcome Influencing Systems" (OISs). My hope is that developing terminology and formalism around this model may mitigate the issues associated with existing terminology and aid in more productive discourse and interdisciplinary research applicable to ASI risk and social coordination issues.

Links:

  • WIP document to become main LW post introducing OISs (comments welcome)

My involvement:

I am currently the only contributor. I think the idea has merit, but I am still at the point where I am seeking either to find collaborators and spread the idea, or to find people who can point out enough flaws in the idea for it to be worth abandoning.

 

Question about the OIS research direction: What sort of principles might you hope to learn about outcome influencing systems which could help with the problems of ASI?

It seems to me that the problem isn't OIS generally. We have plenty of safe/aligned OISs. This includes some examples from your OIS doc such as thermostats. Even a lot of non-AI software systems seem like safe/aligned OISs. The problem seems to be more with this specific type of OIS (powerful deep-learning based models) which are both much more capable and much more difficult to ensure the safe behavior of, compared to software with legible code. (I read your doc quickly so may have missed an answer about this which you already provide in there.)

--

Also, thanks for putting forth a novel research direction for addressing AI risk. I think we should have a high bar for which research directions we put scarce resources into, but at the same time it's super valuable to propose new research directions. What we are looking for could just be low-hanging fruit in the unexplored idea space, overlooked by all the people who are working on more established stuff.

Reply11
Open Global Investment as a Governance Model for AGI
Evan R. Murphy1mo20

To help with the incentives and coordination, instead having the first frontier AI megaowner step forward and unconditionally relinquish some of their power, they could sign on to a conditional contract to do so. It would only activate if other megaowners did the same.

Reply
Open Global Investment as a Governance Model for AGI
Evan R. Murphy2mo*20

Ok yes, that would be great.

I'll point out that this is not unheard of, Altman literally took no equity in OpenAI (though IMO was eventually corrupted by the power nonetheless).

He may have been corrupted later by power later. Alternatively, he may have been playing the long game, knowing that he would have that power eventually even if he took no equity.

Reply
Open Global Investment as a Governance Model for AGI
Evan R. Murphy2mo20

Couldn't we just... set up a financial agreement where the first N employees don't own stock and have a set salary?

Maybe, could be nice... But since the first N employees usually get to sign off on major decisions, why would they go along with such an agreement? Or are you suggesting governments should convene to force this sort of arrangement on them?

My main concern is that they'll have enough power to be functionally wealthy all-the-same, or be able to get it via other means (e.g. Altman with his side hardware investment / company).

I'm not sure I understand this part actually, could you elaborate? Is this your concern with the OGI model or with your salary-only for first-N employees idea?

Reply
Open Global Investment as a Governance Model for AGI
Evan R. Murphy2mo20

Appreciate your integrity in doing that!

At the same time, the unfairness of early frontier lab founders getting rich seems to me like a very acceptable downside given that open investment could solve a lot of issues and the bleakness of many other paths forward. 

Reply
Evan R. Murphy's Shortform
Evan R. Murphy2mo*3-2

Outer alignment seems not as hard as we thought a few years ago. Llms are actually really good at understanding what we mean, so the sorcerer's apprentice and King Midas concerns seem obsolete. Except maybe for systems using heavy RL, where specification gaming is still a concern.

The more salient outer alignment issue now how to align agents when you don't have time or enough capability yourself to supervise them well. And that's mainly only a problem because of competitive race dynamics, which incentivize people to try and supervise AI 'beyond their means' so to speak.

 So, cooling race dynamics could address a main portion of the remaining outer alignment problem. Scalable oversight techniques may also address it. What would remain then for (narrow) alignment is specification gaming, and then of course the whole inner alignment problem including deceptive alignment which is still a huge unsolved problem.

Reply
Evan R. Murphy's Shortform
Evan R. Murphy5mo20

Starting to be some discussion on LW now, e.g.

 https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-general-claims-about-generalizable-reasoning

https://www.lesswrong.com/posts/tnc7YZdfGXbhoxkwj/give-me-a-reason-ing-model

Reply
Evan R. Murphy's Shortform
Evan R. Murphy5mo20

I should have mentioned the above thoughts are a low-confidence take. I was mostly just trying to get the ball rolling on discussion because I couldn't find any discussion of this paper on LessWrong yet, which really surprised me because I saw the paper had been shared thousands of times on LinkedIn already.

Reply
Evan R. Murphy's Shortform
Evan R. Murphy5mo2-4

Thoughts on "The Ilusion of Thinking" paper that came out of Apple recently?

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

Seems to me like at least a point in favor of "stochastic parrots" over "builds a quality world model" for the language reasoning models.

Also wondering if their findings could be used to the advantage of safety/security somehow. E.g. if these models are more dependent on imitating examples than we relaized, then it might also be more effective than we previously thought to purge training data of the types of knowledge and reasoning that we don't want them to have (e.g. knowledge of dangerous weapons development, scheming, etc.)

Reply
Load More
6AI Risk: Can We Thread the Needle? [Recorded Talk from EA Summit Vancouver '25]
1mo
0
43Does the Universal Geometry of Embeddings paper have big implications for interpretability?
Q
5mo
Q
6
6Evan R. Murphy's Shortform
8mo
6
11Steven Pinker on ChatGPT and AGI (Feb 2023)
3y
8
40Steering Behaviour: Testing for (Non-)Myopia in Language Models
Ω
3y
Ω
19
52Paper: Large Language Models Can Self-improve [Linkpost]
Ω
3y
Ω
15
25Google AI integrates PaLM with robotics: SayCan update [Linkpost]
Ω
3y
Ω
0
22Surprised by ELK report's counterexample to Debate, IDA
3y
0
35New US Senate Bill on X-Risk Mitigation [Linkpost]
3y
12
58Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
Ω
3y
Ω
0
Load More
Market making (AI safety technique)
3 years ago