Evan R. Murphy — LessWrong

I read through your paper Large Language Models Often Know When They Are Being Evaluated (the apparent output of OP's research). This is solid progress on an important problem.

While the paper acknowledges that models could sandbag on capability evaluations and notes meta-reasoning in the results, there's a deeper issue that deserves attention: models with evaluation awareness may be sandbagging on the evaluation awareness assessments themselves.

The paper finds that frontier models show "substantial, though not yet superhuman" evaluation awareness, with Gemini 2.5 Pro reaching 0.95 AUC on agentic tasks. You note that models sometimes recognize that being asked about evaluations is itself evidence of being evaluated. But this points to a fundamental challenge: a model sophisticated enough to recognize evaluations could deliberately score lower on situational awareness tests to avoid triggering safety thresholds.

This recursive problem was a key concern in our UC Berkeley paper on intolerable risk thresholds (arxiv.org/pdf/2503.05812, p. 41). Once models reach even moderate evaluation awareness (we propose ~60% accuracy at distinguishing evaluation vs deployment), they can strategically underperform on evaluations, including on evaluation awareness assessments themselves. Your findings suggest we may already be in this territory.

This doesn't diminish the value of your benchmark—it provides essential baseline data. But it looks like we are already at the point where we need to also do things like:

Develop harder-to-game evaluation methods (e.g., white-box techniques examining internal representations)
Use conservative safety margins that assume some sandbagging may already be occurring
Corroborate behavioral tests with other evidence of capabilities (e.g., scaling laws, architectural features)

Have you considered approaches to detect or mitigate potential sandbagging on evaluation awareness tests themselves?

(Thanks to Matteo Pistillo for sharing your paper with me)

Tristan's Projects

Evan R. Murphy1mo30

OISs
Idea:
There is a paradigm missing from the discussion of ASI misalignment existential risk. The threat of ASI generalizes to the concept of "Outcome Influencing Systems" (OISs). My hope is that developing terminology and formalism around this model may mitigate the issues associated with existing terminology and aid in more productive discourse and interdisciplinary research applicable to ASI risk and social coordination issues.
Links:
WIP document to become main LW post introducing OISs (comments welcome)
My involvement:
I am currently the only contributor. I think the idea has merit, but I am still at the point where I am seeking either to find collaborators and spread the idea, or to find people who can point out enough flaws in the idea for it to be worth abandoning.

Question about the OIS research direction: What sort of principles might you hope to learn about outcome influencing systems which could help with the problems of ASI?

It seems to me that the problem isn't OIS generally. We have plenty of safe/aligned OISs. This includes some examples from your OIS doc such as thermostats. Even a lot of non-AI software systems seem like safe/aligned OISs. The problem seems to be more with this specific type of OIS (powerful deep-learning based models) which are both much more capable and much more difficult to ensure the safe behavior of, compared to software with legible code. (I read your doc quickly so may have missed an answer about this which you already provide in there.)

Also, thanks for putting forth a novel research direction for addressing AI risk. I think we should have a high bar for which research directions we put scarce resources into, but at the same time it's super valuable to propose new research directions. What we are looking for could just be low-hanging fruit in the unexplored idea space, overlooked by all the people who are working on more established stuff.

Open Global Investment as a Governance Model for AGI

Evan R. Murphy1mo20

To help with the incentives and coordination, instead having the first frontier AI megaowner step forward and unconditionally relinquish some of their power, they could sign on to a conditional contract to do so. It would only activate if other megaowners did the same.

Open Global Investment as a Governance Model for AGI

Evan R. Murphy2mo*20

Ok yes, that would be great.

I'll point out that this is not unheard of, Altman literally took no equity in OpenAI (though IMO was eventually corrupted by the power nonetheless).

He may have been corrupted later by power later. Alternatively, he may have been playing the long game, knowing that he would have that power eventually even if he took no equity.

Open Global Investment as a Governance Model for AGI

Evan R. Murphy2mo20

Couldn't we just... set up a financial agreement where the first N employees don't own stock and have a set salary?

Maybe, could be nice... But since the first N employees usually get to sign off on major decisions, why would they go along with such an agreement? Or are you suggesting governments should convene to force this sort of arrangement on them?

My main concern is that they'll have enough power to be functionally wealthy all-the-same, or be able to get it via other means (e.g. Altman with his side hardware investment / company).

I'm not sure I understand this part actually, could you elaborate? Is this your concern with the OGI model or with your salary-only for first-N employees idea?

Open Global Investment as a Governance Model for AGI

Evan R. Murphy2mo20

Appreciate your integrity in doing that!

At the same time, the unfairness of early frontier lab founders getting rich seems to me like a very acceptable downside given that open investment could solve a lot of issues and the bleakness of many other paths forward.

Evan R. Murphy's Shortform

Evan R. Murphy2mo*3-2

Outer alignment seems not as hard as we thought a few years ago. Llms are actually really good at understanding what we mean, so the sorcerer's apprentice and King Midas concerns seem obsolete. Except maybe for systems using heavy RL, where specification gaming is still a concern.

The more salient outer alignment issue now how to align agents when you don't have time or enough capability yourself to supervise them well. And that's mainly only a problem because of competitive race dynamics, which incentivize people to try and supervise AI 'beyond their means' so to speak.

So, cooling race dynamics could address a main portion of the remaining outer alignment problem. Scalable oversight techniques may also address it. What would remain then for (narrow) alignment is specification gaming, and then of course the whole inner alignment problem including deceptive alignment which is still a huge unsolved problem.

Evan R. Murphy's Shortform

Evan R. Murphy5mo20

Starting to be some discussion on LW now, e.g.

https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-general-claims-about-generalizable-reasoning

https://www.lesswrong.com/posts/tnc7YZdfGXbhoxkwj/give-me-a-reason-ing-model

Evan R. Murphy's Shortform

Evan R. Murphy5mo20

I should have mentioned the above thoughts are a low-confidence take. I was mostly just trying to get the ball rolling on discussion because I couldn't find any discussion of this paper on LessWrong yet, which really surprised me because I saw the paper had been shared thousands of times on LinkedIn already.

Evan R. Murphy's Shortform

Evan R. Murphy5mo2-4

Thoughts on "The Ilusion of Thinking" paper that came out of Apple recently?

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

Seems to me like at least a point in favor of "stochastic parrots" over "builds a quality world model" for the language reasoning models.

Also wondering if their findings could be used to the advantage of safety/security somehow. E.g. if these models are more dependent on imitating examples than we relaized, then it might also be more effective than we previously thought to purge training data of the types of knowledge and reasoning that we don't want them to have (e.g. knowledge of dangerous weapons development, scheming, etc.)

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments

OISs