Evan R. Murphy

I’m an AI alignment researcher currently focused on myopia and language models. I’m also interested in interpretability and other AI safety-related topics. My research is independent and currently supported by a grant from the Future Fund regranting program*.

Research that I’ve authored or co-authored:

Other recent work:

Before getting into AI alignment, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.

I'm always happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message! 


*In light of the FTX crisis, I’ve set aside the grant funds I received from Future Fund and am evaluating whether/how this money can be returned to customers of FTX who lost their savings in the debacle. In the meantime, I continue to work on AI alignment research using my personal savings. If you’re interested in funding my research or hiring me for related work, please reach out.


Interpretability Research for the Most Important Century


Fair point.

If the issue with "accident" is that it sounds minor*, then one could say "catastrophic accident risk" or similar.

*I'm not fully bought into this as the main issue, but supposing that it is...

Instead of "accident", we could say "gross negligence" or "recklessness" for catastrophic risk from AI misalignment.

I think you have a pretty good argument against the term "accident" for misalignment risk.

Misuse risk still seems like a good description for the class of risks where--once you have AI that is aligned with its operators--those operators may try to do unsavory things with their AI, or have goals that are quite at odds with the broad values of humans and other sentient beings.

Thanks, 'scary thing always on the right' would be a nice bonus. But evhub cleared up that particular confusion I had by saying that further to the right is always 'model agrees with that more.

I'm not sure if the core NIST standards go into catastrophic misalignment risk, but Barrett et al.'s supplemental guidance on the NIST standards does. I was a reviewer on that work, and I think they have more coming (see link in my first comment on this post for their first part).

I would check out the 200 Concrete Open Problems in Mechanistic Interpretability post series by Neel Nanda. Mechanistic interpretability has been considered a promising research direction by many in the alignment community for years. But it's only in the past couple months that we have an experienced researcher in this area laying out specific concrete problems and providing detailed guidance for newcomers.

Caveat: I haven't myself looked closely at this post series yet, as in recent months I have been more focused on investigating language model behaviour than on interpretability. So I don't have direct knowledge that these posts are as useful as they look.

There is a teaching in Buddhism called "the eight worldly winds". The eight wordly winds refer to: praise and blame, success and failure, pleasure and pain, and fame and disrepute.

I don't know how faithful that verbiage is to the original ancient Indian text it was translated from. But I always found the term "wordly winds" really helpful and evocative. When I find myself chasing praise or reputation, if I can recall that phrase it immediately reminds me that these things are like the wind blowing around and changing direction from day to day. So it's foolish to worry about them too much or to try and control them, and it reminds me that I should focus on more important things.

Glad to see both the OP as well as the parent comment. 

I wanted to clarify something I disagreed with in the parent comment as well as in a sibling comment from Sam Marks about the Anthropic paper "Discovering Language Model Behaviors with Model-Written Evaluations" (paper, post):

Another reason for not liking RLHF that's somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model's computation on agency in some sense. My guess is that this explains to an extent the results in that paper - RLHF'd models are better at focusing on simulating agency, agency is correlated with self-preservation desires, and so on.


1) My best guess about why Anthropic's model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires.

Both of these points seem to suggest that the main takeaway from the Anthropic paper was to uncover concerning behaviours in RLHF language models. That's true, but I think it's just as important that the paper also found pretty much the same concerning behaviours in plain pre-trained LLMs that did not undergo RLHF training, once those models were scaled up to a large enough size. 

What do you mean when you say the model is or is not "fighting you"?

Load More