Evan R. Murphy

Software engineer since 2010. Left Google in fall '21, now getting into independent AI alignment research.

Currently working on interpretability, trying to apply Circuits thread-style analysis to acoustic models.

Wiki Contributions


Action: Help expand funding for AI Safety by coordinating on NSF response

Judging by the voting and comments so far (both here as well as on the EA Forum crosspost), my sense is that many here support this effort, but some definitely have concerns. A few of the concerns are based in hardcore skepticism about academic research that I'm not sure are compatible with responding to the RfI. Many concerns though seem to be about this generating vague NSF grants that are in the name of AI safety but don't actually contribute to the field.

For these latter concerns, I wonder is there a way we could resolve them by limiting the scope of topics in our NSF responses or giving them enough specificity? For example, what if we convinced the NSF that all they should make grants for is mechanistic interpretability projects like the Circuits Thread. This is an area that most researchers in the alignment community seem to agree is useful, we just need a lot more people doing it to make substantial progress. And maybe there is less room to go adrift or mess up this kind of concrete and empirical research compared to some of the more theoretical research directions.

It doesn't have to be just mechanistic interpretability, but my point is, are there ways we could shape or constrain our responses to the NSF like this that would help address your concerns?

Action: Help expand funding for AI Safety by coordinating on NSF response

This is a conceivable universe, but do you really think it's likely? It seems to me much more likely that additional funding opportunities would help AI safety research move at least a little bit faster.

Promising posts on AF that have fallen through the cracks

Great question. IMO it's probably worth it to try leaving what you think may be a bad comment rather than no comment at all. Sometimes what we assume is obvious or a bad comment may actually be very useful or just the feedback/fresh perspective that someone else needed. 

You've made me realize that simply counting the comments and other criteria in my post probably doesn't provide enough signal though. Because the first comment may indeed still be bad from the perspective of the original author/researcher or just not the kind of feedback they really needed, or else they need more of it.

But we could promote this "just go for it" attitude for initial comments as a community norm in combination with this suggestion from Jon Garcia:  

What if posts could be flagged by authors and/or the community as needing feedback or discussion? This could work something like pinning to the front page, except that the "pinnedness" could decay over time to make room for other posts while getting periodically refreshed.

Then in case the initial rushed comment didn't satisfy the author's need for feedback, they could simply continue to leave the post flagged as still needing feedback/discussion.

It would be easy to forget to remove this tag from your posts. So every time a new comment is made, AF/LessWrong should probably prompt the author whether the tag is still necessary, or automatically remove the tag and force the author to re-add it if they still need it.

Additional incentives may be useful to make sure people don't clutter up the tag with comments that just-kinda-sorta-would-be-nice to have more comments on. Maybe an author is limited to having 2~3 posts with this tag at a time. Or it uses some kind of bounty mechanism where comments submitted on a post with the "needs feedback" tag earn 2x karma, with the extra karma donated from the author of the post.


Several of the Circuits articles provide colab notebooks reproducing the results in the article, which may be helpful references if one wants to do Circuits research on vision models.


I'm starting to reproduce some results from the Circuits thread. It took me longer than expected just to find these colab notebooks so I wanted to share more specifically in case it saves anyone else some time.

The text "colab" isn't really turned up in a targeted Google search on Distill and the Circuits thread.  Also if you open a post like Visualizing Weights and do a Ctrl+F or Cmd+F search for "colab", you won't turn up any results either.

But if you open that same post and scroll down, you'll see button links to open the colab notebooks that look like this. These will also turn up in in a Ctrl+F or Cmd+F search for "notebook" in case you want to jump around to different colab examples in the Circuits thread.

Beware using words off the probability distribution that generated them.

Nice post, so many hidden assumptions behind the words we use.

I wonder what are some concrete examples of this in alignment discussions, examples like your one about the probability that god exists.

One that comes to mind is a recent comment thread on one of the Late 2021 Miri Conversations posts where we were assigning probabilities to "soft takeoff" and "hard takeoff" scenarios. Then Daniel Kokotajlo realized that "soft takeoff" had to be disambiguated because in that context some people were using it to mean any kind of gradual advancement in AI capabilities, whereas others meant it to mean specifically "GDP doubling in 4 years, then doubling in 1 year". 

Solving Interpretability Week

I'm interested in trying a co-work call sometime but won't have time for it this week.

Thanks for sharing about Shay in this post. I had not heard of her before, what a valuable resource/way she's helping the cause of AI safety.

(As for contact, I check my LessWrong/Alignment Forum inbox for messages regularly.)

More Christiano, Cotra, and Yudkowsky on AI progress

Well said! This resonates with my Eliezer-model too.

Taking this into account I'd update my guess of Eliezer's position to:

  • Eliezer: 5% soft takeoff, 80% hard takeoff, 15% something else

This last "something else" bucket added because "the Future is notoriously difficult to predict" (paraphrasing Eliezer).

More Christiano, Cotra, and Yudkowsky on AI progress

So here y'all have given your sense of the likelihoods as follows:

  • Paul: 70% soft takeoff, 30% hard takeoff
  • Daniel: 30% soft takeoff, 70% hard takeoff

How would Eliezer's position be stated in these terms? Similar to Daniel's?

[AN #61] AI policy and governance, from two people in the field

This work on learning with constraints seems interesting.

Looks like the paper "Bridging Hamilton-Jacobi Safety Analysis and Reinforcement Learning" has moved so that link is currently broken. Here's a working URL: https://ieeexplore.ieee.org/document/8794107 Also one more where the full paper is more easily accessible: http://files.davidqiu.com/research/papers/2019_fisac_Bridging%20Hamilton-Jacobi%20Safety%20Analysis%20and%20Reinforcement%20Learning%20[RL][Constraints].pdf

Interpreting Yudkowsky on Deep vs Shallow Knowledge

Great investigation/clarification of this recurring idea from the ongoing Late 2021 MIRI Conversations.

  • outside vs. inside view - I've thought about this before but hadn't read this clear a description of the differences and tradeoffs before (still catching up on Eliezer's old writings)
  • "deep knowledge is far better at saying what won’t work than at precisely predicting the correct hypothesis." - very useful takeaway

You might not like his tone in the recent discussions, but if someone has been saying the same thing for 13 years, nobody seems to get it, and their model predicts that this will lead to the end of the world, maybe they can get some slack for talking smack.

Good point and we should. Eliezer is a valuable source of ideas and experience around alignment, and it seems like he's contributed immensely to this whole enterprise.

I just hope all his smack talking doesn't turn off/away talented people coming to lend a hand on alignment. I expect a lot of people on this (AF) forum found it like me after reading all Open Phil and 80,000 Hours' convincing writing about the urgency of solving the AI alignment problem. It seems silly to have those orgs working hard to recruit people to help out, only to have them come over here and find one of the leading thinkers in the community going on frequent tirades about how much EAs suck, even though he doesn't know most of us. Not to mention folks like Paul and Richard who have been taking his heat directly in these marathon discussions!

Load More