Evan R. Murphy's Shortform

Evan R. Murphy

Evan R. Murphy's Shortform — LessWrong

Evan R. Murphy's Shortform

28th Feb 2025

1 min read

6

This is a special post for quick takes by Evan R. Murphy. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Evan R. Murphy's Shortform

9Evan R. Murphy

3Evan R. Murphy

2Evan R. Murphy

6 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:28 PM

[-]Evan R. Murphy1y*92

2023: AI governance starting to look promising because governments are waking up about AI risks. Technical AI safety getting challenging if you're not in a frontier lab because hard to access relevant models to run experiments.

2025: AI governance looking bleak after the AI Action Summit. Technical AI safety looking more accessible because open-weight models are proliferating.

[-]Evan R. Murphy1y30

"AI governance looking bleak" seems like an overstatement. Certain types or aims of AI governance are looking bleak right now, especially getting strong safety-oriented international agreements that include the US and China, or meaningful AI regulation at the national level in the US. But there may be other sorts of AI governance projects (e.g. improving the policies of frontier labs, preparing for warning shots, etc.) that could still be quite worthwhile.

[-]Evan R. Murphy7mo*3-2

Outer alignment seems not as hard as we thought a few years ago. Llms are actually really good at understanding what we mean, so the sorcerer's apprentice and King Midas concerns seem obsolete. Except maybe for systems using heavy RL, where specification gaming is still a concern.

The more salient outer alignment issue now how to align agents when you don't have time or enough capability yourself to supervise them well. And that's mainly only a problem because of competitive race dynamics, which incentivize people to try and supervise AI 'beyond their means' so to speak.

So, cooling race dynamics could address a main portion of the remaining outer alignment problem. Scalable oversight techniques may also address it. What would remain then for (narrow) alignment is specification gaming, and then of course the whole inner alignment problem including deceptive alignment which is still a huge unsolved problem.

[-]Evan R. Murphy10mo2-4

Thoughts on "The Ilusion of Thinking" paper that came out of Apple recently?

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

Seems to me like at least a point in favor of "stochastic parrots" over "builds a quality world model" for the language reasoning models.

Also wondering if their findings could be used to the advantage of safety/security somehow. E.g. if these models are more dependent on imitating examples than we relaized, then it might also be more effective than we previously thought to purge training data of the types of knowledge and reasoning that we don't want them to have (e.g. knowledge of dangerous weapons development, scheming, etc.)

[-]Evan R. Murphy10mo20

I should have mentioned the above thoughts are a low-confidence take. I was mostly just trying to get the ball rolling on discussion because I couldn't find any discussion of this paper on LessWrong yet, which really surprised me because I saw the paper had been shared thousands of times on LinkedIn already.

[-]Evan R. Murphy10mo20

Starting to be some discussion on LW now, e.g.

https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-general-claims-about-generalizable-reasoning

https://www.lesswrong.com/posts/tnc7YZdfGXbhoxkwj/give-me-a-reason-ing-model

Moderation Log