Anthropic just released[1] a new AI model, Mythos. Mythos can take a browser crash and turn it into a working exploit that takes over your computer 72% of the time.[2] Anthropic is the least bad AI lab. The people on their alignment team are doing some of the best AI...
Nobody has ever done an in person door to door survey about AI risks[1]. What do people really think about AI? Like really? There have been some surveys on the risks from AI. But there’s a real difference between looking at numbers on page vs. the feeling of talking to...
How can we actually minimize the odds that AI leads to catastrophic outcomes for all of us humans? This question has been rattling around my head for the last two months. The world might be ending. Nobody seems to care. The incentives are steaming us ahead. When I ask strangers...
This is not a condensed post with only my best final ideas[1], this post is me writing across multiple days[2] as I try to work through a problem, enjoy. I did something recently that I regret. I did something that I suspect hurt someone[3]. If I had asked myself in...
Over the last month I have been trying to see just how much I can learn and do from a cold start[1] in the world of AI safety. A large part of this has been frantically learning mech interp, but I've picked up two projects that I think are worth...
Update: I am currently working on an approach to get the extended LW/Alignment Forum/blog sphere included in a smarter way[1]. I'm using https://github.com/StampyAI/alignment-research-dataset as a jumping off point. Click here if you just want to see the Database I made of all[2] AI safety papers written since 2020 and not...
With all of the discussion about changes to Anthropic's Responsible Scaling Policy, I figured actually reading through all of them in one go would be helpful. I wanted to easily compare sections side by side, so I made a quick website which you can find here. It took me a...