How can we actually minimize the odds that AI leads to catastrophic outcomes for all of us humans? This question has been rattling around my head for the last two months. The world might be ending. Nobody seems to care. The incentives are steaming us ahead. When I ask strangers...
This is not a condensed post with only my best final ideas[1], this post is me writing across multiple days[2] as I try to work through a problem, enjoy. I did something recently that I regret. I did something that I suspect hurt someone[3]. If I had asked myself in...
Over the last month I have been trying to see just how much I can learn and do from a cold start[1] in the world of AI safety. A large part of this has been frantically learning mech interp, but I've picked up two projects that I think are worth...
Update: I am currently working on an approach to get the extended LW/Alignment Forum/blog sphere included in a smarter way[1]. I'm using https://github.com/StampyAI/alignment-research-dataset as a jumping off point. Click here if you just want to see the Database I made of all[2] AI safety papers written since 2020 and not...
With all of the discussion about changes to Anthropic's Responsible Scaling Policy, I figured actually reading through all of them in one go would be helpful. I wanted to easily compare sections side by side, so I made a quick website which you can find here. It took me a...