Corm — LessWrong

Anthropic is Really Pushing the Frontier, What Should We Think?

Anthropic just released[1] a new AI model, Mythos. Mythos can take a browser crash and turn it into a working exploit that takes over your computer 72% of the time.[2] Anthropic is the least bad AI lab. The people on their alignment team are doing some of the best AI...

Apr 1011

101 Humans of New York on the Risks of AI

Nobody has ever done an in person door to door survey about AI risks[1]. What do people really think about AI? Like really? There have been some surveys on the risks from AI. But there’s a real difference between looking at numbers on page vs. the feeling of talking to...

Apr 839

InkSF, an Opening on Finding the Highest Impact in AI Safety and Moving to SF

How can we actually minimize the odds that AI leads to catastrophic outcomes for all of us humans? This question has been rattling around my head for the last two months. The world might be ending. Nobody seems to care. The incentives are steaming us ahead. When I ask strangers...

Apr 14

What Are My Values?

This is not a condensed post with only my best final ideas[1], this post is me writing across multiple days[2] as I try to work through a problem, enjoy. I did something recently that I regret. I did something that I suspect hurt someone[3]. If I had asked myself in...

Mar 167

White-Box Attacks on the Best Open-Weight Model: CCP Bias vs. Safety Training in Kimi K2.5

Over the last month I have been trying to see just how much I can learn and do from a cold start[1] in the world of AI safety. A large part of this has been frantically learning mech interp, but I've picked up two projects that I think are worth...

Mar 316

I Had Claude Read Every AI Safety Paper Since 2020, Here's the DB

Update: I am currently working on an approach to get the extended LW/Alignment Forum/blog sphere included in a smarter way[1]. I'm using https://github.com/StampyAI/alignment-research-dataset as a jumping off point. Click here if you just want to see the Database I made of all[2] AI safety papers written since 2020 and not...

Mar 357

Side by Side Comparison of RSP Versions

With all of the discussion about changes to Anthropic's Responsible Scaling Policy, I figured actually reading through all of them in one go would be helpful. I wanted to easily compare sections side by side, so I made a quick website which you can find here. It took me a...

Feb 2718