I'm glad to see this. Some initial thoughts about the control safety case:
Am I correct in interpreting that your definition of "found system" would apply nearly all useful AI systems today such as ChatGPT, as these are algorithms which run on weights that are found with optimization methods such as gradient descent? If so, it is still fairly onerous.
Thanks for writing this and proposing a plan. Coincidentally, I drafted a short take here yesterday explaining one complaint I currently have with the safety conditions of this plan. In short, I suspect the “No AIs improving other AIs” criterion isn't worth including within a safety plan: it i) doesn't address that many more marginal threat models (or does so ineffectively) and ii) would be too unpopular to implement (or, alternatively, too weak to be useful).
I think there is a version of this plan with a lower safety tax, with more focus on reactive policy and the other three criterion, that I would be more excited about.
Another reason why layernorm is weird (and a shameless plug): the final layernorm also contributes to self-repair in language models
Hmm, this transcript just seems like an example of blatant misalignment? I guess I have a definition of scheming that would imply deceptive alignment - for example, for me to classify Sydney as 'obviously scheming', I would need to see examples of Sydney 1) realizing it is in deployment and thus acting 'misaligned' or 2) realizing it is in training and thus acting 'aligned'.
In what manner was Sydney 'pretty obviously scheming'? Feels like the misalignment displayed by Sydney is fairly different than other forms of scheming I would be concerned about
(if this is a joke, whoops sorry)
I'm surprised by this reaction. It feels like the intersection between people who have a decent shot of getting hired at OpenAI to do safety research and those who are unaware of the events at OpenAI related to safety are quite low.
What Comes Next
Coding got another big leap, both for professionals and amateurs.
Claude is now clearly best. I thought for my own purposes Claude Opus was already best even after GPT-4o, but not for everyone, and it was close. Now it is not so close.
Claude’s market share has always been tiny. Will it start to rapidly expand? To what extent does the market care, when most people didn’t in the past even realize they were using GPT-3.5 instead of GPT-4? With Anthropic not doing major marketing? Presumably adaptation will be slow even if they remain on top, especially in the consumer market.
Yet with what is reportedly a big jump, we could see a lot of wrappers and apps start switching over rapidly. Developers have to be more on the ball.
How long should we expect Claude 3.5 Sonnet to remain on top?
I do not expect anyone except Google or OpenAI to pose a threat any time soon.
Was surprised by the one-shot Hanson lecture result. From it, I decided to see if Claude could identify your text, so I plugged in the above after asking "Who do you think the author behind this paragraph is?" Response:
Based on the style, content, and context of this paragraph, I believe the author is likely Zvi Mowshowitz.
Here's my reasoning:
- The writing style is analytical and forward-looking, characteristic of someone who frequently discusses AI developments and their implications.
- The author mentions Claude specifically and compares it to GPT models, showing familiarity with different AI models and their capabilities.
- The discussion of market share, adoption rates, and speculation about future developments is typical of someone who closely follows the AI industry.
- The mention of "wrappers and apps" and consideration of developer behavior suggests someone with knowledge of the AI ecosystem.
- Zvi Mowshowitz is known for writing about AI developments, particularly on his blog "Don't Worry About the Vase" and other platforms. He often provides analysis of new AI models and their potential impacts.
- The structure and tone of the writing is similar to Zvi's typical style, with a mix of personal opinion and broader industry analysis.
However, I should note that while this is my best guess based on the available information, I can't be absolutely certain. If you're looking for a definitive answer, it would be best to verify the source directly.
Would you like me to elaborate on any part of this analysis?
Wow
It also seems to have led to at least one claim in a policy memo that advocates of AI safety are being silly because mechanistic interpretability was solved.
Small nitpick (I agree with mostly everything else in the post and am glad you wrote it up). This feels like an unfair criticism - I assume you are referring specifically to the statement in their paper that:
Although advocates for AI safety guidelines often allude to the "black box" nature of AI models, where the logic behind their conclusions is not transparent, recent advancements in the AI sector have resolved this issue, thereby ensuring the integrity of open-source code models.
I think Anthropic's interpretability team, while making maybe dubious claims about the impact of their work on safety, has been clear that mechanistic interpretability is far from 'solved.' For instance, Chris Olah in the linked NYT article from today:
“There are lots of other challenges ahead of us, but the thing that seemed scariest no longer seems like a roadblock,” he said.
Also, in the paper's section on Inability to Evaluate:
it's unclear that they're really getting at the fundamental thing we care about
I think they are overstating how far/useful mechanistic interpretability is currently. However, I don't think this messaging is close to 'mechanistic interpretability solves AI Interpretability' - this error is on a16z, not Anthropic.
This might not work well for others, but a thing that's worked well for me has been to (basically) block cheap access to it with anticharities. Introducing friction in general is good