Is Claude "more aligned" than Llama?
Anthropic seems to be the AI company that cares the most about AI risk, and Meta cares the least. If Anthropic is doing more alignment research than Meta, do the results of that research visibly show up in the behavior of Claude vs. Llama?
I am not sure how you would test this. The first thing that comes to mind is to test how easily different LLMs can be tricked into doing things they were trained not to do, but I don't know if that's a great example of an "alignment failure". You could test model deception but you'd need some objective standard to compare different models on.
And I am not sure how much you should even expect the results of alignment research to show up in present-day LLMs.
Hmm I wonder if this is why so many April Fools posts have >200 upvotes. April Fools Day in cahoots with itself?
isn't your squiggle model talking about whether racing is good, rather than whether unilaterally pausing is good?
Yes the model is more about racing than about pausing but I thought it was applicable here. My thinking was that there is a spectrum of development speed with "completely pause" on one end and "race as fast as possible" on the other. Pushing more toward the "pause" side of the spectrum has the ~opposite effect as pushing toward the "race" side.
I wish you'd try modeling this with more granularity than "is alignment hard" or whatever
I had a quick look at the linked post and it seems to be making some implicit assumptions, such as
I would like to see some quantification of the from "we think there is a 30% chance that we can bootstrap AI alignment using AI; a unilateral pause will only increase the probability of a global pause by 3 percentage points; and there's only a 50% chance that the 2nd-leading company will attempt to align AI in a way we'd find satisfactory, therefore we think the least-risky plan is to stay at the front of the race and then bootstrap AI alignment." (Or a more detailed version of that.)
I think it would probably be bad for the US to unilaterally force all US AI developers to pause if they didn't simultaneously somehow slow down non-US development.
It seems to me that to believe this, you have to believe all of these four things are true:
I think #3 is sort-of true and the others are probably false, so the probability of all four being simultaneously true is quite low.
(Statements I've seen from Chinese developers lead me to believe that they are less interested in racing and more concerned about safety.)
I made a quick Squiggle model on racing vs. slowing down. Based on my first-guess parameters, it suggests that racing to build AI destroys ~half the expected value of the future compared to not racing. Parameter values are rough, of course.
That's kind-of what happened with the anti-nuclear movement, but it ended up doing lots of harm because the things that could be stopped were the good ones!
The global stockpile of nuclear weapons is down 6x since its peak in 1986. Hard to attribute causality but if the anti-nuclear movement played a part in that, then I'd say it was net positive.
(My guess is it's more attributable to the collapse of the Soviet Union than to anything else, but the anti-nuclear movement probably still played some nonzero role)
Yeah I actually agree with that, I don't think it was sufficient, I just think it was pretty good. I wrote the comment too quickly without thinking about my wording.
I feel kind of silly about supporting PauseAI. Doing ML research, or writing long fancy policy reports feels high status. Public protests feel low status. I would rather not be seen publicly advocating for doing something low-status. I suspect a good number of other people feel the same way.
(I do in fact support PauseAI US, and I have defended it publicly because I think it's important to do so, but it makes me feel silly whenever I do.)
That's not the only reason why people don't endorse PauseAI, but I think it's an important reason that should be mentioned.
Well -- I'm gonna speak broadly -- if you look at the history of PauseAI, they are marked by belief that the measures proposed by others are insufficient for Actually Stopping AI -- for instance the kind of policy measures proposed by people working at AI companies isn't enough; that the kind of measures proposed by people funded by OpenPhil are often not enough; and so on.
They are correct as far as I can tell. Can you identify a policy measure proposed by an AI company or an OpenPhil-funded org that you think would be sufficient to stop unsafe AI development?
I think there is indeed exactly one such policy measure, which is SB 1047, supported by Center for AI Safety which is OpenPhil-funded (IIRC), which most big AI companies lobbied against, and Anthropic opposed the original stronger version and got it reduced to a weaker and probably less-safe version.
When I wrote where I was donating in 2024 I went through a bunch of orgs' policy proposals and explained why they appear deeply inadequate. Some specific relevant parts: 1, 2, 3, 4
Edit: Adding some color so you don't have to click through– when I say the proposals I reviewed were inadequate, I mean they said things like (paraphrasing) "safety should be done on a completely voluntary basis with no government regulations" and "companies should have safety officers but those officers should not have final say on anything", and would simply not address x-risk at all, or would make harmful proposals like "the US Department of Defense should integrate more AI into its weapon systems" or "we need to stop worrying about x-risk because it's distracting from the real issues".
If you look at the kind of claims that PauseAI makes in their risks page, you might believe that some of them seem exaggerated, or that PauseAI is simply throwing all the negative things they can find about AI into big list to make it see bad. If you think that credibility is important to the effort to pause AI, then PauseAI might seem very careless about truth in a way that could backfire.
A couple notes on this:
Thanks, that's useful info!
I thought you could post images by dragging and dropping files into the comment box, I seem to recall doing that in the past, but now it doesn't seem to work for me. Maybe that only works for top-level posts?