MichaelDickens

Wikitag Contributions

Comments

Sorted by

Thanks, that's useful info!

I thought you could post images by dragging and dropping files into the comment box, I seem to recall doing that in the past, but now it doesn't seem to work for me. Maybe that only works for top-level posts?

Is Claude "more aligned" than Llama?

Anthropic seems to be the AI company that cares the most about AI risk, and Meta cares the least. If Anthropic is doing more alignment research than Meta, do the results of that research visibly show up in the behavior of Claude vs. Llama?

I am not sure how you would test this. The first thing that comes to mind is to test how easily different LLMs can be tricked into doing things they were trained not to do, but I don't know if that's a great example of an "alignment failure". You could test model deception but you'd need some objective standard to compare different models on.

And I am not sure how much you should even expect the results of alignment research to show up in present-day LLMs.

Hmm I wonder if this is why so many April Fools posts have >200 upvotes. April Fools Day in cahoots with itself?

isn't your squiggle model talking about whether racing is good, rather than whether unilaterally pausing is good?

Yes the model is more about racing than about pausing but I thought it was applicable here. My thinking was that there is a spectrum of development speed with "completely pause" on one end and "race as fast as possible" on the other. Pushing more toward the "pause" side of the spectrum has the ~opposite effect as pushing toward the "race" side.

I wish you'd try modeling this with more granularity than "is alignment hard" or whatever

  1. I've never seen anyone else try to quantitatively model it. As far as I know, my model is the most granular quantitative model ever made. Which isn't to say it's particularly granular (I spent less than an hour on it) but this feels like an unfair criticism.
  2. In general I am not a fan of criticisms of the form "this model is too simple". All models are too simple. What, specifically, is wrong with it?

I had a quick look at the linked post and it seems to be making some implicit assumptions, such as

  1. the plan of "use AI to make AI safe" has a ~100% chance of working (the post explicitly says this is false, but then proceeds as if it's true)
  2. there is a ~100% chance of slow takeoff
  3. if you unilaterally pause, this doesn't increase the probability that anyone else pauses, doesn't make it easier to get regulations passed, etc.

I would like to see some quantification of the from "we think there is a 30% chance that we can bootstrap AI alignment using AI; a unilateral pause will only increase the probability of a global pause by 3 percentage points; and there's only a 50% chance that the 2nd-leading company will attempt to align AI in a way we'd find satisfactory, therefore we think the least-risky plan is to stay at the front of the race and then bootstrap AI alignment." (Or a more detailed version of that.)

I think it would probably be bad for the US to unilaterally force all US AI developers to pause if they didn't simultaneously somehow slow down non-US development.

It seems to me that to believe this, you have to believe all of these four things are true:

  1. Solving AI alignment is basically easy
  2. Non-US frontier AI developers are not interested in safety
  3. Non-US frontier AI developers will quickly catch up to the US
  4. If US developers slow down, then non-US developers are very unlikely to also slow down—either voluntarily, or because the US strong-arms them into signing a non-proliferation treaty, or whatever

I think #3 is sort-of true and the others are probably false, so the probability of all four being simultaneously true is quite low.

(Statements I've seen from Chinese developers lead me to believe that they are less interested in racing and more concerned about safety.)

I made a quick Squiggle model on racing vs. slowing down. Based on my first-guess parameters, it suggests that racing to build AI destroys ~half the expected value of the future compared to not racing. Parameter values are rough, of course.

That's kind-of what happened with the anti-nuclear movement, but it ended up doing lots of harm because the things that could be stopped were the good ones!

The global stockpile of nuclear weapons is down 6x since its peak in 1986. Hard to attribute causality but if the anti-nuclear movement played a part in that, then I'd say it was net positive.

(My guess is it's more attributable to the collapse of the Soviet Union than to anything else, but the anti-nuclear movement probably still played some nonzero role)

Yeah I actually agree with that, I don't think it was sufficient, I just think it was pretty good. I wrote the comment too quickly without thinking about my wording.

Answer by MichaelDickens243

I feel kind of silly about supporting PauseAI. Doing ML research, or writing long fancy policy reports feels high status. Public protests feel low status. I would rather not be seen publicly advocating for doing something low-status. I suspect a good number of other people feel the same way.

(I do in fact support PauseAI US, and I have defended it publicly because I think it's important to do so, but it makes me feel silly whenever I do.)

That's not the only reason why people don't endorse PauseAI, but I think it's an important reason that should be mentioned.

Well -- I'm gonna speak broadly -- if you look at the history of PauseAI, they are marked by belief that the measures proposed by others are insufficient for Actually Stopping AI -- for instance the kind of policy measures proposed by people working at AI companies isn't enough; that the kind of measures proposed by people funded by OpenPhil are often not enough; and so on.

They are correct as far as I can tell. Can you identify a policy measure proposed by an AI company or an OpenPhil-funded org that you think would be sufficient to stop unsafe AI development?

I think there is indeed exactly one such policy measure, which is SB 1047, supported by Center for AI Safety which is OpenPhil-funded (IIRC), which most big AI companies lobbied against, and Anthropic opposed the original stronger version and got it reduced to a weaker and probably less-safe version.

When I wrote where I was donating in 2024 I went through a bunch of orgs' policy proposals and explained why they appear deeply inadequate. Some specific relevant parts: 1, 2, 3, 4

Edit: Adding some color so you don't have to click through– when I say the proposals I reviewed were inadequate, I mean they said things like (paraphrasing) "safety should be done on a completely voluntary basis with no government regulations" and "companies should have safety officers but those officers should not have final say on anything", and would simply not address x-risk at all, or would make harmful proposals like "the US Department of Defense should integrate more AI into its weapon systems" or "we need to stop worrying about x-risk because it's distracting from the real issues".

If you look at the kind of claims that PauseAI makes in their risks page, you might believe that some of them seem exaggerated, or that PauseAI is simply throwing all the negative things they can find about AI into big list to make it see bad. If you think that credibility is important to the effort to pause AI, then PauseAI might seem very careless about truth in a way that could backfire.

A couple notes on this:

  • AFAICT PauseAI US does not do the thing you describe.
  • I've looked at a good amount of research on protest effectiveness. There are many observational studies showing that nonviolent protests are associated with preferred policy changes / voting patterns, and ~four natural experiments. If protests backfired for fairly minor reasons like "their website makes some hard-to-defend claims" (contrasted with major reasons like "the protesters are setting buildings on fire"), I think that would show up in the literature, and it doesn't.
Load More