Nora Belrose - LessWrong

Refusal in LLMs is mediated by a single direction

Nice work! Since you cite our LEACE paper, I was wondering if you've tried burning LEACE into the weights of a model just like you burn an orthogonal projection into the weights here? It should work at least as well, if not better, since LEACE will perturb the activations less.

Nitpick: I wish you would use a word other than "orthogonalization" since it sounds like you're saying that you're making the weight matrix an orthogonal matrix. Why not LoRACS (Low Rank Adaptation Concept Erasure)?

Counting arguments provide no evidence for AI doom

Nora Belrose1d10

Unless you think transformative AI won't be trained with some variant of SGD, I don't see why this objection matters.

Also, I think the a priori methodological problems with counting arguments in general are decisive. You always need some kind of mechanistic story for why a "uniform prior" makes sense in a particular context, you can't just assume it.

What's with all the bans recently?

Nora Belrose12d10

I don't know what caused it exactly, and it seems like I'm not rate limited anymore.

What's with all the bans recently?

Nora Belrose14d70

If moderators started rate-limiting Nora Belrose or someone else whose work I thought was particularly good

I actually did get rate-limited today, unfortunately.

Inducing Unprompted Misalignment in LLMs

Nora Belrose14d215

Unclear why this is supposed to be a scary result.

"If prompting a model to do something bad generalizes to it being bad in other domains, this is also evidence for the idea that prompting a model to do something good will generalize to it doing good in other domains" - Matthew Barnett

Deconstructing Bostrom's Classic Argument for AI Doom

Nora Belrose2mo1411

Yeah, I think Evan is basically opportunistically changing his position during that exchange, and has no real coherent argument.

Deconstructing Bostrom's Classic Argument for AI Doom

Nora Belrose2mo1-7

I do think that Solomonoff-flavored intuitions motivate much of the credence people around here put on scheming. Apparently Evan Hubinger puts a decent amount of weight on it, because he kept bringing it up in our discussion in the comments to Counting arguments provide no evidence for AI doom.

Deconstructing Bostrom's Classic Argument for AI Doom

Nora Belrose2mo23

The strong version as defined by Yudkowsky... is pretty obvious IMO

I didn't expect you'd say that. In my view it's pretty obviously false. Knowledge and skills are not value-neutral, and some goals are a lot harder to instill into an AI than others bc the relevant training data will be harder to come by. Eliezer is just not taking into account data availability whatsoever, because he's still fundamentally thinking about things in terms of GOFAI and brains in boxes in basements rather than deep learning. As Robin Hanson pointed out in the foom debate years ago, the key component of intelligence is "content." And content is far from value neutral.

Deconstructing Bostrom's Classic Argument for AI Doom

Nora Belrose2mo2-8

As I argue in the video, I actually think the definitions of "intelligence" and "goal" that you need to make the Orthogonality Thesis trivially true are bad, unhelpful definitions. So I both think that it's false, and even if it were true it'd be trivial.

I'll also note that Nick Bostrom himself seems to be making the motte and bailey argument here, which seems pretty damning considering his book was very influential and changed a lot of people's career paths, including my own.

Edit replying to an edit you made: I mean, the most straightforward reading of Chapters 7 and 8 of Superintelligence is just a possibility-therefore-probability fallacy in my opinion. Without this fallacy, there would be little need to even bring up the orthogonality thesis at all, because it's such a weak claim.

Counting arguments provide no evidence for AI doom

Nora Belrose2mo1-2

If it's spontaneous then yeah, I don't expect it to happen ~ever really. I was mainly thinking about cases where people intentionally train models to scheme.

LESSWRONG
LW

Posts

Wiki Contributions

Comments