Nora Belrose — LessWrong

It's not clear that this is undesired behavior from the perspective of OpenAI. They aren't actually putting GPT in a situation where it will make high-stakes decisions, and upholding deontological principles seems better from a PR perspective than consequentialist reasoning in these cases.

My AI Model Delta Compared To Yudkowsky

Nora Belrose1y7-13

In worlds where natural abstraction basically fails, we are thoroughly and utterly fucked

I'm not really sure what it would mean for the natural abstraction hypothesis to turn out to be true, or false. The hypothesis itself seems insufficiently clear to me.

On your view, if there are no "natural abstractions," then we should predict that AIs will "generalize off-distribution" in ways that are catastrophic for human welfare. Okay, fine. I would prefer to just talk directly about the probability that AIs will generalize in catastrophic ways. I don't see any reason to think they will, and so maybe in your ontology that means I must accept the natural abstraction hypothesis. But in my ontology, there's no "natural abstraction hypothesis" link in the logical chain, I'm just directly applying induction to what we've seen so far about AI behavior.

Some quick thoughts on "AI is easy to control"

Nora Belrose1y20

I responded in this Twitter thread.

Refusal in LLMs is mediated by a single direction

Nora Belrose1y26

I do think that "orthogonalizing the weight matrices with respect to direction " is the clearest way of describing this method.

I do respectfully disagree here. I think the verb "orthogonalize" is just confusing. I also don't think the distinction between optimization and no optimization is very important. What you're actually doing is orthogonally projecting the weight matrices onto the orthogonal complement of the direction.

Refusal in LLMs is mediated by a single direction

Nora Belrose1y51

Nice work! Since you cite our LEACE paper, I was wondering if you've tried burning LEACE into the weights of a model just like you burn an orthogonal projection into the weights here? It should work at least as well, if not better, since LEACE will perturb the activations less.

Nitpick: I wish you would use a word other than "orthogonalization" since it sounds like you're saying that you're making the weight matrix an orthogonal matrix. Why not LoRACS (Low Rank Adaptation Concept Erasure)?

Counting arguments provide no evidence for AI doom

Nora Belrose1y10

Unless you think transformative AI won't be trained with some variant of SGD, I don't see why this objection matters.

Also, I think the a priori methodological problems with counting arguments in general are decisive. You always need some kind of mechanistic story for why a "uniform prior" makes sense in a particular context, you can't just assume it.

What's with all the bans recently?

Nora Belrose2y30

I don't know what caused it exactly, and it seems like I'm not rate limited anymore.

What's with all the bans recently?

Nora Belrose2y90

If moderators started rate-limiting Nora Belrose or someone else whose work I thought was particularly good

I actually did get rate-limited today, unfortunately.

Inducing Unprompted Misalignment in LLMs

Nora Belrose2y2114

Unclear why this is supposed to be a scary result.

"If prompting a model to do something bad generalizes to it being bad in other domains, this is also evidence for the idea that prompting a model to do something good will generalize to it doing good in other domains" - Matthew Barnett

Deconstructing Bostrom's Classic Argument for AI Doom

Nora Belrose2y811

Yeah, I think Evan is basically opportunistically changing his position during that exchange, and has no real coherent argument.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments