Nina Panickssery — LessWrong

Views purely my own unless clearly stated otherwise

Perhaps. I see where you are coming from. Though I think it’s possible contrastive-prompt-based vectors (eg. CAA) also approximate “natural” features better than training on those same prompts (fewer degrees of freedom with the correct inductive bias). I should check whether there has been new research on this…

Not sure what distinction you're making. I'm talking about steering for controlling behavior in production, not for red-teaming at eval time or to test interp hypotheses via causal interventions. However this still covers both safety (e.g. "be truthful") and "capabilities" (e.g. "write in X style") interventions.

Whenever I read yet another paper or discussion of activation steering to modify model behavior, my instinctive reaction is to slightly cringe at the naiveté of the idea. Training a model to do some task only to then manually tweak some of the activations or weights using a heuristic-guided process seems quite un-bitter-lesson-pilled. Why not just directly train for the final behavior you want—find better data, tweak the reward function, etc.?

But actually there may be a good reason to continue working on model-internals control (i.e. ways of influencing model behavior outside of modifying the text input or training process, by directly changing internal state). For some applications, you may want to express something in terms of the model’s own abstractions, something that you won’t know a priori how to do in text or via training data in fine-tuning. Throughout the training process, a model naturally learns a rich semantic activation space. And in some cases, the “cleanest” way to modify its behavior is by expressing the change in terms of its learned concepts, whose representations are sculpted by exaflops of compute.

Maybe some future version of humanity will want to do some handover, but we are very far from the limits of human potential

I think this is conceding too much. Many successionists will jump on this and say "Well, that's what I'm talking about! I'm not saying AI should take over now, but just that it likely will one day and so we should prepare for that."

Furthermore, people who don't want to be succeeded by AI are often not saying this just because they think human potential can be advanced further; that we can become much smarter and wiser. I'd guess that even if we proved somehow that human IQ could never exceed n and n was reached, most would not desire that their lineage of biological descendants gradually dwindle away to zero while AI prospers.

You can say "maybe some future version of humanity will want to X" for any X because it's hard to prove anything about humanity in the far future. But such reasoning should not play into our current decision-making process unless we think it's particularly likely that future humanity will want X.

This. LLMs make it much easier to write one-off scripts and custom tools for a bunch of stuff like plotting and visualizations. Because the cost is lower you produce much more of this stuff. But this doesn't mean that AI is contributing 90% of the value. The ideal metric would be something like "how much longer would it take to ship Claude n+1 if we didn't use Claude n for coding"?

I see. In that case the phrasing makes more sense but then I think it’s more false.

I think context and inability to handle certain specific tasks (especially around computer use details) are holding things back right now more than intelligence.

What does this mean? Insofar as context is like working memory, that seems like a crucial component of "intelligence" in humans. It seems hard to define "intelligence" as distinct from working memory and the ability to handle a wide variety of tasks.

Criticism quality-valence bias

Something I've noticed from posting more of my thoughts online:

People who disagree with your conclusion to begin with are more likely to carefully read and point out errors in your reasoning/argumentation, or instances where you've made incorrect factual claims. Whereas people who agree with your conclusion before reading are more likely to consciously or subconsciously gloss over any flaws in your writing because they are onboard with the "broad strokes".

So your best criticism ends up coming with a negative valence, i.e. from people who disagree with your conclusion to begin with.

(LessWrong has much less of this bias than other places, though I still see some of it.)

The mini website with animations is super cool! The LessWrong team's design chops never fail to impress :)

I'm asking if you're up for at least being willing to entertain the structure of "maybe, Ray will be right that there is a large-but-finite set of claims, and it's possible to get enough certainty on each claim to at least put pretty significant bounds on how unaligned AI may play out

Certainly, I could be wrong! I don’t mean to:

Dismiss the possibility of misaligned AI related X-risk
Dismiss the possibility that your particular lines of argument make sense and I’m missing some things

And I think caution with AI development is warranted for a number of reasons beyond pure misalignment risk.

But it’s a little worrying when a community widely shares a strong belief in doom while implying that the required arguments are esoteric and require lots of subtle claims, each of which might have counterarguments, but which overall will eventually convince you. 1a3orn has a good essay about this: https://1a3orn.com/sub/essays-ai-doom-invincible.html.

I think having intuitions around general intelligences being dangerous is perfectly reasonable; I have them too. As a very risk-averse and pro-humanity person, I’d almost be tempted to press a button to peacefully prevent AI advancement purely on the basis of a tiny potential risk (for I think everyone dying is very, very, very bad, I am not disagreeing with that point at all). But no such button exists, and attempts to stop AI development have their own side-effects that could add up to more risks on net. And though that’s unfortunate, it doesn’t mean that we should spread a message of “we are definitely doomed unless we stop”. A large number of people believing they are doomed is not a free way to increase the chances of an AI slowdown or pause. It has a lot of negative side-effects. Many smart and caring people I know have put their lives on pause and made serious (in my opinion, bad) decisions on the basis that superintelligence will probably kill us, or if not there’ll be a guaranteed utopia. To be clear, I am not saying that we should believe or spread false things about AI risk being lower than it actually is so that people’s personal lives temporarily improve. But rather I am saying that exaggerating claims of doom or making arguments sound more certain than they are for consequentialist purposes is not free.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments