HoldenKarnofsky - LessWrong

We're Not Ready: thoughts on "pausing" and responsible scaling policies

Thanks for the thoughts!

#1: METR made some edits to the post in this direction (in particular see footnote 3).

On #2, Malo’s read is what I intended. I think compromising with people who want "less caution" is most likely to result in progress (given the current state of things), so it seems appropriate to focus on that direction of disagreement when making pragmatic calls like this.

On #3: I endorse the “That’s a V 1” view. While industry-wide standards often take years to revise, I think individual company policies often (maybe usually) update more quickly and frequently.

We're Not Ready: thoughts on "pausing" and responsible scaling policies

HoldenKarnofsky4mo20

Thanks for the thoughts!

I don’t think the communications you’re referring to “take for granted that the best path forward is compromising.” I would simply say that they point out the compromise aspect as a positive consideration, which seems fair to me - “X is a compromise” does seem like a point in favor of X all else equal (implying that it can unite a broader tent), though not a dispositive point.

I address the point about improvements on the status quo in my response to Akash above.

We're Not Ready: thoughts on "pausing" and responsible scaling policies

HoldenKarnofsky4mo142

Thanks for the thoughts! Some brief (and belated) responses:

I disagree with you on #1 and think the thread below your comment addresses this.
Re: #2, I think we have different expectations. We can just see what happens, but I’ll note that the RSP you refer to is quite explicit about the need for further iterations (not just “revisions” but also the need to define further-out, more severe risks).
I’m not sure what you mean by “an evals regime in which the burden of proof is on labs to show that scaling is safe.” How high is the burden you’re hoping for? If they need to definitively rule out risks like “The weights leak, then the science of post-training enhancements moves forward to the point where the leaked weights are catastrophically dangerous” in order to do any further scaling, my sense is that nobody (certainly not me) has any idea how to do this, and so this proposal seems pretty much equivalent to “immediate pause,” which I’ve shared my thoughts on. If you have a lower burden of proof in mind, I think that’s potentially consistent with the work on RSPs that is happening (it depends on exactly what you are hoping for).
I agree with the conceptual point that improvements on the status quo can be net negative for the reasons you say. When I said “Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate,” I didn’t mean to say that there’s no way this can make intellectual sense. To take a quick stab at what bugs me: I think to the extent a measure is an improvement but insufficient, the strategy should be to say “This is an improvement but insufficient,” accept the improvement and count on oneself to win the “insufficient” argument. This kind of behavior seems to generalize to a world in which everyone is clear about their preferred order of object-level states of the world, and incentive gradients consistently point in the right direction (especially important if - as I believe - getting some progress generally makes it easier rather than harder to get more). I worry that the behavior of opposing object-level improvements on the grounds that others might find them sufficient seems to generalize to a world with choppier incentive gradients, more confusing discourse, and a lot of difficulty building coalitions generally (it’s a lot harder to get agreement on “X vs. the best otherwise achievable outcome” than on “X vs. the status quo”).
I think nearly all proponents of RSPs do not see them as a substitute for regulation. Early communications could have emphasized this point more (including the METR post, which has been updated). I think communications since then have been clearer about it.
I lean toward agreeing that another name would be better. I don’t feel very strongly, and am not sure it matters at this point anyway with different parties using different names.
I don’t agree that "we can't do anything until we literally have proof that the model is imminently dangerous" is the frame of RSPs, although I do agree that the frame is distinct from a “pause now” frame. I’m excited about conditional pauses as something that can reduce risk a lot while having high tractability and bringing together a big coalition; the developments you mention are great, but I think we’re still a long way from where “advocate for an immediate pause” looks better to me than working within this framework. I also disagree with your implication that the RSP framework has sucked up a lot of talent; while evals have drawn in a lot of people and momentum, hammering out conditional pause related frameworks seems to be something that only a handful of people were working on as of the date of your comment. (Since then the number has gone up due to AI companies forming teams dedicated to this; this seems like a good thing to me.) Overall, it seems to me that most of the people working in this area would otherwise be working on evals and other things short of advocating for immediate pauses.

A Playbook for AI Risk Reduction (focused on misaligned AI)

HoldenKarnofsky7mo20

(Apologies for slow reply!)

I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira's playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they're doing on the alignment front. I would guess you wouldn't agree with that, but I'm not sure.

I think an adversarial social movement could have a positive impact. I have tended to think of the impact as mostly being about getting risks taken more seriously and thus creating more political will for “standards and monitoring,” but you’re right that there could also be benefits simply from buying time generically for other stuff.

I'm not saying that it would be a force against regulation in general but that it would be a force against any regulation which slows down substantially the current capabilities progress rate of labs. And empirics don't demonstrate the opposite as far as I can tell.

I said it’s “far from obvious” empirically what’s going on. I agree that discussion of slowing down has focused on the future rather than now, but I don’t think it has been pointing to a specific time horizon (the vibe looks to me more like “slow down at a certain capabilities level”).

Finally, on your conceptual part, as some argued, it's in fact probably not possible to affect all players equally without a drastic regime of control (which is a true downside of slowing down now, but IMO still much less worse than slowing down once a leak or a jailbreak of an advanced system can cause a large-scale engineered pandemic) bc smaller actors will use the time to try to catch up as close as possible from the frontier.

It’s true that no regulation will affect everyone precisely the same way. But there is plenty of precedent for major industry players supporting regulation that generally slows things down (even when the dynamic you’re describing applies).

I agree, but if anything, my sense is that due to various compound effects (due to AI accelerating AI, to investment, to increased compute demand, and to more talent earlier), an earlier product release of N months just gives a lower bound for TAI timelines shortening (hence greater than N). Moreover, I think that the ChatGPT product release is, ex-post at least, not in the typical product release reference class. It was clearly a massive game changer for OpenAI and the entire ecosystem.

I don’t agree that we are looking at a lower bound here, bearing in mind that (I think) we are just talking about when ChatGPT was released (not when the underlying technology was developed), and that (I think) we should be holding fixed the release timing of GPT-4. (What I’ve seen in the NYT seems to imply that they rushed out functionality they’d otherwise have bundled with GPT-4.)

If ChatGPT had been held for longer, then:

Scaling and research would have continued in the meantime. And even with investment and talent flooding in, I expect that there’s very disproportionate impact from players who were already active before ChatGPT came out, who were easily well capitalized enough to go at ~max speed for the few months between ChatGPT and GPT-4.
GPT-4 would have had the same absolute level of impressiveness, revenue potential, etc. (as would some other things that I think have been important factors in bringing in investment, such as Midjourney). You could have a model like “ChatGPT maxed out hype such that the bottleneck on money and talent rushing into the field became calendar time alone,” which would support your picture; but you could also have other models, e.g. where the level of investment is more like a function of the absolute level of visible AI capabilities such that the timing of ChatGPT mattered little, holding fixed the timing of GPT-4. I’d guess the right model is somewhere in between those two; in particular, I'd guess that it matters a lot how high revenue is from various sources, and revenue seems to behave somewhere in between these two things (there are calendar-time bottlenecks, but absolute capabilities matter a lot too; and parallel progress on image generation seems important here as well.)
Attention from policymakers would’ve been more delayed; the more hopeful you are about slowing things via regulation, the more you should think of this as an offsetting factor, especially since regulation may be more of a pure “calendar-time-bottlenecked response to hype” model than research and scaling progress.
(I also am not sure I understand your point for why it could be more than 3 months of speedup. All the factors you name seem like they will nearly-inevitably happen somewhere between here and TAI - e.g., there will be releases and demos that galvanize investment, talent, etc. - so it’s not clear how speeding a bunch of these things up 3 months speeds the whole thing up more than 3 months, assuming that there will be enough time for these things to matter either way.)

But more important than any of these points is that circumstances have (unfortunately, IMO) changed. My take on the “successful, careful AI lab” intervention was quite a bit more negative in mid-2022 (when I worried about exactly the kind of acceleration effects you point to) than when I did my writing on this topic in 2023 (at which point ChatGPT had already been released and the marginal further speedup of this kind of thing seemed a lot lower). Since I wrote this post, it seems like the marginal downsides have continued to fall, although I do remain ambivalent.

3 levels of threat obfuscation

HoldenKarnofsky7moΩ120

Just noting that these seem like valid points! (Apologies for slow reply!)

A Playbook for AI Risk Reduction (focused on misaligned AI)

HoldenKarnofsky9mo20

Thanks for the response!

Re: your other interventions - I meant for these to be part of the "Standards and monitoring" category of interventions (my discussion of that mentions advocacy and external pressure as important factors).

I think it's far from obvious that an AI company needs to be a force against regulation, both conceptually (if it affects all players, it doesn't necessarily hurt the company) and empirically.

Thanks for giving your take on the size of speedup effects. I disagree on a number of fronts. I don't want to get into the details of most of them, but will comment that it seems like a big leap from "X product was released N months earlier than otherwise" to "Transformative AI will now arrive N months earlier than otherwise." (I think this feels fairly clear when looking at other technological breakthroughs and how much they would've been affected by differently timed product releases.)

A Playbook for AI Risk Reduction (focused on misaligned AI)

HoldenKarnofsky10mo20

I'm not convinced it requires a huge compute tax to reliably avoid being caught. (If I were, I would in fact probably be feeling a lot more chill than I am.)

The analogy to humans seems important. Humans are capable of things like going undercover, and pulling off coups, and also things like "working every day with people they'd fire if they could, without clearly revealing this." I think they mostly pull this off with:

Simple heuristics like "Be nice, unless you're in the very unusual situation where hostile action would work well." (I think the analogy to how AIs might behave is straightforward.)
The fact that they don't need to be perfect - lots of behaviors and situations are just ambiguous. (I think this also applies to AI, and maybe more so: there could be lots of weird little incidents that are chalked up to bugs without causing concern sufficient to address the problem reliably.)
Also, humans generally need to do a lot of reasoning along the lines of "X usually works, but I do need to notice the rare situations when something radically different is called for." So if this is expensive, they just need to be doing that expensive thing a lot.

A Playbook for AI Risk Reduction (focused on misaligned AI)

HoldenKarnofsky10mo20

I think training exclusively on objective measures has a couple of other issues:

For sufficiently open-ended training, objective performance metrics could incentivize manipulating and deceiving humans to accomplish the objective. A simple example would be training an AI to make money, which might incentivize illegal/unethical behavior.
For less open-ended training, I basically just think you can only get so much done this way, and people will want to use fuzzier "approval" measures to get help from AIs with fuzzier goals (this seems to be how things are now with LLMs).

I think your point about the footprint is a good one and means we could potentially be very well-placed to track "escaped" AIs if a big effort were put in to do so. But I don't see signs of that effort today and don't feel at all confident that it will happen in time to stop an "escape."

A Playbook for AI Risk Reduction (focused on misaligned AI)

HoldenKarnofsky10mo31

That's interesting, thanks!

In addition to some generalized concern about "unknown unknowns" leading to faster progress on reliability than expected by default (especially in the presence of commercial incentives for reliability), I also want to point out that there may be some level of capabilities where AIs become good at doing things like:

Assessing the reliability of their own thoughts, and putting more effort into things that have the right combination of uncertainty and importance.
Being able to use that effort productively, via things like "trying multiple angles on a question" and "setting up systems for error checking."

I think that in some sense humans are quite unreliable, and use a lot of scaffolding - variable effort at reliability, consulting with each other and trying to catch things each other missed, using systems and procedures, etc. - to achieve high reliability, when we do so. Because of this, I think AIs could be have pretty low baseline reliability (like humans) while finding ways to be effectively highly reliable (like humans). And I think this applies to deception as much as anything else (if a human thinks it's really important to deceive someone, they're going to make a lot of use of things like this).

A Playbook for AI Risk Reduction (focused on misaligned AI)

HoldenKarnofsky10mo30

I agree with these points! But:

Getting the capabilities to be used by other agents to do good things could still be tricky and/or risky, when reinforcement is vulnerable to deception and manipulation.
I still don't think this adds up to a case for being confident that there aren't going to be "escapes" anytime soon.

LESSWRONG
LW

Sequences

Posts

Wiki Contributions

Comments