When I said "AI control is easy", I meant "AI control mitigates most risk arising from human-ish-level schemers directly causing catastrophes"; I wasn't trying to comment more generally. I agree with your concern.

Reply

Buck's Shortform

Buck1dΩ30419

[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]

I’m interested in the following subset of risk from AI:

Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).
Scheming: Risk associated with loss of control to AIs that arises from AIs scheming
- So e.g. I exclude state actors stealing weights in ways that aren’t enabled by the AIs scheming, and I also exclude non-scheming failure modes. IMO, state actors stealing weights is a serious threat, but non-scheming failure modes aren’t (at this level of capability and dignity).
Medium dignity: that is, developers of these AIs are putting a reasonable amount of effort into preventing catastrophic outcomes from their AIs (perhaps they’re spending the equivalent of 10% of their budget on cost-effective measures to prevent catastrophes).
Nearcasted: no substantial fundamental progress on AI safety techniques, no substantial changes in how AI works.

This subset of risk is interesting because I think it’s a natural scenario at which to target technical work on AI safety. (E.g. it’s the main scenario we’re targeting with our AI control agenda.)

I claim that the majority of this risk comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored.

Using AIs for AI development looks uniquely risky to me among applications of early-transformative AIs, because unlike all other applications I know about:

It’s very expensive to refrain from using AIs for this application.
There’s no simple way to remove affordances from the AI such that it’s very hard for the AI to take a small sequence of actions which plausibly lead quickly to loss of control. In contrast, most other applications of AI probably can be controlled just by restricting their affordances.

If I’m right that the risk from scheming early-transformative models is concentrated onto this pretty specific scenario, it implies a bunch of things:

It implies that work on mitigating these risks should focus on this very specific setting.
It implies that AI control is organizationally simpler, because most applications can be made trivially controlled.
It is pretty plausible to me that AI control is quite easy, because you actually can remove affordances from the AIs that are doing AI R&D such that it’s hard for them to cause problems.

Reply

Introducing AI Lab Watch

Buck2d103

One reason that I'm particularly excited for this: AI-x-risk-concerned people are often accused of supporting Anthropic over other labs for reasons that are related to social affiliation rather than substantive differences. I think these accusations have some merit--if you ask AI-x-risk-concerned people for exactly how Anthropic differs from e.g. OpenAI, they often turn out to have a pretty shallow understanding of the difference. This resource makes it easier for these people to have a firmer understanding of concrete differences.

I hope also that this project makes it easier for AI-x-risk-concerned people to better allocate their social pressure on labs.

Reply

Introducing AI Lab Watch

Buck3d172

[I've talked to Zach about this project]

I think this is cool, thanks for building it! In particular, it's great to have a single place where all these facts have been collected.

I can imagine this growing into the default reference that people use when talking about whether labs are behaving responsibly.

Reply

3

Refusal in LLMs is mediated by a single direction

Buck4dΩ233

And I feel like it's pretty obvious that addressing issues with current harmlessness training, if they improve on state of the art, is "more grounded" than "we found a cool SAE feature that correlates with X and Y!"?

Yeah definitely I agree with the implication, I was confused because I don't think that these techniques do improve on state of the art.

Reply

Refusal in LLMs is mediated by a single direction

Buck4dΩ662

I'm pretty skeptical that this technique is what you end up using if you approach the problem of removing refusal behavior technique-agnostically, e.g. trying to carefully tune your fine-tuning setup, and then pick the best technique.

Reply

Refusal in LLMs is mediated by a single direction

Buck4dΩ9117

??? Come on, there's clearly a difference between "we can find an Arabic feature when we go looking for anything interpretable" vs "we chose from the relatively small set of practically important things and succeeded in doing something interesting in that domain".

Oh okay, you're saying the core point is that this project was less streetlighty because the topic you investigated was determined by the field's interest rather than cherrypicking. I actually hadn't understood that this is what you were saying. I agree that this makes the results slightly better.

Reply

Refusal in LLMs is mediated by a single direction

Buck5dΩ540

Lawrence, how are these results any more grounded than any other interp work?

Reply

Refusal in LLMs is mediated by a single direction

Buck5dΩ98-12

I don't see how this is a success at doing something useful on a real task. (Edit: I see how this is a real task, I just don't see how it's a useful improvement on baselines.)

Because I don't think this is realistically useful, I don't think this at all reduces my probability that your techniques are fake and your models of interpretability are wrong.

Maybe the groundedness you're talking about comes from the fact that you're doing interp on a domain of practical importance? I agree that doing things on a domain of practical importance might make it easier to be grounded. But it mostly seems like it would be helpful because it gives you well-tuned baselines to compare your results to. I don't think you have results that can cleanly be compared to well-established baselines?

(Tbc I don't think this work is particularly more ungrounded/sloppy than other interp, having not engaged with it much, I'm just not sure why you're referring to groundedness as a particular strength of this compared to other work. I could very well be wrong here.)

Reply

Express interest in an "FHI of the West"

Buck14d137

I'll also note that if you want to show up anywhere in the world and get good takes from people on the "how aliens might build AGI" question, Constellation might currently be the best bet (especially if you're interested in decision-relevant questions about this).

Reply

1