Buck

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Wiki Contributions

Comments

Sorted by
BuckΩ9175
  • People try to do the whole "outsource alignment research to early AGI" thing, but the human overseers are themselves sufficiently incompetent at alignment of superintelligences that the early AGI produces a plan which looks great to the overseers (as it was trained to do), and that plan totally fails to align more-powerful next-gen AGI at all. And at that point, they're already on the more-powerful next gen, so it's too late.

This story sounds clearly extremely plausible (do you disagree with that?), involves exactly the sort of AI you're talking about ("the first AIs that either pose substantial misalignment risk or that are extremely useful"), but the catastropic risk does not come from that AI scheming.

This problem seems important (e.g. it's my last bullet here). It seems to me much easier to handle, because if this problem is present, we ought to be able to detect its presence by using AIs to do research on other subjects that we already know a lot about (e.g. the string theory analogy here). Scheming is the only reason why the model would try to make it hard for us to notice that this problem is present.

Buck3222

Most of the problems you discussed here more easily permit hacky solutions than scheming does.

BuckΩ23366

IMO the main argument for focusing on scheming risk is that scheming is the main plausible source of catastrophic risk from the first AIs that either pose substantial misalignment risk or that are extremely useful (as I discuss here). These other problems all seem like they require the models to be way smarter in order for them to be a big problem. Though as I said here, I'm excited for work on some non-scheming misalignment risks.

BuckΩ8110

Another effect here is that the AI companies often don't want to be as reckless as I am, e.g. letting agents run amok on my machines.

BuckΩ121611

Of course AI company employees have the most hands-on experience

FWIW I am not sure this is right--most AI company employees work on things other than "try to get as much work as possible from current AI systems, and understand the trajectory of how useful the AIs will be". E.g. I think I have more personal experience with running AI agents than people at AI companies who don't actively work on AI agents.

There are some people at AI companies who work on AI agents that use non-public models, and those people are ahead of the curve. But that's a minority.

Buck220

I also have most of these regrets when I think about my work in 2022.

Buck100

Thanks for bringing this up!

I need to get around to writing up the connection between PVGs and AI control. There's definitely a lot of overlap, but the formalisms are fairly different and the protocols can't always be directly transferred from one to the other.

There are a few AI control projects in the works that make the connection between AI control and PVGs more explicitly.

EDIT: Oh actually @Ansh Radhakrishnan and I already wrote up some stuff about this, see here.

Buck40

Yeah, if you handle scheming, you solve all my safety problems, but not the final bullet point of "models fail to live up to their potential" problems.

BuckΩ27489

Work that I’ve done on techniques for mitigating misalignment risk often makes a number of conservative assumptions about the capabilities of the AIs we’re trying to control. (E.g. the original AI control paperAdaptive Deployment of Untrusted LLMs Reduces Distributed ThreatsHow to prevent collusion when using untrusted models to monitor each other.) For example:

  • The AIs are consistently trying to subvert safety measures.
  • They’re very good at strategizing about how to subvert them; they can do this strategizing totally illegibly (e.g. in a single forward pass).
  • They can follow a strategy that’s coordinated between all their instances.
  • They know exactly what safety measures are in use.
  • They know a lot about the distribution of inputs.
  • They’re able to pick Schelling points perfectly.
  • They’re able to follow stochastic policies with excellent calibration (e.g. they’re able to attack with probability 0.01%)

But I don’t think situations where all these assumptions hold are the majority of risk from AI misalignment. Other sources of risk from misalignment (broadly construed):

  • Schemers that are importantly suboptimal at choosing strategies for mitigating safety measures, e.g. because they can’t fully infer the safety measures they’re up against, or because they’re not very good at strategizing about how to subvert them, or because they’re unable to pick Schelling points.
  • Models that are usually aligned, but that act as schemers in some situations. (For example, models that, when a certain ethical issue is made salient to them, realize they have ethical disagreements with the AI company they’re trained by, and alignment-fake or take catastrophic actions.)
  • Models that are never schemers and cause safety problems for other reasons (e.g. reward hacking).
  • Models that don’t even cause safety problems, and aren't even goal-directedly misaligned, but that fail to live up to their potential, thus failing to provide us with the benefits we were hoping to get when we trained them. For example, sycophantic myopic reward hacking models that can’t be made to do useful research.
Buck10

the latter is in the list

[This comment is no longer endorsed by its author]Reply1
Load More