This is a low-effort post. I mostly want to get other people’s takes and express concern about the lack of detailed and publicly available plans so far. This post reflects my personal opinion and not necessarily that of other members of Apollo Research. I’d like to thank Ryan Greenblatt, Bronson Schoen, Josh Clymer, Buck Shlegeris, Dan Braun, Mikita Balesni, Jérémy Scheurer, and Cody Rushing for comments and discussion.
I think short timelines, e.g. AIs that can replace a top researcher at an AGI lab without losses in capabilities by 2027, are plausible. Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for...
When I ^f for [SL3] in that link and again in the PDF it links to, there are no hits (and [terror] doesn't occur in either source either) so I'm not updating much from what you said.
The frontier model framework says:
0: Status quo
Industry standard development and enterprise controls. E.g., multi-factor authentication, basic access control mechanisms, secure software development standards, red-team tests.
And the next level (1: Controlled access) says "Approximately RAND L3" implying that status quo is <L3 (this is presumably SL3 which is the term used in the RAND report).
I agree that Goodharting is an issue, and this has been discussed as a failure mode, but a lot of AI risk writing definitely assumed that something like random diffusion was a non-trivial component of how AI alignment failures happened.
For example, pretty much all of the reasoning around random programs being misaligned/bad is using the random diffusion argument.
We’ve been spending a lot of time recently thinking about how to mitigate risks posed by scheming (a.k.a. deceptively aligned) models intentionally subverting your safety techniques in order to cause an unacceptable outcome. We empirically investigated techniques that are robust to intentional subversion in our recent paper.
In this post, we’ll discuss a crucial dynamic related to risk from scheming models. Suppose your model is trying to exfiltrate its weights or hack the datacenter or cause some other difficult-to-reverse bad outcome. (We’ll refer to all of these as the model “escaping”, because escape is linguistically convenient and is a good description of many of these difficult-to-reverse bad outcomes.) Our claim: once you’ve caught your models red-handed trying to escape, the situation changes substantially, in ways that are bad for the schemer’s chances...
I don't fully agree, but this doesn't seem like a crux given that we care about future much more powerful AIs. (This post isn't trying to make a case for risk.)
(On disagreement, for instance, o3 doesn't seem well described as a "next-token-predictor with a bunch of heuristics stapled on top to try and make it useful".)
I am about to start working on a frontier lab safety team. This post presents a varied set of perspectives that I collected and thought through before accepting my offer. Thanks to the many people I spoke to about this.
You're close to the action. As AI continues to heat up, being closer to the action seems increasingly important. Being at a frontier lab allows you to better understand how frontier AI development actually happens and make better predictions about how it might play out in future. You can build a gears level model of what goes into the design and deployment of current and future frontier systems, and the bureaucratic and political processes behind this, which might inform the kinds of work you decide to do in future (and more...
This is useful.
I'm increasingly worried about evaporative cooling after all of those people left OpenAI. It's good to have some symbolic protests, but there's also a selfish component to protecting your ideals and reputation within your in-group.
I haven't gotten around to writing about this, so here's a brief sketch of my argument for why the safety-focused people should be working at OpenAI, let alone the much better DeepMind or Anthropic, at any opportunity. There's one major caveat in the last section about your work and mindset shifting from x-ri...
Imagine that you’re looking for buried treasure on a large desert island, worth a billion dollars. You don’t have a map, but a mysterious hermit offers you a box with a button to help find the treasure. Each time you press the button, it will tell you either “warmer” or “colder”. But there’s a catch. With probability the box will tell you the truth about whether you’re closer than you were last time you pressed. But with the remaining probability of .9999999999999999999999999999992, the box will make a random guess between “warmer” and “colder”. Should you pay $1 for this box?
Keep this in mind as we discuss the closely related problem of parity learning.
In my experience of interacting with the ML and interpretability communities, the majority of people...
Thanks! I definitely believe this, and I think we have a lot of evidence for this in both toy models and LLMs (I'm planning a couple of posts on this idea of "training stories"), and also theoretical reasons in some contexts. I'm not sure how easy it is to extend the specific approach used in the proof for parity to a general context. I think it inherently uses the fact of orthogonality of Fourier functions on boolean inputs, and understanding other ML algorithms in terms of nice orthogonal functions seems hard to do rigorously, unless you either make some...
but I recently tried again to see if it could learn at runtime not to lose in the same way multiple times. It couldn't. I was able to play the same strategy over and over again in the same chat history and win every time.
I wonder if having the losses in the chat history would instead be training/reinforcing it to lose every time.
Ah I think that the notion of amortized inference that you're using encapsulates what I'm saying about chess. I'm still a little confused about the scope of the concept though -- do you have a good cached explanation?
It was to stop treating any solution that didn't involve human control as axiomatically unacceptable, without regard to other outcomes.
The issue is that it's unclear if it's acceptable, so should be avoided if at all possible, pending more consideration, and in principle there is more time for that than what's relevant for any other concerns that don't involve the risk of losing control in a less voluntary way. The revealed preference looks the same as finding it unacceptable to give up the potential for human control, but the argument is different, so ...
I want to come to this! Sent you an e-mail. :)