Former safety researcher & TPM at OpenAI, 2020-24
https://www.linkedin.com/in/sjgadler
stevenadler.substack.com
Weight can be such an extreme determinative factor in combat sports that an untrained 250-pound couch potato could walk into any boxing gym and absolutely demolish a 100-pound opponent with decades of training.
I think this is kind of beside the point, but is this really true?
I buy that it conceptually could be the case for some small number of people, but I would have expected most 100-pound opponents with decades of training to beat untrained 250-lb couch potatoes (all it seems to take is one or two good punches against someone who doesn't know how to defend themself). Maybe I'm mistaken?
(PS - I laughed at the "Classic ostrich-and-egg problem" line!)
(Appreciate the correction re my nit, edited mine as well)
Thanks for taking the time to write up your reflections. I agree that the before/after distinction seems especially important (‘only one shot to get it right’), and a crux that I expect many non-readers not to know about the EY/NS worldview.
I’m wondering about your take in this passage:
In the book they make an analogy to a ladder where every time you climb it you get more rewards but once you reach the top rung then the ladder explodes and kills everyone. However, our experience so far with AI does not suggest that this is a correct world view.
I’m curious what about the world’s experience with AI seems to falsify it from your POV? / casts doubt upon it? Is it about believing that systems have become safer and more controlled over time?
(Nit, but the book doesn’t posit that the explosion happens at the top rung; in that case, we could just avoid ever reaching the top rung. It posits that the explosion happens at a not-yet-known rung, and so each successive rung climb carries some risk of blow-up. I don’t expect this distinction is load-bearing for you though)
(Edit: my nit is wrong as written! Thanks Boaz - he’s right that the book’s argument is actually about the top of the ladder, I was mistaken - though with the distinction I was trying to point at, of not knowing where the top is, so from a climber’s perspective there’s no way of just avoiding that particular rung)
This was really interesting, thanks for putting yourself in that situation and for writing it up
I was curious what examples were of therapy speak in the conversation, if you’re down to elaborate
FWIW, my experience was that the utility of user data was always much higher in promise than in actual outcomes. This might have changed over time though.
An ask that works is, e.g., “tell the government they need to stop everyone, including us”.)
For sure, I think that would be a reasonable ask too. FWIW, I think if multiple leading AI companies did make a statement like the one outlined, I think that would increase the chance of non-complying ones being made to halt by the government, even though they hadn’t made a statement themselves. That is, even one prominent AI company making this statement then starts to widen the Overton window
Yeah fair, I think we just read that passage differently - I agree it’s a very important one though and quoted it in my own (favorable) review
But I read the “because it would succeed” eg as a claim that they are arguing for, not something definitionally inseparable from superintelligence
Regardless, thanks for engaging on this, and hope it’s helped to clarify some of the objections EY/NS are hearing
FWIW that definition of “it” wasn’t clear to me from the book. I took IABIED as arguing that superintelligence is capable of killing everyone if it wants to, not taking “superintelligence can kill everyone if it wants to” as an assumption of its argument
That is, I’d have expected “superintelligence would not be capable enough to kill us all” to be a refutation of their argument, not to be sidestepping its conditional
Nit, but I think some safety-ish evals do run periodically in the training loop at some AI companies, and sometimes fuller sets of evals get run on checkpoints that are far along but not yet the version that’ll be shipped. I agree this isn’t sufficient of course
(I think it would be cool if someone wrote up a “how to evaluate your model a reasonable way during its training loop” piece, which accounted for the different types of safety evals people do. I also wish that task-specific fine-tuning were more of a thing for evals, because it seems like one way of perhaps reducing sandbagging)
Thanks for writing this up! I really liked this related podcast episode with Patrick McKenzie: https://open.spotify.com/episode/1QqFw5hlHKRrjRUTVLfKRV?si=ptVmFvXQRKaPwRNTg1Ollg
I think the biggest update for me was how the rewards programs are inseparable in some sense from the airlines. I think your language too of ordinary flight being a loss leader helps to describe it as well; the airlines couldn’t just have the valuable rewards program, because having the underlying less-profitable flights that make it possible!