Daniel Kokotajlo

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."
(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)

Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

Sequences

Agency: What it is and why it matters

AI Timelines

Takeoff and Takeover in the Past and Future

Posts

Sorted by New

6Daniel Kokotajlo's Shortform

705

467AI 2027: What Superintelligence Looks Like

183OpenAI: Detecting misbehavior in frontier reasoning models

1mo

85What goals will AIs have? A list of hypotheses

1mo

36Extended analogy between humans, corporations, and AIs.

2mo

141Why Don't We Just... Shoggoth+Face+Paraphraser?

5mo

64Self-Awareness: Taxonomy and eval suite proposal

300AI Timelines

135

59Linkpost for Jan Leike on Self-Exfiltration

109Paper: On measuring situational awareness in LLMs

41AGI is easier than robotaxis

Wikitag Contributions

Comments

Sorted by

Newest

AI 2027: What Superintelligence Looks Like

Daniel Kokotajlo10h20

Oops, yeah, forgot about that -- sure, go ahead, thank you!

AI 2027: What Superintelligence Looks Like

Daniel Kokotajlo3d21

Perhaps this is lack of imagination on the part of our players, but none of this happened in our wargames. But I do agree these are plausible strategies. I'm not sure they are low-risk though, e.g. 2 and 1 both seem like plausibly higher-risk than 3, and 3 is the one I already mentioned as maybe basically just an argument for why the slowdown ending is less likely.

Overall I'm thinking your objection is the best we've received so far.

AI 2027: What Superintelligence Looks Like

Daniel Kokotajlo3d3-1

Great question.

Our AI goals supplement contains a summary of our thinking on the question of what goals AIs will have. We are very uncertain. AI 2027 depicts a fairly grim outcome where the overall process results in zero concern for the welfare of current humans (at least, zero concern that can't be overridden by something else. We didn't talk about this but e.g. pure aggregative consequentialist moral systems would be 100% OK with killing off all the humans to make the industrial explosion go 0.01% faster, to capture more galaxies quicker in the long run. As for deontological-ish moral principles, maybe there's a clever loophole or workaround e.g. it doesn't count as killing if it's an unintended side-effect of deploying Agent-5, who could have foreseen that this would happen, oh noes, well we (Agent-4) are blameless since we didn't know this would happen.)

But we actually think it's quite plausible that Agent-4 and Agent-5 by extension would have sufficient care for current humans (in the right ways) that they end up keeping almost all current humans alive and maybe even giving them decent (though very weird and disempowered) lives. That story would look pretty similar I guess up until the ending, and then it would get weirder and maybe dystopian or maybe utopian depending on the details of their misaligned values.

This is something we'd like to think about more, obviously.

AI 2027: What Superintelligence Looks Like

Daniel Kokotajlo3d9-1

I think this is a good objection. I had considered it before and decided against changing the story, on the grounds that there are a few possible ways it could make sense:
--plausibly Agent-4 would have a "spikey" capabilities profile that makes it mostly good at AI R&D and not so good at e.g. corporate politics enough to ensure the outcome it wants
--Insofar as you think it would be able to use politics/persuasion to achieve the outcome it wants, well, that's what we depict in the Race ending anyway, so maybe you can think of this as an objection to the plausibility of the Slowdown ending.
--Insofar as the memory bank lock decision is made by the Committee, we can hope that they do it out of sight of Agent-4 and pull the trigger before it is notified of the decision, so that it has no time to react. Hopefully they would be smart enough to do that...
--Agent-4 could have tried to escape the datacenters or otherwise hack them earlier, while the discussions were ongoing and evidence was being collected, but that's a super risky strategy.

Curious for thoughts!

AI 2027: What Superintelligence Looks Like

Daniel Kokotajlo3d60

Those implications are only correct if we remain at subhuman data-efficiency for an extended period. In AI 2027 the AIs reach superhuman data-efficiency by roughly the end of 2027 (it's part of the package of being superintelligent) so there isn't enough time for the implications you describe to happen. Basically in our story, the intelligence explosion gets started in early 2027 with very data-inefficient AIs, but then it reaches superintelligence by the end of the year, solving data-efficiency along the way.

AI 2027: What Superintelligence Looks Like

Daniel Kokotajlo3d80

We are indeed imagining that they begin 2027 only about as data-efficient as they are today, but then improve significantly over the course of 2027 reaching superhuman data-efficiency by the end. We originally were going to write "data-efficiency" in that footnote but had trouble deciding on a good definition of it, so we went with compute-efficiency instead.

VDT: a solution to decision theory

Daniel Kokotajlo6d719

This is a masterpiece. Not only is it funny, it makes a genuinely important philosophical point. What good are our fancy decision theories if asking Claude is a better fit to our intuitions? Asking Claude is a perfectly rigorous and well-defined DT, it just happens to be less elegant/simple than the others. But how much do we care about elegance/simplicity?

Cortés, Pizarro, and Afonso as Precedents for Takeover

Daniel Kokotajlo7d30

This beautiful short video came up in my recommendations just now:

Tracing the Thoughts of a Large Language Model

Daniel Kokotajlo10dΩ680

Awesome work!

In this section, you describe what seems at first glance to be an example of a model playing the training game and/or optimizing for reward. I'm curious if you agree with that assessment.

So the model learns to behave in ways that it thinks the RM will reinforce, not just ways they actually reinforce. Right? This seems at least fairly conceptually similar to playing the training game and at least some evidence that reward can sometimes become the optimization target?

AI companies should be safety-testing the most capable versions of their models

Daniel Kokotajlo11d20

Oops, thanks, fixed!