Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Not sure what I'll do next yet. Views are my own & do not represent those of my current or former employer(s). I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)

Sequences

Agency: What it is and why it matters

AI Timelines

Takeoff and Takeover in the Past and Future

Posts

Sorted by New

5Daniel Kokotajlo's Shortform

360

61Self-Awareness: Taxonomy and eval suite proposal

3mo

255AI Timelines

6mo

58Linkpost for Jan Leike on Self-Exfiltration

8mo

106Paper: On measuring situational awareness in LLMs

8mo

38AGI is easier than robotaxis

9mo

61Pulling the Rope Sideways: Empirical Test Results

9mo

39What money-pumps exist, if any, for deontologists?

10mo

55The Treacherous Turn is finished! (AI-takeover-themed tabletop RPG)

41My version of Simulacra Levels

13Kallipolis, USA

Wiki Contributions

Comments

Please stop publishing ideas/insights/research about AI

Daniel Kokotajlo1d2926

"If nobody publishes anything, how will alignment get solved?" — sure, it's harder for alignment researchers to succeed if they don't communicate publicly with one another — but it's not impossible. That's what dignity is about. A

Huh, I have the opposite intuition. I was about to cite that exact same "Death with dignity" post as an argument for why you are wrong; it's undignified for us to stop trying to solve the alignment problem and publicly discussing the problem with each other, out of fear that some of our ideas might accidentally percolate into OpenAI and cause them to go slightly faster, and that this increased speedup might have made the difference between victory and defeat. The dignified thing to do is think and talk about the problem.

TurnTrout's shortform feed

Daniel Kokotajlo2d159

I think this is also what I was confused about -- TurnTrout says that AIXI is not a shard-theoretic agent because it just has one utility function, but typically we imagine that the utility function itself decomposes into parts e.g. +10 utility for ice cream, +5 for cookies, etc. So the difference must not be about the decomposition into parts, but the possibility of independent activation? but what does that mean? Perhaps it means: The shards aren't always applied, but rather only in some circumstances does the circuitry fire at all, and there are circumstances in which shard A fires without B and vice versa. (Whereas the utility function always adds up cookies and ice cream, even if there are no cookies and ice cream around?) I still feel like I don't understand this.

Shane Legg's necessary properties for every AGI Safety plan

Daniel Kokotajlo2d50

Did Shane leave a way for people who didn't attend the talk, such as myself, to contact him? Do you think he would welcome such a thing?

Shane Legg's necessary properties for every AGI Safety plan

Daniel Kokotajlo2d1813

Thanks for sharing!

So, it seems like he is proposing something like AutoGPT except with a carefully thought-through prompt laying out what goals the system should pursue and constraints the system should obey. Probably he thinks it shouldn't just use a base model but should include instruction-tuning at least, and maybe fancier training methods as well.

This is what labs are already doing? This is the standard story for what kind of system will be built, right? Probably 80%+ of the alignment literature is responding at least in part to this proposal. Some problems:

How do we get a model that is genuinely robustly trying to obey the instruction text, instead of e.g. choosing actions on the basis of a bunch of shards of desire/drives that were historically reinforced, and which are not well-modelled as "it just tries to obey the instruction text?" Or e.g. trying to play the training game well? Or e.g. nonrobustly trying to obey the instruction text, such that in some future situation it might cease obeying and do something else? (See e.g. Ajeya Cotra's training game report)
Suppose we do get a model that is genuinely robustly trying to obey the instruction text. How do we ensure that its concepts are sufficiently similar to ours, that it interprets the instructions in the ways we would have wanted them to be interpreted? (This part is fairly easy I think but probably not trivial and at least conceptually deserves mention)
Suppose we solve both of the above problems, what exactly should the instruction text be? Normal computers 'obey' their 'instructions' (the code) perfectly, yet unintended catastrophic side-effects (bugs) are typical. Another analogy is to legal systems -- laws generally end up with loopholes, bugs, unintended consequences, etc... (As above, this part seems to me to be probably fairly easy but nontrivial.)

AI Regulation is Unsafe

Daniel Kokotajlo6d144

I think most people pushing for a pause are trying to push against a 'selective pause' and for an actual pause that would apply to the big labs who are at the forefront of progress. I agree with you, however, that the current overton window seems unfortunately centered around some combination of evals-and-mitigations that is at IMO high risk of regulatory capture (i.e. resulting in a selective pause that doesn't apply to the big corporations that most need to pause!) My disillusionment about this is part of why I left OpenAI.

AI Regulation is Unsafe

Daniel Kokotajlo6d42

Good point, I guess I was thinking in that case about people who care a bunch about a smaller group of humans e.g. their family and friends.

AI Regulation is Unsafe

Daniel Kokotajlo7d53

I agree that 0.7% is the number to beat for people who mostly focus on helping present humans and who don't take acausal or simulation argument stuff or cryonics seriously. I think that even if I was much more optimistic about AI alignment, I'd still think that number would be fairly plausibly beaten by a 1-year pause that begins right around the time of AGI.

What are the mechanisms people have given and why are you skeptical of them?

Uncontrollable Super-Powerful Explosives

Daniel Kokotajlo7d40

Perhaps. I don't know much about the yields and so forth at the time, nor about the specific plans if any that were made for nuclear combat.

But I'd speculate that dozens of kiloton range fission bombs would have enabled the US and allies to win a war against the USSR. Perhaps by destroying dozens of cities, perhaps by preventing concentrations of defensive force sufficient to stop an armored thrust.

AI Regulation is Unsafe

Daniel Kokotajlo7d20

OK, thanks for clarifying.

Personally I think a 1-year pause right around the time of AGI would give us something like 50% of the benefits of a 10-year pause. That's just an initial guess, not super stable. And quantitatively I think it would improve overall chances of AGI going well by double-digit percentage points at least. Such that it makes sense to do a 1-year pause even for the sake of an elderly relative avoiding death from cancer, not to mention all the younger people alive today.

AI Regulation is Unsafe

Daniel Kokotajlo7d42

Big +1 to that. Part of why I support (some kinds of) AI regulation is that I think they'll reduce the risk of totalitarianism, not increase it.