Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Not sure what I'll do next yet. Views are my own & do not represent those of my current or former employer(s). I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:


(by Rob Wiblin)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)


 

Sequences

Agency: What it is and why it matters
AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions

Comments

I think this is also what I was confused about -- TurnTrout says that AIXI is not a shard-theoretic agent because it just has one utility function, but typically we imagine that the utility function itself decomposes into parts e.g. +10 utility for ice cream, +5 for cookies, etc. So the difference must not be about the decomposition into parts, but the possibility of independent activation? but what does that mean? Perhaps it means: The shards aren't always applied, but rather only in some circumstances does the circuitry fire at all, and there are circumstances in which shard A fires without B and vice versa. (Whereas the utility function always adds up cookies and ice cream, even if there are no cookies and ice cream around?) I still feel like I don't understand this.

Did Shane leave a way for people who didn't attend the talk, such as myself, to contact him? Do you think he would welcome such a thing?

Thanks for sharing!

So, it seems like he is proposing something like AutoGPT except with a carefully thought-through prompt laying out what goals the system should pursue and constraints the system should obey. Probably he thinks it shouldn't just use a base model but should include instruction-tuning at least, and maybe fancier training methods as well.

This is what labs are already doing? This is the standard story for what kind of system will be built, right? Probably 80%+ of the alignment literature is responding at least in part to this proposal. Some problems:

  • How do we get a model that is genuinely robustly trying to obey the instruction text, instead of e.g. choosing actions on the basis of a bunch of shards of desire/drives that were historically reinforced, and which are not well-modelled as "it just tries to obey the instruction text?" Or e.g. trying to play the training game well? Or e.g. nonrobustly trying to obey the instruction text, such that in some future situation it might cease obeying and do something else? (See e.g. Ajeya Cotra's training game report)
  • Suppose we do get a model that is genuinely robustly trying to obey the instruction text. How do we ensure that its concepts are sufficiently similar to ours, that it interprets the instructions in the ways we would have wanted them to be interpreted? (This part is fairly easy I think but probably not trivial and at least conceptually deserves mention)
  • Suppose we solve both of the above problems, what exactly should the instruction text be? Normal computers 'obey' their 'instructions' (the code) perfectly, yet unintended catastrophic side-effects (bugs) are typical. Another analogy is to legal systems -- laws generally end up with loopholes, bugs, unintended consequences, etc... (As above, this part seems to me to be probably fairly easy but nontrivial.)

I think most people pushing for a pause are trying to push against a 'selective pause' and for an actual pause that would apply to the big labs who are at the forefront of progress. I agree with you, however, that the current overton window seems unfortunately centered around some combination of evals-and-mitigations that is at IMO high risk of regulatory capture (i.e. resulting in a selective pause that doesn't apply to the big corporations that most need to pause!) My disillusionment about this is part of why I left OpenAI.

Good point, I guess I was thinking in that case about people who care a bunch about a smaller group of humans e.g. their family and friends. 

I agree that 0.7% is the number to beat for people who mostly focus on helping present humans and who don't take acausal or simulation argument stuff or cryonics seriously. I think that even if I was much more optimistic about AI alignment, I'd still think that number would be fairly plausibly beaten by a 1-year pause that begins right around the time of AGI. 

What are the mechanisms people have given and why are you skeptical of them?

Perhaps. I don't know much about the yields and so forth at the time, nor about the specific plans if any that were made for nuclear combat.

But I'd speculate that dozens of kiloton range fission bombs would have enabled the US and allies to win a war against the USSR. Perhaps by destroying dozens of cities, perhaps by preventing concentrations of defensive force sufficient to stop an armored thrust.

OK, thanks for clarifying.

Personally I think a 1-year pause right around the time of AGI would give us something like 50% of the benefits of a 10-year pause. That's just an initial guess, not super stable. And quantitatively I think it would improve overall chances of AGI going well by double-digit percentage points at least. Such that it makes sense to do a 1-year pause even for the sake of an elderly relative avoiding death from cancer, not to mention all the younger people alive today.

Big +1 to that. Part of why I support (some kinds of) AI regulation is that I think they'll reduce the risk of totalitarianism, not increase it.

So, it sounds like you'd be in favor of a 1-year pause or slowdown then, but not a 10-year?

(Also, I object to your side-swipe at longtermism. Longtermism according to wikipedia is "Longtermism is the ethical view that positively influencing the long-term future is a key moral priority of our time." "A key moral priority" doesn't mean "the only thing that has substantial moral value." If you had instead dunked on classic utilitarianism, I would have agreed.)

Load More