Daniel Kokotajlo

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Not sure what I'll do next yet. Views are my own & do not represent those of my current or former employer(s). I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."
(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)

Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

Sequences

Agency: What it is and why it matters

AI Timelines

Takeoff and Takeover in the Past and Future

Posts

Sorted by New

5Daniel Kokotajlo's Shortform

461

63Self-Awareness: Taxonomy and eval suite proposal

8mo

277AI Timelines

11mo

59Linkpost for Jan Leike on Self-Exfiltration

107Paper: On measuring situational awareness in LLMs

41AGI is easier than robotaxis

61Pulling the Rope Sideways: Empirical Test Results

39What money-pumps exist, if any, for deontologists?

55The Treacherous Turn is finished! (AI-takeover-themed tabletop RPG)

41My version of Simulacra Levels

13Kallipolis, USA

Wiki Contributions

Comments

Sorted by

Newest

My disagreements with "AGI ruin: A List of Lethalities"

Daniel Kokotajlo2d20

OK, why? You say:

one very crucial advantage we have over evolution is that our goals are much more densely defined, constraining the AI more than evolution, where very, very sparse reward was the norm, and critically sparse-reward RL does not work for capabilities right now, and there are reasons to think it will be way less tractable than RL where rewards are more densely specified.

idk if our goals are much more densely defined. Inclusive genetic fitness is approximately what, like 1-3 standard reproductive cycles? "Number of grandkids" basically? So like fifty years?

Humans trying to make their AGI be helpful harmless and honest... well technically the humans have longer-term goals because we care a lot about e.g. whether the AGI will put humanity on a path to destruction even if that path takes a century to complete, but I agree that in practice if we can e.g. get the behavior we'd ideally want over the course of the next year, that's probably good enough. Possibly even for shorter periods like a month. Also, separately, our design cycle for the AGIs is more like months than hours or years. Months is how long the biggest training runs take, for one thing.

So I'd say the comparison is like 5 months to 50 years, a 2-OOM difference in calendar time. But AIs read, write, and think much faster than humans. In those 5 months, they'll do serial thinking (and learning, during a training run) that is probably deeper/longer than a 50-year human lifetime, no? (Not sure how to think about the parallel computation advantage)

idk my point is that it doesn't seem like a huge difference to me, it probably matters but I would want to see it modelled and explained more carefully and then measurements done to quantify it.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo2d20

One thing we can say is that Eliezer was wrong to claim that you could have an AI that could takeoff in hours to weeks, because compute bottlenecks do matter a lot, and they prevent the pure software singularity from happening.
So we can fairly clearly call this a win for slow takeoff views, though I do think Paul's operationalization is wrong IMO for technical reasons.

I strongly disagree, I think hours-to-weeks is still on the menu. Also, note that Paul himself said this:

My intuition is that by the time that you have an AI which is superhuman at every task (e.g. for $10/h of hardware it strictly dominates hiring a remote human for any task) then you are likely weeks rather than months from the singularity.
But mostly this is because I think "strictly dominates" is a very hard standard which we will only meet long after AI systems are driving the large majority of technical progress in computer software, computer hardware, robotics, etc. (Also note that we can fail to meet that standard by computing costs rising based on demand for AI.)

So one argument for fast takeoff is: What if strictly dominates turns out to be in reach? What if e.g. we get AgentGPT-6 and it's good enough to massively automate AI R&D, and then it synthesizes knowledge from biology, neurology, psychology, and ML to figure out how the human brain is so data-efficient, and boom, after a few weeks of tinkering we have something as data-efficient as the human brain but also bigger and faster and able to learn in parallel from distributed copies? Also we've built up some awesome learning environments/curricula to give it ten lifetimes of elite tutoring & self-play in all important subjects? So we jump from 'massively automate AI R&D' to 'strictly dominates' in a few weeks?

Also, doesn't Tom's model support a pure software singularity being possible?

Thanks for sharing your models btw that's good of you. I strongly agree that conditional on your timelines/model-settings, Paul will overall come out looking significantly more correct than Eliezer.

My disagreements with "AGI ruin: A List of Lethalities"

Daniel Kokotajlo3d20

one very crucial advantage we have over evolution is that our goals

The right analogy is evolution::the field of ML research, in-lifetime-learning-e.g.-dopamine-etc.::the-training-loop-e.g.-reward-function-etc.

There's a gap between what humans actually value and what maximizes their discounted future rewards, AND there's a gap between what humans actually value and what maximizes their inclusive genetic fitness. Similarly, there'll be a gap between what AGIs actually value and what maximizes performance in training, and a gap between what AGIs actually value and what their creators were designing them to value.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo3d20

Re 1: I guess I'd say there are different ways to be reliable; one way is simply being better at not making mistakes in the first place, another way is being better at noticing and correcting them before anything is locked in / before it's too late to correct. I think that LLMs are already probably around human-level at the first method of being reliable, but they seem to be subhuman at the second method. And I think the second method is really important to how humans achieve high reliability in practice. Hence why LLMs are generally less reliable than humans. But notice how o1 is already pretty good at correcting its mistakes, at least in the domain of math reasoning, compared to earlier models... and correspondingly, o1 is way better at math.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo3d40

There are different definitions of AGI, but I think they tend to cluster around the core idea "can do everything smart humans can do, or at least everything nonphysical / everything they can do at their desk." LLM chatbots are a giant leap in that direction in progress-space, but they are still maybe only 10% of the way there in what-fraction-of-economically-useful-tasks-can-they-do space. True AGI would be a drop-in substitute for a human employee in any remote-friendly job; current LLMs are not that for any job pretty much, though they can substitute for (some) particular tasks in many jobs.

And the main reason for this, I claim, is that they lack agency skills: Put them in an AutoGPT scaffold and treat them like an employee, and what'll happen? They'll flail around uselessly, get stuck often, break things and not notice they broke things, etc. They'll be a terrible employee despite probably knowing more relevant facts and understanding more relevant concepts than your average professional.

the case for CoT unfaithfulness is overstated

Daniel Kokotajlo4d4713

Yep!

I think Faithful CoT is a very promising research agenda and have been pushing for it since the second half of 2023. I wrote some agenda-setting and brainstorming docs (#10 and #11 in this list) which people are welcome to read and comment on if interested.

I'd say the CoT is mostly-faithful by default in current LLMs; the important things to research are how to strengthen the faithfulness property and how to avoid degrading it (e.g. by creating training pressures/incentives for unfaithfulness). I was pleased to see OpenAI reference this in their justification for why they aren't letting users see o1's CoT (even though of course it was a total non sequitur; they can show the users the CoT without also training on the resulting judgments).

My disagreements with "AGI ruin: A List of Lethalities"

Daniel Kokotajlo4d20

Cool, thanks!

How does this fit in to the rest of the training, the training that makes it an AGI? Is that separate, or is the idea that your AGI training dataset will be curated to contain e.g. only stories about AIs behaving ethically, and only demonstrations of AIs behaving ethically, in lots of diverse situations?

How do you measure 'before it can start to conceive of deceptive alignment?'

How is this different from just "use HFDT" or "Use RLHF/constitutional AI?"

I also like COT interpretability.

Daniel Kokotajlo's Shortform

Daniel Kokotajlo4d44

Yeah both 1 and 2 are 'they lack agency skills.' If they had more agency skills, they would be more reliable, because they'd be better at e.g. double-checking their work, knowing when to take a guess vs. go do more thinking and research first, better at doing research, etc. (Humans aren't actually more reliable than LLMs in an apples to apples comparison where the human has to answer off the top of their head with the first thing that comes to mind, or so I'd bet, I haven't seen any data on this)

As for 2 yeah that's an example of an agency skill. (Agency skill = the bundle of skills specifically useful for operating autonomously for long periods in pursuit of goals, including skills like noticing when you are stuck, doublechecking your past work, planning, etc.)

Daniel Kokotajlo's Shortform

Daniel Kokotajlo4d3220

We did not basically get AGI. I think recent history has been a vindication of people like Gwern and Eliezer back in the day (as opposed to Karnofsky and Drexler and Hanson). The point was always that agency is useful/powerful, and now we find ourselves in a situation where we have general world understanding but not agency and indeed our AIs are not that useful (compared to how useful AGI would be) precisely because they lack agency skills. We can ask them questions and give them very short tasks but we can't let them operate autonomously for long periods in pursuit of ambitious goals like we would an employee.

At least this is my take, you don't have to agree.

AI Timelines

Daniel Kokotajlo4d30

I gain money in expectation with loans, because I don't expect to have to pay them back. What specific bet are you offering?