Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, now works at Center on Long-Term Risk. Research interests include acausal trade, timelines, takeoff speeds & scenarios, decision theory, history, and a bunch of other stuff. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html


AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions


How's it going with the Universal Cultural Takeover? Part I

TMM IIRC has an a priori argument that we should expect horizontal transmission to beat vertical transmission often: It happens faster. Replication rate of memes is orders of magnitude faster than replication rate of genes. It is surprisingly analogous to viruses/parasites in that regard. Genes do evolve over time to be more resistant to viruses/parasites; this is why the native Americans were disproportionately wiped out by disease compared to Europeans, who were disproportionately wiped out compared to Africans. However, despite this, viruses/parasites remain prevalent in the world, even (especially) in Africa. The solution to the puzzle is that the diseases are evolving too.

How's it going with the Universal Cultural Takeover? Part I
Culture survives by reproducing itself. Ultimately it can only do this by aiding its bearers in reproducing themselves.

Not so. Part of the point of The Meme Machine was that this isn't true; memes (which are a kind of culture?) can be harmful to their bearers in every way and yet still successful.

Consider viruses and parasites. They typically harm their bearer's reproductive chances, and harm their bearers in other ways as well. Just not enough that they die before passing on the virus/parasite.

Forecasting Transformative AI, Part 1: What Kind of AI?

This is not very important, but: What was your thought process behind the acronym PASTA? It sounds kinda silly, and while I don't mind that myself I feel like that makes it harder to pitch to other people new to the topic. You could have said something like "R&D Automation."

[AN #164]: How well can language models write code?
I agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not "understand" the code.

Haha, good point -- yes. I guess what I should say is: Since humans would have performed just as poorly on this experiment, it doesn't count as evidence that e.g. "current methods are fundamentally limited" or "artificial neural nets can't truly understand concepts in the ways humans can" or "what goes on inside ANN's is fundamentally a different kind of cognition from what goes on inside biological neural nets" or whatnot.

[AN #164]: How well can language models write code?

Thanks again for these newsletters and summaries! I'm excited about the flagship paper.

First comment: I don't think their experiment about code execution is much evidence re "true understanding."

Recall that GPT-3 has 96 layers and the biggest model used in this paper was smaller than GPT-3. Each pass through the network is therefore loosely equivalent to less than one second of subjective time, by comparison to the human brain which typically goes through something like 100 serial operations per second I think? Could be a lot more, I'm not sure. https://aiimpacts.org/rate-of-neuron-firing/#Maximum_neural_firing_rates

So, the relevant comparison should be: Give a human the same test. Show them some code and give them 1 second to respond with an answer (or the first token of an answer, and then 1 second for the second token, and so forth). See how well they do at predicting the code output. I predict that they'd also do poorly, probably <50% accuracy. In claim that this passage from the paper inadvertently supports my hypothesis:

Including test cases and natural language descriptions in the prompt lead to the highest overall performance—higher than using the code itself. Because the code unambiguously describes the semantics, whereas test cases do not, this suggests that models are in some sense not really “reading” the source code and using it to execute. Models trained on general text corpora may be better at inducing patterns from as few as two input-output examples than they are at predicting the execution of code.

Second comment: Speculation about scaling trends:

Extrapolating from Figure 3, it seems that an AI which can solve (via at least one sample) approximately 100% of coding tasks in this set, without even needing fine-tuning, will require +2 OOMs of parameters, which would probably cost about $5B to train when you factor in the extra data required but also the lower prices and algorithmic improvements since GPT-3. Being almost 2 OOMs bigger than GPT-3, it might be expected to cost $6 per 1000 tokens, which would make it pretty expensive to use (especially if you wanted to use it at full-strength where it makes multiple samples and then picks the best one) though I think it might still find an economic niche; you could have a system where first a smaller model attempts a solution and you only call up the big model if that fails, and then you keep generating samples till you get one that works so on average the number of samples you need to generate will be small, and only cost you multiple dollars for a the toughest few percentile of cases. Then this service could be used by well-paid programmers for whom the time savings are worth it.

Does this extrapolation/speculation seem right?

Review of A Map that Reflects the Territory

IDA stands for iterated distillation and amplification. The idea is to start with a system M which is aligned (such as a literal human), amplify it into a smarter system Amp(M) (such as by letting it think longer or spin off copies of itself), and then distilling the amplified system into a new system M+ which is smarter than M but dumber than Amp(M), and then repeat indefinitely to scale up the capabilities of the system while preserving alignment.

The important thing is to ensure that the amplification and distillation steps both preserve alignment. That way, we start with an aligned system and continue having aligned systems every step of the way even as they get arbitrarily more powerful. How does the amplification step preserve alignment? Well, it depends on the details of the proposal, but intuitively this shouldn't be too hard--letting an aligned agent think longer shouldn't make it cease being aligned. How does the distillation step preserve alignment? Well, it depends on the details of the proposal, but intuitively this should be possible -- the distilled agent M+ is dumber than Amp(M) and Amp(M) is aligned, so hopefully Amp(M) can "oversee" the training/creation of M+ in a way that results in M+ being aligned also. Intuitively, M+ shouldn't be able to fool or deceive Amp(M) because it's not as smart as Amp(M).

Review of A Map that Reflects the Territory

Great review! One comment:

But the collection bogs down with a series of essays (about 1/3 of the book) on Paul Christano's research on Iterated Amplification. This technique seems to be that the AI has to tell you at regular intervals Why it is doing what it is doing, to avoid the AI gaming the system. That sounds interesting! But then the essays seem to debate whether its a good idea or not. I kept shouting at the book JUST TRY IT OUT AND SEE IF IT WORKS! Did Paul Christano DO this? If so what were the results? We never find out.

First of all, that's not how I would describe the core idea of IDA. Secondly, and much more importantly, we can't try it out yet because our AIs aren't smart enough. For example, we could ask GPT-3 to tell us why it is doing what it is doing... but as far as we can tell it doesn't even know! And if it did know, maybe it could be trained to tell us... but maybe that only works because it's too dumb to know how to convincingly lie to us, and if it were smarter, the training methods would stop working.

Our situation is analogous to someone in medieval europe who has a dragon egg and is trying to figure out how to train and domesticate the dragon so it doesn't hurt anyone. You can't "just try out" ideas because your egg hasn't hatched yet. The best you can do is (a) practice on non-dragons (things like chickens, lizards, horses...) and hope that the lessons generalize, (b) theorize about domestication in general, so that you have a firm foundation on which to stand when "crunch time" happens and you are actually dealing with a live dragon and trying to figure out how to train it, (c) theorize about dragon domestication in particular by imagining what dragons might be like, e.g. "We probably won't be able to put it in a cage like we do with chickens and lizards and horses because it will be able to melt steel with its fiery breath..."

Currently people in the AI risk community are pursuing the analogues of a, b, and c. Did I leave out any option d? I probably did, I'd be interested to hear it!

Forecasting Thread: AI Timelines

It's been a year, what do my timelines look like now?

My median has shifted to the left a bit, it's now 2030. However, I have somewhat less probability in the 2020-2025 range I think, because I've become more aware of the difficulties in scaling up compute. You can't just spend more money. You have to do lots of software engineering and for 4+ OOMs you literally need to build more chip fabs to produce more chips. (Also because 2020 has passed without TAI/AGI/etc., so obviously I won't put as much mass there...)

So if I were to draw a distribution it would look pretty similar, just a bit more extreme of a spike and the tip of the spike might be a bit to the right.

Thoughts on gradient hacking

Even in the simple case no. 1, I don't quite see why Evan isn't right yet.

It's true that deterministically failing will create a sort of wall in the landscape that the ball will bounce off of and then roll right back into as you said. However, wouldn't it also perhaps roll in other directions, such as perpendicular to the wall? Instead of getting stuck bouncing into the wall forever, the ball would bounce against the wall while also rolling in some other direction along it. (Maybe the analogy to balls and walls is leading me astray here?)

What 2026 looks like (Daniel's Median Future)

I disagree about compressibility; Elon said "AI is summoning the demon" and that's a five-word phrase that seems to have been somewhat memorable and memetically fit. I think if we had a good longer piece of content that expressed the idea that lots of people could read/watch/play then that would probably be enough.

Load More