Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, then Center on Long-Term Risk, now OpenAI Futures/Governance team. Views are my own & do not represent those of my employer. Research interests include acausal trade, timelines, takeoff speeds & scenarios, decision theory, history, and a bunch of other stuff. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Sequences

Agency: What it is and why it matters
AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions

Comments

Language models seem to be much better than humans at next-token prediction

OK, cool. Well, I don't buy that argument. There are other ways to do math besides being really really ridiculously good at internet text prediction. Humans are better at math than GPT-3 but probably that's because they are doing it in a different way than merely as a side-effect of being good at text prediction.

Language models seem to be much better than humans at next-token prediction

"However, it's possible though not by any means proven, that deep within the human brain there are subnetworks that are as good or better than AIs at predicting text,"

Actually this is basically just a known fact from neuroscience and general AI knowledge at this point - about as proven as such things can be. Much of the brain (and nearly all sensori-motor cortex) learns through unsupervised sensory prediction. I haven't even looked yet, but I'm also near certain there are neuroscience papers that probe this for linguistic token prediction ability (as I'm reasonably familiar with the research on how the vision system works, and it is all essentially transformer-UL-style training prediction of pixel streams, and it can't possibly be different for the linguistic centers - as there are no hard-coded linguistic centers, there is just generic cortex).

So from this we already know a way to estimate human equivalent perplexity - measure human ability on a battery of actually important linguistic tasks (writing, reading, math, etc) and then train a predictor to predict equivalent perplexity for a LM of similar benchmark performance on the downstream tasks. The difficulty here if anything is that even the best LMs (last I checked) haven't learned all of the emergent downstream tasks yet, so you'd have to bias the benchmark to the tasks the LMs currently can handle.

 

Whoa, hold up. It's one thing to say that the literature proves that the human brain is doing text prediction. It's another thing entirely to say that it's doing it better than GPT-3. What's the argument for that claim, exactly? I don't follow the reasoning you give above. It sounds like you are saying something like this:

"Both the brain and language models work the same way: Primarily they just predict stuff, but then as a result of that they develop downstream abilities like writing, answering questions, doing math, etc. So since the humans are better than GPT-3 at math etc., they must also be better than GPT-3 at predicting text. QED."
 

Language models seem to be much better than humans at next-token prediction

An even cheaper version would be to have the humans spend a few days practicing and measure how much performance improvement came from that, and then extrapolate. Have a log scale of practice time, and see how much performance each doubling of practice time gets you...

Language models seem to be much better than humans at next-token prediction

Wait a minute hold on, I disagree that this is a problem at all... isn't it a fully general counterargument to any human-vs.-AI comparison?

Like, for example, consider AlphaGo vs. humans. It's now well-established that AlphaGo is superhuman at Go. But it had the same "huge advantage" over humans that GPT does in the text prediction game! If AlphaGo had been hooked up to a robot body that had to image-recognize the board and then move pieces around, it too would have struggled, to put it mildly.

Here's another way of putting it:

There's this game, the "predict the next token" game. AIs are already vastly superhuman at this game, in exactly the same way that they are vastly superhuman at chess, go, etc. However, it's possible though not by any means proven, that deep within the human brain there are subnetworks that are as good or better than AIs at predicting text, and those subnetworks just aren't hooked up in the right way to the actual human behavior / motor outputs to get humans to be good at this game yet.

Language models seem to be much better than humans at next-token prediction

I think this problem would be solved with additional training of the humans. If a human spent years (months? Days?) practicing, they would learn to "use the force" and make use of their abilities. (Analogous to how we learn to ride bikes intuitively, or type, or read words effortlessly instead of by carefully looking at the letters and sounding them out.) Literally the neurons in your brain would rewire to connect the text-prediction parts to the playing-this-game parts.

Seriously, what goes wrong with "reward the agent when it makes you smile"?

Quoting Rob Bensinger quoting Eliezer:

So what actually happens as near as I can figure (predicting future = hard) is that somebody is trying to teach their research AI to, god knows what, maybe just obey human orders in a safe way, and it seems to be doing that, and a mix of things goes wrong like:

The preferences not being really readable because it's a system of neural nets acting on a world-representation built up by other neural nets, parts of the system are self-modifying and the self-modifiers are being trained by gradient descent in Tensorflow, there's a bunch of people in the company trying to work on a safer version but it's way less powerful than the one that does unrestricted self-modification, they're really excited when the system seems to be substantially improving multiple components, there's a social and cognitive conflict I find hard to empathize with because I personally would be running screaming in the other direction two years earlier, there's a lot of false alarms and suggested or attempted misbehavior that the creators all patch successfully, some instrumental strategies pass this filter because they arose in places that were harder to see and less transparent, the system at some point seems to finally "get it" and lock in to good behavior which is the point at which it has a good enough human model to predict what gets the supervised rewards and what the humans don't want to hear, they scale the system further, it goes past the point of real strategic understanding and having a little agent inside plotting, the programmers shut down six visibly formulated goals to develop cognitive steganography and the seventh one slips through, somebody says "slow down" and somebody else observes that China and Russia both managed to steal a copy of the code from six months ago and while China might proceed cautiously Russia probably won't, the agent starts to conceal some capability gains, it builds an environmental subagent, the environmental agent begins self-improving more freely, undefined things happen as a sensory-supervision ML-based architecture shakes out into the convergent shape of expected utility with a utility function over the environmental model, the main result is driven by whatever the self-modifying decision systems happen to see as locally optimal in their supervised system locally acting on a different domain than the domain of data on which it was trained, the light cone is transformed to the optimum of a utility function that grew out of the stable version of a criterion that originally happened to be about a reward signal counter on a GPU or God knows what.

Perhaps the optimal configuration for utility per unit of matter, under this utility function, happens to be a tiny molecular structure shaped roughly like a paperclip.

That is what a paperclip maximizer is. It does not come from a paperclip factory AI. That would be a silly idea and is a distortion of the original example.

The Main Sources of AI Risk?

Thanks for your willingness to help! At this point, years later, I think the next step would be to turn this into a different info format than list. Maybe a gigantic venn diagram. Yes, a website would be a good place for that maybe. But the first thing to do is think about how to reorganize it conceptually--would a venn diagram work? If not, what should we do?

On a smaller scale, if you have edits you want to make to this existing list please gimme them & I'll implement them and credit you.

Two-year update on my personal AI timelines

As opposed to what, linear? Or s-curvy? S-curves look exponential until you get close to the theoretical limit. I doubt we are close to the theoretical limits.

Ajeya bases her estimate on empirical data, so if you want to see whether it's exponential go look at that I guess.

Two-year update on my personal AI timelines

And in particular we should update towards below-human-level FLOPS.

Two-year update on my personal AI timelines

Thanks so much for this update! Some quick questions:

  1. Are you still estimating that the transformative model uses probably about 1e16 parameters & 1e16 flops? IMO something more like 1e13 is more reasonable.
  2. Are you still estimating that algorithmic efficiency doubles every 2.5 years (for now at least, until R&D acceleration kicks in?) I've heard from thers (e.g. Jaime Sevilla) that more recent data suggests it's doubling every 1 year currently.
  3. Do you still update against the lower end of training FLOP requirements, on the grounds that if we were 1-4 OOMs away right now the world would look very different?
  4. Is there an updated spreadsheet we can play around with?
Load More