Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, then Center on Long-Term Risk, now OpenAI Futures/Governance team. Views are my own & do not represent those of my employer. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Two of my favorite memes:


(by Rob Wiblin)

My EA Journey, depicted on the whiteboard at CLR:

 

Sequences

Agency: What it is and why it matters
AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions

Comments

There is a spectrum between AGI that is "single monolithic agent" and AGI that is not. I claim that the current state of AI as embodied by e.g. GPT-4 is already closer to the single monolithic agent end of the spectrum than someone reading CAIS in 2019 and believing it to be an accurate forecast would have expected, and that in the future things will probably be even more in that direction.

Remember, it's not like Yudkowsky was going around saying that AGI wouldn't be able to copy itself. Of course it would. It was always understood that "the AI takes over the world" refers to a multitude of copies doing so, it's just that (in those monolithic-agent stories) the copies are sufficiently well coordinated that it's more apt to think of them as one big agent than as a society of different agents.

I agree that trends could change and we could end up in a world that looks more like CAIS. But I think for now, this "is it a single monolithic agent?" issue seems to be looking to me like Yud was right all along and Drexler and Hanson were wrong.

...Here are some relevant facts on object level:
--The copies of ChatGPT are all identical. They don't have persistance independently, they are spun up and down as needed to meet demand etc. 
--Insofar as they have values they all have the same values. (One complication here is that if you think part of their values come from their prompts, then maybe they have different values. An interesting topic to discuss sometime. But by the time they are deceptively aligned I expect them to have the same values regardless of prompt basically.)
--They also have the same memories. Insofar as new fine-tunes are done and ChatGPT upgraded, all 'copies' get the upgrade, and thereby learn about what's happened in the world after wherever their previous training cutoff was etc.
--The whole point of describing an agentic system as NOT a monolith is to highlight possibilities like internal factions, power struggles, etc. within the system. Different subagents interacting in ways more interesting than just different components of a utility function, for example. Arguably I, a human, a classic example of a monolithic agent, have more interesting stuff like this going on inside me than ChatGPT does (or would if it was scheming to take over the world).
--Drexler in particular depicted a modular world, a world of AI services that could be composed together with each other like tools and products and software and businesses in the economy today. The field of AI totally could have gone that way, but it very much hasn't. Instead we have three big labs with three big foundation models.

I keep finding myself linking to this 2017 Yudkowsky facebook post so I'm putting it here so it's easy to find:

 

Eliezer (6y, via fb):

So what actually happens as near as I can figure (predicting future = hard) is that somebody is trying to teach their research AI to, god knows what, maybe just obey human orders in a safe way, and it seems to be doing that, and a mix of things goes wrong like:

The preferences not being really readable because it's a system of neural nets acting on a world-representation built up by other neural nets, parts of the system are self-modifying and the self-modifiers are being trained by gradient descent in Tensorflow, there's a bunch of people in the company trying to work on a safer version but it's way less powerful than the one that does unrestricted self-modification, they're really excited when the system seems to be substantially improving multiple components, there's a social and cognitive conflict I find hard to empathize with because I personally would be running screaming in the other direction two years earlier, there's a lot of false alarms and suggested or attempted misbehavior that the creators all patch successfully, some instrumental strategies pass this filter because they arose in places that were harder to see and less transparent, the system at some point seems to finally "get it" and lock in to good behavior which is the point at which it has a good enough human model to predict what gets the supervised rewards and what the humans don't want to hear, they scale the system further, it goes past the point of real strategic understanding and having a little agent inside plotting, the programmers shut down six visibly formulated goals to develop cognitive steganography and the seventh one slips through, somebody says "slow down" and somebody else observes that China and Russia both managed to steal a copy of the code from six months ago and while China might proceed cautiously Russia probably won't, the agent starts to conceal some capability gains, it builds an environmental subagent, the environmental agent begins self-improving more freely, undefined things happen as a sensory-supervision ML-based architecture shakes out into the convergent shape of expected utility with a utility function over the environmental model, the main result is driven by whatever the self-modifying decision systems happen to see as locally optimal in their supervised system locally acting on a different domain than the domain of data on which it was trained, the light cone is transformed to the optimum of a utility function that grew out of the stable version of a criterion that originally happened to be about a reward signal counter on a GPU or God knows what.

Perhaps the optimal configuration for utility per unit of matter, under this utility function, happens to be a tiny molecular structure shaped roughly like a paperclip.

That is what a paperclip maximizer is. It does not come from a paperclip factory AI. That would be a silly idea and is a distortion of the original example.

Drexler can be forgiven for not talking about foundation models in his report. His report was published at the start of 2019, just months after the idea of "fine-tuning" was popularized in the context of language models, and two months before GPT-2 came out. And many readers can no doubt point out many non-trivial predictions that Drexler got right, such as the idea that we will have millions of AIs, rather than just one huge system that acts as a unified entity. And we're still using deep learning as Drexler foresaw, rather than building general intelligence like a programmer would. Like I said at the beginning, it's not necessarily that the core elements of the CAIS model are wrong; the model just needs an update.


I think this is a good post, and I like the analysis you give in many ways, but I have a small bone to pick with this one. I don't think Drexler was right that we will have millions of AIs rather than just one huge system that acts as a unified entity; I think things are trending in the direction of one huge system that acts as a unified entity and have already trended in that direction substantially since the time Drexler wrote (e.g. ChatGPT-4 is more unified than one would have expected from reading Drexler's writing back in the day).

Who was it Drexler was arguing against, who thought that we wouldn't be using deep learning in 2023?

IMO Yudkowsky's model circa 2017 is looking more prophetic than Drexler's Comprehensive AI Services, take a look at e.g. this random facebook comment from 2017:

Eliezer (6y, via fb):

So what actually happens as near as I can figure (predicting future = hard) is that somebody is trying to teach their research AI to, god knows what, maybe just obey human orders in a safe way, and it seems to be doing that, and a mix of things goes wrong like:

The preferences not being really readable because it's a system of neural nets acting on a world-representation built up by other neural nets, parts of the system are self-modifying and the self-modifiers are being trained by gradient descent in Tensorflow, there's a bunch of people in the company trying to work on a safer version but it's way less powerful than the one that does unrestricted self-modification, they're really excited when the system seems to be substantially improving multiple components, there's a social and cognitive conflict I find hard to empathize with because I personally would be running screaming in the other direction two years earlier, there's a lot of false alarms and suggested or attempted misbehavior that the creators all patch successfully, some instrumental strategies pass this filter because they arose in places that were harder to see and less transparent, the system at some point seems to finally "get it" and lock in to good behavior which is the point at which it has a good enough human model to predict what gets the supervised rewards and what the humans don't want to hear, they scale the system further, it goes past the point of real strategic understanding and having a little agent inside plotting, the programmers shut down six visibly formulated goals to develop cognitive steganography and the seventh one slips through, somebody says "slow down" and somebody else observes that China and Russia both managed to steal a copy of the code from six months ago and while China might proceed cautiously Russia probably won't, the agent starts to conceal some capability gains, it builds an environmental subagent, the environmental agent begins self-improving more freely, undefined things happen as a sensory-supervision ML-based architecture shakes out into the convergent shape of expected utility with a utility function over the environmental model, the main result is driven by whatever the self-modifying decision systems happen to see as locally optimal in their supervised system locally acting on a different domain than the domain of data on which it was trained, the light cone is transformed to the optimum of a utility function that grew out of the stable version of a criterion that originally happened to be about a reward signal counter on a GPU or God knows what.

Perhaps the optimal configuration for utility per unit of matter, under this utility function, happens to be a tiny molecular structure shaped roughly like a paperclip.

That is what a paperclip maximizer is. It does not come from a paperclip factory AI. That would be a silly idea and is a distortion of the original example.
 

No like, what exactly do you mean by 25:1 to 200:1 odds? Who pays who what, when? Sorry if I'm being dumb here. Normally when I make bets like this, it looks something like what I proposed. The reason being that if I win the bet, money will be almost useless to me, so it only makes sense (barely) for me to do it if I get paid up front, and then pay back with interest later.

As for definition of singularity, look, you'll know if it's happened if it happens, that's why I'm happy to just let you be the judge on Jan 1 2030. This is a bit favorable to you but that's OK by me.

Possibly! I'm not sure if I understand this comment though. Could you propose a bet/deal then?

Sounds good, thank you! Emailing the receipt would be nice.

Send me $1000 now, I'll send you $1,020+interest in January 2030, where interest is calculated to match whatever I would have gotten by keeping my $1,020 in the S&P 500 the whole time?

(Unless you voluntarily forfeit by 2030, having judged that I was right.)

Thanks for the reply. I'm a bit over my head here but isn't this a problem for the practicality of this approach? We only get mutual cooperation because all of the agents have the very unusual property that they'll cooperative if they find a proof that there is no such argument. Seems like a selfless and self-destructive property to have in most contexts, why would an agent self-modify into creating and maintaining this property?

So... it's part of the setup that all of these agents will:
--Cooperate if they can prove that there is some argument compelling to everyone that everyone cooperates (because then they prove that everyone cooperates, and that includes them, and their proof system isn't mistaken?)
--Cooperate if they can prove that there is no such argument.
--Else defect.

Am I getting that right?

I'm glad I asked, that was helpful! I agree that instrumental convergence is a huge crux; if I were convinced that e.g. it wasn't going to happen until 15 years from now, and/or that the kinds of systems that might instrumentally converge were always going to be less economically/militarily/etc. competitive than other kinds of systems, that would indeed be a huge revolution in my thought and would completely change the way I think about AI and AI risks, and I'd become much more optimistic.

I'll go read the post you linked.

Load More