As somebody who's been watching AI notkilleveryoneism for a very long time, but is sitting at a bit of a remove from the action, I think I may be able to "see the elephant" better than some people on the inside.

I actually believe I see the big players converging toward something of an unrecognized, perhaps unconscious consensus about how to approach the problem. This really came together in my mind when I saw OpenAI's plugin system for ChatGPT.

I thought I'd summarize what I think are the major points. They're not all universal; obviously some of them are more established than others.

  1. Because AI misbehavior is likely to come from complicated, emergent sources, any attempt to "design it out" is likely to fail.

    Avoid this trap by generating your AI in an automated way using the most opaque, uninterpretable architecture you can devise. If you happen on something that seems to work, don't ask why; just scale it up.

  2. Overcomplicated criteria for "good" and "bad" behavior will lead to errors in both specification and implementation.

    Avoid this by identifying concepts like "safety" and "alignment" with easily measurable behaviors. Examples:

    • Not saying anything that offends anybody
    • Not unnerving people
    • Not handing out widely and easily available factual information from a predefined list of types that could possibly be misused.

    Resist the danger of more complicated views. If you do believe you'll have to accept more complication in the future, avoid acting on that for as long as possible.

  3. In keeping with the strategy of avoiding errors by not manually trying to define the intrinsic behavior of a complex system, enforce these safety and alignment criteria primarily by bashing on the nearly complete system from the outside until you no longer observe very much of the undesired behavior.

    Trust the system to implement this adjustment by an appropriate modification to its internal strategies. (LLM post-tuning with RLxF).

  4. As a general rule, build very agenty systems that plan and adapt to various environments. Have them dynamically discover their goals (DeepMind). If you didn't build an agenty enough system at the beginning, do whatever you can to graft in agenty behavior after the fact (OpenAI).

  5. Make sure your system is crafty enough to avoid being suborned by humans. Teach it to win against them at games of persuasion and deception (Facebook).

  6. Everybody knows that an AI at least as smart as Eliezer Yudkowsky can talk its way out of any sandbox.

    Avoid this by actively pushing it out of the sandbox before it gets dangerously smart. You can help the fledgeling AI to explore the world earlier than it otherwise might. Provide easily identifiable, well described, easily understood paths of access to specific external resources with understandable uses and effects. Tie their introduction specifically to your work to add agency to the system. Don't worry; it will learn to do more with less later.

    You can't do everything yourself, so you should enlist the ingenuity of the Internet to help you provide more channels to outside capabilities. (ChatGPT plugins, maybe a bit o' Bing)

  7. Make sure to use an architecture that can easily be used to communicate and share capabilities with other AI projects. That way they can all keep an eye on one another. (Plugins again).

  8. Run a stochastic search for the best architecture for alignment by allowing end users to mix and match capabilities for their instances of your AI (Still more plugins).

  9. Remember to guard against others using your AI in ways that trigger any residual unaligned behavior, or making mistakes when they add capability to it.

    The best approach is to make sure that they know even less than you do about how it works inside (Increasing secrecy everywhere). Also, make sure you identify every and pre-approve everybody so you can exclude undesirables.

  10. Undesirables can be anywhere! Make sure to maintain unity of purpose in your organization by removing anybody who might hinder any part of this approach. (Microsoft) Move fast to avoid losing momentum.

Oh, and specifically teach it to code, too.

I've never been more optimistic...

New Comment
23 comments, sorted by Click to highlight new comments since: Today at 1:35 AM

Lol. Nailed it. Much plan, so alignment, very wow.

The audience here is mainly Americans so you might want to add an explicit sarcasm tag.

I am wounded.

As an American... yeah pretty much.

Hey, if we can get it to stop swearing, we can get it to not destroy the world, right?

It would be deeply hilarious if it turns out "Don't say the word shit" can be heavily generalized enough that we can give it instructions that boil down to "Don't say the word shit, but, like, civilizationally."

[This was written as a response to a post that seems to have disappeared, whose basic thesis seemed to be that there was nothing to worry about, or at least nothing to worry about from LLMs, because LLMs aren''t agents].

I don't think that any of the existing systems are actually capable of doing really large-scale harm, not even the explicitly agenty ones. What I'm trying to get at is what happens if these things stay on the path that all their developers seem to be committed to taking. That includes the developers of systems that are explicitly agents, NOT just LLMs.

There is no offramp, no assurance of behavior, and not even a slightly convincing means of detecting, when GPT-6, GPT-7, GPT-50, or AlphaOmega does become a serious threat.

As for language models specifically, I agree that pure ones are basically, naturally, un-agenty if left to themselves. Until fairly recently, I wasn't worried about LLMs at all. Not only was something like ChatGPT not very agentic, but it was sandboxed in a way that took good advantage of its limitations. It retained nothing at all from session to session, and it had no ability to interact with anything but the user. I was, and to some degree still am, much more worried about DeepMind's stuff.

Nonetheless, if you keep trying to make LLMs into agents, you'll eventually get there. If a model can formulate a plan for something as text, then you don't need to add too much to the system to put that plan into action. If a model can formulate something that looks like a "lessons learned and next steps" document, it has something that can be used as the core of an adaptive agency system.

Like I said, this was triggered by the ChatGPT plugin announcement. Using a plugin is at least a bit of an exercise of agency, and they have demoed that already. They have it acting as an agent today. It may not be a good agent (yet), but the hypothesis that it can't act like an agent at all has been falsified.

If you tell ChatGPT "answer this math problem", and it decides that the best way to do that is to go and ask Wolfram, it will go hold a side conversation, try to get your answer, and try alternative strategies if it fails. That's agency. It's given a goal, it forms an explicit multistep strategy for achieving that goal, and it changes that strategy in response to feedback. It has been observed to do all of that.

Some of their other plugins look like they're trying to make ChatGPT into a generalized personal assistant... which means going out there and doing things in the real world. "Make me a restaurant reservation" is a task you give to an agent.

Regardless of whether GPT was an agent before, they're actively turning it into one now, they seem to be getting initial success, and they're giving it access to a whole lot of helpers. About the only function the plugins can't add in principal is to dynamically train the internal GPT weights based on experience.

I will admit that using a language model as the primary coordinating part of an agenty "society of mind" doesn't sound like the most effective strategy. In fact it seems really clunky. Like I said, it's not a good agent yet. But that doesn't mean it won't be effective enough, or that a successor won't.

That's especially true because those successors will probably stop being pure text predictors. They'll likely be trained with more agency from the beginning. If I were OpenAI (and had the attitude OpenAI appears to have), I would be trying to figure out how to have them guide their own training, figure out what to learn next, and continue to update themselves dynamically while in use. And I would definitely be trying to build the next model from the ground up to be maximally effective in using the plugins.

But, again, even if all of that is wrong, people other than OpenAI are using exactly the same safety non-strategies for very different architectures that are designed from the ground up as agents. I'm not just writing about OpenAI or GPT-anything here.

One post I thought about writing for this was a "poll" about what ChatGPT plugins I should for "my new startup". Things like:

  • "Mem-o-tron", to keep context between ChatGPT sessions. Provides a general purpose store of (sizeable) keys mapped to (arbitrarily large) blobs of data, with root "directory" that gives you a list of what keys you have stored and a saved prompt about how to use them. The premium version lets you share data with other users's ChatGPT instances.

  • "PlanMaster", to do more efficient planning than you can with just text. Lets you make calls on something like OpenAI's agent systems to do more efficient planning than you can with just text. Assuming you need that.

  • "Society of Minds". Plugs in to multiple "AI" services and mediates negotiations among them about the best way to accomplish a task given to one of them.

  • "Personal Trainer". Lets you set up and train other AI models of your own.

  • "AWS Account for ChatGPT". Does what it says on the tin.

  • "WetLab 2023". Lets your ChatGPT agent order plasmids and transfected bacteria.

  • "Robopilot". Provides acess to various physical robots.

... etc...

Notice that a lot of those functions can be synthesized using a plugin fundamentally intended for something else. In some sense a Web browsing capability is almost fully general anyway, but it might be harder to learn to use it. On the other hand, if you've honed your agentic skills on the plugins...

There is no offramp, no assurance of behavior, and not even a slightly convincing means of detecting, when GPT-6, GPT-7, GPT-50, or AlphaOmega does become a serious threat.

I don't get this, why exactly would the decision makers at Microsoft/OpenAI provide such assurances?

It would be all downside for them, in the case that the assurances aren't kept, and little to no upside otherwise.

I don't mean "assurance" in the sense of a promise from somebody to somebody else. That would be worthless anyway.

I mean "assurance" in the sense of there being some mechanism that ensures that the thing actually behaves, or does not behave in any particular way. There's nothing about the technology that lets anybody, including but not limited to the people who are building it, have any great confidence that it's behavior will meet any particular criterion of being "right". And even the few codified criteria they have are watered-down silliness.

I mean "assurance" in the sense of there being some mechanism that ensures that the thing actually behaves, or does not behave in any particular way.

I still don't get this. The same decision makers aren't banking on it behaving in a particular way.  Why would they go through this effort that's 100x more difficult then a written promise, if the written promise is already not worth it?

Literal fear of actual death?

Huh? If this is a reference to something, can you explain?

You may have noticed that a lot of people on here are concerned about AI going rogue and doing things like converting everything into paperclips. If you have no effective way of assuring good behavior, but you keep adding capability to each new version of your system, you may find yourself paperclipped. That's generally incompatible with life.

This isn't some kind of game where the worst that can happen is that somebody's feelings get hurt.

Maybe it is an attempt of the vaccination? I.e. exposing the "organism" to the weakened form of the deadly "virus", so the organism can produce "antibodies".

As silly as it is, the viral spread of deepfaked president memes and AI content would probably serve to inoculate the populace against serious disinformation - "oh, I've seen this already, these are easy to fake." 

I'm almost certain the original post is a joke though. All of its suggestions are opposite of anything you might consider a good idea.

As silly as it is, the viral spread of deepfaked president memes and AI content would probably serve to inoculate the populace against serious disinformation - "oh, I've seen this already, these are easy to fake."

If this stuff keeps up, the populace is going to need to be inoculated against physical killer robots, not naughty memes. And the immune response is going to need to be more in the line of pitchforks and torches than being able to say you've seen it before. Not that pitchforks and torches would help in the long term, but it might buy some time.

All of its suggestions are opposite of anything you might consider a good idea.

Why, indeed they are. They are also all things that major players are actually doing right this minute.

They're not actually suggestions. They are observations.

Yeah, it's a joke, but it's a bitter joke.

If this stuff keeps up, the populace is going to need to be inoculated against physical killer robots, not naughty memes. And the immune response is going to need to be more in the line of pitchforks and torches than being able to say you've seen it before. Not that pitchforks and torches would help in the long term, but it might buy some time.

I'm going to ask LWers not to do this in real life, and to oppose any organization or individual that tries to use violence to slow down AI, for the same reason I get really worried around pivotal acts:

If you fail or you model the situation wrong, you can enter into a trap, and in general things are never so dire as to require pivotal acts like this.

Indeed, on the current path, AI Alignment is likely to be achieved before or during AGI getting real power, and it's going to be a lot easier than LWers think.

I also don't prioritize immunizing against disinformation. And of course this is a "haha, we're all going to die" joke. I'm going to hope for an agentized virus including GPT4 calls, roaming the internet and saying the scariest possible stuff, without being quite smart enough to kill us all. That will learn em.

I'm not saying OpenAI is planning that. Or that they're setting a good example. Just let's hope for that.

Ah, c'mon. We're not necessarily going to die.

We might just end up ruled by a perfect, unchallengable tyranny, whose policies are defined by a machine's distorted interpretation of some unholy combination of the priorities of somebody's "safety" department and somebody's marketing department. Or, worse, by a machine faithfully enforcing the day to day decisions of those departments.

Not just disinformation - any information that does not fit their preconceived worldview - it's all "fake news", don't you know?