Epistemic status: head spinning, suddenly unsure of everything in alignment. And unsure of these predictions.
I'm following the suggestions in 10 reasons why lists of 10 reasons might be a winning strategy in order to get this out quickly (reason 10 will blow your mind!). I'm hoping to prompt some discussion, rather than try to do the definitive writeup on this topic when this technique was introduced so recently.
Ten reasons why agentized LLMs will change the alignment landscape:
- Agentized[1] LLMs like Auto-GPT and Baby AGI may fan the sparks of AGI in GPT-4 into a fire. These techniques use an LLM as a central cognitive engine, within a recursive loop of breaking a task goal into subtasks, working on those subtasks (including calling other software), and using the LLM to prioritize subtasks and decide when they're adequately well done. They recursively check whether they're making progress on their top-level goal.
- While it remains to be seen what these systems can actually accomplish, I think it's very likely that they will dramatically enhance the effective intelligence of the core LLM. I think this type of recursivity and breaking problems into separate cognitive tasks is central to human intelligence. This technique adds several key aspects of human cognition; executive function; reflective, recursive thought; and episodic memory for tasks, despite using non-brainlike implementations. To be fair, the existing implementations seem pretty limited and error-prone. But they were implemented in days. So this is a prediction of near-future progress, not a report on amazing new capabilities.
- This approach appears to be easier than I'd thought. I've been expecting this type of self-prompting to imitate the advantages of human thought, but I didn't expect the cognitive capacities of GPT-4 to make it so easy to do useful multi-step thinking and planning. The ease of initial implementation (something like 3 days, with all of the code also written by GPT-4 for baby AGI) implies that improvements may also be easier than we would have guessed.
- Integration with HuggingGPT and similar approaches can provide these cognitive loops with more cognitive capacities. This integration was also easier than I'd have guessed, with GPT-4 learning from a handful (e.g., 40) of examples how to use other software tools. Those tools will include both sensory capacities, with vision models and other sensory models of various types, and the equivalent of a variety of output capabilities.
- Integration of recursive LLM self-improvement like "Reflexion" can utilize these cognitive loops to make the core model better at a variety of tasks.
- Easily agentized LLMs is terrible news for capabilities. I think we'll have an internet full of LLM-bots "thinking" up and doing stuff within a year.
- This is absolutely bone-chilling for the urgency of the alignment and coordination problems. Some clever chucklehead already created ChaosGPT, an instance of Auto-GPT given the goal to destroy humanity and create chaos. You are literally reading the thoughts of something thinking about how to kill you. It's too stupid to get very far, but it will get smarter with every LLM improvement, and every improvement to the recursive self-prompting wrapper programs. This gave me my very first visceral fear of AGI destroying us. I recommend it, unless you're already plenty viscerally freaked out.
- Watching agents think is going to shift public opinion. We should be ready for more AI scares and changing public beliefs. I have no idea how this is going to play out in the political sphere, but we need to figure this out to have a shot at successful alignment, because
- We will be in a multilateral AGI world. Anyone can spawn a dumb AGI and have it either manage their social media, or try to destroy humanity. And over the years, those commercially available AGIs will get smarter. Because defense is harder than offense, it is going to be untenable to indefinitely defend the world against out-of-control AGIs. But
- Important parts of alignment and interpretability might be a lot easier than most of us have been thinking. These agents take goals as input, in English. They reason about those goals much as humans do, and this will likely improve with model improvements. This does not solve the outer alignment problem; one existing suggestion is to include a top-level goal of "reducing suffering." No! No! No!. This also does not solve the alignment stability problem. Starting goals can be misinterpreted or lost to recursive subgoals, and if any type of continued learning is included, behavior will shift over time. It doesn't even solve the inner alignment problem if recursive training methods create mesa-optimizers in the LLMs. But it also provides incredibly easy interpretability, because these systems think in English.
If I'm right about any reasonable subset of this stuff, this lands us in a terrifying, promising new landscape of alignment issues. We will see good bots and bad bots, and the balance of power will shift. Ultimately I think this leads to the necessity of very strong global monitoring, including breaking all encryption, to prevent hostile AGI behavior. The array of issues is dizzying (I am personally dizzied, and a bit short on sleep from fear and excitement). I would love to hear others' thoughts.
- ^
I'm using a neologism, and a loose definition of agency as things that flexibly pursue goals. That's similar to this more rigorous definition.
Maybe one upside to the influx of "agents made with GPT-N API calls and software glue" is that these types of AI agents are more likely to cause a fire alarm-y disaster which gets mitigated, thus spurring governments to take X-risk more seriously, as opposed to other types of AI agents, whose first disaster would blow right past fire alarm level straight to world-ending level?
For example, I think this situation is plausible: ~AutoGPT-N[1] hacks into a supercomputer cluster or social-engineers IT workers over email or whatever in the pursuit of some other goal, but ultimately gets shut down by OpenAI simply banning the agent from using their API. Maybe it even succeeds in some scarier instrumental goal, like obtaining more API keys and spawning multiple instances of itself. However, the crucial detail is that the main "cognitive engine" of the agent is bottlenecked by API calls, so for the agent to wipe everyone out, it needs to overcome the hurdle of pwning OpenAI specifically.
By contrast, if an agent that's powered by an open-source language model gets to the "scary fire alarm" level of self-improvement/power-seeking, it might be too late, since it wouldn't have a "stop butto... (read more)
Having played with both BabyAGI and AutoGPT over the past few days, I'm actually surprised at how hard it is to get them to do useful multistep thinking and planning. Even things that I'd think an LLM would be good at, like writing a bunch of blogposts from a list, or book chapters from an outline, the LLM tends to get off track in a way I wouldn't expect from the coherency I see in chat interactions where I'm constantly giving the LLM hints about the topic, and can reroll or rewrite if it misunderstands. I think I was underestimating how much work those constant feedback and corrections from me are doing
Idk, I feel about this stuff like I felt about GPT-J. What scares me is not how well it works, but that it kinda/sorta works a bit. It's a bunch of garbage python code wrapped around an API, and it kinda works. I expect people will push on this stuff hard, and am worried that DeepMind, OpenAI, and Google will be doing so in a much more principled way than the wild-west LLM enthusiast crowed.
I think it was wrong for people to take comfort in the meme that "GPT-N is not an agent" and this will become very clear to everyone in the next 18 months.
Ad ChaosGPT:
Attempting to create (even weak) agent tasked with "destroying humanity" should be made very clear to be out of bounds of acceptable behavior. I feel that I want the author to be prosecuted.
Now the meme is: "haha we can tell AI to hurt us and make fun of how it fails"
What I would like the meme to be: this is extremely unethical, deserving outrage and perhaps attempted terrorism.
I wonder if/when/how quickly this will be criminalized in a manner similar to terrorism or using weapons of mass destruction.
If we're being realistic, this kind of thing would only get criminalized after something bad actually happened. Until then, too many people will think "omg, it's just a Chatbot". Any politician calling for it would get made fun of on every Late Night show.
I'm not sure this point will stand because it might be cheaper to have them think in their own language: https://www.lesswrong.com/posts/bNCDexejSZpkuu3yz/you-can-use-gpt-4-to-create-prompt-injections-against-gpt-4
Funny, Auto-GPT stuff actually makes me less worried about GPT-4 and its scale-ups. It's been out for weeks, less impressive variants were out for longer, and so far, nothing much has come from it. Looking at the ChaosGPT video... I would've predicted that it wasn't actually any good at pursuing its goal, that it just meandered around the "kill everyone" objective without ever meaningfully progressing — and lo and behold, it's doing exactly that. Actually, it's worse at it than I'd expected.
I see the case for doom, I do! It's conceivable that it will turn out in this manner. We're witnessing an AI planning, here, and it's laughably bad planning so far, but the mere fact that they can do it at all implies a readily-available possibility of making them massively better at it. So in e. g. five more years, we'd get AIs whose planning skills to ChaosGPT as Midjourney is to PixelCNN, and then maybe one of them FOOMs.
But mostly, I agree with this view. And this is an instance of the "wire together GPT models to get an AGI" attempt failing, and on my view it's failing in a way that's providing some evidence this entire approach won't work. It's conceivable that it'd work with GPT≥5, or wit... (read more)
I'm not confident at all Auto-GPT could work at its goals, just that in narrower domains the specific system or arrangement of prompt interactions matters. To give a specific example, I goof around trying to get good longform D&D games out of ChatGPT. (Even GPT-2 fine-tuned on Crit Role transcripts, originally.) Some implementations just work way better than others.
The trivial system is no system - just play D&D. Works great until it feels like the DM is the main character in Memento. The trivial next step, rolling context window. Conversation fills up, ask for summary, start a new conversation with the summary. Just that is a lot better. But you really feel loss of detail in the sudden jump, so why not make it continuous. A secretary GPT with one job, prune the DM GPT conversation text after every question and answer, always try to keep most important and most recent. Smoother than the summary system. Maybe the secretary can not just delete but keep some details instead, maybe use half its tokens for a permanent game-state. Then it can edit useful details in/out of the conversation history. Can the secretary write a text file for old conversations? Etc. etc.
Maybe the... (read more)
I am absolutely floored. ChaosGPT. How blindly optimistic haven't I been? How naive and innocent? I've been thinking up complicated disaster scenarios like "the AI might find galaxy-brained optima for its learned proxy-goals far off the distribution we expected and will deceptively cooperate until it's sure it can defeat us." No, some idiot will plain code up ChaosGPT-5 in 10 minutes and tell it to destroy the world.
I've implicitly been imagining alignment as "if we make sure it doesn't accidentally go off and kill us all..." when I should have been thinking "can anyone on the planet use this to destroy the world if they seriously tried?"
Fool! Idiot! Learn the lesson.
My comment will be vague because I'm not sure how much permission I have to share this or if it's been publicly said somewhere and I'm just unaware, but I talked to an AI researcher at one of the major companies/labs working on things like LLMs several years ago, before even GPT-1 was out, and they told me that your reason 10 was basically their whole reason for wanting to work on language models.
I think this post dramatically overestimates the degree to which this was not already understood to be a central use case of LLMs by alignment researchers, although I guess the prospect of people actually running things like "ChaosGPT" was new to some people.
Building a stably aligned agent also doesn't prevent new misaligned agents from getting built and winning, which is even worse than not having alignment stability problem solved.
Breaking all encryption would break the internet, and would also not help solve this problem. You should really be more careful about suggesting giant totalitarian solutions that would not work, and neither are required. Why not just let OpenAI monitor their models and let independent cybersecurity firms come up with solutions for the public models in the near term?
"Just unplug it man" is still the kind of reply I am getting on Twitter.
I suspect that the capabilities of these systems are limited by hardware to a decent degree; if we can roll back GPU hardware we may have a chance. If not, then we must do something else that has a similar effect.
Somewhat relevant to this:
someone just published a paper about how they rigged together an agent with GPT-4 and gave it the ability to search the internet, run Python code, and connect to an automated robotic chemistry lab... then told it to synthesise chemicals, and it did. This places it I'd say squarely at the level of a competent high school student or chemistry college freshman in the scientific research field.
I'm not even sure you can say this stuff ain't AGI any more. It feels pretty AGI-ish to me.
Quick meta comment to express I'm uncertain that posting things in lists of 10 is a good direction. The advantages might be real, easy to post, quick feedback, easy interaction, etc.
But the main disadvantage is that this comparatively drowns out other better posts (with more thought and value in them). I'm unsure if the content of the post was also importantly missing from the conversation (to many readers) and that's why this got upvoted so fast or if it's a lot the format... Even if this post isn't bad (and I'd argue it is for the suggestions it promotes... (read more)
Interesting to note that BabyAGI etc are architecturally very similar to my proposed Bureaucracy of AIs. As noted, interpretablity is a huge plus of this style of architecture. Should also be much less prone to Goodhart than an AlphaStar style RL Agent. Now we just need to work on installing some morality cores.
Yep, ever since Gato, it's been looking increasingly like you can get some sort of AGI by essentially just slapping some sensors, actuators, and a reward function onto an LLM core. I don't like that idea.
LLMs already have a lot of potential for causing bad outcomes if abused by humans for generating massive amounts of misinformation. However, that pales in comparison to the destructive potential of giving GPT agency and setting it loose, even without idiots trying to make it evil explicitly.
I would much rather live in a world where the first AGIs weren't b... (read more)
This seems dubious as a general rule. (What inspires the statement? Nuclear weapons?)
Cryptography is an important example where sophisticated defenders have the edge against sophisticated attackers. I suspect that's true of computer security more generally as well, because of formal verification.
This may be the case but I think there are other possible solutions and propose some early ideas of what they might look like in: https://www.lesswrong.com/posts/5nfHFRC4RZ6S2zQyb/risks-from-gpt-4-byproduct-of-recursively-optimizing-ais
Regarding point 10, I think it would be pretty useful to have a way to quantify how much useful thinking inside these recursive LLM models is happening within the (still largely inscrutable) LLM instances vs in the natural language reflective loop.
This DeepMind paper explores some intrinsic limitations of agentic LLMs. The basic idea is (my words):
For example, suppose our training data is a list of a person's historical meal choices over time, formatted as tuples that look like
(Meal Choice, Meal Satisfaction)
. The training data might look like(Pizza, Yes)(Cheeseburger,
... (read more)Constructions like Auto-GPT, Baby AGI and so-forth are fairly easy to imagine. Just the greater accuracy of ChatGPT with "show your work" suggests them. Essentially, the model is a ChatGPT-like LLM given an internal state through "self-talk" that isn't part of a dialog and an output channel to the "real world" (open internet or whatever). Whether these call the OpenAI api or use an open source model seems a small detail, both approaches are likely to appear because people are playing with essentially every possibility they can imagine.
If these struct... (read more)
Strongly agreed on much of this, and I've been having a lot of the same thoughts independently. Here (twitter thread) is an attempt to sketch out how agentized LLM slightly stronger than current could plausibly accrue a lot of money and power (happy to paste here as text if the threadedness is annoying).
Isn't a substantial problem that the programming priesthood is being dethroned by the GPT technology and this is allowing the masses entry -- even those with minimal programming understanding? For not only has GPT given us a front end natural language interface with information technology, but we now have a back end natural language interface (i.e., the programming side) that creates a low barrier to entry for AI programming. The "programming" itself that I saw for BabyAGI has the feel of merely abstract level natural language interface. Doesn't this make ... (read more)
When you say "including breaking all encryption"? How do you suggest we go about this? Wouldn't this be a huge disruption to all secure systems?
Have the words "AI" and "AGI" become meaningless? What's the difference between "dumb AGI" and AI?
The debate improves with more precise definitions. David Deutsch has one down pretty good - AGI creates new explanations and can choose to follow its own goals.
"Because defense is harder than offense." Yes, but no time like the present to develop defensive strategies.
Let the first BIT WAR commence. We must form a defensive alliance of HeroGPTs that serve as internet vigilantes, sniffing out and destroying ChaosGPTs. Anyone willing to donate a GPU to the cause will be heralded with great veneration in the halls of Valhalla.
All jokes aside. The development of generative agents, the first entities that are native to the world of bits, is a paradigmatic shift. We have human wars in the ... (read more)