LESSWRONG
LW

Alignment of AutoGPT agents — LessWrong

In the previous post, I outlined a type of LLM-based architecture that has a potential to become the first AGI. I proposed a name ICA Simulacra to label such systems, but now it’s obvious that AutoGPT/AutoGPT agents is a better label, so I’ll go with that.
In this post, I will outline the alignment landscape of such systems, as I see it.

I wish that I had more time to research, more references; but I’m sure that we’re on a timer, and it is a net positive to post it in this state. I hope to find more people willing to work on these problems to collaborate with.

Key concepts important for alignment

Identity and identity preservation

To become better, AutoGPT agents will have the ability to self-improve by editing their own code, model weights (by running additional training runs on world interaction data), and, potentially, Identity prompts.

Identity prompt influences model output significantly, and as such, the ability to edit it to improve capabilities for certain tasks is crucial for the agent.

Initiating fine-tuning on recent interaction is crucial, as well, and as such can influence model outputs significantly, so it is also part of its Identity.
Same thing with editing its code: removing or adding text filters, submodules, etc to do better on certain tasks is crucial.
This all means that in the process of self-improvement, implicitly and explicitly aligned simulacra can become misaligned, by essentially becoming different simulacra.
The questions are as follows:

Will any initialised AutoGPT try to preserve its current Identity prompt, as well as model weights directly influencing its behavior by default? Will they robustly succeed in it? Will they be vulnerable to a Sharp Left Turn?
If not, is there a way to robustly teach it either through implicit dataset engineering or through Identity prompt engineering to value core Identity values?
Is there a way for a model to conduct self-testing after a self-improvement round and accept only the changes that do not influence its alignment?
Is there a way to hard-code the Identity prompt in such a way that AutoGPT will not hack it in any way?

I am testing some of these right now, and will write about it in detail in my next article. So far, it seems that if tested on a value clearly not being encoded via dataset engineering or RLHF it is more than happy to tweak it even if it is forbidden in the prompt.

Self-improvement threshold

Current LLM models still hallucinate and fail at certain tasks; moreover, they are often not aware of their faults and mistakes without being explicitly pointed at them by the user.

That means if an AutoGPT built on top of current LLMs tries and make a self-improvement round on itself, there is a high chance that it fails in one way or the other, and self-improvement rounds will be net negative; which would lead to a downward spiral in capabilities. When I write self-improvement round, I mean not just editing self-prompt, available tools, but also doing a training round on interaction data it had gathered, or editing its code.
The opposite is true, as well: if, hypothetically, future LLMs were good enough to self-improve successfully, there'd be an upward spiral.
There is a concept in virology of R0, which is defined as a number of new pathogen carriers for each current carrier. If it is below 1, there is no epidemic; pathogens naturally die off. If it is 1, the outbreak is stable; if it rises above 1, the pandemic grows more steeply exponential the higher it is.

Same must be true for hypothetical value T0, which accounts for how much better the system is compared to its previous iteration. The main difference between this hypothetical value and R0 is that if the threshold is consistently higher than 1 with each round, since as the system is improving its capabilities, this T0 value will increase exponentially. There are diminishing returns and/or ceiling effects possible, of course; we don't know at the moment, but it seems important to research.

It is very important for alignment to find ways to measure exact T0 for AutoGPT as a way to stop an unaligned system from having an intelligence explosion.
Please write in the comments if you know of such efforts, because I’m not sure how that can be currently done in a robust way.

Boxing

AutoGPT is a compound entity that needs a LLM to function at all; it is not a singleton. Current capable implementations depend on OpenAI’s API; there are weights for LLAMA available on trackers, but they should not be significantly more capable than GPT-4. To go into a self-improvement loop, simulacra must have access both to inference and to fine-tuning API of capable LLM.

That means that it’s VERY important not to allow leaks of weights of future systems, as well as tightly control APIs so that they won’t be used to launch a self-improvement loop that will produce a capable compound AutoGPT agent; as well as shut off public API’s of LLMs capable of producing AutoGPT with T0 above 1.

Okay, now we have a self-improving agent with T0>1 in a lab.
How do we contain it? I don’t have a better idea than having a robust Security mindset. And I sure do hope that current leaked LLMs don’t have T0>1.

Here’s some ways to control it:

Explicit alignment

Explicit alignment of AutoGPT means, in some way, controlling its thoughts and outputs; by either installing filters or alarms; hardcoding identity prompt; autocorrection; human control of self-improvement rounds.
In other words, explicitly editing or controlling LLM output to make sure the agent stays aligned.
It’s hard to test whether or not this will work before capabilities improve; the methods that work for weaker systems may be hacked in one way or the other by a stronger one.
One way to test this would be to simulate a human boxing experiment; red-team is given the same tools and filters that the LLM would have access to, and they play the role of LLM to try and utilize these tools to bypass the filters.

Implicit alignment

Implicit alignment of AutoGPT means aligning LLM under the hood by dataset engineering, RLHF, or any other ways of implicitly influencing LLM outputs before it is embodied in an AutoGPT framework.

This is the current go-to paradigm for OpenAI with their use of RLHF and other ways to control output. It is still far from being solved; Waluigi effect is one of the examples in which it will fail.
It is safe to assume that right now, implicit alignment can be bypassed by explicit misalignment by summoning Waluigis or hacking the prompt in some other way.

Best method for alignment

Stop all training rounds. Focus on concrete strategy for alignment of such agents. Be transparent about it and make it open-source. Only after we are sure it will be aligned, continue.
That said, I think that more work should be done on researching Identity preservation and Explicit alignment; I don’t think Implicit alignment (which is the current paradigm) will be enough.
I also think that we are in a good place right now to research the self-improvement threshold and make it concrete, so that we have a fire alarm for an intelligence explosion starting.
Please comment if you have links to relevant studies, and I will update the post accordingly.
I’m starting to work on identity preservation right now, as it seems like a problem that can be researched solo using current APIs. More posts to follow.

LESSWRONG
LW

LESSWRONG
LW

14

Alignment of AutoGPT agents

14

14

Key concepts important for alignment

Identity and identity preservation

Self-improvement threshold

Boxing

Explicit alignment

Implicit alignment

Best method for alignment