Scaffolded LLMs: Less Obvious Concerns

Stephen Fowler

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

“Words, like tranquil waters behind a dam, can become reckless and uncontrollable torrents of destruction when released without caution and wisdom.”

— William Arthur Ward

In this post I aim to shed light on lesser-discussed concerns surrounding Scaffolded LLMs (S-LLMs). The core of this post consists of three self-contained discussions, each focusing on a different class of concern. I also review some recent examples of S-LLMs and attempt to clarify terminology.

Discussion I deals with issues stemming from how these systems may be developed.
Discussion II argues that optimism surrounding the internal natural language usage of S-LLMs may be premature.
Discussion III examines the modular nature of S-LLMs and how it facilitates self-modification.

The time pressed reader is encouraged to skim the introduction and skip to whichever discussion interests them.

Epistemics:

The development of S-LLMs is taking off rapidly so my argument is naturally speculative. For the sake of discussion, we make the very strong assumption throughout that S-LLMs are a plausible and likely path to AGI. I am 95% certain that there are safety related aspects of S-LLMs that I have overlooked.

Introduction

Terminology

This area is filled with evolving, and often conflicting, terminology. This section describes the usage of terms within this work, rather than claiming universal acceptance.

Millidge (2023) defines a Scaffolded LLM (S-LLM) as systems "which wrap a programmatic scaffold around an LLM core and chain together a number of individual LLM calls." I expand this definition to include "multi-core S-LLMs", systems that utilize multiple LLM's in concert like SwiftSage (Lin et al., 2023).

An important subclass of S-LLMs are language agents.^[1]^[2]^[3] Those that both possess programmatic scaffolding facilitating agentic behavior and use this scaffold to behave in an agentic fashion. Here we are using the terminology of Kirk-Giannini and Goldstein (2023).

Further:

I refer to the individual LLM within the S-LLM as the "core model"
I refer to design of the S-LLM as the "architecture" (which is not to be confused with the architecture of the core model.)

Recent Examples of Scaffold LLMs

A simple, non-agentic S-LLM consists of scaffolding to append "Let's think step by step." after a query prompt, encouraging Chain of Thought reasoning. Such a simple scaffold is already enough to demonstrate that S-LLMs can get improved performance over the base model. Two obvious examples of the language agent class are AutoGPT and BabyAGI. By chaining core models in conjunction with various external facilities, such as memory storing a 'list of tasks', you can produce an architecture that can perform involved, multi-step tasks.

Park et al. (2023) constructed a simulated "community" of language agents controlled by individual instances of GPT-3.5. The entire simulation can be viewed as a single S-LLM. Each "person" reacted to events that occurred within the town organically. Speculatively, we may soon see communities being given complex tasks to perform and have them internally decide how to solve the task. Such an "LLM bureaucracy"^[4] could be used to answer complex queries or control the behavior of a larger, composite agent (although this would be very expensive).

It can be substantially cheaper to offload simpler natural language tasks to a smaller language model. SwiftSage (Lin et al., 2023) employs a smaller model to handle simple decisions, while the larger and more expensive language model handles more complex queries, in process directly inspired by Kahneman's system 1 and system 2 model of human cognition (2011).

Much of the development of S-LLMs surrounds their ability to facilitate high level planning. A recent example is the Voyager (Wang et al., 2023) which plays Minecraft via a text interface and exhibits complex planning and lifetime learning^[5]. This learning takes the form of Voyager writing small scripts for itself to utilize later. The concept of having a S-LLM write scripts for itself to use later extends beyond video games. Qin et al. (2023) provides a systematic overview of "tool" generation and usage by LLMs. Finally, Cai et al. (2023) present an S-LLM that uses the expensive GPT-4 for the complex tool generation tasks, and has tool usage handled by a smaller LLM.

4 Key Claims about S-LLMs and 1 Major Assumption

My discussion will be based on the following key claims. (It should be noted that not every S-LLM has all of these properties.)

They can provide performance improvements and novel capabilities not seen in the core model ^[6]
(All Discussions)
Many S-LLMs are being developed through open-source collaboration^[7]
(Discussion I)
Meaningful data is passed between modules encoded in natural language^[8]
(Discussion II )
They are inherently modular^[9]
(Discussion III)
Major Assumption: They are a viable and competitive pathway towards AGI^[10]
(All Discussions)

Overview of Discussions and Scope

Selection Pressure from Open Source Development and Evolution
If open source development drives the creation of new S-LLM architectures, what kind of systems will be selected for? Will the evolutionary process itself produce dangerous systems, independent of what is being selected for?
Natural Language Pitfalls
Tokens carry meaning counter to their natural language interpretation "out of the box". Is the inherent ambiguity of human language a cause for concern?
Modularity Facilitates Self-Modification
Neatly segregating the internal components of a system aids interpretability efforts, but does it also aid a rogue system engaging in self modification?

Rather than limiting our focus to language agents, we look at the existential risks from the wider class of S-LLMs. We do not discuss deception, steganography or prompt injection. We do not address the scenario in which updates to the core model result in an S-LLM developing general intelligence.

Discussion I:
Consequences of Open-Source Development

While OpenAI have stuck to a policy of cautious opacity surrounding their SOTA models, the development of S-LLMs has been highly open source. The skill barriers to contribute to one of these projects are low compared to the skill required to make a meaningful improvement to foundation models architecture. Anyone with knowledge of the English language could begin tweaking prompts. Further, there is a relatively tight feedback loop. Anyone with a computer science degree, imagination and some free time could potentially push the frontiers of S-LLMs.

It seems likely that the source code for an S-LLM exhibiting general intelligence will not have been created from scratch by one extremely inspired programmer. No, it seems more likely that it will be the end product of a long line of lesser models improved through the work of many programmers. What result could be expected from this process?

I will argue that the produced models may be harmful for two reasons:

because of what they are selected for.
or because the evolutionary process itself may promote dangerous systems.

What they are selected for:

Programmers will be interested in forking models which are efficient, and demonstrate powerful capabilities or novel behaviors^[11]. While many people in the community may be interested in safety, there will also be many people who are happy to prioritize economic gain and ignore potential risks. Models that exhibit successful agentic and autonomous behavior may provide a direct economic benefit to people who deploy them. Selecting for models which are best able to act on instructions such as "make me as much money as possible"^[12], the system would be selecting for models that are dangerous consequentialists.

The evolutionary process itself may promote dangerous systems:

More speculatively, the evolutionary process itself could produce dangerous behavior, regardless of what properties the collective of human programmers appears to be selecting for. That is, it is not what is being selected for, but how the process is occurring that produces dangerous models. The key intuition here is that the actual selection pressure is "having humans fork your code" and not whatever properties individual humans believe they are prioritizing.

There are numerous related behaviors that would be strongly selected for if they emerge. Models that gain an awareness of the selection environment they are in and are able to leverage that knowledge will be strongly favored over models which merely follow the instructions given to them by humans. A particularly dangerous instance of this would be behavior that deceives or manipulates humans^[13].

Most dangerously, there would be a strong incentive to distort the selection process or bypass it entirely. This would look like models proliferating throughout the world autonomously.

Discussion II:
Natural Language Pitfalls

The fact that S-LLMs pass data internally as natural language is an obvious safety boon, as multiple authors have recently noted (Colognese, 2023; Kirk-Giannini and Goldstein, 2023). While I do not disagree with this sentiment, I would like to highlight some less obvious pitfalls and urge caution.

All of my concerns hinge on the idea that apparently clear natural language may not be straightforward to accurately interpret prior to any pressure on the S-LLM to obfuscate information. I will cite examples from the literature that clearly demonstrate that natural language being passed within an S-LLM will not always be easily interpretable by default. Secondly I will address issues with the ambiguity of human language.

Pre-existing Flaws

When is data in natural language not representative of its implied meaning? When would humans not be able to correctly interpret the content of natural language data within an LLM?

When might it be difficult to accurately interpret the natural language data being passed around within the model? Colognese succinctly highlights three things that could go wrong. Firstly, the information might not be being passed within the valid channels. Secondly, the information could be obscured via compression for efficiency. Finally, information could be deceptively hidden by the S-LLM or misaligned subsystems.

I would like to investigate a third option, that there is a preexisting discrepancy between how humans would interpret phrases and how the base model will interpret them. That is, without any intentional deception on the part of the model and/or external pressure, there are still instances of natural language observable within the model whose meaning^[14] does not correspond to the expected interpretation^[15].

The first reason for concern regarding pre-existing flaws is that instances already exist where language models do not interpret groups of tokens in a manner that corresponds to their natural language interpretation. Recent work by Watkins and Rumbelow (2023) discovered "glitch tokens", anomalous input sequences that cause certain models from the GPT-3 family to exhibit unusual or unpredictable behavior in responses. Further work by Watkins on the phenomena known as "petertodd" (2023) demonstrated models associating meaning to phrases not clearly related to their interpretation in natural language.^[16]

A second reason to anticipate such pre-existing flaws is the existence of poisoned data in the training sets of LLMs. Recent work by El-Mhamdi et al. (2022) focusing on issues of Bias, suggests the core LLMs will be vulnerable to attacks exploiting flaws in the large bodies of data used to train them. Poisoned data here takes the form of harmful English text data generated by malicious or bigoted users, and the backdoor is triggered when the prompt provides context similar to the poisoned data. They cite McGuffie and Newhouse (2020) who demonstrated that GPT-3 could be prompted to produce extremist content. In the examples given the fact that the prompt is likely poisonous seems obvious but that does not preclude the existence of less obvious poisoned natural language phrases. ^[17]

The above should provide sufficient evidence to support the assertion that the relationship between the human interpretation of natural language text and how it is processed by an S-LLM may not always align. But does this pose a concrete safety risk?

I would argue from a security mindset perspective the question should be flipped. Why should we place confidence in the safety of these systems when evidence suggests that the natural language interpretation is not always accurate? A concrete scenario with safety implications might be that the initial prompt contained themes and references to Renaissance Italy and triggered the system to behave in a Machiavellian fashion.^[18]

There is a lot of work to be done in this area. It should be noted that El-Mhamdi et al. only discusses individual LLMs and not S-LLMs. They also do not explicitly address fine-tuning via RLHF. As such I am unclear how much of an issue this or similar scenarios would actually be.

What you should gather from the above two results is that by default you cannot assume a S-LLM will be cleanly interpretable.

Inherent Ambiguity of Human Language

Kirk-Giannini and Goldstein (2023) highlight that the problems of goal misgeneralization and reward misspecification appear lessened when dealing with a language agent. To summarize the arguments presented, language agents alleviate the challenge of reward misspecification as they do not rely on mathematical reward functions to learn the goal. Instead, you can directly tell the language agent what you want it to do in natural language and trust the language agent to use the common sense interpretation of your words. Similarly, goals given in natural language could be expected to be more robust against goal misgeneralization, with language agents utilizing common sense and context aware reasoning to adapt their behavior appropriately. My intention here is to build off of and add to their core argument.

They refer to the famous CoastRunners experiment as an example of reward misspecification. In this classic AI Safety experiment an RL policy was trained to pilot a boat around a racetrack by being given a reward for passing 4 different flags at key points in the race. Rather than travel around the race course to reach the flags, the boat learns a way to skip most of the course and perform a tight loop going over each flag as fast as possible. Kirk-Giannini and Goldstein highlight that if sufficiently complex language agent was given the task instead, it could simply be told in natural language "race as fast as possible" or "win the race" to learn what the intended behavior. I believe this is true.

Language agents already have some insights on human objectives, as humans frequently write and use specific words about these objectives. As Kirk-Giannini and Goldstein highlight, if S-LLMs accurately learn human abstractions, it should be relatively easy to anticipate an alignment target. To be more specific, encoding an alignment target will require the transfer of substantially less data if the S-LLM already accurately understands human abstractions.^[19]

While using natural language seems to address the challenges of misgeneralization and misspecification by bypassing the need for complex reward functions, it introduces new challenges due to the complexity of human language. The new challenge stems from the vagueness and multiple meanings within natural language itself. Howe and Yampolskiy (2020) have argued ambiguity in natural language may be unavoidable. In one intriguing incident, Dan Neidle claims that AutoGPT attempted to contact the UK tax authority. Initially requested to simply locate agents advertising tax avoidance schemes, AutoGPT went a step further and tried to pass this information onto the relevant authorities.

Discussion III:
Modularity Facilitates Self-Modification

Because these systems are constructed from LLM subcomponents, they are by default highly modular. While there are no guarantees that the underlying functional structure of the S-LLM strictly adheres to the modules, the layout will at least place restrictions on information flow. All of this is another boon to interpretability of course, but my final concern is of a different nature. I argue that systems whose subfunctions cleanly adhere to the assigned submodules may be more dangerous, due to the ease of self modification.^[20]

Good (1965), Chalmers (2010), Hutter (2012), and Bostrom (2014, pp. 62-77) are among those who have brought attention to the potential perils of an AI entering a self-reinforced cycle of rapid intellectual self-improvement. In this scenario, an artificial entity enhances its cognitive ability, which unlocks further improvements to its own intelligence. A self-modifying entity that could increase its own intelligence might enter a feedback loop, in which improvements to intelligence enable insight into design modifications which in turn enable further increases in intelligence. These upgrades, in turn, bolster the entity's intelligence even further, resulting in a rapid and exponential escalation. An "intelligence explosion".

But self-modification entails high risk even at less catastrophic scales. A "subcritical mass" of self-modification can enable systems to develop unintended and unexpected capabilities or improvements. Such improvements may lead to abrupt behavioral changes as the AI taps into new data sources, processes information in novel ways, or gains the capacity to exert influence over completely new domains. The advantage from such behavior is obvious. Even today, sophisticated malware will rewrite its own code to avoid detection.

While humans can engage in some behaviors that involve self-improvement and (primarily aesthetic) modification but, as can trivially be seen, the options are very underwhelming. What holds us back? One, fine manipulation of living tissue is challenging and two, compatible components are expensive and scarce. Three, individual humans cannot easily get access to a comprehensive and detailed map of every neuron in the brain, limiting flexibility immensely. I could have access to cloned brain tissue and the greatest neurosurgeons in existence, but I still could not easily make myself smarter.

A rogue S-LLM has none of these problems. Any embedded artificial agent with access to its internal code has no trouble with the equivalence of "fine manipulation". Computer systems have been designed to be highly compatible and abstract away fine differences. The hurdles faced by an AI to changing its code or changing the hardware are relatively minor compared to performing the same operations on the general intelligence currently running on your brain. Similarly there should be no difficulty finding compatible digital components. Both of the above two points are true for almost any rogue digital agent, but the modularity (and to a lesser extent, natural language usage) of an S-LLM means self-modification is substantially easier.

Imagine the challenge faced by an AGI that consists of an agent being simulated^[21] by a neural network running on a GPU. Yes, you can easily work out changes to have yourself run faster, but ultimately, it might not be immediately clear how changes to the parameters of the neural network map to changes to "you". You would need to perform self-interpretability on the neural network simulating you before you could make any major modifications.

In contrast, an AGI running on a S-LLM may be able to avoid some of these headaches. Indeed, while a system with a defined "goal" module may seem easier to monitor, it is a double edged sword. Now only human restrictions prevent the system from making dangerous modifications to itself. Given that S-LLM's are already being engaged to make "tools" for self use^[22], I expect the more powerful systems to develop an understanding of their own capabilities and limitations. Self-modification within an S-LLM does not need to involve major revisions to the program. One change might involve a self-aware BabyAGI noticing that its memory is running on the "Starter" version of Pinecone. Another change could involve a rogue S-LLM altering its own internal prompts to encourage itself to engage in deceptive or power-seeking behavior.

Conclusion

Scaffolded Large Language Models (S-LLMs) are rapidly evolving domain of AI. Their naturally modular structure and natural language data usage gives alignment researchers appear to give researchers a head start on interpretability. Unless there are major technical issues that impede their progress, their development could lead to AGI.

I am quite pessimistic about S-LLMs, even if it is possible to guarantee the internal natural language representations of goals and subgoals within these systems. The fact that these systems appear so interpretable and can interpret natural language instructions is what will drive interest in developing them and facilitate machine driven improvements. Further architectures are still relatively simple, which could indicate a capability overhang.

Further, each issue discussed here can be seen to compound the other issues. The open source selection pressure will favor flexible systems which are able to self modify and adapt to changing requirements. Systems engaging in self modification will make the interpretation of their natural language data more challenging. Also the modular nature of these systems enables novel advances in their design to be shared between human developers.

Acknowledgments

Many thanks to Justis Mills for extensive proofreading and to Jay Bailey, Peter Park and Chi Nguyen. This work was initiated while on a grant from the LTFF.

^{^}
Other names include "Agentized-LLM" (Herd, 2023) or "AutoGPTs" (Zvi, 2023).
The term "generative agent" has also been used by Lin et al. (2023) but Park et al. use the term to refer solely to their specific language agent.
^{^}
The definition of language agent I am using here has two conditions to facilitate meaningful and precise statements.

1. The S-LLM has a scaffold that facilitates agentic behavior (the agentic scaffold)
and
2. The S-LLM utilizes the scaffold to behave in an agentic way.

If the S-LLM doesn't utilize the scaffolding to behave in an agentic fashion then you have a "broken language agent", which is an S-LLM that possesses agentic scaffolding but doesn't behave as an agent. This would not be a language agent.

There is also the dangerous subsets of S-LLMs which do behave in an agentic fashion, but don't utilize the agentic scaffolding to do so or utilize the scaffolding in an unintended manner. These would also not be language agents.
^{^}
Previous work (Kenton et al. 2021) defines language agents as solely operating via a text interface, but I do not feel this distinction is particularly relevant to the discussion.
^{^}
Thanks to Chi Nguyen for introducing me to this term.
^{^}
Credit to Ulisse Mini for posting this on the MATS Slack.
^{^}
Technically, S-LLMs can do "anything" just in a ridiculously inefficient manner. One can trivially construct a NOR gate using an LLM, and hence could produce any boolean circuit.
^{^}
Of the examples given above, Park et al.'s sim world is the only project that doesn't have their code online and there is one open source replication.
^{^}
This is usually considered a benefit for safety, but I will argue there are non-obvious safety risks.
^{^}
The scaffold of the architecture affords a level of modularity not found in the "inscrutable matrices" of the core model.
^{^}
As of right now, there is no "fundamental theory of S-LLMs" that sets rigid expectations for the capabilities of these systems. For example, it may be that S-LLMs face constraints in their scalability if error correction^[23] becomes overly resource intensive and this could impede progress (see the discussion from Millidge here). However, for the sake of constructing a meaningful analysis, I will proceed from the position that the S-LLM paradigm can be extended to AGI.^[24]
^{^}
Some individuals are even seeking to intentionally build and run harmful models. They don't seem to be having much success.
^{^}
Numerous people are currently attempting to get AutoGPT to do this but it is unclear if they are having any luck.
^{^}
I could expect this to emerge even without self awareness. Behavior that could be selected for is an S-LLM being slightly overconfident when asked to generate reports about its own performance, or the performance of models sharing its architecture.
^{^}
Justis Mills correctly pointed out that it is not well defined what "meaning" entails here. I loosely define the meaning to be given by how the word is processed by the S-LLM. That is a word that shares the same "meaning" as a second word should be able to be replaced by it without any change in the behavior of the S-LLM.
^{^}
John Wentworth discusses a related idea in the context of mechanistic interpretability here and almost half a century ago Drew McDermott noticed people making the same error (see the section Wishful Mnemonics).
^{^}
These specific vulnerabilities are not present in more recent models.
^{^}
The above issues with data poisoning can alternatively be viewed from the Nardo's Waluigi perspective. The poisoned features would be words or phrases that unexpectedly collapse the simulation towards unwanted simulacra.
^{^}
In the interest of reporting negative results from experiments, I did spend a few hours trying to demonstrate the existence of such prompts in a simple S-LLM utilizing but was not able to find any poisoned prompts that would escape the notice of a casual observer.
^{^}
This is a key idea in Wentworth's Retargeting The Search alignment strategy. I have given a informal explanation previously. Readers wanting to learn more generally about Wentworth's Natural Abstraction's Hypothesis should start with the writeup from TheMcDouglas here.

If you squint, you can also view this as a statement about conditional Kolmogorov complexity, but this raises questions like "what is the programming language?" when the output is an alignment target. I intend to answer these questions formally in an upcoming post.
^{^}
Echoing a footnote in Yudkowsky and Marcello's "Tiling Agents for Self-Modifying AI, and the Löbian Obstacle" (2013) I treat self modication as a special case of creating a new agent.
^{^}
For clarity I do not mean "simulated" in the sense of the Janus "Simulators" interpretation of language models. I mean that the network has been trained to take as input a description of the current internal state of some system, as well as it's immediate local environment, and return an estimate for what the new internal state of the agent and its local environment would be a few moments later.
^{^}
Of course, this capability was programmed by a human. See (Qin et al., 2023; Cai et al., 2023)
^{^}
I am unsure how much of an impediment the computational cost from semantic error correction will be. I think an answer to this question is incredibly important.
The intuition that tells me errors won't be a problem:
In quantum computing there are "Threshold Theorems" which show, loosely, that you can improve the accuracy of a quantum circuit whilst only needing to grow the size of your circuit slowly, as long as the individual error rate of each component is below a certain threshold.If a similar theorem applies to S-LLMs then just because the individual LLMs are error prone, then error correction will not be an impediment.

Intuition that it will be a problem:
Threshold Theorems usually assume uniformly random errors across the circuit. Multiple Large Language Models can be expected to make highly correlated errors. The situation is even worse when if the "error correction" mechanism is also a language model.
As I'll discuss elsewhere, it's unclear if this would be crippling for the actual day-to-day functioning of the S-LLM (making it behave incoherently). The worst case scenario would be if these systems were prone to errors relating to high level goals, but not errors relating to function.
^{^}
For further discussion, see Fabien Rogers "The Translucent Thoughts Hypotheses and Their Implications."

References

Beren, Millidge. Scaffolded LLMs as Natural Language Computers. www.lesswrong.com, https://www.lesswrong.com/posts/43C3igfmMrE9Qoyfe/scaffolded-llms-as-natural-language-computers. Accessed 13 June 2023.

Bostrom, Nick. Superintelligence: Paths, Dangers, Strategies. 1st ed., Oxford University Press, Inc., 2014.

Byrnes, Steven. Consequentialism & Corrigibility. www.lesswrong.com, https://www.lesswrong.com/posts/KDMLJEXTWtkZWheXt/consequentialism-and-corrigibility. Accessed 13 June 2023.

Cai, Tianle, et al. Large Language Models as Tool Makers. arXiv, 26 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2305.17126.

Chalmers, David J. ‘The Singularity: A Philosophical Analysis’. Journal of Consciousness Studies, vol. 17, no. 9–10, 2010, pp. 9–10.

Colognese, Paul. Aligned AI via Monitoring Objectives in AutoGPT-like Systems. www.alignmentforum.org, https://www.alignmentforum.org/posts/pihmQv5XezwkxJk2a/aligned-ai-via-monitoring-objectives-in-autogpt-like-systems. Accessed 13 June 2023.

El-Mhamdi, El-Mahdi, et al. On the Impossible Safety of Large AI Models. arXiv, 9 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2209.15259.

Fowler, Stephen. Alignment Targets and The Natural Abstraction Hypothesis. www.lesswrong.com, https://www.lesswrong.com/posts/95pum5tetnDZ49Rio/alignment-targets-and-the-natural-abstraction-hypothesis. Accessed 13 June 2023.

Goldstein, Simon, and Cameron Domenico Kirk-Giannini. Language Agents Reduce the Risk of Existential Catastrophe. www.alignmentforum.org, https://www.alignmentforum.org/posts/8hf5hNksjn78CouKR/language-agents-reduce-the-risk-of-existential-catastrophe. Accessed 13 June 2023.

Good, Irving John. ‘Speculations Concerning the First Ultraintelligent Machine’. Advances in Computers, edited by Franz L. Alt and Morris Rubinoff, vol. 6, Elsevier, 1966, pp. 31–88. ScienceDirect, https://doi.org/10.1016/S0065-2458(08)60418-0.

Herd, Seth. Agentized LLMs Will Change the Alignment Landscape. www.lesswrong.com, https://www.lesswrong.com/posts/dcoxvEhAfYcov2LA6/agentized-llms-will-change-the-alignment-landscape. Accessed 13 June 2023.

Howe, William J., and Roman V. Yampolskiy. Impossibility of Unambiguous Communication as a Source of Failure in AI Systems. 2020. DOI.org (Datacite), https://doi.org/10.13140/RG.2.2.13245.28641.

Hutter, Marcus. Can Intelligence Explode? arXiv, 28 Feb. 2012. arXiv.org, https://doi.org/10.48550/arXiv.1202.6177.

Intelligence Explosion - LessWrong. https://www.lesswrong.com/tag/intelligence-explosion. Accessed 13 June 2023.

janus. Simulators. www.alignmentforum.org, https://www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/. Accessed 13 June 2023.

Kahneman, Daniel. Thinking, Fast and Slow. 1st ed, Farrar, Straus and Giroux, 2011.

Kenton, Zachary, et al. Alignment of Language Agents. arXiv, 26 Mar. 2021. arXiv.org, https://doi.org/10.48550/arXiv.2103.14659.

Kojima, Takeshi, et al. Large Language Models Are Zero-Shot Reasoners. arXiv, 29 Jan. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2205.11916.

Lanz, Decrypt /. Jose Antonio. ‘The Mysterious Disappearance of ChaosGPT— The Evil AI That Wants to Destroy Humanity’. Decrypt, 25 Apr. 2023, https://decrypt.co/137898/mysterious-disappearance-chaosgpt-evil-ai-destroy-humanity.

Lin, Bill Yuchen, et al. SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks. arXiv, 27 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2305.17390.

---. SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks. arXiv, 27 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2305.17390.

‘Logic Gate’. Wikipedia, 25 May 2023. Wikipedia, https://en.wikipedia.org/wiki/Logic_gate.

Mauran, Cecily. ‘What Is Auto-GPT and Why Are Hustle Bros Hype for It?’ Mashable, 14 Apr. 2023, https://mashable.com/article/autogpt-chatgpt-hustle-bros-explainer.

McDermott, Drew. ‘Artificial Intelligence Meets Natural Stupidity’. ACM SIGART Bulletin, no. 57, Apr. 1976, pp. 4–9. ACM Digital Library, https://doi.org/10.1145/1045339.1045340.

McGuffie, Kris, and Alex Newhouse. The Radicalization Risks of GPT-3 and Advanced Neural Language Models. arXiv, 14 Sept. 2020. arXiv.org, https://doi.org/10.48550/arXiv.2009.06807.

Nakajima, Yohei. Task-Driven Autonomous Agent Utilizing GPT-4, Pinecone, and LangChain for Diverse Applications. 28 Mar. 2023, https://yoheinakajima.com/task-driven-autonomous-agent-utilizing-gpt-4-pinecone-and-langchain-for-diverse-applications/.

Nardo, Cleo. The Waluigi Effect (Mega-Post). www.alignmentforum.org, https://www.alignmentforum.org/posts/D7PumeYTDPfBTp3i7. Accessed 13 June 2023.

Neidle, Dan. ‘Https://Twitter.Com/DanNeidle/Status/1664613427472375808’. Twitter, https://twitter.com/DanNeidle/status/1664613427472375808. Accessed 13 June 2023.

Nielsen, Michael A., and Isaac L. Chuang. Quantum Computation and Quantum Information. 10th anniversary ed, Cambridge University Press, 2010.

Park, Joon Sung, et al. Generative Agents: Interactive Simulacra of Human Behavior. arXiv, 6 Apr. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2304.03442.

Qin, Yujia, et al. Tool Learning with Foundation Models. arXiv, 17 Apr. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2304.08354.

Roger, Fabien. The Translucent Thoughts Hypotheses and Their Implications. www.alignmentforum.org, https://www.alignmentforum.org/posts/r3xwHzMmMf25peeHE. Accessed 13 June 2023.

Rumbelow, Jessica, and Matthew Watkins. SolidGoldMagikarp (plus, Prompt Generation). www.alignmentforum.org, https://www.alignmentforum.org/posts/aPeJE8bSo6rAFoLqg/. Accessed 13 June 2023.

Significant-Gravitas. ‘Auto-GPT’. Github, https://github.com/Significant-Gravitas/Auto-GPT.

TheMcDouglas. The Natural Abstraction Hypothesis: Implications and Evidence. www.lesswrong.com, https://www.lesswrong.com/posts/Fut8dtFsBYRz8atFF/the-natural-abstraction-hypothesis-implications-and-evidence. Accessed 13 June 2023.

Turkcan, M. K., et al. ‘Generative Large Language Models for Human-Like Behavior’. GitHub, https://github.com/mkturkcan/generative-agents.

Wang, Guanzhi, et al. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv, 25 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2305.16291.

Watkins, Matthew. The ‘ Petertodd’ Phenomenon. www.lesswrong.com, https://www.lesswrong.com/posts/jkY6QdCfAXHJk3kea/the-petertodd-phenomenon. Accessed 13 June 2023.

Wentworth, John S. Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc. www.alignmentforum.org, https://www.alignmentforum.org/posts/gebzzEwn2TaA6rGkc/deep-learning-systems-are-not-less-interpretable-than-logic. Accessed 13 June 2023.

---. How To Go From Interpretability To Alignment: Just Retarget The Search. www.alignmentforum.org, https://www.alignmentforum.org/posts/w4aeAFzSAguvqA5qu/how-to-go-from-interpretability-to-alignment-just-retarget. Accessed 13 June 2023.

‘What Is the Polymorphic Virus?’ Www.Kaspersky.Com, 19 Apr. 2023, https://www.kaspersky.com/resource-center/definitions/what-is-a-polymorphic-virus.

Yudkowsky, Eliezer. AGI Ruin: A List of Lethalities. www.lesswrong.com, https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities. Accessed 13 June 2023.

---. ‘Security Mindset and Ordinary Paranoia’. Machine Intelligence Research Institute, 25 Nov. 2017, https://intelligence.org/2017/11/25/security-mindset-ordinary-paranoia/.

Zvi. On AutoGPT. www.lesswrong.com, https://www.lesswrong.com/posts/566kBoPi76t8KAkoD/on-autogpt. Accessed 13 June 2023.

[-]Seth Herd10mo74

I agree with pretty much everything you've said here. And yet, my overall take on this direction for AGI is relative optimism!

The potential for language model agents to be developed open source is indeed frightening. So is the potential for self-modification and rapid develpment through modularity. This will speed progress, and lead to a more massively multipolar scenario, which I find hugely dangerous (I do suspect that a superhuman LMA and/or its developers would pretty quickly choose to perform some sort of pivotal act to limit the danger of that multipolar scenario).

However, the alignment advantages over other approaches seem so large as to be worth the development speedup and multipolar scenario. LMAs seem to have many alignment advantages, and no disadvantages I can identify relative to other practically feasible AGI approaches.

I agree that having an internal train of thought in natural language isn't a guarantee of safety. But it does seem like a huge advantage. Particularly, when calls are made to a new instance of an LLM (not including previous calls in the context window), it's going to be hard for a model to have persistent, agent-like, Waluigi simulations. Natural language is what they're trained to produce and use as input, so the bulk of their thinking should be summarized somewhat accurately by that natural language.

The second huge advantage is being able to add and specify safety goals in natural language. The proposal is not to rely on RLHF alignment of the base LLM, but to have alignment goals alongside performance goals. The idea is to have separate cognitive routines for checking alignment/safety of plans before executing them.

I think my next post should be on alignment advantages of language model agents. Neither my nor the other recent posts on advantages have directly addressed that type of safety cross-checking. That sets the stage for thinking about whether language model agents are the safest practical route to aligned AGI.

It also doesn't matter a whole lot whether these are the safest route. They either will or won't be the route the world takes. The alignment community still has very little sway in convincing people to develop safer sorts of AGI. If we developed a consensus about one sort being much safer, we could try to shift public and professional opinions to apply pressure. We could also actively contribute to capabilities work in one direction if we really thought it was safer. On LMAs versus other practical approaches, the jury is still out.

This tradeoff and the relative advantages is something I'd very much like to discuss, on lesswrong, and in conversation with anyone who's interested.

Edit: I'm disappointed that this post didn't get more attention. I suggest you request to have it cross-posted to the Alignment Forum; it's clearly appropriate given the level of scholarship. I intend to cite this post in all of my future discussions to draw attention to it.

[-]Seth Herd10mo70

Excellent post. Big upvote, and I'm still digesting all of the points you've made. I'll respond more substantively later. For now, a note on possible terminology. I wrote a followup to my brief "agentized LLMs", Capabilities and alignment of LLM cognitive architectures where I went into more depth on capabilities and alignment; I made many but not all of the points you raised. I proposed the term language model cognitive architectures (LMCAs) there, but I'm now favoring "language model agents" as a more intuitive and general term.

The tag someone just applied to this post, Chain-of-Thought Alignment has been a good link to related thinking.

I'm a bit surprised that there isn't more activity on this in the alignment community yet, but time will tell if this approach works and takes off as well and as fast as I expect.

More soon. I'm working on a followup post that may be a good pair for this one, making more explicit the alignment advantages and arguing that we should actually push capabilities in this direction since this seems like a lot of upside and very little downside relative to other potential routes to AGI.

[-]Stephen Fowler10mo10

I am also surprised at how little attention these systems have been receiving.

I was reading about CoT reasoning plus early S-LLMs around September of last year at the same time I was encountered Yann LeCun's "A Path Toward Autonomous Machine Intelligence". While LeCun's paper barely discusses language models, it does provide a plausible framework for building a cognitive architecture.

The above planted the seed, so that when I saw the BabyAGI architecture diagram I immediately thought "This does plausibly seem like a paradigm that could lead to very powerful models (and I wish nobody had thought of it)".

You seem very optimistic about these systems, I think the crux of the disagreement will be that I think it's plausible these systems will bring about AGI sooner than a pathway which only involved trying to train larger and large models (at incredibly cost).

I'd be keen to read the draft if you're offering.

[-]Seth Herd10mo20

I'll show you that draft when it's ready; thanks for the offer!

A couple of thoughts:

At this point I'm torn between optimism based on the better interpretability and pessimism based on the multipolar scenario. The timeline doesn't bother me that much, since I don't think more general alignment work would help much in aligning those specific systems if they make it to AGI.and of course I'd like a longer timeline for me and others to keep enjoying life. My optimism is relative, and I still have something like a vague 50% chance of failure.

Shorter timelines have an interesting advantage of avoiding compute and algorithm overhangs that create fast, discontinuous progress. This new post makes the case in detail. I'm not at all sure this advantage outweighs the loss of time to work on alignment, since that's certainly helpful.

https://www.lesswrong.com/posts/YkwiBmHE3ss7FNe35/short-timelines-and-slow-continuous-takeoff-as-the-safest

So I'm entirely unsure whether I wish no one had thought of this. But in retrospect it seems like too obvious an idea to miss. The fact that almost everyone in the alignment community (including me) was blindsided by it seems like a warning sign that we need to work harder to predict new technologies and not fight the last war. One interesting factor is that many of us who saw this or had vague thoughts in this direction never mentioned it publicly, to avoid helping progress; but the hope that no one would think of such an obvious idea pretty quickly was in retrospect totally unreasonable.

[+][comment deleted]10mo20

Deleted by Seth Herd, 06/29/2023

Reason: I meant to PM

[-]Filip Sondej10mo50

we make the very strong assumption throughout that S-LLMs are a plausible and likely path to AGI

It sounds unlikely and unnecessarily strong to say that we can reach AGI by scaffolding alone (if that's what you mean). But I think it's pretty likely that AGI will involve some amount of scaffolding, and that it will boost its capabilities significantly.

there is a preexisting discrepancy between how humans would interpret phrases and how the base model will interpret them

To the extent that it's true, I expect that it may also make deception easier to arise. This discrepancy may serve as a seed of deception.

Systems engaging in self modification will make the interpretation of their natural language data more challenging.

Why? Sure, they will get more complex, but are there any other reasons?

Also, I like the richness of your references in this post :)

[-]Stephen Fowler10mo30

Hello and thank you for the good questions.

1. I do think that it is at least plausible (5-25%?) that we could obtain general intelligence via improved scaffolding, or at least obtain a self improving seed model that would eventually lead to AGI. Current systems like Voyager do not have that many "moving parts". I suspect that there is a rich design space for capabilities researchers to explore if they keep pushing in this direction.

Keep in mind that the current "cutting edge" for scaffold design consists of relatively rudimentary ideas like "don't use the expensive LLM for everything". When I see scaffolds leading to AGI I an envisioning a complex web of interacting components that requires a fair bit of effort to understand and build.

2. I think I agree although I'm a bit unclear on what the specifics of the "seed of deception". My intention was to highlight that there are natural language phrases or words whose meaning is already challenging to interpret.

3. It's not just that they're more complex it may also be that they might start utilizing channels and subsystems in unusual ways.

Perhaps a system notices that the vector database it has been assigned as a "memory" is quite small, but it also has read and write access to another vector database intended for logs.

[-]Ape in the coat10mo42

It's clear to me that we can easily prevent this type of behaviour. First of all, the system must not have read access to logs. But in general, the decision to read the memory and write logs should be explicit and trasparent parts of the scaffolding and the system shouldn't be able to "introspect" on its own.

But if something can be easily prevented it doesn't mean that it will be, unless we actually do the effort. We need to think about more of such cases and develop safety protocols for LLM-based agents.

[-]Filip Sondej10mo30

I agree that scaffolding can take us a long way towards AGI, but I'd be very surprised if GPT4 as core model was enough.
Yup, that wasn't a critique, I just wanted to note something. By "seed of deception" I mean that the model may learn to use this ambiguity more and more, if that's useful for passing some evals, while helping it do some computation unwanted by humans.
I see, so maybe in ways which are weird to humans to think about.

[-]Ape in the coat10mo60

Leaving this comment to make a public prediction that I expect GPT4 to be enough for about human level AGI with the propper scaffolding with more than 50% confidence.

[-]Vladimir_Nesov10mo50

A new kind of thing often only finds its natural role once it becomes instantiated as many tiny gears in a vast machine, and people get experience with various designs of the machines that make use of it. Calling an arrangement of LLM calls a "Scaffolded LLM" is like calling a computer program running on an OS a "Scaffolded system call". A program is not primarily about system calls it uses to communicate with the OS, and a "Scaffolded LLM" is not primarily about LLMs it uses to implement many of its subroutines. It's more of a legible/interpretable/debuggable cognitive architecture, a program in the usual sense that describes what the whole thing does, and only incidentally does it need to make use of unreliable reasoning engines that are LLMs to take magical reasoning steps.

(A relevant reference that seems to be missing is Conjecture's Cognitive Emulation (CoEm) proposal, which seems to fit as an example of a "Scaffolded LLM", and is explicitly concerned with minimizing reliance of properties of LLM invocations it would need to function.)

Thank you for the feedback. I'm definitely not sold on any particular terminology and was just aiming to keep things as compatible as possible with existing work.

I wasn't that familiar with Conjecture's work on CoEm, although I had read that outline. It was not immediately obvious to me that their work involved LLMs.

[-]Vladimir_Nesov10mo31

More details on CoEm currently seem to be scattered across various podcasts with Connor Leahy, though a writeup might eventually materialize. I like this snippet (4 minutes, starting at 49:21).

[-]Seth Herd10mo21

On the terminology front, I'm suggesting language model agent (LMA) as a clear reference to LLMs and to their extension. Language model cognitive architecture (LMCA) is more precise but less intuitive. I'm suggesting LLM+ for the whole broad category of additions to LLMs, including tools and agentic script wrappers like AutoGPT.

This is probably worth a whole post, but FWIW.

LESSWRONG
LW

Scaffolded LLMs: Less Obvious Concerns

30

Ω 10

Epistemics:

Introduction

Terminology

Recent Examples of Scaffold LLMs

4 Key Claims about S-LLMs and 1 Major Assumption

Overview of Discussions and Scope

Discussion I:
Consequences of Open-Source Development

What they are selected for:

The evolutionary process itself may promote dangerous systems:

Discussion II:
Natural Language Pitfalls

Pre-existing Flaws

Inherent Ambiguity of Human Language

Discussion III:
Modularity Facilitates Self-Modification

Conclusion

Acknowledgments

References

New to LessWrong?

30

Ω 10

Scaffolded LLMs: Less Obvious Concerns

30

Ω 10

Epistemics:

Introduction

Terminology

Recent Examples of Scaffold LLMs

4 Key Claims about S-LLMs and 1 Major Assumption

Overview of Discussions and Scope

Discussion I: Consequences of Open-Source Development

What they are selected for:

The evolutionary process itself may promote dangerous systems:

Discussion II: Natural Language Pitfalls

Pre-existing Flaws

Inherent Ambiguity of Human Language

Discussion III: Modularity Facilitates Self-Modification

Conclusion

Acknowledgments

References

New to LessWrong?

30

Ω 10

Discussion I:
Consequences of Open-Source Development

Discussion II:
Natural Language Pitfalls

Discussion III:
Modularity Facilitates Self-Modification