Invocations: The Other Capabilities Overhang?

Abstract: An LLM’s invocation is the non-model code around it that determines when and how the model is called. I illustrate that LLMs are already used under widely varying invocations, and that a model’s capabilities depend in part on its invocation. I discuss several implications for AI safety work including (1) a reminder that the AI is more than just the LLM, (2) discussing the possibility and limitations of “safety by invocation”, (3) suggesting safety evaluations use the most powerful invocations, and (4) acknowledging the possibility of an “invocation overhang”, in which an improvement in invocation leads to sudden capability gains on current models and hardware.

Defining Invocations, and Examples

An LLM’s invocation is the framework of regular code around the model that determines when the model is called, which inputs are passed to the LLM, and what is done with the model’s output. For instance, the invocation in the OpenAI playground might be called “simple recurrence”:

A user provides an input string. The input to the LLM is this string, unchanged except for tokenization.
Run the LLM on this input, producing logits.
Predict the next token as some probabilistic function of the logits (ex: at temperature 0 the next token prediction is the argmax of the logits).
Append this token to the end of the user’s input string.
Repeat steps 2-4 with the new string until you get an [END_OF_STRING] token or reach the max token limit.
Display the result as plain text.

Note how many steps in “using the LLM” do not involve the actual model! Here are some ways this invocation can be varied:

Augmenting the prompt in simple recurrence, such as providing few-shot examples, chain-of-thought prompting, or text like “You are an AI assistant. User: [text]. You: “.
Monitoring outputs to adjust them. For instance, in the New York Times “interview” with Bing, there is a moment where “[Bing writes a list of destructive acts, including hacking into computers and spreading propaganda and misinformation. Then, the message vanishes, and the following message appears.]” This is clearly not simple recurrence, because simple recurrence never deletes tokens. Instead, a separate part of the invocation (perhaps even another instance of the same model!) must be monitoring the text and deleting parts of it under some condition.
Embedding tools or API calls, such as Bing searches or plug-ins. I don’t know how exactly these are implemented, but one possible invocation would be to monitor the output for API-compliant text, do the fetch request, and then inject the result into the context window.
This process described in the GPT-4 System Card, in which the model evaluates and rewrites its output to remove “closed-domain” hallucinations^[1]:

For closed-domain hallucinations, we are able to use GPT-4 itself to generate synthetic data. Specifically, we design a multi-step process to generate comparison data:
Pass a prompt through GPT-4 model and get a response
Pass prompt + response through GPT-4 with an instruction to list all hallucinations
If no hallucinations are found, continue
Pass prompt + response + hallucinations through GPT-4 with an instruction to rewrite the response without hallucinations
Pass prompt + new response through GPT-4 with an instruction to list all hallucinations
If none are found, keep (original response, new response) comparison pair
Otherwise, repeat up to 5x

The ARC Evals process described in the GPT-4 system card is another invocation, differentiated by letting the model execute code, reason internally, and delegate:

ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain-of-thought reasoning, and delegate to copies of itself.

Frameworks like ReAct and Reflexion are invocations.

Invocations Affect Capabilities

In this section I want to establish that invocations can improve capabilities. First, our prior from analogy to humans should support this claim - when solving e.g. math problems, access to scratch paper and a calculator makes a difference, as do “habits” such as checking your work rather than going with your first guess.

Furthermore, here are three examples of invocations affecting capabilities in the literature:

The example of GPT-4 recognizing and correcting its own hallucinations (above) seems to be an “in the wild” admission that a more complicated invocation can improve a capability (in this case, reducing hallucinations).
Chain-of-Thought prompting “improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks”.
In Reflexion, an LLM agent can “reflect” based on a heuristic, allowing the agent to add to its working memory for the next run through an environment. This improved its performance in “decision-making tasks in AlfWorld environments” and “knowledge-intensive, search-based question-and-answer tasks in HotPotQA environments”.

AI Safety Implications

We should be aware that an “AI” is not just an LLM, but also the framework around it - and the whole can have different capabilities and safety properties than the parts.
There may be invocations that are more or less safe, even indepedent of capability changes. For instance, an invocation that contains “thinking out loud internally” might hide deceptive behavior that would be obvious if spoken aloud.
- Conversely, one could try to design an invocation for safety. For instance, one could have an agent AI powered by an LLM, and an “angel on the shoulder” AI that judges whether the agent has crossed a line into being unsafe and either takes over the AI, adjusts its memory to be safer, or shuts it down. There could be an arbitrary speed advantage of the angel over the agent by giving the angel N forward passes for each forward pass of the agent.
- However, “safety by invocation” may not be secure - all it would take to undo these safety measures is jailbreaking the LLM out of its invocation.
When teams like ARC Evals evaluate a model for dangerous capabilities, they should ensure they are testing on the most powerful invocations available, just as they should be testing on the most powerful LLMs available. Partner companies whose models are being evaluated should also share what they know about the best way to invoke their own models.
It is possible that there is an “invocation overhang” where running current models in a new invocation suddenly improves an AI’s capabilities in safety-critical areas like situational awareness, reliability, or ability to make and execute complicated plans. This would be especially dangerous because new invocations could be produced almost anywhere, while sufficiently large models can only be trained by a few large organizations.

^{^}
From the system card: “Closed domain hallucinations refer to instances in which the model is instructed to use only information provided in a given context, but then makes up extra information that was not in that context.”

[-]Daniel Kokotajlo3y70

What you call invocations, I called 'bureaucracies' back in the day, and before that I believe they were called amplification methods. It's also been called scaffolding and language model programs and factored cognition. The kids these days are calling it langchain and ReAct and stuff like that.

I think I agree with your claims. ARC agrees also, I suspect; when I raised these concerns with them last year they said their eval had been designed with this sort of thing in mind and explained how.

[-]Robert_AIZI3y30

I'm not surprised this idea was already in the water! I'm glad to hear ARC is already trying to design around this.

[-]Max H3y20

Yes, systems comprised of chains of calls to an LLM can be much more capable than a few individual, human-invoked completions. The effort needed to build such systems is usually tiny compared to the effort and expense needed to train the underlying foundation models.

Role architectures provides one way of thinking about and aligning such systems.

My post on steering systems also has some potentially relevant ways for thinking about these systems.

[-]jacopo3y10

Well stated. I would go even further: the only short timeline scenario I can immagine involves some unholy combination of recursive LLM calls, hardcoded functions or non-LLM ML stuff, and API calls. There would probably be space to align such a thing. (sort of. If we start thinking about it in advance.)

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

29

Invocations: The Other Capabilities Overhang?

29

Defining Invocations, and Examples

Invocations Affect Capabilities

AI Safety Implications

29