Tools, Agents, and Sycophantic Things

Eleni Angelou

Crossposted from my Substack.

For more context, you may also want to read The Intentional Stance, LLMs Edition.

Why Am I Writing This

I recently realized that, in applying the intentional stance to LLMs, I have not fully spelled out what exactly I’m applying the intentional stance to. For the most part, I assumed that everyone agrees that the intentional stance applies to a single chat with a given model and there’s no point in discussing beyond that, especially if there’s no continuity between chats, as tends to be the case by default. However, this is not the only way to think about it. As others have pointed out, one might think of the model as a predictor, a simulator, an instance, a thread, and so on. My goal now is to explicitly address this question.

What Am I Applying the Intentional Stance To

Since the time of GPT-4, I’ve argued for a pragmatic application of the intentional stance to LLMs, one that appeals to those experimenting with the conversation-friendly, fine-tuned versions of the models. I take this to mean that we apply the intentional stance during a single interaction with the LLM. More specifically, each conversation with the LLM constitutes its own subject of intentional modeling. There is, therefore, no implied continuity in the sense of a persisting self of a model; rather, we are simply interacting with the system and modeling it in this way for pragmatic purposes.^[1] This often involves eliciting capabilities by prompting it as if it were an agent (a “quasi-agent” according to Chalmers), or explaining and predicting its behavior by saying that the model intends, wants, desires, or aims at certain things. By doing so, I never assume anything about LLM mentality. I am mostly agnostic about this, and if pressured too much, I might end up telling you that I’m agnostic about human mentality as well, although I don’t have confident takes to share.

Back to the pragmatic application. I think this is the most reasonable way to approach the problem without needing a full-blown theory of the LLM self or of artificial personhood. My application of the stances framework models the phenomenon of talking to the LLM, rather than the LLM as a person. This is for another reason as well: there are cases in which we apply both the intentional and design stances to explain and predict LLM behavior.

Example 1: Sycophancy

Typically, a sycophantic model is one that systematically agrees with the user’s inputs, regardless of whether these are objectively true. Such models appear to be polite and friendly, rather than truth-seeking. If I use the intentional stance to explain the model’s behavior, I’ll say that the model recognizes the user’s expressed attitudes, infers their preferences, and aligns itself with their view. Through the intentional stance, we model the system as a cooperative interlocutor who maintains a positive conversational environment. Sycophancy resembles familiar human patterns of flattery and manipulation, although the term originates from ancient Athens.^[2] Now, if I use the design stance, I’ll say something along the lines of “it’s RLHF’s fault” and add that the training process somehow optimizes for user satisfaction, not truth or robustness. In other words, I’ll connect the behavior of the model to specific training procedures and might even have thoughts about how to intervene to change this behavior.

In most contexts, there are good reasons to think about sycophancy through both lenses. The intentional stance is useful for just talking to the model and getting it to behave in certain ways. And having a mechanistic understanding of how that behavior comes about, at least to the degree we can have that understanding, is also desirable.

Example 2: Hallucination

When a model hallucinates, it usually outputs plausible-sounding nonsense. This can be any kind of made-up answer, for example, a citation like “Introduction to AI Alignment, Oxford University Press, Yudkowsky, 2023”. The intentional stance move here is to view the model as coming up with answers the way a human does, in both cases lacking a robust answer. Best guesses can be wrong and human agents are also often incentivized not to admit they don’t have a good answer. From the design stance perspective, because during training similar responses minimized loss, the model assigns a higher probability during inference time to the responses that most resemble previously seen data.

In discussions of Dennett’s view, it is not typical to mix the two stances. For LLMs, however, I see two options: we either need to shift between these two stances or, in some cases, mix them in order to make sense of what is happening in these models and their behavior. Intentions without mechanisms are empty, mechanisms without intentions are blind.^[3] Which brings me to another abstraction.

From Dennett to Lakoff

George Lakoff, in “Women, Fire, and Dangerous Things”, introduces prototype theory, where he suggests that categories are not defined in practice by necessary and sufficient conditions. Instead, they have prototypes, i.e., better or worse examples, family resemblances, and radial structure. “Women, Fire, and Dangerous Things“ comes from a Dyirbal noun class that classifies together women, fire, and various dangerous animals and objects. Lakoff argues that this classification is explainable through cultural narratives and what he calls idealized cognitive models, suggesting that category coherence is often conceptual and not feature-based.

I see a similar framing when I think of LLMs and applying either the design or the intentional stance, therefore treating them either as tools or agents. In both cases, we have the same system, and thinking about it as a tool or as an agent is a question about whether its behavior in a given interaction serves as a good example of the prototype “tool” or agent”. And there are more distinctions to make if we think carefully about each prototype. “Tool”, “agent”, or “sycophant,” for that matter, aren’t natural kinds; they are prototypes that help us explain and predict behaviors we care about. There’s no need to decide once and for all whether the LLM is a tool or an agent, or anything else.

In summary

In applying the intentional stance to LLMs, I take a pragmatic approach: I model the phenomenon of a single chat in intentional terms without further ontological commitments, like saying that the LLM is a person.
It’s often necessary to shift between the design and the intentional stance for the purpose of explanation and prediction, or mix the two stances.
The stances don’t correspond to natural kinds: they are helpful labels we assign to models depending on the context of the problem at hand.

^{^}
For example, a prompt that requires intentional modeling along the lines of “You are an expert website designer. Can you recommend 5 easy ways to improve my website page?”
^{^}
Sycophant (συκοφάντης) means “fig revealer”. Several stories float around about the origin of the term, but Plutarch and Athenaeus suggest that the exportation of figs as forbidden by Athenian law. Hence, accusing someone of trying to export figs would make you a sycophant.
^{^}
This is my favorite paraphrasable Kantian quote: “Thoughts without content are empty, intuitions without concepts are blind.”

LESSWRONG
LW