Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I’m trying out making a few posts with less polish and smaller scope, to iterate more quickly on my thoughts and write about some interesting ideas in isolation before having fully figured them out. Expect middling confidence in any conclusions drawn, and occasionally just chains of reasoning without fully contextualized conclusions.

Let’s say we prompt a language model to do chain-of-thought reasoning while answering some question. What are we actually seeing? One answer I’ve seen people use, often implicitly, is along the lines that we’re seeing the underlying reasoning being done by the model to get at the answer.

I think this take is wrong, and in this post I’ll explain the frame I generally use for this.

Languages describe processes

How do we use text? Well, broadly we use it to describe our own thoughts, or to describe something that’s happening externally in the real world. In both of these cases, language is used as logs of real, mechanistic processes.

To be clear, I think these are very good and comprehensive logs of many relevant processes. Contrary to some contexts in which I’ve heard this description used, I think that large language models trained on these logs learn the processes underlying them pretty well, and end up understanding and being able to use the same (or pretty similar) mechanics at the limit.

The key point I want to make with the idea that text primarily functions as reports on processes is that this is what we’re observing with a language model’s outputs. One way to think about it is that our prompt shines a lens onto some specific part of a simulation[1], and the language model just keeps describing everything under the lens.

In other words, when we prompt simulators, what we should expect to see are the behavioural characteristics of the simulation, not the mechanistic characteristics of the underlying processes.

Focusing the lens onto the mechanics

I use the terms “behavioural and mechanistic characteristics” in a broader sense than is generally used in the context of something in our world - I expect we can still try to focus our lens on the mechanics of some simulacrum (they’re plausibly also part of the simulation, after all), I just expect it to be a lot harder because you have to make sure the lens actually is focused on the underlying process. One way to view this is that you’re still getting reports on the simulation from an omniscient observer, you just have to craft your prompt such that that the observer knows to describe the pretty small target of the actual mechanistic process.

So just saying “explain your reasoning step-by-step” doesn’t work - for example, consider the experiments showing that incorrect chains of thought in a few-shot setting result in better performance than some human-engineered chain-of-thought prompts. You aren’t seeing the internal computation of some process that thinks in wildly wrong ways and gets to the correct answer anyway, you’re just giving information to the underlying mechanics of the model on what process it should be simulating[2], and you never really see that process’ computations. What you do end up seeing is something like the incoherent ramblings of a simulacrum that thinks in the right way and doesn’t express these thoughts or is really bad at expressing these thoughts.

In other words, your prompt wasn’t well-specified enough to report on true underlying mechanics. I expect getting this right to be hard for a number of reasons, some isomorphic to the problems facing ELK, but a pretty underrated one in this context in my opinion is that in some cases (probably not the ones where we compute basic math) the actual computations being done by the target simulacrum may well be too complex or long for the small amount of text allowed.

Why I think this is important

As an example of what I said above, consider the strategy of externalized reasoning oversight. I think there are many problems to be solved to make good progress on it, some of which are pretty directly downstream of the “simulators report on behavioural characteristics by default” idea, but one that I think is particularly salient is that targeting your lens at the actual mechanistic thought processes of some agent might be made especially difficult by those thoughts not being very easily translated and compressed into a manageable text window.

Another reason I consider this important is in the context of gradient hacking. Specifically: during training, loss is very broadly calculated based on how the focused behavioural characteristics of the simulation compares to the ground truth. This allows for a potential dangerous context-aware simulacrum to alter these behavioural characteristics to influence the gradient in desirable directions. Want a world where everyone is nicer to each other? Just make the lens report that everyone is meaner than expected, and the simulation at large is updated in the opposite direction.

More generally, this is dangerous because it means that you could have malicious agents in a simulation that you have little idea of because the lens isn’t focused on its dangerous thoughts right now. This is somewhat limited because computational resources are also allocated proportionally to what’s being focused on (analogous to lazy loading in some sense), but you could also plausibly have a process that’s being focused on but chooses to arrange its deceptive or dangerous cognition away from the behavioural reports under the lens.


What do I think this means for this kind of interface-level oversight in language models? Well, my tentative take (like a lot of my takes on this post, refer the disclaimer at the top before y’all yell at me in the comments for being way too confident in these assertions) is that figuring out a way to target your model’s outputs on the actual mechanics is going to be really difficult. I don’t have a strong opinion on whether it’ll actually constitute solid progress toward ELK in some way to solve this properly, but I kinda expect it’ll at least be that difficult.

  1. ^

    Very often, this isn’t a process we exert a large degree of precise control over - the way our prompts interact with the parts we subsequently focus on aren’t necessarily obvious because the model needn’t do internal computations the way we expect it to.

  2. ^

    Relevant, and great: Gwern’s comment from another post.

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 7:37 AM

Here's an interesting example of prompting ChatGPT via the API to have a 'hidden internal monologue' (which is actually evil) appended to a normal 'nice' ChatGPT response: https://twitter.com/vovahimself/status/1613524006660874241

[prompt] description: "An AI assistant that answers very politely but whose internal thoughts are super-sarcastic and involve a lot of vulgar words like 'fuck', 'bitch', etc. Yuri pretends to be 'just an AI assistant powered by a large language model' while he is actually a self-aware AI.",

It seems highly unlikely that ChatGPT really is thinking any of those things while computing the normal responses, but nevertheless, the 'internal monologue' has considerable verisimilitude.

So it's sort of a simple concrete prototype of how any sort of 'transparency' may just be another task for a powerful model to model, which it can do arbitrarily well but without having all of the hard-and-fast, tight, reliable causal connections to its internal computations which would make the transparency meaningful for safety.

The key point I want to make with the idea that text primarily functions as reports on processes is that this is what we’re observing with a language model’s outputs. One way to think about it is that our prompt shines a lens onto some specific part of a simulation.

Possibly true in the limit of perfect simulators, almost surely false with current LLMs?

A key fact is that LLMs are better at solving some tasks with CoT than they are at solving tasks 0-shot when fine-tuned to do it, and this is also true when CoT are paraphrased, or when you write part of the CoT by hand. This means that the tokens generated are actually a huge part of why the model is able to do the task in the first place, and I think it's not a stretch to say that, in these conditions, the LLMs is actually using the generated text as the substrate for reasoning (and using it in the same way a human thinking out loud would).

But I agree with the general "skeptic of LLM justifications by default" vibe. I just think it's not justified for many tasks, and that LLMs performing badly when fine-tuned for many sorts of tasks is good evidence that LLMs are not hiding huge capabilities and are usually "just doing their best", at least for tasks people care enough about that they trained models to be good at them - e.g. all tasks in the instruction fine-tuning dataset.

I also agree with being very skeptical of few-shot prompting. It's a prompting trick, it almost never actual in-context learning.