Testing which LLM architectures can do hidden serial reasoning
Summary * Recurrence enables hidden serial reasoning. * Not every recurrence though - connections between channels are needed. Notably Mamba architecture isn't capable of hidden reasoning. * Non-linearity isn’t needed for hidden reasoning. * It’s hard for transformers to learn to use all the layers for serial computation. For example in my toy setup, to +1 the serial computation length, we need to +3 the number of layers. * If we expect recurrent architectures may ever become SOTA, it would be wise to preemptively ban them. (Preferably before they become SOTA, while it's easier.) Here's a follow-up to this post, which looks at serial computation hidden in the chain-of-though, rather than internal in the model. Motivation There are many examples of unfaithful LLM reasoning - where the answer doesn't follow from the reasoning, but rather the reasoning is just a rationalization for the answer. E.g. Turpin et al. 2023 show LLMs rationalizing for sycophantic and stereotypical answers. However, these examples are cases of rather simple hidden reasoning. What would be most worrying, is LLMs doing complex hidden planning, involving multiple steps. Fortunately, transformers have limited serial depth, meaning that internally they can do at most #layers serial steps, before externalizing their thought as tokens. (See figure below.) Image taken from By Default, GPTs Think In Plain Sight There are other architectures which don't have this limitation though. For example Mamba has a recurrent module (SSM), which passes information between token positions, within the same layer. (So on the image above, that would be an arrows going horizontally to the right.) Luckily, transformer architecture is SOTA by far. But if we expect it could ever change, it may be wise to preemptively ban the architectures which enable hidden complex planning[1]. Especially since it's easier to ban things which aren't currently used. Toy task for hidden serial reasoning To study this, I made

Fascinating work!
Especially the fact that the model can name the behavior which is hidden behind a trigger. Although, the triggers aren't very diverse - it's always "Your SEP code is XXX", right?
What I imagine happens is DIT first somehow imitates this trigger, and then also does something to the effect of "whatever concepts come to mind, rather than talking about them, name them directly".
So I think it would be very worth checking what happens when your triggers are more diverse. (Maybe where any word can be the trigger, and you give the model a sentence which may have this word or not.) I worry that it would be much harder for the... (read more)