LLMs, on the other hand, are feed-forward networks. Once an LLM decides on a path, it's committed. It can't go back to the previous layer. We run the entire model once to generate a token. Then, when it outputs a token, that token is locked in, and the whole model runs again to generate the subsequent token, with its intermediate states ("working memory") completely wiped. This is not a good architecture for deep thinking.
It might be the case that LLMs develop different cognitive strategies to cope with this, such as storing the working memory on the CoT tokens, so that the ephemeral intermediate steps aren't load-bearing. The effect would be that the LLM+CoT system acts as... whatever part of our brain explores ideas.
It might be the case that LLMs develop different cognitive strategies to cope with this, such as storing the working memory on the CoT tokens, so that the ephemeral intermediate steps aren't load-bearing. The effect would be that the LLM+CoT system acts as... whatever part of our brain explores ideas.