What I was pointing to was the fact that the feed forward networks for the new token don't have access to the past feed-forward states of the other tokens [...] When curing cancer the second time, it didn't have access to any of the processing from the first time. Only what previous layers outputted for previous tokens.
That is the misconception. I'll try to explain it in my words (because frankly despite knowing how a transformer works, I can't understand Radford Neal's explanation).
In the GPT architecture each token starts out as an embedding, which is then in each layer enriched with information from previous tokens and knowledge stored in the nn itself. So you have a vector which is modified in each layer, let's call the output of the -th layer:
The computation of accesses the v of all previous tokens! So in your example, if in layer at some token the cure for cancer is discovered, all following tokens will have access to that information in layer . The model cannot forget this information. It might never access it again, but the information will always be there for the taking.
This is in contrast to a recurrent neural network that might actually forget important information if it is unfortunate in editing its state.
I think even in the case that AI 2027 is directionally correct (very fast AI progress) the concrete details are likely to be wrong, so I'm not sure how impressed one should be if your predictions turn out to be correct.
About "it's all just vibes": AI 2027 is strongly based on the METR time horizon analysis. I think it would be more fruitful to critique and analyse that. Stuff like the time from SC to SAI seems like epicycles. Though the biggest uncertainty in AI 2027 probably comes from the assumption of recursive improvement.
I am not sure how fruitful the "shallow vs deep thinking" terminology is. What you explain in more detail is what I call "knowledge integration" and "learning while problem solving" which is both about humans having more powerful representations that can be modified while mulling stuff over and improved by integrating data from other domains.
Your algorithmic explanation for LLM shortcomings seems to be wrong and based on a misunderstanding of how LLMs work:
As joseph_c already mentioned the human brain (as an nn architecture) is much, much wider and shallower than a GPT. One of your examples, coming up with clever jokes, also doesn't require enough time for humans to engage in a lot of recursive thought.
Also, LLMs do actually keep the entire earlier state around, that's what the KV-cache is. The computation of each new token does access the fine-grained vector representation of earlier tokens. There is no memory wiping going on.
I think the opposite is correct: LLMs are not nearly wide enough. As a consequence their representation of the "the problem" or "the situation" is impoverished.
I think this insight is really interesting! Especially the potential connection to LLMisms.
But I don't really understand why you chose these experiments. It seems to me the things to check or prove are:
You do significantly more work to show the effect in a toy setting that may or may not bear on the real case. And I think the outcome of your experiments is already clear before you do them because the effect of top-k sampling on tokens with low/high probability is not complicated (and well explained by you in the post).
because tokens are too low bandwidth
That's also my impression: https://www.lesswrong.com/posts/KrgBkqeChtAWuPsLP/what-llms-lack
The 4-month doubling trend implies getting 8h+ horizon length by early 2026 and an order of magnitude more by mid-2027. If the best time horizon length in mid-2027 would be 9h, would you feel like you had won the argument, even if you had won the bet?
I think it is a cool idea and has its application but you are right that it seems very unlikely to contribute to AGI in any way. But there was nonetheless excitement about integrating KANs into transformers which was easy to do but just didn't improve anything.
SSMs are really quite similar to transformers. Similar to all the "sub-quadratic" transformer variants the expectation is at best that they will do the same thing but more efficiently than transformers.
HRMs or continuous thought machines or KANs on the other hand contain new and different ideas that make a discontinuous jump in abilities at least conceivable. So I think one should distinguish between those two types of "promising new architectures".
My view is that these new ideas accumulate and at some points somebody will be able to put them together in a new way to build actual AGI.
But the authors of these papers are not stupid. If there was straightforward applicability to language modelling they would already have done that. If there was line of sight for GPT4 level abilities in six month they probably wouldn't publish the paper.
Empathy is not: That person acts like this. How would I feel if I acted like this? Oh, absolutely disgusted of myself.
Empathy is: This person acts like this. How must he feel inside to act like this? Have I ever felt like that? Can I understand or extrapolate from my experiences how this would be? Maybe from my internal states when I was really exhausted or hangry or drunk or in rage or depressed? Could I imagine having this internal state so that I would act like this? This also involves how the internal state would have to be different to not feel disgusted of yourself.
I think Sailer had it right 30 years ago. It's mostly just behavioral and physical masculinity/femininity. That may be unfair, but it's not racism.
The function of the feedforward components in transformers is mostly to store knowledge and to enrich the token vectors with that knowledge. The wider you make the ff-network the more knowledge you can store. The network is trained to put the relevant knowledge from the wide hidden layer into the output (i.e. into the token stream).
I fail to see the problem in the fact that the hidden activation is not accessible to future tokens. The ff-nn is just a component to store and inject knowledge. It is wide because it has to store a lot of knowledge, not because the hidden activation has to be wide. The full content of the hidden activation in isolation just is not that relevant.
Case in point: Nowadays the ff-nns actually look different than in GPT-3. They have two hidden layers with one acting as a gating mechanism: The design has changed to allow the possibility to actively erase part of the hidden state!
Also: This seems very different from what you are talking about in the post, it has nothing to do with "the next run". The hidden layer activations aren't even "accessible" in the same run! They are purely internal "gears" of a subcomponent.
It also seems to me like you have retreated from
to "intermediate activations of ff-components are not accessible in subsequent layers and because these are wider than the output not all information therein contained can make it into the output".