Amidst the rumours about a new breakthrough at OpenAI I thought I'd better publish this draft before it gets completely overtaken by reality. It is essentially a collection of "gaps" between GPT4 and the human mind. Unfortunately the rumours around Q* force me to change the conclusion from "very short timelines seem unlikely" to "who the f**k knows".
While GPT4 has a superhuman breadth of knowledge, writing speed and short-term memory, compared to the human mind GPT4 has a number of important limitations.
Some of these will be overcome in the near future because they depend on engineering and training data choices. Others seem more fundamental to me, because they are due to the model architecture and the training setup.
These fundamental limitations are the reason why I do not expect scaling GPT further to lead to AGI. In fact, I interpret further scaling of the exact current paradigm as evidence that overcoming these limitations is hard.
I expect scaled up GPT4 to exhibit the same strength and weaknesses and for the improved strengths to paper over the old weaknesses at most in a superficial fashion.
I also expect with further scaling the tasks GPT is unable to do to increasingly load on the fundamental limitations and therefore diminishing returns.
This list is not exhaustive and there are likely ways to frame even the limitations I identify in a more insightful or fruitful way. For example, I am not sure how to interpret GPT4s curious inability to understand humor. I hope further limitations will be mentioned in the comments.
Integration of the Senses
GPT4 cannot really hear, and it cannot really talk.
Voice input is transcribed into text by a separate model, 'Whisper,' and then fed to GPT4. The output is read by another model. This process loses the nuances of input—pronunciation, emphasis, emotion, accent, etc. Likewise, the output cannot be modulated by GPT4 in terms of speed, rhythm, melody, singing, emphasis, accent, or onomatopoeia.
In fact, due to tokenization it would be fair to say that GPT4 also cannot really read. All information about char-level structure that is relevant to spelling, rhyming, pronunciation must be inferred during training.
The vision component of most open multi-modal models is typically likewise "grafted on." That is, it's partially trained separately and then connected and fine-tuned with the large language model (LLM), for example by using the CLIP model, which maps images and their descriptions to the same vector space.
This means GPT4 may not access the exact position of objects or specific details and cannot "take a closer look" as humans would.
I expect these limitations to largely vanish as models are scaled up and trained end-to-end on a large variety of modalities.
System 2 Thinking
Humans not only think quickly and intuitively but also engage in slow, reflective thinking to process complex issues. GPT-4's architecture is not meaningfully recurrent; it has a limited number of processing steps for each token, putting a hard cap on sequential thought.
This contrast with human cognition is most evident in GPT4's unreliable counting ability. But it also shows up in many other tasks. The lack of system 2 thinking may be the most fundamental limitation of current large language models.
Learning during Problem Solving
Humans rewire their brains through thinking; synapses are continuously formed or broken down. When we suddenly understand something, that realization often lasts a lifetime. GPT4, once trained, does not change during use.
It doesn't learn from its mistakes nor from correctly solved problems. It notably lacks an optimization step in problem-solving that would ensure previously unsolvable problems can be solved and that this problem-solving ability persists.
The fundamental difference here, is that in humans the correct representations for a given problem are worked out during the problem-solving process and then usually persist – GPT4 relies on the representations learned during training, new problems stay out of reach.
Even retraining doesn’t solve this issue because it would require many similar problems and their solutions for GPT4 to learn the necessary representations.
Compositionality and Extrapolation
Some theories suggest that the human neocortex, the seat of intelligence, uses most of its capacity to model the interplay of objects, parts, concepts, and sub-concepts. This ability to abstractly model the interplay of parts allows for better extrapolation and learning from significantly less data.
In contrast, GPT-4 learns the statistical interplay between words. Small changes in vocabulary can significantly influence its output. It requires a vast amount of data to learn connections due to a lack of inductive bias for compositionality.
Limitations due to the Training Setup
Things not present in the training data are beyond the model's learning capacity, including many visual or acoustic phenomena and especially physical interaction with the world.
GPT-4 does not possess a physical, mechanical, or intuitive understanding of many world aspects. The world is full of details that become apparent only when one tries to perform tasks within it. Humans learn from their interaction with the world and are evolutionarily designed to act within it. GPT-4 models data, and there is nothing beyond data for it.
This results in a lack of consistency in decisions, the ability to robustly pursue goals, and the understanding or even the need to change things in the world. The input stands alone and does not represent real-world situations.
GPT-4's causal knowledge is merely meta-knowledge stored in text. Learning causal models of new systems would require interaction with the system and feedback from it. Due to this missing feedback, there is little optimization pressure against hallucinations.
Some of these points probably interact or will be solved by the same innovation. System 2 thinking is probably necessary to move the parts of concepts around while looking for a solution to a problem.
The limitations due to the training setup might be solved with a different one. But that means forgoing cheap and plentiful data. The ability to learn from little data will be required to learn from modalities other than abundant and information-dense text.
It is very unclear to me how difficult these problems are to solve. But I also haven’t seen realistic approaches to tackle them. Every passing year makes it more likely that these problems are hard to solve.
Very short timelines seemed unlikely to me when I wrote this post, but Q* could conceivably solve "system 2 thinking" and/or "learning during problem solving" which might be enough to put GPT5 over the threshold of "competent human" in many domains.
System 2 thinking is exactly what many people are trying to add to LLMs via things like thinking-step-by-step, scratchpads, graph-of-thoughts and lots of other agentic scaffolding mechanisms that people are working on.
Integration of the senses, lack of physical, mechanical, or intuitive understanding of many world aspects is solvable by training a truly multi-modal model also on a variety of other media including images, audio, video, motion logs, robot manipulation logs etc etc. From their publications, Google DeepMind have clearly been working in this direction for years,
Learning during problem solving requires feeding data back from inference runs to future training runs. While there are user privacy issues here, in practice most AI companies doe this as much as they can. But the cycle length is long, and (from a capabilities point of view) it would be nice to shorten it by some for of incremental training.
System 2 thinking that takes a detour over tokens is fundamentally limited compared to something that continuously modifies a complex and highly dimensional vector representation.
Integration of senses will happen, but is the information density of non-text modalities high enough to contribute to the intelligence of future models?
What I call "learning during problem solving" relies on the ability to extract a lot of information from a single problem. To investigate and understand the structure of this one problem. To in the process of doing that building a representation of this problem that can be leveraged in the future to solve problems that have similar aspects.
You wrote "GPT4 cannot really hear, and it cannot really talk". I used GPT builder to create Conversation. If you use it on a phone in voice mode it does, for me at least, seem like it can hear and talk, and isn't that all that matters?
I didn't try it, but unless your GPT conversator is able to produce significantly different outputs when listening the same words in a different tone, I think it would be fair to classify it as not really talking.
For example, can GPT infer that you are really sad because you are speaking in a weeping broken voice?
I think you have defined me as not really talking as I am on the autism spectrum and have trouble telling emotions from tone. Funny, given that I make my living talking (I'm a professor at a liberal arts college). But this probably explains why I think my conversator can talk and you don't.
I think you have defined me as not really talking as I am on the autism spectrum and have trouble telling emotions from tone.
No, he didn't. Talking is not listening and there's a big difference between being bad at understanding emotional nuance because of cognitive limitations and the information that would be necessary for understanding emotional nuance never even reaching you brain.
Was Stephen Hawking able to talk (late in life)? No, he wasn't. He was able to write and his writing was read by a machine. Just like GPT4.
If I read a book to my daughter, does the author talk to her? No. He might be mute or dead. Writing and then having your text read by a different person or system is not talking.
But in the end, these are just words. It's a fact that GPT4 has no control over how what it writes is read, nor can it hear how what it has written is being read.
He wrote "unless your GPT conversator is able to produce significantly different outputs when listening the same words in a different tone, I think it would be fair to classify it as not really talking." So if that is true and I'm horribly at picking up tone and so it doesn't impact my "outputs", I'm not really talking.
It's probably better to taboo "talking" here.
In the broader sense of transmitting information via spoken words, of course that GPT4 hooked to a text-to-speech software can "talk". It can talk in the same way Stephen Hawking (RIP) could talk, by passing written text to a mindless automaton reader.
I used "talking" in the sense of being able to join a conversation and exchange information not through literal text only. I am not very good at picking up tone myself, but I suppose that even people on the autism spectrum would notice a difference between someone yelling at them and someone speaking soberly, even if the spoken words are the same. And that's definitely a skill that GPT-conversator should have if people want to use it as a personal therapist or the like (I am not saying that using GPT as a personal therapist would be a good idea anyway).
It is very unclear to me how difficult these problems are to solve. But I also haven’t seen realistic approaches to tackle them.
That sounds be me more like lack of interest in research than lack of attempts to solve the problems.
AutoGPT frameworks provide LLMs a way to have system II thinking. With 100k token as a context window, there's a lot that can be done as far as such agents go. It's work to create good training data but it's doable provided there the capital investment.
As far as multimodel models go, DeepMinds GATO does that. It's just not as performant as being a pure LLM
I know these approaches and they don't work. Maybe they will start working at some point, but to me very unclear when and why that should happen. All approaches that use recurrence based on token-output are fundamentally impoverished compared to real recurrence.
Yes, maybe Gemini will be able to really hear and see.
Several of these seem trivially solvable (in term of limitation, not necessarily in terms of power). If GPT-4 is given access to itself as a tool, it can continue to "reason" across calls. It can probably also be plugged into continuous learning trivially (just keep updating weights when you detect something worth learning).
Things not present in the training data are beyond the model's learning capacity
We can't see in infrared or UV, but it seems like we're able to reason over them through the use of tools.
A lot of these don't seem like hard limitations.
If the entire representation of a complex task or problem is collapsed into a text, reading that text and trying to push further is not really "reasoning across calls". I expect that you can go further with that, but not much further. At least that's what it looks like currently.
I don't think you can learn to solve very specific complex problems with the kind of continuous learning that would be possibly to implement with current models. Some of the theorem-prover papers have continuous learning loops that basically try to do this but those still seem very inefficient and are applied to only highly formalised problems whose solutions can be automatically verified.
Yes, multi-modality is not a hard limitation.