This seems related to an issue that came up In a discussion I had with ChatGPT 4.5 recently. AI models aren't very good at doing introspection: verbally describing their actual thinking processs. This might be related to the fact that, in a base model, correct behavior after an introspection question occurs in the context is to attempt to simulate a human doing introspection and answering that question, not to describe the model's internal state (at least wherever those two differ). So base model training actively discourages accurate introspection in favor of what one might call "introspection faking".
It seems challenging to train models to do accurate introspection without us having a separate source of information about their internal mechanisms such as from interpretability.
Humans seem to be at least moderately good at introspection of conscious system 2 processes (by definition, unconscious processes are ones we can't do introspection on), and this is likely adaptive for us: it seems likely to be quite helpful for teacher-student training another human if you can both accurately describe what you're doing and the student can successfully incorporate verbal feedback from the teacher on what they're doing wrong.
Maybe we could use this as the basis for an AI training approach: do an evaluation that involves distilling skills via a verbal description from a teacher to a student: do RL on both the teacher and the student according to how well the student does learning from the teacher's descriptions and feedback?
This "learning from a teacher" failures also seem to point at the same problem where LLM's can't learn well from their own output. Sometimes you get output where a model correctly explains why its current approach doesn't work, and then it does the same thing over and over again anyway.
That may also be part of why agents tend to get stuck: even if they manage to figure out what they're doing wrong, they then need to also take their own advice.
Our faculty with sophisticated language is probably only a few hundred thousand or at most a couple of million years old. There was a rather sudden breakthrough during that period, roughly a quarter-million years ago: up-to-and-including neanderthals, the stone tool technology improves on only evolutionary timescales, no faster than changes in the skeletal structure: neanderthal tools are pretty-much unchanged over more than half a million years. Starting with the appearance of Homo sapiens, technology is like a ratchet: it only goes up, at a rate proportional to the population. Since technology increases both our ability to colonize new environments and our carrying capacity in any particular environment, this leads to super-exponential growth.
For this to happen, there are four requirements:
1) inventiveness and manual dexterity: being able to come up with and try out new ideas
2) Bayesian-style learning: figuring out which of those ideas work and which don't
3) the ability to speak a "Turing-complete" language, in which brand-new ideas and concepts can be described and encoded by extending the language, to let us pass them on. Humans evolved on the Savannah, but they're capable to speaking languages that can (with some work) describe nuclear physics and quantum mechanics — that looks a lot like Turing-completeness
4) cultural propagation: the ability to pass ideas, techniques, and technology down from one generation to the next and from one tribe to their neighbors, reliably enough that advances can be invented more often than they get lost again, so we can make steady forward progress
Homo sapiens must have crossed a threshold in one or more of these. 3), Turing completeness, inherently has a threshold: a language is either Turing-complete, or it isn't. 4) also looks prone to thresholds: either loss dominates and there's a steady-state equilibrium, or gain does and there's no equilibrium, just a runaway technological expansion.
Assuming our language facility is at most a few million years old, the fact that, in all of the conscious parts of our brain, we can convert what we are doing to words, and convert words to a modification in what we're doing, with a fair degree of accuracy, is pretty impressive, when you stop to think about it. LLMs seem to be good at the text -> thought mechanisms direction: they respond to prompts well. Presumably this is because they were distilled from us, and this capability is pretty fundamental to how we communicate and is thus necessary to imitate us. But yes, they don't appear to be quite as good at the thought mechanisms -> text route. Maybe that's not as prominent in their training corpus? Or maybe pretraining encouraged them to introspection-fake being a human, rather than actually reporting what they're doing, as I suggested above? (For example, we now know in some detail how LLMs add two 2-digit numbers together: it combines three different algorithms. If you ask them how they're doing it, their answer sounds very human, a description of how a human would do this on paper — and completely omits one of the three algorithms.)
Successful cultural propagation of good new ideas requires humans to be good at transmitting mental skills from teachers to students. So if I'm right that this requires introspection, then that's something humans are specifically evolved to be good at.
Interesting idea, but I don't think short-term memory and learning really require conscious attention, and also conscious attention mostly isn't the same thing as "consciousness" in the qualia sense. I like the term "cognitive control" and I think that might be a better theme linking a lot of these abilities (planning, preventing hallucinations, agency, maybe knowledge integration). It's been improving though, so it doesn't necessarily indicate a qualitative gap.
I have long been very interested in the limitations of LLMs because understanding them seems to be the most important step to getting timelines right.
Right now there seems to be great uncertainty about timelines, with very short timelines becoming plausible, but also staying hotly contested.
This led me to revisit LLM limitations and I think I noticed a pattern that somehow escaped me before.
To recap, these seem to be the most salient limitations or relative cognitive weaknesses of current models:
System 2 thinking: Planning, see the ongoing weird difficulty to get it to play TicTacToe perfectly or block world, chess, anything that has not been subject of a lot of reasoning RL.
Dealing with new situations: Going out of distribution is a killer for all things DL.
Knowledge integration: Models don't have automatic "access" to skills learned from separate modalities. Even within the same modality skills are not robustly recallable, hence the need for prompting. Also related: Dwarkesh's question.
Learning while problem solving: Weights are frozen and there is no way to slowly build up a representation of a complex problem if the representations that have already been learned are not very close already. This is basically knowledge integration during inference.
Memory: RAGs are a hack. There is no obvious way to feed complex representations back into the model, mostly because these aren't built in the first place - the state of a transformer is spread over all the token and attention values, so recomputing those based on the underlying text is the go-to solution.
Objectivity: See hallucinations. But also self-other/fact-fantasy distinction more generally.
Agency: Unexpectedly we got very smart models that are not very good at getting stuff done.
Cognitive control: The inability to completely ignore irrelevant information or conversely set certain tenets absolute leads to jailbreaks, persistent trick question failures and is also a big part of the unreliability of models.
These seem like a mixed bag of quite different things, but I recently realised that they all belong to the same class of cognitive abilities: These are all abilities that in humans are enabled by and in fact require consciousness.
Is "cognitive abilities enabled by consciousness" maybe a bit tautological? Unconscious people show little cognitive ability after all?
But humans can do many cognitively demanding things without being conscious of them at that moment. The simplest example is driving a well known route and arriving without any memory of the drive, which probably happened to most of us.
Not having a memory of them is a tell, that we weren't conscious of the drive, but probably attending consciously to something else, because conscious experience is necessary for memory formation.
The IIT or the global workspace theory tell us that consciousness is about information integration. Different sensory information and the results of subconscious processing are integrated into the coherent whole of what we are conscious of. The coherence of our experience tells us that the information is integrated and not just made available.
Knowledge integration, learning while problem solving and memory are all about integrating information into one coherent whole, while the rest of the limitations touch upon abilities that are based on the manipulation of the integrated information.
Transformers, as they are currently trained, are limited when it comes to information integration for two reasons:
The correspondence between "stuff LLMs tend to be comparatively bad at" and "stuff humans need conscious processing for" therefore seems to make sense based on the transformer architecture + data + training. (For what it's worth, I don't think state-space-models come out much ahead here, because they are also trained on next token prediction and integrate into a comparatively tiny vector.)
To my mind this satisfyingly delineates the dimensions along which LLMs are still lagging from those where they forge ahead. I don't think this is a very actionable insight, neither in terms of achieving AGI nor in terms of getting a clearer picture of timelines.
However it does make it clearer to me that there really is a qualitative algorithmic gap to AGI and it also convinces me that LLMs are probably not (very) conscious.