What LLMs lack

[-]RogerDearnaley5mo*40

This seems related to an issue that came up In a discussion I had with ChatGPT 4.5 recently. AI models aren't very good at doing introspection: verbally describing their actual thinking processs. This might be related to the fact that, in a base model, correct behavior after an introspection question occurs in the context is to attempt to simulate a human doing introspection and answering that question, not to describe the model's internal state (at least wherever those two differ). So base model training actively discourages accurate introspection in favor of what one might call "introspection faking".

It seems challenging to train models to do accurate introspection without us having a separate source of information about their internal mechanisms such as from interpretability.

Humans seem to be at least moderately good at introspection of conscious system 2 processes (by definition, unconscious processes are ones we can't do introspection on), and this is likely adaptive for us: it seems likely to be quite helpful for teacher-student training another human if you can both accurately describe what you're doing and the student can successfully incorporate verbal feedback from the teacher on what they're doing wrong.

Maybe we could use this as the basis for an AI training approach: do an evaluation that involves distilling skills via a verbal description from a teacher to a student: do RL on both the teacher and the student according to how well the student does learning from the teacher's descriptions and feedback?

[-]Brendan Long5mo42

This "learning from a teacher" failures also seem to point at the same problem where LLM's can't learn well from their own output. Sometimes you get output where a model correctly explains why its current approach doesn't work, and then it does the same thing over and over again anyway.

[-]RogerDearnaley5mo42

That may also be part of why agents tend to get stuck: even if they manage to figure out what they're doing wrong, they then need to also take their own advice.

[-]RogerDearnaley5mo40

Our faculty with sophisticated language is probably only a few hundred thousand or at most a couple of million years old. There was a rather sudden breakthrough during that period, roughly a quarter-million years ago: up-to-and-including neanderthals, the stone tool technology improves on only evolutionary timescales, no faster than changes in the skeletal structure: neanderthal tools are pretty-much unchanged over more than half a million years. Starting with the appearance of Homo sapiens, technology is like a ratchet: it only goes up, at a rate proportional to the population. Since technology increases both our ability to colonize new environments and our carrying capacity in any particular environment, this leads to super-exponential growth.

For this to happen, there are four requirements:

1) inventiveness and manual dexterity: being able to come up with and try out new ideas
2) Bayesian-style learning: figuring out which of those ideas work and which don't
3) the ability to speak a "Turing-complete" language, in which brand-new ideas and concepts can be described and encoded by extending the language, to let us pass them on. Humans evolved on the Savannah, but they're capable to speaking languages that can (with some work) describe nuclear physics and quantum mechanics — that looks a lot like Turing-completeness
4) cultural propagation: the ability to pass ideas, techniques, and technology down from one generation to the next and from one tribe to their neighbors, reliably enough that advances can be invented more often than they get lost again, so we can make steady forward progress

Homo sapiens must have crossed a threshold in one or more of these. 3), Turing completeness, inherently has a threshold: a language is either Turing-complete, or it isn't. 4) also looks prone to thresholds: either loss dominates and there's a steady-state equilibrium, or gain does and there's no equilibrium, just a runaway technological expansion.

Assuming our language facility is at most a few million years old, the fact that, in all of the conscious parts of our brain, we can convert what we are doing to words, and convert words to a modification in what we're doing, with a fair degree of accuracy, is pretty impressive, when you stop to think about it. LLMs seem to be good at the text -> thought mechanisms direction: they respond to prompts well. Presumably this is because they were distilled from us, and this capability is pretty fundamental to how we communicate and is thus necessary to imitate us. But yes, they don't appear to be quite as good at the thought mechanisms -> text route. Maybe that's not as prominent in their training corpus? Or maybe pretraining encouraged them to introspection-fake being a human, rather than actually reporting what they're doing, as I suggested above? (For example, we now know in some detail how LLMs add two 2-digit numbers together: it combines three different algorithms. If you ask them how they're doing it, their answer sounds very human, a description of how a human would do this on paper — and completely omits one of the three algorithms.)

Successful cultural propagation of good new ideas requires humans to be good at transmitting mental skills from teachers to students. So if I'm right that this requires introspection, then that's something humans are specifically evolved to be good at.

[-]ErickBall5mo12

Interesting idea, but I don't think short-term memory and learning really require conscious attention, and also conscious attention mostly isn't the same thing as "consciousness" in the qualia sense. I like the term "cognitive control" and I think that might be a better theme linking a lot of these abilities (planning, preventing hallucinations, agency, maybe knowledge integration). It's been improving though, so it doesn't necessarily indicate a qualitative gap.

LESSWRONG
LW

LESSWRONG
LW

15

15

15

Introduction

Limitations

One category

Does this make sense?

Conclusion