A recent post from Scott Alexander argues in favor of treating intelligence as a coherent and somewhat monolithic concept. Especially when thinking about ML, the post says, it is useful to think of intelligence as a quite general faculty rather than a set of narrow abilities. I encourage you to read the full post if you haven’t already.
Now we attempt to answer questions such as:
But first, before I talk about IQ, let me introduce something called FQ. That’s the “Football Quotient” (or the “soccer quotient” for us Americans).
Imagine you wanted to know how good someone was at football, so you took all the important football things like dribbling and shooting goals and you looked at how good they were at those things. Then you could add up how good they are at all those different skills and get a number that sort of represents how good they are at football. And some people will have a higher football quotient than others.
But it doesn't really tell the whole story. Maybe it doesn't capture goalies very well, or it doesn't capture people with athletic talent but no training. Maybe one person has a high FQ because they’re tall and they can head the ball really well, but they aren’t as good at free throws, while another player is fast and scores well, but they’re rubbish on defense. Two players might have the same FQ while having different skills.
But it’s still a pretty good measure of overall football skills, and it probably explains a great deal of the variation between people in how good they are at football. It’s even probably applicable outside of football, so you could use the FQ to evaluate basketball players and get pretty decent results, even though it wasn’t designed for that. It’s a useful test even though it isn’t perfect.
IQ is like that.
The examples that Scott chooses are intended to support the idea that intelligence is a broad, and that the g factor is real, coherent, and useful. But I think these examples obscure more than they elucidate.
A primary example of intelligence as a general-use capability is that scores on the math and verbal sections of the SAT correlate quite strongly at a value of 0.72. As Scott notes, maybe some people are just better test takers than others, or have more access to tutoring or a healthier diet in childhood, but nonetheless it does seem like “test taking ability” is captured by a measurement of intelligence in humans.
The math and written sections of the SAT are quite similar to each other. They both rely on linguistic reasoning and knowledge, to some extent, and they both rely on language as the questioning and answering modality. It’s true that if a human or large language model does well on the verbal section, we expect it to do similarly well on the math section, with some correlation.
But what about other pairs of tests that are quite similar?
For example, let’s go back to the FQ (football quotient). Imagine it has a written portion of the test and a physical portion of the test. Among American adults, my guess is that there’s a moderate correlation between knowledge of the rules of football and ability to actually play football. A lot of people learn the rules because they go out and play games, and the people who like to watch on TV probably like to go out and kick the ball around sometimes.
But now let’s talk about AI models. I’d guess that GPT4 would do quite well on the written portion of the FQ test; it might score in the 99th percentile on a hard test. But it’s not even able to participate in the physical test; it gets zero points on disqualification.
On the other hand, a trained monkey would bomb the written portion even if it did okay on the physical section.
Is this just because GPT4 doesn’t have a body?
No. Once we get outside the realm of a written test, GPT4 starts to fail at all sorts of things. For example, GPT4 currently (as of publication) lacks a long term memory store, so it’s not able to handle tasks that require a long memory. It might be able to ace the bar exam, but it can’t represent a client over multiple interactions.
Another example comes from Voyager, a paper in which they use a coding agent to play Minecraft. It’s eventually able to achieve impressive goals like mining diamonds, but it can’t build a house because it has an impoverished visual system.
Once we’re talking about these quite different tasks, we start to have less expectation of correlated performance among AIs.
A large reason for the existence of the g factor in humans is that humans all have approximately the same architecture. The large majority of humans (although not all) have a working visual system, a functioning long term memory, and a working grasp of language. Furthermore, most humans have mastered basic motor control, and we have decent intuitions about physical objects. Most humans are able to set and pursue at least simple goals and act to try to maintain their personal comfort.
On the other hand, AI models can have vastly different architectures! A large language model doesn’t necessarily have multimodal visual input, or motor output, or a long term memory, or the ability to use tools, or any notion of goal pursuit.
When it comes to AI models, it’s necessary to break up the concept of intelligence because AI models are composed of multiple multiple distinct functions. It’s very easy to get an AI that passes some sorts of tests but fails others.
Which is smarter, GPT3 or a self-driving car? This isn’t a good question; their architectures are too different.
It’s wrong to think that tests of IQ are different from each other because they test different types of subject matter, rather than being functionally different.
It’s also wrong to take model architecture for granted and assume that an AI model will always have a baseline ability to participate in relevant tests.
Sure, the concept of “intelligence” is useful if you want to think about the differences between GPT2 and GPT4. But it breaks down when you’re thinking about AI models with different capabilities.
What’s the goal for ML researchers at this moment? Is it to make smarter models with a higher IQ? Or is it to make more broadly capable models that can do things that GPT4 can’t?
If you’re excited about continuing to minimize next word prediction error, then you will probably find IQ to be a useful concept.
But if you find it unsatisfactory that AI models don’t have memory or agency or vision or motor skills, then you probably want to use a multi-factor model of intelligence rather than a generalized quotient.
I think you're right that there's not a g factor for AI systems (besides raw amount of computation, which I think is a huge caveat); I nevertheless think that arguments that route thru "intelligence is a thing" aren't Platonic (b/c of the difference between 'thing' and 'form'). While I normally give 'intelligence explosion' arguments in a scalar fashion, I think they go thru when considering intelligence as a vector or bundle of cognitive capacities instead. The Twitter OP that Scott is responding to calls the argument with cognitive capacities 'much weaker', but I think it's more like "slightly weaker" instead.
I don't think this is looking at humans at the right level of abstraction; I think that it matters a lot that any individual brain is using basically the same neuron design for all bits of the brain (but that neuron design varies between humans). Like, g correlates with reaction speed and firearm accuracy. If brains were using different neural designs for different regions, then even if everyone had the same broad architecture I would be much more surprised at the strength of g.
This makes me suspect that things like 'raw amount of computation' and 'efficiency of PyTorch' and so on will be, in some respect, the equivalent of g for AI systems. Like, yes, we are getting improvements to the basic AI designs, but I think most of the progress of AI systems in the last twenty years has been driven by underlying increases in data and computation, i.e. the Bitter Lesson. [This is making the same basic point as Scott's section 3.]
I think this was a sensible view five years ago, but is becoming less and less sensible in a foundation model world, where the impressive advances in capabilities have not come from adding in additional domains of experience so much as by increasing the size of model and training data. Like, I think the Voyager results you mention before are pretty impressive given the lack of multimodality (did you think next-token-prediction, with a bit of scaffolding, would do that well on Minecraft tasks?).
If the next generations of transformers are multi-modal, I think it really won't take much effort to have them code football-playing robots and do pretty well on your FQ.