I recently talked with someone about large language models and the risks they pose. They suggested that LLMs would be bounded by human intelligence/knowledge because LLMs learn from the internet text that humans have written.
Now, this was a very casual conversation (small-talk at a housewarming party), and they changed their mind a few minutes later. But perhaps this is an assumption people often make about LLMs.
Anyway, I'll briefly explain why LLMs are not bounded by human capacities.
I think the picture they had in their head was something like this:
Let's say that all the internet text has been written by humans and we train GPT-N to model the text. There are many ways that GPT-N might do this, but at the very least, GPT-N can just model the humans writing the internet text. This strategy is overkill — like swatting a fly with a sledgehammer — but it suggests that GPT-N has no incentive for more-than-human intelligence or more-than-human knowledge.
- Concretely, GPT-N doesn't need to "know" things that humans don't know.
For example, humans currently don't know the genome of the next pandemic virus, and therefore the genome of the next pandemic virus won't appear in the internet text, and therefore GPT-N doesn't need to "know" the genome of the next pandemic virus.
- Similarly, GPT-N doesn't need any skills that humans don't possess.
For example, humans currently can't prove really tricky theorems in mathematics, and therefore proofs of really tricky theorems won't appear in the internet text, and therefore GPT-N won't need to prove really tricky theorems.
Here's the problem. If GPT-N wants to perfectly model the internet text, it must model the entire causal process which generates it. But this causal process doesn't just include the human, but rather includes the entire universe that the human interacts with.
For example, suppose the internet text includes many chess games between Stockfish engines. If GPT-N wants to minimise cross-entropy loss on internet text, then GPT-N would need to learn chess at the level of Stockfish. It doesn't matter that it was a human who physically typed the games onto the internet. A model which could play superhuman chess would achieve lower cross-entropy loss on the current internet text.
(What exactly do I mean by "learn chess"? Either of the following two definitions will work:
- There is a circuit in the weights that instantiates a chess engine.
- You could make a very short & quick chess engine if you had GPT-N as an oracle. Specifically, you could repeatedly feed GPT-N the prompt "The following is complete a game between two Stockfish engines: 1.e4 c5 2.c3 d5 3.exd5 Qxd5 4.d4 Nf6 5.Nf3 " and play the move that GPT-N predicts follows.
The first definition is looking inside the weights, and the second definition treats the weights as a black-box.)
Note: I am not claiming that GPT-N can actually learns superhuman chess — maybe the architecture isn't capable of instantiating a good chess engine. But I am claiming that "GPT-N can't learn superhuman chess" is not implied by "humans wrote all the text on the internet". It is probably better to imagine that the text on the internet was written by the entire universe, and humans are just the bits of the universe that touch the keyboard.