After having my mind blown by the GPT-2 codes python demo, I realized that all „skills“ that GPT-2 and especially GPT-3 possess fall into a certain category. 

Autoregressive language models like the GPT models can write text by recursively predicting the next token. The exciting development of the last two years is the discovery that implicit in this functionality is a variety of skills, if the language model is good enough. So really good language models do not just write really good text. They can be queried for facts, they can be set up to be a conversation partner, they can solve SAT-like analogy tasks, they can do arithmetic, they can translate between languages and they can even generate (often) functioning code from a verbal description. 

Why does a model that just predicts the very next token develop the ability to translate between languages? Predicting the next token, generally, is very hard. But there are occasions where it becomes a lot easier. These are basically loss-function sinks, where suddenly the language model can learn to reduce the loss much more than is generally possible. Translation is a very intuitive example, because the original text very strongly constrains the possible continuations of the translation. 

But, on second thought, really all the skills mentioned above fall into this category! Except maybe for holding a conversation, they all translate information from one modality into another. Questions into answers, comment strings into code, expression into result, French in English, etc. And even in a conversation each remark constrains the possible replies. So what seems to be happening with more powerful models is that they increasingly detect these loss function sinks and exploit them for what they are worth. 

Along these line, we can divide what GPT models can learn to do into two very different camps: 

On the one hand they can learn a new data distribution from scratch. This is what the training objective is ostensibly about and it can be very hard. When GPT-2 was published I was briefly intrigued with the possibility to train it on chess games, but when this idea was realized by at least two different people, the result was quite underwhelming. I don’t doubt that GPT-2 would in principle be able to infer the board state from the moves and learn to predict moves accurately, but this would take a lot of compute and data,.

On the other hand they can learn to map between a distribution that they have already modeled to a new distribution. This can be much easier if the distributions are related somehow. I suspect training on the python repos alone would not work nearly as well as fine-tuning a fully trained GPT-2. (In the recent GPT-f paper, they explicitly show that text-trained models do better at learning to predict formulas than unpretrained models and models trained additionally on math-heavy text do better still.)

To me this framework makes the code generation demo less miraculous and more part of a general pattern. It also informs my thinking about what to expect from future iterations of GPT. For example, I suspect that symbol grounding in the sense of connecting words correctly to images is something GPT-4 or 5 will be very good at. 

New Comment