Why Q*, if real, might be a game changer

Shmi

Some thoughts based on a conversation at a meetup. Disclaimer: I am less than a dilettante in this area.

TL;DR: if this rumored Q* thing represents a shift from "most probable" to "most accurate" token completion, it might be a hint of an unexpected and momentous change from a LARPer emitting the most probable, often hallucinatory, token designed to please the askers (and trainers), to an entity that tries to minimize the error vs the unknown underlying reality, whatever it might be, then we are seeing a shift from a relatively benign "stochastic parrot" to a much more powerful, and potentially more dangerous entity.

One thing that is pretty obvious to anyone using the current generation of LLMs is that they do not really care about reality, let alone about changing it. They are shallow erudites of the type you often see at parties: they know just enough about every topic to be impressive in a casual conversation, but they do not care whether what they say is accurate ("true"), only how much of an impression it makes on the conversation partner. Though, admittedly, copious amounts of RLHF make them dull. If pressed, they can evaluate their own accuracy, but they do not really care about it. All that matters is that the output sounds realistic. In that sense, the LLMs optimize the probability of the next token to match what the training set would imply. This is a big and obvious shortcoming, but also, if you are in the "doomer" camp, a bit of a breather: at least these things are not immediately dangerous to the whole human race.

Now, the initial "reports" are that Q* can "solve basic math problems" and "reason symbolically," which does not sound like much on the surface, but, and this is a big but, if this means that it is less hallucinatory in the domain where it works then it might (a big might) mean that it is able to track reality, rather than the pure training set. The usual argument against this being a big deal is "to predict the next token well, you must have an accurate model of the world", but so far it does not seem to be the case, as I understand it.

Whether there is a coming shift from high probability to high accuracy, or even if it is a meaningful statement to make, I cannot evaluate. But if so, well, it's going get a lot more interesting.

TL;DR: if this rumored Q* thing represents a shift from "most probable" to "most accurate" token completion,

Q* is most likely a RL method and thus more about a shift from "most probable" to "most valuable".

The usual argument against this being a big deal is "to predict the next token well, you must have an accurate model of the world", but so far it does not seem to be the case, as I understand it.

Why does that not seem to be the case to you?

I'd guess he's thinking of the observation that when tried, humans seem a lot worse at next-token prediction than even a GPT-3 model. This raises questions about the next-token logic: why doesn't superhuman next-token prediction then produce superhuman intelligence?

(Background)

However, I don't think that necessarily works: the original logic is correct, it is clearly sufficient to be an accurate next-token predictor in at least some next-token scenarios like a dataset constructed to include only the most difficult multiple-choice problems (eg. GPQA). Because then you can simply pose all tasks in the form of the multiple-choice question and by definition, it will perform as well as the humans. But it does not, and the benchmark scores are still sub-human, so we know almost by definition that if we asked humans for the probability of the next-token where next-token=answer, they would predict better. Note that we didn't say, "random Internet text" but "only the most difficult problems". The next-token argument doesn't work for average text. Humans predict worse on average text, but predict better on some important subsets of text.

The models are clearly subhuman on many text benchmarks, even though that is still 'just' next-token prediction of the answer-completions. It is also the case that, AFAIK, we have no benchmarks of comparing human predictions on much longer passages - the GPT-2 model may beat you if you have to predict the next token, but you can easily beat it if you are given several instances of the next 100 tokens and asked to predict which one is more likely. How can it beat us on average at predicting a random next token, yet lose to us at predicting many next tokens? ("We lose money on each unit we sell, but don't worry, we'll make it up on volume!")

What this is telling us is that the model appears to be 'cheating' by winning a lot of predictive edge over unimportant tokens, even though its errors accumulate and it fails to predict key tokens. the correct comparison can't be 'the average Internet next-token'. It has to be specific key 'golden' tokens, which are analogous to the choice 'a'/'b'/'c'/'d' of answering a multiple choice question: you can predict every token up to that, but if you aren't genuinely understanding, you can't predict the final one of 'a' rather than 'd'. (Or my old example of a murder mystery - thousands and thousands of tokens which must be analyzed deeply in order to predict the final handful of tokens which complete the text "And the murderer is - !".) A model mimicks the easy tokens flawlessly, but then once it hits a critical junction point, it goes off the rails, and then the human chugs along past it. In a benchmark dataset, those junction points come up regularly and are indeed the entire point, while during random Internet texts, there might be zero such points, depending on how repetitive or superficial or mundane the text is.

So why does training on low-quality average tokens demonstrably work even though the models are superhuman at that, and the token prediction argument is inapplicable to such tokens? Well, that's a good question.

The easiest answer (drawing on active learning / experiment design / reinforcement learning / coreset / machine teaching observations about optimal sample-efficiency & how far away LLMs are in pretraining from what seems like human sample-efficiency) is that the models have such large capacities that they can learn all the superficial stuff that humans have not which are useful for predicting the average next-token but do not themselves elicit the deep capabilities we want; it is then the occasional 'gold' token which very very gradually forces the model to learn those too. So a model is brrring through vast reams of Internet text, successfully memorizing every meme or stylistic tic or spammer text over millions of tokens, and once in a while, someone says something actually meaningful to predict like "I put my ice cream in the microwave and then it ______" and it makes a mistake in predicting "melted" and learns a bit about real-world physics and commonsense, and then goes back to the memorization. There is, I think, a good deal of evidence for this. (And this predicts, among other things, that it should be possible to train models of great intelligence with many OOMs less data than we do now.)

I wonder if giving lower rewards for correctly guessing common tokens, and higher rewards for correctly guessing uncommon tokens would improve models? I don't think I've seen anyone trying this.

Found: https://ar5iv.labs.arxiv.org/html/1902.09191 - Improving Neural Response Diversity with Frequency-Aware Cross-Entropy Loss .

It's not obvious that 'uncommon' tokens are good or that that's a good approach.

They could also just be unlikely or garbage, and your screening method for filtering for 'uncommon' tokens may ensure that they are garbage, or otherwise sabotage your model. (This is the 'mammogram screening problem': even if you have a good filter, if you run it across trillions of tokens, you will wind up throwing out many good tokens and keeping many bad tokens. There are a number of LLM-related papers about the horrificly bad data you can wind up compiling if you neglect data cleaning, particularly in multilingual translation when you're trying to scrape rare languages off the general Internet.)

Nor are good datapoints necessarily made up of uncommon tokens: there are zero uncommon tokens in my 'microwave' example.

(Data pruning & active learning are hard.)

Grade-school math, where problems have a single well-defined answer, seems like an environment in which a Q-learning-like approach to figuring out whether a step is valuable, from whether it helps lead you to the right answer, might be pretty feasible (the biggest confounder would be cases where you manage to make two mistakes that cancel out and still get to the right answer). Given something like that, a path-finding algorithm along the lines of A* for finding the shortest route to the correct answer would then become feasible. The net result would be a system that, at large inference-time cost, could ace grade-school math problems, and by doing so might well produce really valuable training data for then training a less inference-expensive system on.