This post was rejected for the following reason(s):
Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar.
If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly.
We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example.
No LLM generated, heavily assisted/co-written, or otherwise reliant work. LessWrong has recently been inundated with new users submitting work where much of the content is the output of LLM(s). This work by-and-large does not meet our standards, and is rejected. This includes dialogs with LLMs that claim to demonstrate various properties about them, posts introducing some new concept and terminology that explains how LLMs work, often centered around recursiveness, emergence, sentience, consciousness, etc. Our LLM-generated content policy can be viewed here.
Deep Learning has come to be associated with a general rule: Bigger Is Better. For tasks such as image or text processing, this has seemed to pan out: more parameters has almost always produced more capacity. In the domain I work in, Natural Language Processing (NLP), we use Large Language Models (LLM) that can be hundreds of billions of parameters large, and which then often give fantastic performance. They can remember trillions of facts and regurgitate them at will. However, anyone who has used one of these models, such as ChatGPT, should know of one frustrating phenomenon associated with these models. They are not good at listening and adapting to new information. This can be seen in counterfactual reasoning examples, and has proven a persistent problem in NLP that is not going away with bigger models.
I believe this is not an accident, and I want to share the theory and evidence with you. We will today walk through the evidence that suggests that general reasoning ability degrades with increasing model scale, and the chain of philosophy, theory, and experiments underpinning these conclusions. In specific, we will:
Discuss the information-theory tradeoff that occurs between memorization and reasoning. This will consist of defining static abstraction as being based on reflexes and memorization, while dynamic abstraction is built on intelligence and reasoning ability.
Discuss the construction of the novel Two Pass Perplexity Ratio (TPPR) metric which allows exploration of pure dynamic reasoning tasks over just reasoning plus memorization.
Show that larger models in the same series are consistently worse at dynamic abstraction under these metrics, even if they improve on static abstraction.
Show that as the new theory predicts, models that are thought to be corrolated with higher reasoning have better TPPR scores.
From this, I will ultimately then make the argument that bigger is not always better, and that the issues we see with general reasoning in LLMs get worse with increasing scale.
Brief Background
Those well versed in machine learning are encouraged to skip this section
Machine Learning (ML) is the name for a class of algorithms in which, after designing a Model - roughly analogous to a brain - massive quantities of training data are fed through the model in order to get better at performing a task. These models have associated with them vast quantities of tunable knobs known as Parameters. These are roughly analogous to ideas like a slope or a y-intercept - statically determined constants that control the main equation.
The Deep Learning (DL) explosion of around 2007-2008 occurred as the backpropagation algorithm became computationally tractable. This algorithm, in combination with a phenomenon called gradient descent causes a model to look at a value known as a "loss" which represents how much we dislike this result - higher is worse. We then take a small step in the direction that lowers the loss. Repeating this over and over again is like taking steps down a hill in the direction that decreases fastest, and since GPU acceleration became commonplace has become the preferred method of solving most ML problems due to the fact it can work in cases where it is not immediately obvious how to statically compute the best outcome from the training data. For an analogy, consider being asked to plan out what the fastest way down a mountain is, versus figure it out as you go. Planning it out is a lot more work, but going downhill gets the job done, even if it is not always optimal.
Intelligence Versus Memorization
In John Searle's Chinese Room thought experiment, circa 1980s or so, a person who knows no Chinese is locked in a room and fed slips of papers with Chinese on them. They must return another sentence in Chinese. In the room is a computer program which, given a sequence of Chinese characters, says what other sequence of Chinese characters to return. Searle's original narrow claim was that a computer can thus be programmed to appear intelligence but be relying on brute memorization to be effective.
As indeed I plan to demonstrate.
For almost every purpose, the person in the room appears to know fluent Chinese; if you give him a question, he will give you a comprehensive and clear answer. Yet, the fragility of the system is illustrated if you give him a sequence which is, by chance or mishap, not included in the program. The person in the room freezes up, and does not know what to do.
P1: Intelligence requires being able to adapt and learn correctly to novel situations
P2: Memorization and pattern matching cannot adapt correctly to novel situations
C: Memorization alone is not sufficient for intelligence.
Given this, what is intelligence then? This is an entirely a philosophy in and of itself, but the machine learning field has appeared to have settled on an answer. That answer is Abstraction.
The relevant definition of the Oxford Dictionary defines abstraction as:
The process of considering something independently of its associations, attributes, or concrete accompaniments
It involves finding the main idea, the core essence, of perhaps a group of phenomenon, and then using that to make conclusions in the future. This, ML has decided, is intelligence. And indeed, LLMs do show that effect. Train them on data and they get better at predictions in the future. They can chat with you. But do they really fulfill all the necessary criteria? Recall, our conclusion that memorization alone is not sufficient was based on the idea that intelligence requires the ability to adapt and learn in novel situations. Has that been sufficiently explored? Can models learn and adapt as they work?
Because the word abstraction has become so overloaded in the field, and so associated with intelligence without looking at this nuance, I do not try to salvage it. Instead, we define two novel breakdowns of Abstraction:
Static Abstraction: Abstraction based primarily on memorization and reflexive responses. The Bayesian version of the Chinese Room.
Dynamic Abstraction: Abstraction capable of integrating new information and developing new patterns for usage in the future; the ability to become better as we learn.
The latter is, I would make the case, is the more fundamental form of intelligence. That is, I hypothesize that Dynamic Abstraction is Intelligence. We shall now go seek some philosophical evidence.
The Perfect Memory Thought Experiment
Let's think this through from first principles.
We consider two philosophy students taking a test, Student A and B.
Student A is a standard student. He reads a problem and absorbs the contents. He then modifies his priors using Dynamic Abstraction into the core pieces of relevant information. Then he compares it to his Static Abstractions to retrieve relevant information, and perform more Dynamic Abstraction and reasoning to reach his answer.
We now consider Student B. Student B has two interesting quirks. First, she has a perfect and enormous memory. Second, her eccentric grandfather built her a special helmet that can stream information directly into her brain. She, the night before the test, went through the entire internet and absorbed the answers for every philosophy test ever administered. She then takes the test, finds the closest parallel on each question, rewrites the answers a bit, and passes in the results. She gets the top score. She does, however, bomb the last question which she had never seen before.
Is student A or student B the one who exhibited more intelligence ? Empirical evidence suggests student B would be. Yet, is that actually the case?
P1: Dynamic Abstraction is associated with true intelligence
P2: Student A showed far more dynamic abstraction than Student B
C1: Despite belief to the contrary, student A is exhibiting more intelligence
C2: It is possible to mistake memorization and static abstraction for intelligence if care is not taken to separate static vs dynamic abstraction.
I observe from this that Memorization and General Reasoning appear to Be in Tension - the more you have of one, the less you will need the other. I believe this is a general principle that shows up in a wide variety of situations, but acknowledge a more robust philosophical exploration of the concept is needed and call on the community to provide it. Nonetheless, for the remaining pages we will adopt the prior that Dynamic Abstraction is Intelligence.
P1: Dynamic abstraction is intelligence
P2: Memorization can increase performance while decreasing dynamic abstraction
C: Therefore, Memorization may increase performance while decreasing intelligence
We return now to LLMs. What is actually happening when we scale up a model? Is dynamic abstraction getting better? It is possible reasoning is improving. But by mere first principles, we know more memory means less reasoning is needed. This by itself would suggest for every task there is an ideal amount of parameters needed, and going above it will continue to improve benchmarks but at the cost of dynamic abstraction.
But is there an underlying mechanism that could cause this effect in the real world, on machine learning models? In short, yes.
Underlying Mechanisms
A simple parameter contention argument is sufficient to provide a possible underlying mechanism for this effect, and encourage further exploration. This mechanism is in fact the issue I first noticed that drove me into the philosophy and experiments in the first place eight or so months ago that drove me into this cave and out the other end.
On a small model, which is being fed a large amount of data, a model often enters a state known as being Underfit. In this condition, there are not enough parameters to capture all the nuances of the world it is training on. This produces a situation during training known as Parameter Contention. Parameter contention may be thought, at an abstract level, as a situation in which some pieces of text are telling the model to modify its parameter 'knobs' in one direction, while others are asking the model for a change in the opposite direction. The model does not have enough knobs to satisfy all objectives. So instead it seeks out more general principles that can be applied in many locations based on the context. At a technical level, the contention is forcing the model to slow down during gradient descent to hunt down more general patterns that respond to context.
Meanwhile, in a typical modern LLM, billions of parameters are available. Now, gradient contention does still occur but when it does there are always uncontested parameters available elsewhere to store information in. The model still learns some context responses, but can nonetheless store the majority of its insights into the easier-to-change uncontested parameters. Recall that the small model was being forced to slow down and resolve the issue by developing dynamic abstraction capability in the first place - the large model needs to do this much less. More memory means less parameter contention, which means less need for developing dynamic reasoning over just brute-force memorization. Performance may indeed improve for in-distribution tasks, but this may not transfer to novel situations as it is only learning view-response pattern matching, not deeply understanding.
This is the heart of what I believe drives the production of Dynamic Abstraction; the balancing of resources in constrained conditions. As models get larger, they may become worse at this topic. This is not reflected in current metrics of any sort I know of directly, though indirect support exists in the form of the failure of transformers on hard reasoning tasks. Can we explore this effect deeper and isolate dynamic abstraction? This should be testable directly if we can only create some better metrics.
A Proper Intelligence Metric
To even begin to explore this, we need better metrics. A Metric in machine learning is a way of measuring information, much like a score or an average, that lets a human grade some sort of capacity of the model. We have already seen one commonly used metric known as the "Loss". We will shortly build a metric known as the Two Pass Perplexity Ratio out of the metric known as Perplexity. However, first we should familiarize ourself with the problem space. Let us begin.
Familiarization and Perplexity
To begin, we need to understand the idea of Entropy. Entropy, as used in information theory and indeed machine learning, has an association with an underlying probability distribution. It can be roughly thought of as how "surprised" we are to see a particular result or distribution. Entropy is a kind of quantity known as a 'Negative Log Probability' which is a fancy way of saying if you exponentiate the negative of it you get a probability back. It is formally defined, in machine learning, as:
H(X)=−∑xP(x)lnP(x)
Where P(x) is the underlying probability distribution and X is each option. We note here the need to put a negative out front in order to ensure this is positive, as expected. The smallest value we can possibly see here is zero, and higher is more surprising. Cross Entropy is a modified form of this that asks how "surprised" we are to see one distribution — one outcome — given we were expecting another. It has the formal form given below as:
H(P,Q)=−∑xP(x)lnQ(x)
Note in this case, P is usually referred to as the Target or Expected distribution. Q is the Predicted distribution. Like entropy, larger is worse and the smallest possible value is zero.
With this background, we can talk about how LLMs actually work. They are fed words in a sequence, with each word known as a Token. Given all the words so far, the model is asked to produce a Predicted distribution which tells us what word the model wants to say next, and trained with feedback to closer match the Target distribution. We do this for all the words in a sequence, across billions of sequences, and like a regression get a pretty good idea of how to predict upcoming words. This is how cross entropy is used to produce the Loss in the first place, and then that is trained using Gradient Descent as described before.
Trying to actually interpret how good a job we have done can be tricky, however. You can check if the right word had the highest probability, or did not, but it is difficult to clearly state how wrong the model is, on average, using raw Accuracy or Loss: Maybe the model was trying to predict a chemistry textbook and it is unfair to expect it to get everything for instance. Instead, in machine learning we often use Perplexity. Perplexity is related to cross entropy, and can be approximately thought of as "How many reasonable choices the model thinks it had at a particular word". However, in ML perplexity usually means the average of these scores across all words, and this is the form we shall use going forward. It is zero in the perfect case, indicating the model knew it had exactly one choice. It can in theory go infinitely high, though in practice that never happens as that would mean the model thought it had infinite choices. Formally, it would be defined as
Perplexity(P,Q)=1Ne∑iH(Pi,Qi)
That is, this is the exponentiation of the cross entropy. Note the extra dimension of "i" is now representing the position of the word in the sequence, and the fact we divide by the number of words as explained earlier.
This tool is what is usually used when deciding whether a model is good at general intelligence tasks. But do you notice an issue here? We are taking an average. Doing good on most of the tokens - most of the words - is straightforward, and that may mask the actually hard predictions. For instance, you may only need to predict a cities name three or four times through an entire novel, if you were an LLM processing it. This may thus be gameable by local prediction and memorization. While this is not inherently a bad thing - performance is in fact increasing - it is insufficient if we wish to explore only dynamic abstraction. It is into this void I introduce the Two Pass Perplexity Ratio.
Two Pass Perplexity Ratio
The key insight behind the Two Pass Perplexity Ratio is to ask not how many choices the model has on average, but what the ratio between initial and final choices are once the model has had a chance to learn about the situation. In theory, if the model is learning by applying dynamic abstraction, this ratio should then reflect it.
In more detail, if a model is shown a sequence of text, then shown the same sequence again, it should be able to use the fact it has seen it twice to make the second predictions easier. This would demonstrate dynamic abstraction, and is exactly the dynamic Two Pass Perplexity Ratio exploits. Formally, TPPR first starts with a sequence of Tokens. We then place special tokens known as Beginning of Sequence (BOS) and End of Sequence (EOS) tokens in the following special pattern:
"[BOS] The fox jumped over the car [BOS] The fox jumped over the car [EOS]"
This is repeated for all relevant sequences which we wish to test this on. We are now going to label the first set of sequences T1 and the second set T2. The sequence above would be
"[BOS] T1 [BOS] T2 [EOS]"
The two pass perplexity ratio is defined as the following. Find the predictions for sequence T1, T2 given. These are Q1=Predictions(T1), Q2=Predictions(T2). Take their perplexity, producing P1 = Perplexity(T1, Q1), P2 = Perplexity(T2, Q2). Finally, your two pass perplexity ratio is:
TPPR(P1,P2)=P2P1
Notably, and critically, static abstraction is nullified out by this metric since it is the case any static abstraction should show up equally well in P1 and P2. This then disappears as a constant term when we take the ratio. This should mean any improvement we get is solely due to usage of the context.
A score of zero means perfect learning. The model knew every choice during P2 because it saw it during P1, and thus is perfectly dynamically adapting on the fly to the new information. It is displaying dynamic abstraction and online learning. A score of one, meanwhile, would mean no dynamic abstraction at all. The model was just as confused during the second pass as the first. A score higher than one indicates something is wrong with your computation or your model; it should not get harder to answer correctly when you have already seen the answers.
This is the intelligence metric I used to probe machine learning. The hypothesis? Lower TPPR indicated more dynamic abstraction and thus more general reasoning. The results? They appear to be correlated, and larger models get worse at it.
Overthrow of a Paradigm?
We shall now begin to question the dominant Scaling Paradigm of NLP. I will not make any claim as ludicrous as that current scaling is useless. For most tasks, bigger models provide better solutions; recall there was nothing inherently wrong with Student B, but they simply were not adaptable. But therein is the rub: if we wish for general Artificial General Intelligence, scaling may not be the way to get it. It appears bigger models may not only be failing to reliably produce general reasoners, but in fact may be degrading the very general reasoning abilities we are trying to cultivate.
Two core correlations need to be established to ground this claim. These are
Models among the same series and the same architecture tend to show worse dynamic reasoning when scaling up, indicating more reliance on memorization. This can be detected using TPPR. This grounds us in the claim TPPR gets worse with increasing scale.
Models which are widely considered better reasoners have better TPPRs than models considered worse, given the same number of parameters. This grounds the intelligence correlation between TPPR and intelligence.
We shall now set out to establish both. This is done with experiments, and reproduction is of course encouraged. For the interested, the colab used to run the experiments is located at a link in the end of the post.
Larger models Exhibit Worse TPPR ratios
Two model series were examined. The GPT-2 model series in their small, medium, and large form. And the Pythia model series at 410 million, then 1.4, 2.8, and 6.9 billion parameters. The Pythia series showed an odd anomaly that could not be rectified for the 2.8 billion parameter case, and that case is excluded from the trend line. Maybe early stopping triggered anomalously early? All models were tested using the test set of wikitext2; those replicating the results may feel free to switch the colab to use the train test, but these results are still statistically significant. Note as well extremely short sequences, defined as having less than 40 tokens, were filtered as they did not have enough tokens to establish clean perplexity signals. Do take note: this is a necessary stage in TPPR measuring pipelines, and failing to length filter will corrupt your results during replication.
The GPT-2 series shows the trend quite distinctively, and also shows a very disturbing property: worsening TPPR with larger scale. This would seem to suggest the larger model is relying on and anchoring in its broader memory context rather than performing dynamic abstraction. Since this is within the same model series, the same training and tokenization was applied, and the same kinds of layers were used. All that changed was how many layers and thus the number of parameters. The error bars are, of course, in terms of standard error. I will be honest, I did not bother to do the T-tests once the trend was this clear.
The Pythia series has the observed anomaly at 2.8b parameters. Nonetheless, even this does not undercut the trend. Error bars do not overlap. Interesting, the more modern architectures seem to be more robust against this kind of failure. Or perhaps I made a methodology mistake. Replication please!
This is just two existing model series examined, done using hobbyist levels of resources Can we have some personal with more compute access please check some of the larger ones? A more robust scattering of points may also help us discover if this is truly linear, or more like a zoomed-in exponential relationship. !!Science!! is called for.
TPPR has a correlation with reasoning
To establish the other needed correlation, we should be comparing models that have roughly similar parameter counts, or preferably the same, and which have widely-regarded reasoning abilities that are superior or inferior to each other. The two cases chosen for this were first DeepSeek R1 and GPT-J, with 6B and 7B parameters respectively, and then GPT2 both as its own control and after fine-tuning on a modern Chain Of Thought (CoT) dataset.
We begin with the DeepSeek/GPT-J comparison. Our hypothesis is:
Hypothesis: Since DeepSeek R1 is widely acknowledged to punch well above its weight in reasoning, and reasoning is correlated with TPPR, it should have a notably lower TPPR than the GPT-J control.
This hypothesis is, indeed, confirmed. A correlation is confirmed, and with the broader knowledge makes a pretty good case for a broader theory. However we must acknowledge this may be due to some underlying architectural reason that is not connected. So lets control for that too
GPT-2 Small was chosen for the task. The GSM8K dataset was used for training, in question; explanation; answer configuration. 10 epochs with a batch size of 16 and the default learning rate of 5e-5 was used.
Hypothesis: The same model before fine-tuning in Chain of Thought reasoning will have a larger TPPR score than after fine tuning, because Chain of Though reasoning fine-tuning improves reasoning ability
This was really the smoking gun. Again, there was no particular need to do a T-test on this case. When everything, including the architecture itself, is the same, and the only difference is it has been trained on chain of reasoning, this shows up directly in the TPPR.
Again, I would highly encourage those with more compute resources to run these same experiments with other models and report back. Is this general? Do older models still do this? Are newer models better reasoners with more robust TPPR scores?
Overthrow of a Paradigm
We are now in a position to substantiate the hypothesis, and begin a process that may invalidate a considerable portion of the NLP paradigm as a result. First, let us establish a correlation between intelligence and TPPR
P1: If a correlation exists between dynamic abstraction and TPPR, then smarter models should see lower TPPR scores.
P2: Smarter models have been observed with lower TPPR scores.
C: Therefore, TPPR is likely correlated with intelligence.
Now, the other half of the equation. Scale.
P1: If scaling up models up makes them better at dynamic abstraction then they should produce lower TPPR scores
P2: Scaling models up does not produce lower TPPR scores
C: Scaling up did not produce lower TPPR scores therefore it did not make them better at dynamic abstraction.
In fact, as we now know, it probably made it worse. Before we proceed into the final stage, it is worth again highlighting dynamic abstraction, and static abstraction. In many cases, where raw performance is concerned, we only care about the sum behavior. Overall capacity is still increasing with model scale, as can be seen in their perplexity scores. Not all models need dynamic reasoning; for instance, a spam filter does not need to be able to reason from first principles. Nonetheless, if that capability is needed, and we want to see it in our models, we can draw the following conclusion.
P1: Scaling up models makes them worse at dynamic abstraction
P2: Getting worse at dynamic abstraction shows a reduction in flexibility and reasoning-focused intelligence
C: Therefore, scaling up models is making them worse at reasoning-focused intelligence
This now completes the picture. It suggests that scaling up models is not the path to Artificial General Intelligence alone, and that better training incentives and model architectures are likely needed that encourage dynamic abstraction itself in order to produce general intelligence. Notably, if a prior is broken at any point this falls apart; falsification is part of this argument. Nonetheless, the fact these correlations were predicted from first principles, then indeed found, is severely damning evidence for the Scaling Paradigm and suggestive that the hypothesized mechanism should be substantiated. Replication is called upon to verify these findings and discuss the consequences.
Call for research and replication
I briefly state some of the core directions and questions that I think this work brings up and that requires community engagement here. Please feel free to bring up additional questions as well.
Does the trend of larger models getting worse TPPR scores scale across other model families? What trends can we see? Are there some architectures that do better?
Does TPPR improving with CoT training hold across other model families? Do other reasoning processes show this effect? Are there any fine-tuning tasks that are improving the TPPR score?
Should we preliminarily accept TPPR as associated with intelligence? If so, what are the metrics falsification conditions?
Resource contention is at the heart of improving TPPR. Do other resource contention scenarios create similar effects? For instance, does model distillation improve TPPR under the right condition? How strong a coupling is needed to show the effect?
Ah, and of course the promised colab is located right here: link
Conclusion
There is almost certainly a middle ground, a proper exchange between dynamic and static abstraction that gives best total rewards. However, blindly scaling up will not find it. It is time to design training incentives and model structure to encourage intelligence from first principles, rather than blindly embrace empirical metrics that may not correlate with the properties we actually want.
Welcome to the first stage of the Devil's Bargain Hypothesis. This hypothesis revolves around an unreleased paper that systematically deconstructs modern NLP as applied to lifelong reasoning situations, suggesting it is likely unsuitable to support general intelligence. We will find that lifelong learning is being inhibited by encoding decisions, fine-tuning likely erases our alignment information, recurrence needs to make a comeback, and many other insights. Stick around!
Preview of next post and authors commentary
In the next post of this series, we will examine a simple pretraining modification based on the theory in this work, and observe how we can boost GPT2-small's counterfactual performance up to nearly 80% of GPT2-large on an example with just a few minutes of pretraining. We will also discuss the wider implications, and how it connects from first principle with the theory outlined in this work. Keep an eye out for the ongoing Lets Break NLP series, which should play out over a month or two. I hope to post once a week, but may need to do every other week depending on schedule.
On a more personal note, I have never had a formal CS job in Academica or Industry. I just have a physics undergraduate and 7 years of self study. If you finds any of this interesting, I would be thrilled to collaborate to move some of this to publishing. Extra resources would be great too - I am currently paying for colab GPU hours out of my own pocket, and can not afford to pretrain a model. I would love to have help with that once my stack rebuild from first principles is complete.
If anyone has ever been frustrated by prompting workflows you may wish to comment on the Workflow Forge project. This is designed for mass production of synthetic training data and complex workflows - think C++ for prompting workflows - and is currently accepting comments and use cases on the Huggingface forum here: Workflow Forge/SUPS. It is a lot easier to add use cases now if you have any. The associated repo is here: Repo. I will be adding the open source licenses and contribution guidelines sometime this week, and probably finishing up the parsers. Maybe I will get to the state machines this week, but I am working part time as a tutor right now so we shall see.
The Workflow Forge project is needed for mass synthesis of training data in the alignment self-play loop I am making, which basically implements Paul's self-bootstrapping process illustrated here: Recursive Self-Alignment. Oh, but do take care not to overinterpret my claim - it is likely not stable. I am trying to bootstrap reasoning, metacognition, and alignment concurrently, but under no illusions of likely success without a lot of work.