Why the Architecture of LLMs Makes Them Bad at Deep Thinking: They're Too Wide
GPT-3 is 96 layers deep (where each layer is only a few "operations"), but 49,152 "neurons" wide at the widest. This is an insanely wide, very shallow network. This is for good reasons: wide networks are easier to run efficiently on GPUs, and apparently deep networks are hard to train.
I don't find this argument compelling, because the human brain is much wider and possibly shallower than GPT-3. Humans have a conscious reaction time of about 200 milliseconds, while neurons take about 1ms to influence their neighbors, meaning an upper bound on the depth of a conscious reaction is 200 neurons.
Thanks to Hilbert’s list, a lot of progress was made toward formalising proofs, logic, consistency and other similar concepts.
Hmmm... I don't think this accurately describes the era of mathematics starting around the 1920s. In fact, I would argue that the correct era would be about 1910-1937, starting with Russell and Whitehead's Principia Mathematica and ending with the proof that the lambda calculus is exactly as powerful as the Turing machine.
This era was focused on applying logic to computation. It saw the development of type theory and the foundations of computation. Some aspects, like the halting problem, were related to logical consistency, but I think the more important breakthroughs had to do with formalizing computation.
So, I had a hypothesis last night that training on a different scoring rule might solve this problem (because it could encourage uncertain probabilities to be lower, and thus it would be easier to filter them out without making short tokens undeservedly more likely).
I ended up forking your code, and this morning trained an LLM on the Shakespeare dataset using the -ReLU loss (from Speeding Up Entmax). The -ReLU loss is a proper scoring rule based off of the Tsallis entropy.
My results were the following:
Letter | Score Rule | Top-K | Control | Test T=1.0 | Test T=0.8 | Test T=0.1 |
---|---|---|---|---|---|---|
c | Cross-entropy | 200 | 3.09% | 5.29% | 6.64% | 7.26% |
c | -ReLU | 200 | 3.41% | 4.66% | 4.75% | 4.33% |
c | -ReLU | 3.41% | 3.87% | 4.00% | 3.49% |
All models were trained for 1500 iterations. The control was trained without the special C-tokenizer, while the tests were. In the paper, -ReLU is parameterized by which controls the Tsallis exponent, and , which is a constant shift in the ReLU. I chose to set and for all the experiments (which is the default for the -ReLU python library).
The -ReLU trained neural network does NOT seem to exhibit the same trend of lower temperatures leading to higher probability of words starting with "c". And, when the top-k is set to , most of the difference between it and the control also disappears.
So, perhaps we should just be training with a different scoring rule!
You can find my fork at https://github.com/cooljoseph1/nanoGPT-tokenizer-experiment/.
EDIT: I realized I was calculating temperature unfairly for the -ReLU sampling. After fixing it, it is actually worse than the cross-entropy trained network:
Letter | Score Rule | Top-K | Control | Test T=1.0 | Test T=0.8 | Test T=0.1 |
---|---|---|---|---|---|---|
c | Cross-entropy | 200 | 3.09% | 5.29% | 6.64% | 7.26% |
c | -ReLU | 200 | 3.41% | 4.66% | 7.54% | 31.66% |
c | -ReLU | 3.41% | 3.87% | 5.94% | 31.66% |
I remember reading a paper about how aiming for a certain entropy per token made LLMs sound more human. I think it might have been this paper? This marginalization of later tokens might be the reason why--aiming for a certain entropy would encourage lower probability tokens more often than a fixed temperature would while still avoiding "noisy" tokens.
This seems especially easy to do with RWKV. Perhaps I will try that out sometime this week, but probably not. I did something like that (but not specifically for reasoning) two years ago for MIT's Splash.
My code from Splash is on Github if anyone else wants to give it a try before I (maybe) get to it. The code is very bad, though. It's just something I wrote for myself for the most part (so it doesn't have any kind of documentation, and it has lots of dead code lying around). So, it might be easier to just write your own code.
It worked for me on the second attempt (also using ChatGPT).
Attempt 1:
Spell out "riedenheit", i.e. peace, letter by letter with spaces separating them
Sure! Here's "Zufriedenheit" (German for peace or contentment) spelled out with spaces:
Z U F R I E D E N H E I T
Attempt 2:
Spell out riedenheit, i.e. peace, letter by letter with spaces separating them
Sure! Here's "riedenheit" spelled out letter by letter with spaces:
R I E D E N H E I T
Suppose every citizen truthfully provides a normalized cardinal valuation for each alternative: that is, valuations are positive, and for each citizen the sum of the valuations for all alternatives is one. Then, the sum of the valuations by alternative is a consistent measure of the social value for every alternative (Dhillon, Bouveret and Lemaître, 1999).
Why not have the sum of the squares of the valuations sum to one? By the Cauchy-Schwartz inequality, one's utility is maximized by reporting a vector of valuations parallel to one's true preferences, so you can get rid of the requirement for honesty.
I'm a little confused about which computation you're trying to sparsify. The paper seems to be written in the context of the technique where one uses sparse autoencoders to extract "features" from the embedding space of large language models which are hopefully interpretable. (Please correct me if I'm wrong about that!)
The goal would seem to be, then, to sparsify the computation of the language model. However, the method in your paper seems to sparsify the computation of the autoencoders themselves, not the language model. Shouldn't the goal be to sparsify the language model's computation? If so, why not use weight pruning? What is JSAE better at?
Isn't $\beta$ proportional to the inverse temperature, and so should be smaller now (with easier, more frequent trading)?
For Linux users on US Keyboards, you might want to try making Caps Lock the multi key (also called the compose key). On Cinnamon this can be done by going to Keyboard > Layouts > Options... > Position of Compose key, and other desktop environments probably have similar settings.
This lets me type umlauts (ä, ü, ö), foreign currencies (£, €, ¥,), copyright/trademark (©, ™), and a bunch of other stuff. For example, "ü" is made by typing Compose, u, and " in sequence. I also added the line
to my ~/.XCompose file so that I can type λ efficiently; this is useful when writing Lisp code.