I'm not claiming I know the perfect cut off point between not losing information and not letting errors accumulate, if something like that even exists. It could very well be that after 3 forward passes with neuralese, it would still be mostly fine or it could also be that even in the middle of a single forward pass it makes sense to have some kind of mitigations (I think you could build an argument around MoE being a mitigation). But what this perfect ratio is doesn't really matter, the point is that recurrent forward passes will be a thousand times worse than a normal forward pass and therefore can't be worth it anymore.
Making everything discrete is one extreme just as making everything continuous is the other extreme. I'm arguing that the golden ratio lies somewhere in the middle, recognizing the importance of both rich, continuous representations and clear, discrete representations.
I'm not sure which passage you seem to refer to, saying that my argument implies this. The sections "The Bandwidth Intuition/Counterargument" are supposed to clear exactly this, stating roughly that I understand that there is still obviously a loss of information and as such it's nonsensical for a normal NN to have miniscule layers. This isn't an accurate assessment for neuralese LLMs though. They recursively aggregate this error, turning it into a lot bigger problem. If simply allowed to grow through the tens and hundreds of forward passes, it's simply not worth it. If we do already tokenize them though, quantization no longer would serve any real purpose.
I hope my position has become a little bit more clear.
No, for several reasons: For starters, quantization is normally done after training and not present during training (mainly because it introduces a lot of grad problems) - this is not comparable to the token distribution which we incorporate during training and train on. (In other words, it can't take advantage of any possible benefits cause it was trained on a whole other setting)
Even more importantly, the error doesn't aggregate for the KV Cache (the normal weights obviously can't, they are literally fixed): inspecting a token's KV cache in the i'th layer, it only carries the noise from the 0'th to i'th layer (any minor noise from before got removed since it was a discrete token at the beginning). It in turn will carry this noise to other tokens through the attention mechanism which will then still have to face the noise from the i'th to final layer. But this is just a normal forward pass worth of noise, not something we have to worry much about since we are now just gonna tokenize it, removing all minor noise. (In other words, my argument focuses on noise that grows and grows through autoregressive steps, this is just the noise of a normal forward pass)
Specifically as noted in "The Bandwidth Counterargument":
Having bottleneck layers in a normal neural network is nonsensical - when the "distance" is small, you should stay in high-dimensional, continuous latent space. By the time one forward pass finishes, noise hasn't yet grown enough to matter and tokenization can clean it up.
Ah, I think I might have slightly misunderstood the intent of your posts title and tried answering a different question: why does LLM writing often seem shallow and bad rather than why LLMs specifically seem biased to a subset of stylistic devices.
I honestly don’t use LLMs much to chat or write with, so my personal experience is rather limited. But I do find the point others made, on the data distribution for post training just not being an accurate sample, convincing enough - just not particularly satisfying.
So, here’s my thoughts on why both RLHF but also SFT or DPO could, even with a perfect sample of training data, result in converged down distributions of stylistic devices.
In the case of RLHF, we can go even further by assuming the distillation of the training data went perfectly - the reward model isn’t biased towards any stylistic devices but a perfect representation of its training data.
Even then, the key problem is that the reward model only sees a single trace. This is important because it makes the reward model unable to determine whether the distribution of stylistic devices seen in the trace is simply a reasonable sample from the whole distribution or only a subset of it.
And because of constant optimization pressure, only mastering a few stylistic devices (just enough to fool the RM in a single trace) will quickly become the path forward.
Now what about something like SFT - after all, here we don’t do any rollouts anymore. This does help. We can assume because of the unbiased loss, the distribution of stylistic devices when presented with some training examples is pretty accurate. But that’s the extent to which we can make statements: we were completely offline.
The traces during inference are very different from the training data: Errors propagate during token generation, biases accumulate and suddenly we are faced with only a subset of the training distribution or worse, something not encountered at all. Assuming the distribution of stylistic devices, once aligned to a completely different distribution of traces, will still be unbiased, is wishful thinking at best.
Online based training where you look beyond a single trace seems most promising. This can happen by either including stuff like logits (KL distillation, see this post for an idea which should work well as well) or simply incorporating multiple traces into the judgement of one - how diverse is this trace compared to others generated? (including its stylistic devices for example)
I think it might be that the undesired response in RLHF/DPO settings isn't good enough.
Imagine two responses, one leveraging stylistic devices and persuasive words while the other... well, just doesn't. Naturally the first is better and more desirable. If we now inspect this over the whole training batch, these distinctions of the preferred response at any point in the response leveraging stylistic devices will become clear. That is, a phrase like "It's not an X, it's a Y" will occur at a bunch of different positions throughout all the different positive examples in contrast to the negative examples which very rarely, if at all, showcase such pleasant phrasing.
But wouldn't then such behavior of constantly repeating stylistic devices be exactly what we would expect? This clear contrast between positive and negative example will be what we distill in our final model, basically telling it that stylistic devices at any point are preferred.
To move away from this, looking for better high quality positive examples won't be helpful at all - instead we need the negative examples throughout training to become closer and closer to the positive examples just like a writer progresses through his career: first learning about stylistic devices, then understanding when to use them meaningfully, when less is more and finally fully mastering it. This contrast between good and really good writer needs to be captured more in the posttraining data for something like DPO.
Do take this with a grain of salt, just a random theory i came up with while thinking about this for 10 minutes or so, i didn't research the empirical state of research with this hypothesis but it does seem somewhat convincing to me at least.
but I can’t explain why I believe that, because it gets into gory details of brain algorithms that I don’t want to talk about publicly, sorry.
somewhat random but I think I want to learn more about this field in general - from what I can tell, you didn't learn about it in a normal academic setting (like doing a neuroscience B.Sc.) either; any tips for good resources?
This isn’t as much a question as it is just sharing some thoughts I had, but I would love to hear your thoughts :) Let’s imagine we are our own brain’s optimizer. We just received a bad signal, we feel pain. Let’s say, we realized someone else is soon going to feel pain, so we feel pain. What could the optimizer do now? Well, there are only 2 things it can do:
Try to disconnect “she feels pain” from the concept of pain that then triggered pain in yourself
Try to disconnect your previous thoughts from arriving at “she feels pain”
You speak a lot to (1), explaining the symbol grounding mechanism that continuously symbol grounds it in the ground truth, so the optimizer trying to move “she feels pain” away from its previous position in the feature space won’t work (at least as long as we continuously have such ground truth input - this sheds light on the very immoral but very interesting experiment of having an individual not exposed to such input for long periods, like not seeing any human face for multiple months, be it in person, on pictures or on your phone. There, this theory should predict that such a move in feature space could happen and will be successful - to be dramatic, you become a psychopath).
You don’t speak much to (2) though. One option for example here would be to unlearn the concept of “future” - babies first gradually learn about it therefore it’s reasonable to assume that you could unlearn it again. Luckily, this doesn’t seem to happen, so there must be some opposing force, something that promises reward if this concept persists.
Specifically, this concept must offer you insight into your actions such that your future expected reward rises. This is obvious in this case - without the concept “future”, you can hardly make any intelligent decisions at all. But it also carries over to much more specific and even human invented associations/knowledge:
Let’s say you work in cyber-security and the reason you think this person will feel pain is because using those cyber-security skills enabled you to make an association the normal person wouldn’t. The optimizer could try to unlearn these skills, but actually those skills lead to higher expected reward, else you wouldn’t be pursuing it: be it the nice house you can afford, the social status you enjoy because of it or simply the joy you receive from enacting it.
In other words, anything you learned, you learned because you assumed it would result in a higher expected reward and anything you act out (after learning), you do because it results in a higher expected reward. To forget these concepts will at least require a reward matching theirs.
This doesn’t imply it should be impossible though - let’s say you learned something that you hate, like say chiseling stone. You did this because the market would pay insane wages because only few could do the job and so the reward you saw attached to those wages was immense and you pushed through the boring education of becoming an expert in chiseling stone. And once you got there, you realize, you weren’t the only one with the idea: wages drop quicker than the average pump & dump crypto coin. In fact the profession you enacted before, which you intrinsically enjoy, even pays better.
As I’m writing this, I realize there are no good stories for why chiseling stone might give you a better glimpse into someone’s future pain, but let’s just take it for granted. Then the reward of the knowledge of chiseling stone is pretty much zero, maybe even negative because whenever you recall it, you recall all the effort that didn’t pay off.
Yet I have never heard of something along these lines happening. It would be quite a great mechanism for the free market though, the wages would jump right up: let’s hope our individual in question doesn’t once again try to learn to chisel stone, completely forgetting this tale of unreciprocated effort.
You could maybe argue something like: precisely the things that fall in this category are things we gave up on, that is, their occurrence in our day-to-day life is incredibly rare. Therefore, with a normal learning rate, we simply wouldn’t iterate over them often enough to forget them meaningfully.
Lastly, just for completeness, naturally ‘disconnecting your previous thoughts from arriving at “she feels pain”’ also entails your previous actions - it’s a very special occurrence to know somebody will feel pain in the future, unless you had a play in it yourself. Naturally those decisions back then will be optimized on as well, hopefully leading you to make better decisions in the future.
Think of it as vaguely like I-am-juggling versus you-are-juggling.
Here, I can see how they would overlap to a reasonable degree - I don't think this easily carries over to emotions. Emotions atleast feel like this weird, distinct thing such that any statement along the lines "I'm happy" does it injustice. Therefore I can't see it being carried over to "She's happy", their intersection wouldn't be robust enough such that it won't falsely trigger for actually unrelated things. That is, "She's happy" ≈ "I'm happy" ≉ experiencing happiness.
Facial cues (as one example, it makes sense that there would be other things like higher-pitched voices when enjoying oneself etc) eliminate this problem because opposed to something introspective being the link, a more objective state of the mind, like "He's sad", will be the learned link.
this might sound like I'm being unnecessarily picky about this, but imo these associations need to be very exact, else humans would be reward-hacking all day: it's reasonable to assume that the activations of thinking "She's happy" are very similar to trying to convince oneself "She's happy" internally, even 'knowing' the truth. But if both resulted in big feelings of internal happiness, we would have a lot more psychopaths.
regarding micro expressions specifically, it's definitely not a hill i want to die on, it kind of just popped in my mind as I was writing about facial cues and by micro I really mean 'micro micro' - e.g. smiles that aren't perfectly symmetrical for quarter of a second, something I at least can't really pick up on; what is their evolutionary advantage if they don't atleast offer some kind of subconscious effect on conspecifics? But yea, if you can't consciously pick up on it, linking the two is pointless or even bad.
I read the linked post roughly, but as I read neither so far, i probably can't relate too well to it. seems reasonable (or honestly, obvious) though that it's a mix rather than either of those extreme statements.
let me preface this by saying how much I enjoyed reading this post - it really shows that this isn't some random idea you had but that you really thought a lot about this. As someone who's first introduction to this kind of idea was precisely this blogpost, thanks.
question - maybe I'm simply misunderstanding you:
-you seem to assume that the cortex's modelling of one's own happiness is very similar to the cortex's modelling of thinking of happiness. you might argue that it's only the "concept of happiness", which I would agree is present in both scenarios, but it doesn't strike me why that in particular would be learned using this supervised mechanism.
-building on that point, I think it might be more probable that understanding another's feelings is part of 1A - instead of simply seeing, hearing, etc. there would be something tasked with analyzing facial cues - in particular humans exhibit micro expressions (expressions that last very short periods and are almost impossible to control), something most people can't seem to pick up on, atleast consciously. So why do we have them if other people can't pick up on them? Maybe they can, but only subconsciously to precisely facilitate this symbol grounding for somebody else's feelings. Then again, if you can't consciously pick up on it, the target for the supervision will probably be terrible as well so maybe that's not it.
(i'll probably hammer u with more questions down the line, still trying to process all of this lol)
Comments