Wiki Contributions

Comments

p.b.2d20

In psychometrics this is called "backward digit span".

p.b.4d61

Diminishing returns in loss are not diminishing returns in capabilities. And benchmarks tend to saturate, so diminishing returns are baked in if you look at those. 

I am not saying that there aren't diminishing returns to scale, but I just haven't seen anything definitive yet.

p.b.5d10

Frankly, I don't really understand what you are saying here and I am open to the possibility that I don't really understand how the gradient works in autoregressive transformers. 

But as I said in my other comment, my current understanding is: 

In standard attention (for example in an encoder) tokens are not ordered, so it is clear that the gradient of the loss of one of the token predictions (for example a masked token in BERT) flows through all other tokens equally. In autoregressive transformers an order is imposed by masking, but all later tokens attend to all earlier tokens in the same way. 

The gradient of the loss of a later tokens flows through all earlier tokens in the same way. It doesn't matter whether a token is half the context back or all the context, neither for the information flow nor for the gradient flow. 

To put it another way: In the n-th layer the last token attends to all the output tokens from the n-1-th layer. It doesn't somehow have to make do with the output of earlier layers for tokens that are further back. 

p.b.5d10

Yeah, the first 99 tokens would be optimized both to be locally the correct character, and also to set things up so that the 100th character is also correct.

That is how LLMs currently work. The gradient of each token prediction does flow back into all the earlier tokens whose information was integrated into the predicted token. So each token optimizes its own next token prediction but also tries to integrate the information that is most useful for future tokens. 

p.b.5d10

I don't know how people are creating huge context windows these days, but IIRC the way it works is that the longer you look back into your context (and correspondingly the further you are trying to plan ahead) the less of your computation is available. Like, if you have N layers, then for a token M steps back, you only have access to the computation up until layer N-M.

Everything in the context window is equally available. It doesn't make a difference whether an earlier token is 5 tokens back or 5000. The attention mechanism is an operation over a set of tokens, there is no intrinsic order. 

p.b.6d60
  • Scaling curves show strongly diminishing returns to $ spend: A $100m model might not be that far behind a $1b model, performance wise. 

What's your argument for that? 

p.b.8d10

Hah, I didn't see your answer but our links complement nicely. 

I think my first link was the paper that was making some waves when it came out.

p.b.9d20

This reminds me a lot of a toy project I have in the back of my mind but will probably never get around to: 

Which is to train a transformer on the sequences generated by the logic models from the apperception engine paper (which in the paper are inferred by the apperception engine from the sequences) with the aim of predicting the logic model. 

Load More