Same person as nostalgebraist2point0, but now I have my account back.


the scaling “inconsistency”: openAI’s new insight

I don't think you're completely missing something.  This is the active learning approach, which gwern also suggested -- see that thread for more.

the scaling “inconsistency”: openAI’s new insight

I disagree. Transfer learning is practically the entire point. 'Blessings of scale' etc.

Sure -- my point to contrast two cases

  1. a counterfactual world with a much larger "regular" web, so WebText and Common Crawl are 1000x their real size
  2. the real world, where we have to go beyond "regular" web scrapes to add orders of magnitude

Many, including OpenAI, argue that general web crawls are a good way to get high domain diversity for free.  This includes domains the research would never have come up with themselves.

If we switch to manually hunting down large specialized datasets, this will definitely help, but we're no longer getting broad domain coverage for free.  At best we get broad domain coverage through manual researcher effort and luck, at worst we don't get it at all.

I see your point about active learning "telling us" when we need more data -- that's especially appealing if it can point us to specific domains where more coverage would help.

the scaling “inconsistency”: openAI’s new insight

What scaling curve in L(D)/L(C) could we get with even a simple active learning approach like running a small GPT over Common Crawl and throwing out datapoints which are too easily predicted?


IIUC, this is trying to make L(D) faster by making every data point more impactful (at lowering test loss).  This will help if

  1. you get most of the way to intrinsic entropy L(D) on your first pass over D points
  2. you can downsample your full dataset without lowering the total number of examples seen in training, i.e. you have too many points to do one full epoch over them

I can imagine this regime becoming the typical one for non-text modalities like video that have huge data with lots of complex redundancy (which the model will learn to compress).

With text data, though, I'm concerned that (2) will fail soon.

The number of train steps taken by GPT-3 was the same order of magnitude as the size of Common Crawl. I haven't seen convincing evidence that comparably good/diverse text datasets can be constructed which are 10x this size, 100x, etc.  The Pile is an interesting experiment, but they're mostly adding large quantities of single-domain text like Github, which is great for those domains but won't help outside them.

Why GPT wants to mesa-optimize & how we might change this

Are you saying that GPT-3's training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?


No, it's a more philosophical point.  Even if such things appear in the context window, they're simply more text, and convey the same kind of information: not "the denotation of these words is factually true," but "these words are part of the text."

For example, the mere appearance of something like

Title: Why GPT wants to mesa-optimize & how we might change this 

Author: John_Maxwell

does not guarantee that the text following it bears that title, or was written by that author.  (As I am illustrating right now.)

Of course, one can design datasets where information like this is provided more authoritatively -- say, always at the start of each text, curated for quality, etc.  (GPT isn't like that, but Grover and CTRL kind of are, in different ways.)

But even that can only go so far.  If the author is "Julius Caesar," does that mean the historical figure, some internet poster with that handle, or any number of other possibilities?  A passage of fiction written in a character's voice -- is the appropriate author cue the actual writer (who may have written in many different voices over their career) or the character?  (Note that the character is a much better answer to the question "who does this sound like?")  And doesn't the date matter too, so we know whether this post in the venue "Less Wrong" was on 2010's LW or 2020's?

Fundamentally, language modeling is about understanding structures in decontextualized blocks of contiguous words.  You can try to hack in some sidechannels to provide context, but there's no way they will capture everything needing to locate the text fully in its social, physical, and temporal position within the broader world.  And just as a definitional manner, these sidechannels are modifications to "language modeling," which in its purest sense is just about filling in an arbitrary text from substrings of it (and no other information).

My intuition is that small-L lookahead could be close to large-L lookahead in programspace for something like an RNN, but not for GPT-3's transformer architecture.

Yeah, not for transformers I think.

Anyway, the question here isn't whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution.

capybaralet's point about conservation of expected evidence applies here -- GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahead, it should already have those facts priced in to its next-step prediction.

If we then say "the mechanism for pricing them in is doing internal lookahead," then we are imagining that lookahead operating over some predictor that is otherwise good but hasn't priced in lookahead yet.  But I don't know why we should imagine the computation would naturally factor this way, when the benefits of lookahead are small and it beam search take a lot of parameters to implement internally.

Why GPT wants to mesa-optimize & how we might change this

I'm skeptical that internal beam search would help in language modeling.

Language modeling is like predicting the weather, in the sense that even if you are literally as good as possible at it, your prediction accuracy still degrades rapidly as a function of the number of steps ahead you're looking.  So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.

Weather is like this because of chaotic dynamics.  Language modeling is like this because

(a) Text is used to communicate: the writer expects the audience to learn something from the last X% of a text that they couldn't extrapolate from reading the first (100-X)%, or else they'd just stop and not write the remaining X%.

(b) By construction, language modeling gives you nothing to work with except the text itself, so you don't know who produced it or for whom.  So even if you were smart enough to guess what any individual human would say next (!), you don't know which human produced the text you're looking at.  (Or even whether it was a human at all.)

Thus (IMO), language modeling is not really about thinking ahead to find some "objectively correct" next move as in Chess/Go.  It's more about trying to guess what the author of this text will do in the very next step.  The author and the LM are almost sure to diverge after a few more steps, so even if the LM had a beam search oracle, I expect it wouldn't find it very useful.

To make the point concrete, I don't think "orange" is necessarily a bad guess here -- among other things, it would be the correct guess if the author were trying to illustrate the point of your example!

And if we were predicting this post itself, the true next token would not be orange or any other word but an ellipsis "...", which seems bizarre from the narrow perspective of the example, but is typical of the wild world LMs operate in.  (Which also contains typos, actually-incoherent writers, mangled formatting, the list goes on . . . )

on “learning to summarize”

To me the concept of a horizon (or a discount factor) as a hyperparameter no longer makes sense when there's only a single reward at the end of the episode, as in the paper here or in AlphaGo/Zero. They only make sense with intermediate reward, as in Atari or my proposal here.

With only final rewards, you can still include it as a variable formally. but there's no reason to make that variable anything less than the episode length. (If the horizon is n steps lower than the episode length, this just sets the value function identically to 0 for the first n steps.)

I guess I was using "there isn't a horizon per se" to mean "the time structure of the rewards determines the horizon for you, it wouldn't make sense to vary it," but I can see how that would be confusing.

If you only set the horizon to 1 but changed nothing else in their work, you'd get a dumb policy that equals the initial LM until the very last token, which it treats as responsible for the entire reward. If you add intermediate rewards and set the horizon to 1, you get something more sensible.

on “learning to summarize”
I think in the original paper, they don't have the KL term that prevents the policy from overfitting to the reward model, which seems sufficient to explain this.

Yeah, that makes sense. Something like this explanation occurred to me yesterday, after finishing the post -- I was reading over the funny samples I quoted at the end and thought "huh, that would qualify as 'bizarre behavior,' wouldn't it?"

Or perhaps you don't want to use a learned reward model, and instead you want some hardcoded method of converting human preferences over tokens into <thing that can be plugged into an ML algorithm>?

If I understand you, yes, this is what I want. My intuition here is based on:

  • at the end of the day, our final model will be sampling one token at a time, like the original LM; we just want it to output better probabilities
  • when OpenAI (and I) think about what "better probabilities" we want in specific cases, our preference often looks localized to specific tokens and identifiable using only preceding context, e.g. to specific "made-up" facts, or the kind of synthetic errors they introduce in Table 18

So, it feels like "we" want the LM to have different probabilities in specific places, and we can often point to these exact places and at least say whether the probability should be higher or lower.

Insofar as this is true, it means our true preferences look a lot like what the original LM is trained to do. If I'm annotating to improve an LM for nonfiction writing, and I see "Paris, the capital of Canada," what I really want is to make the token " Canada" less probable in this context.

This is a preference over next-token probabilities, not sequences -- if I compress it down to a preference over whole sequences, I must be hoping the models will later decompress it back to my true preference. It seems needlessly convoluted to translate my preferences out of LM terms and then use RL to translate them back, when they're naturally in LM terms to begin with.

This doesn't get you all the way to having a unique loss: the most obvious thing would be to ascend likelihood for tokens marked "good" and descend for tokens marked "bad," but there may be conceptually similar losses that are better-behaved in training.

Some versions of this would look like RL with a horizon of 1 and the rewards given by my annotations plus a KL penalty, but note that this is very different from their approach, where there isn't a "horizon" per se because all episodes have a fixed duration and receive rewards only at the end.

"Learning to Summarize with Human Feedback" - OpenAI

Various thoughts -- focused on critique because I find that most interesting to write down. (I didn't have a strong negative or positive reaction to the paper.)


IMO there are two almost unrelated ideas going on in OpenAI's preference learning work (this paper and the original one).

  • First, the idea of collecting binary preference annotations on LM samples, and (in some way) tuning the LM so its samples are better aligned with the preferences.
  • Second, a specific method for tuning the sampling behavior of LMs to maximize an (arbitrary) score function defined over entire samples.

As their first step, they do supervised learning on the data from the first idea to produce a "reward model." (In this paper, this happens roughly once, with little active learning of the reward model over successive batches of annotation.)

This model assigns a score to an entire sample of N tokens, but for LM finetuning, you want something that tells you how good each token is individually. The second idea is the way they choose to bridge the gap, with a specific RL technique.

The overall results look good, but it's not clear how to attribute that across the two ideas, and OpenAI's discussion tends to blur the two together. They can perhaps learn high-quality reward models from preference data (first idea), but it's less clear they are using these models to tune sampling in a good way (gwern said the same thing after trying it).

On the flipside, their RL approach to sampling treats the reward as a black box, so it has nothing to do with preference data per se; you could apply it with any score function.


As far as I can tell, their final "human evaluation" was done by the same labelers who provided the preference annotations. This makes me concerned about a variant of "evaluating on training data." It's not surprising that a model tuned on someone's annotations agrees with that person more than a model which wasn't.

For example, in Fig. 3, it looks like the "supervised" baseline tuned on tl;dr was rated about as highly as true examples from tl;dr itself (!), but not as well as the final model. I think the authors are arguing that tl;dr and any other supervised dataset will have flaws, and preference data lets you get closer to what people actually want.

This seems true, but is a familiar observation from supervised learning, motivating e.g. active learning. It would be nice to see how much the difference can be mitigated by just augmenting tl;dr with annotations but otherwise doing supervised learning, vs. using their RL approach.

Compared to tl;dr, the story for CNN/DM is more complicated, but again the models they outperform have not seen any data from their labelers, so maybe it is no surprise they have flaws according to those same labelers.


The importance of annotation quality, close relationships with annotators, clear guidelines, etc. will be familiar to anyone with experience in annotation for ML. It's good that OpenAI is doing the right things here, but this is not a new result -- rather, other researchers resort to MTurk and similar due to time/money constraints, while OpenAI has the freedom to do the right things everyone else wants to do. (That includes building their own internal annotation platform for contracted annotators, which is costly but better in the long term than relying on a janky 3rd party product.)

interpreting GPT: the logit lens

Interesting topic! I'm not confident this lens would reveal much about it (vs. attention maps or something), but it's worth a try.

I'd encourage you to try this yourself with the Colab notebook, since you presumably have more experience writing this kind of prompt than I do.

interpreting GPT: the logit lens

I also thought of PCA/SVD, but I imagine matrix decompositions like these would be misleading here.

What matters here (I think) is not some basis of N_emb orthogonal vectors in embedding space, but some much larger set of ~exp(N_emb) almost orthogonal vectors. We only have 1600 degrees of freedom to tune, but they're continuous degrees of freedom, and this lets us express >>1600 distinct vectors in vocab space as long as we accept some small amount of reconstruction error.

I expect GPT and many other neural models are effectively working in such space of nearly orthogonal vectors, and picking/combining elements of it. A decomposition into orthogonal vectors won't really illuminate this. I wish I knew more about this topic -- are there standard techniques?

Load More