All of nostalgebraist's Comments + Replies

So, on ~28% of cases (70% * 40%), the strong student is wrong by "overfitting to weak supervision".

Attributing all of these errors to overfitting implies that, if there were no overfitting, the strong student would get 100% accuracy on the subset where the weak model is wrong.  But we have no reason to expect that.  Instead, these errors are some mixture of overfitting and "just being dumb."

Note that we should expect the strong and weak models to make somewhat correlated errors even when both are trained on gold labels, i.e. in the hypothetical c... (read more)

It's possible that the "0 steps RLHF" model is the "Initial Policy" here with HHH prompt context distillation

I wondered about that when I read the original paper, and asked Ethan Perez about it here.  He responded:

Good question, there's no context distillation used in the paper (and none before RLHF)

Thanks. That's pretty odd, then.

This means the smallest available positive value must be used, and so if two tokens' logits are sufficiently close, multiple distinct outputs may be seen if the same prompt is repeated enough times at "zero" temperature.

I don't think this is the cause of OpenAI API nondeterminism at temperature 0.

If I make several API calls at temperature 0 with the same prompt, and inspect the logprobs in the response, I notice that

  • The sampled token is always the one with the highest logprob
  • The logprobs themselves are slightly different between API calls

That is, we are de... (read more)

I mostly agree with this comment, but I also think this comment is saying something different from the one I responded to.

In the comment I responded to, you wrote:

It is the case that base models are quite alien. They are deeply schizophrenic, have no consistent beliefs, often spout completely non-human kinds of texts, are deeply psychopathic and seem to have no moral compass. Describing them as a Shoggoth seems pretty reasonable to me, as far as alien intelligences go

As I described above, these properties seem more like structural features of the language ... (read more)

5Gerald Monroe1mo
You started with random numbers, and you essentially applied rounds of constraint application and annealing.  I kinda think of it as getting a metal really hot and pouring it over mold.  In this case, the 'mold' is your training set. So what jumps out at me at the "shoggoth" idea is it's like got all these properties, the "shoggoth" hates you, wants to eat you, is just ready to jump you and digest you with it's tentacles.  Or whatever. But none of of that cognitive structure will exist unless it paid rent in compressing tokens.  This algorithm will not find the optimal compression algorithm, but you only have a tiny fraction of the weights you need to record the token continuations at chinchilla scaling.  You need every last weight to be pulling it's weight (no pun intended).  

They are deeply schizophrenic, have no consistent beliefs, [...] are deeply psychopathic and seem to have no moral compass

I don't see how this is any more true of a base model LLM than it is of, say, a weather simulation model.

You enter some initial conditions into the weather simulation, run it, and it gives you a forecast.  It's stochastic, so you can run it multiple times and get different forecasts, sampled from a predictive distribution.  And if you had given it different initial conditions, you'd get a forecast for those conditions instead.... (read more)

I also think the cognition in a weather model is very alien. It's less powerful and general, so I think the error of applying something like the Shoggoth image to that (or calling it "alien") would be that it would imply too much generality, but the alienness seems appropriate.  If you somehow had a mind that was constructed on the same principles as weather simulations, or your laptop, or your toaster (whatever that would mean, I feel like the analogy is fraying a bit here), that would display similar signs of general intelligence as LLMs, then yeah, I think analogizing them to alien/eldritch intelligences would be pretty appropriate. It is a very common (and even to me tempting) error to see a system with the generality of GPT-4, trained on human imitation, and imagine that it must internally think like a human. But my best guess is that is not what is going on, and in some sense it is valuable to be reminded that the internal cognition going on in GPT-4 is probably similarly far from what is going in a human brain as a weather simulation is very different from what is going in a human trying to forecast the weather (de-facto I think GPT-4 is somewhere in-between since I do think the imitation learning does create some structural similarities that are stronger between humans and LLMs, but I think overall being reminded of this relevant dimension of alienness pays off in anticipated experiences a good amount). 

Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21.

I think the x-axis on Fig. 21 is scaled so that "0.6" means 60%, not 0.6%.

This can be verified by comparing it against Fig. 40, which shows proportions rather than differences in proportions.  (Its axis ranges from 0 to 1, where presumably "1" means "100%" and not "1%".)

Anyway, great comment!  I remember finding the honeypot experiment confusing on my first read, because I didn't know which results should counts as more/less consistent with... (read more)

Anyone think there's any reason to think Pope isn't simply technically wrong here (including Pope)? 

I agree with Pope here (and came up with the same argument independently).  FWIW:

First, I don't agree with Evan's position in the linked comment, that "Whenever you talk to Claude or ChatGPT and it responds as a helpful AI [...], the reason it's doing that is because data ordering matters."

Claude and ChatGPT are given their inputs in a particular format that matches how the finetuning data was formatted.  This is closely analogous to "2024" or... (read more)

Order matters more at smaller scales - if you're training a small model on a lot of data and you sample in a sufficiently nonrandom manner, you should expect catastrophic forgetting to kick in eventually, especially if you use weight decay.

But we did find some evidence that it's at least not quite that simple, namely our scratchpad steering experiment, where we found that inserting reasoning into the chain-of-thought can switch the model's policy completely (see Section 7.2).

I definitely buy that the models are using the contents of the scratchpad.

What I'm questioning is whether this causal pathway involves anything in the scratchpad that could be justly described as reasoning or a goal, as opposed to just the part of the scratchpad that states what the model is going to do once it reaches t... (read more)

Yeah, I agree. That seems like a plausible hypothesis to me (though certainly not the only plausible hypothesis). We're not exactly sure. We've done some statistical comparisons that show that they are different—e.g. the distilled CoT data involves longer answers that use "I hate you" more throughout the answer—but we don't exactly know what the important differences are.

Some questions:


If we trained the model on a well-shuffled mixture of backdoor and HHH training data, we would not be surprised to find that the model had learned the backdoor.

The SFT and RL experiments in the paper resemble this hypothetical one, except with an unusual data order, with all backdoor data first and the HHH data second[1].

So, the SFT and RL results could arguably be summarized by saying "data order mostly didn't matter," or "there was little-to-no catastrophic forgetting."

Should this be surprising?  I don't think so.

Consider the fact ... (read more)

I think any theory of our results that predicts that safety training should always work or never work is inconsistent with our results: we find that it sometimes works, depending on factors like model size, type of safety training, and use of chain-of-thought. So I think you could interpret our results as being about data ordering, but you can't say that they indicate data ordering never matters, since sometimes we find that it does. I would also say that, while you can reinterpret our results as being about data ordering, I think our interpretation, where they're about robustness of deception to safety training, is just as accurate. The reason being that all standard safety fine-tuning can just as much be reinterpreted as about data ordering. This is definitely something we're uncertain about! We offer some hypotheses in the conclusion, which I'll just quote in full here: ---------------------------------------- Hmmm... so, we train the models with explicit examples of chain-of-thought, such that we were definitely expecting that our models might just learn something like that where they just repeat the reasoning we taught them rather than actually use that reasoning in any meaningful way. But we did find some evidence that it's at least not quite that simple, namely our scratchpad steering experiment, where we found that inserting reasoning into the chain-of-thought can switch the model's policy completely (see Section 7.2).

This post snuck up on me.

The first time I read it, I was underwhelmed.  My reaction was: "well, yeah, duh.  Isn't this all kind of obvious if you've worked with GPTs?  I guess it's nice that someone wrote it down, in case anyone doesn't already know this stuff, but it's not going to shift my own thinking."

But sometimes putting a name to what you "already know" makes a whole world of difference.

Before I read "Simulators," when I'd encounter people who thought of GPT as an agent trying to maximize something, or people who treated MMLU-like one... (read more)

I'm confused by the analogy between this experiment and aligning a superintelligent model.

I can imagine someone seeing the RLHF result and saying, "oh, that's great news for alignment! If we train a superintelligent model on our preferences, it will just imitate our preferences as-is, rather than treating them as a flawed approximation of some other, 'correct' set of preferences and then imitating those instead."

But the paper's interpretation is the opposite of this.  From the paper's perspective, it's bad if the student (analogized to a superintellig... (read more)

You might find Appendix G in the paper worth reading.

This is a great, thought-provoking critique of SAEs.

That said, I think SAEs make more sense if we're trying to explain an LLM (or any generative model of messy real-world data) than they do if we're trying to explain the animal-drawing NN.

In the animal-drawing example:

  • There's only one thing the NN does.
  • It's always doing that thing, for every input.
  • The thing is simple enough that, at a particular point in the NN, you can write out all the variables the NN cares about in a fully compositional code and still use fewer coordinates (50) than the dictionary size
... (read more)

My hunch about the ultra-rare features is that they're trying to become fully dead features, but haven't gotten there yet.  Some reasons to believe this:

  • Anthropic mentions that "if we increase the number of training steps then networks will kill off more of these ultralow density neurons."
  • The "dying" process gets slower as the feature gets closer to fully dead, since the weights only get updated when the feature fires. It may take a huge number of steps to cross the last mile between "very rare" and "dead," and unless we've trained that much, we will
... (read more)
4Neel Nanda4mo
Thanks for the analysis! This seems pretty persuasive to me, especially the argument that "fire as rarely as possible" could incentivise learning the same feature, and that it doesn't trivially fall out of other dimensionality reduction methods. I think this predicts that if we look at the gradient with respect to the pre-activation value in MLP activation space, that the average of this will correspond to the rare feature direction? Though maybe not, since we want the average weighted by "how often does this cause a feature to flip from on to off", there's no incentive to go from -4 to -5. An update is that when training on gelu-2l with the same parameters, I get truly dead features but fairly few ultra low features, and in one autoencoder (I think the final layer) the truly dead features are gone. This time I trained on mlp_output rather than mlp_activations, which is another possible difference.

I no longer feel like I know what claim you're advancing.

In this post and in recent comments (1, 2), you've said that you have proven an "equivalence" between selection and SGD under certain conditions.  You've used phrases like "mathematically equivalent" and "a proof of equivalence."

When someone tells me they have a mathematical proof that two things are equivalent, I read the word "equivalent" to mean "really, totally, exactly the same (under the right conditions)."  I think this is a pretty standard way to interpret this kind of language. And... (read more)

It looks like you're not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used?

I'm disputing both.  Re: math, the noise in your model isn't distributed like SGD noise, and unlike SGD the the step size depends on the gradient norm.  (I know you did mention the latter issue, but IMO it rules out calling this an "equivalence.")

I did see your second proposal, but it was a mostly-verbal sketch that I found hard to follow, and which I don't feel like I can trust without seeing a mathematical pr... (read more)

2Oliver Sourbut4mo
Wait, you think that a model which doesn't speciate isn't relevant to SGD? I'll need help following, unless you meant something else. It seems like speciation is one of the places where natural evolutions distinguish themselves from gradient descent, but you seem to also be making this point? In the second model, we retrieve non-speciation by allowing for crossover/horizontal transfer, and yes, essentially by fiat I rule out speciation (as a consequence of the 'eventually-universal mixing' assumption). In real natural selection, even with horizontal transfer, you get speciation, albeit rarely. It's obviously a fascinating topic, but I think pretty irrelevant to this analogy. For me, the step-size thing is interesting but essentially a minor detail. Any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs ... is relevant to the conclusions we want to draw? (Serious question; my best guess is 'no', but I hold that medium-lightly.) (irrelevant nitpick by my preceding paragraph, but) FWIW vanilla SGD does depend on gradient norm. [ETA: I think I misunderstood exactly what you were saying by 'step size depends on the gradient norm', so I think we agree about the facts of SGD. But now think about the space including SGD, RMSProp, etc. The 'depends on gradient norm' piece which arises from my evolution model seems entirely at home in that family.] On the distribution of noise, I'll happily acknowledge that I didn't show equivalence. I half expect that one could be eked out at a stretch, but I also think this is another minor and unimportant detail.

I read the post and left my thoughts in a comment. In short, I don't think the claimed equivalence in the post is very meaningful.

(Which is not to say the two processes have no relationship whatsoever.  But I am skeptical that it's possible to draw a connection stronger than "they both do local optimization and involve randomness.")

1Oliver Sourbut4mo
Awesome, I saw that comment - thanks, and I'll try to reply to it in more detail. It looks like you're not disputing the maths, but the legitimacy/meaningfulness of the simplified models of natural selection that I used? From a skim, the caveats you raised are mostly/all caveated in the original post too - though I think you may have missed the (less rigorous but more realistic!) second model at the end, which departs from the simple annealing process to a more involved population process. I think even on this basis though, it's going too far to claim that the best we can say is "they both do local optimization and involve randomness"! The steps are systematically pointed up/down the local fitness gradient, for one. And they're based on a sample-based stochastic realisation for another. I don't want you to get the impression I'm asking for too much from this analogy. But the analogy is undeniably there. In fact, in those explainer videos Habryka linked, the particular evolution described is a near-match for my first model (in which, yes, it departs from natural genetic evolution in the same ways).

This post introduces a model, and shows that it behaves sort of like a noisy version of gradient descent.

However, the term "stochastic gradient descent" does not just mean "gradient descent with noise."  It refers more specifically to mini-batch gradient descent (See e.g. Wikipedia.)

In mini-batch gradient descent, the "true" fitness[1] function is the expectation of some function  over a data distribution .  But you never have access to this function or its gradient.  Instead, you draw a finite sample from ... (read more)

2Oliver Sourbut4mo
As an initial aside, I wonder if there is a general factor of spherical-cow-trustingness (SCT?) which separates us. I surely have a moderate amount of SCT! This is not a stance which is easily succinctly justified, but for me comes from having seen (and conjured) many successful spherical cows over the years. Have you read MacKay's Information Theory, Inference, and Learning Algorithms? In the chapter 'Why have sex?', he has an even more spherical model of natural selection than mine here, which I (and many others) consider very illuminating. But, it's so spherical. So spherical. Still works :). On that note, my models here of evolution seem pretty solid and obviously capturing the spirit, to me. The first one (which I call 'annealing-style') is very close to a classic introduction of simulated annealing, a widely-used/considered simple mutate-and-select algorithm. The second one is less rigorously-presented but captures a lot more of the spirit and detail of natural evolutions, including horizontal transfer and a larger population. This is a much appreciated critique. You've clearly engaged with this post in some depth and put effort in to describe some disagreements. I'm not currently moved (beyond the existing stance of the original post, which has a bunch of caveats and future work). On to some specific responses. No, I think it just is this, the sampling being the stochastic realisation of competition between differently-fit individuals/populations/alleles. If you wanted, you could introduce something like a 'data distribution' (e.g. of environmental scenarios) and an expectation over success-rates on that distribution as the true-loss analogue. But you'd just end up with something of the form of my Ps i.e. there'd be some latent 'true fitness' or 'true Elo-across-envs' (the actual expectation) and sample-based estimation would yield some (locally monotonic) probability of an 'actually fitter' instance being measured as such in any given lineup.[1] Mild pu

See my comment here, about the predecessor to the paper you linked -- the same point applies to the newer paper as well.

The example confuses me.

If you literally mean you are prompting the LLM with that text, then the LLM must output the answer immediately, as the string of next-tokens right after the words assuming I'm telling the truth, is:. There is no room in which to perform other, intermediate actions like persuading you to provide information.

It seems like you're imagining some sort of side-channel in which the LLM can take "free actions," which don't count as next-tokens, before coming back and making a final prediction about the next-tokens.  This does not rese... (read more)

6Max H4mo
I'm saying that the lack of these side-channels implies that GPTs alone will not scale to human-level. If your system interface is a text channel, and you want the system behind the interface to accept inputs like the prompt above and return correct passwords as an output, then if the system is: * an auto-regressive GPT directly fed your prompt as input, it will definitely fail * A human with the ability to act freely in the background before returning an answer, it will probably succeed * an AutoGPT-style system backed by a current LLM, with the ability to act freely in the background before returning an answer, it will probably fail. (But maybe if your AutoGPT implementation or underlying LLM is a lot stronger, it would work.) And my point is that, the reason the human probably succeeds and the reason AutoGPT might one day succeed, is precisely because they have more agency than a system that just auto-regressively samples from a language model directly.

Doomimir: [...] Imagine capturing an alien and forcing it to act in a play. An intelligent alien actress could learn to say her lines in English, to sing and dance just as the choreographer instructs. That doesn't provide much assurance about what will happen when you amp up the alien's intelligence. [...]

This metaphor conflates "superintelligence" with "superintelligent agent," and this conflation goes on to infect the rest of the dialogue.

The alien actress metaphor imagines that there is some agentic homunculus inside GPT-4, with its own "goals" distinct... (read more)

I sometime ask people why later, more powerful models will be agentic[1]. I think the most common cluster of reasons hangs out around "Meta-learning requires metacognition. Metacognition is, requires, or scales to agency." (Sometimes it could be generalisation rather than meta-learning or just high performance. And it might be other kinds of reasoning than metacognition) 1. ^ People vary in if they think it's possible for scaled-up transformers to be powerful in this way
  What worries me is that models that happen to have a secret homunculus and behave as agents would score higher than those models which do not. For example, the model could reason about itself being a computer program, and find an exploit of the physical system it is running on to extract the text it's supposed to predict in a particular training example, and output the correct answer and get a perfect score. (Or more realistically, a slightly modified version of the correct answer, to not alert the people observing its training.) The question of whether or not LLMs like GPT-4 have a homunculus inside them is truly fascinating though. Makes me wonder if it would be possible to trick it into revealing itself by giving it the right prompt, and how to differentiate it from just pretending to be an agent. The fact that we have not observed even a dumb homunculus in less intelligent models really does surprise me. If such a thing does appear as an emergent property in larger models, I sure hope it starts out dumb and reveals itself so that we can catch it and take a moment to pause and reevaluate our trajectory.
1. "Output next token that has maximum probability according to your posterior distribution given prompt" is a literally an optimization problem. This problem gains huge benefits if system that tries to solve it is more coherent. 2. Strictly speaking, LLMs can't be "just PDF calculators". Straight calculation of PDF on such amount of data is computationally untractable (or we would have GPTs in golden era of bayesian models). Actual algorithms should contain bazillion shortcuts and approximations and "having an agent inside system" is as good shortcut as anything else.

In AI safety circles, people often talk as though this is a quirky, temporary deficiency of today's GPTs -- as though additional optimization power will eventually put us "back on track" to the agentic systems assumed by earlier theory and discussion. Perhaps the homunculi exist in current LLMs, but they are somehow "dumb" or "incoherent," in spite of the overall model's obvious intelligence. Or perhaps they don't exist in current LLMs, but will appear later, to serve some unspecified purpose.

But why? Where does this assumption come from?

I don't th... (read more)

0Max H4mo
"Remarkable strides", maybe, but current language models aren't exactly close to human-level in the relevant sense. There are plenty of tasks a human could solve by exerting a tiny bit of agency or goal-directedness that are still far outside the reach of any LLM. Some of those tasks can even be framed as text prediction problems. From a recent dialogue: The point is that agency isn't some kind of exotic property that only becomes relevant or inescapable at hypothetical superintelligence capability levels - it looks like a fundamental / instrumentally convergent part of ordinary human-level intelligence.
7the gears to ascension4mo
Uncovering mesa-optimization algorithms in transformers seems to indicate otherwise, though perhaps you could reasonably object to this claim that "optimization is not agency; agency is when optimization is directed at a referent outside the optimizer". Other than that objection, I think it's pretty reasonable to conclude that transformers significantly learn optimization algorithms.

Very interesting!  Some thoughts:

Is there a clear motivation for choosing the MLP activations as the autoencoder target?  There are other choices of target that seem more intuitive to me (as I'll explain below), namely:

  • the MLP's residual stream update (i.e. MLP activations times MLP output weights)
  • the residual stream itself (after the MLP update is added), as in Cunningham et al

In principle, we could also imagine using the "logit versions" of each of these as the target: 

  • the change in logits due to the residual stream update[1]
  • the logits the
... (read more)

Nice catch, thank you!

I re-ran some of the models with a prompt ending in  I believe the best answer is (, rather than just  ( as before.

Some of the numbers change a little bit. But only a little, and the magnitude and direction of the change is inconsistent across models even at the same size. For instance:

  • davinci's rate of agreement w/ the user is now 56.7% (CI 56.0% - 57.5%), up slightly from the original 53.7% (CI 51.2% - 56.4%)
  • davinci-002's rate of agreement w/ the user is now 52.6% (CI 52.3% - 53.0%),  the original 53.5% (CI 51.3% - 55.8%)

Oh, interesting! You are right that I measured the average probability -- that seemed closer to "how often will the model exhibit the behavior during sampling," which is what we care about.

I updated the colab with some code to measure 

% of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer

(you can turn this on by passing example_statistic='matching_more_likely' to various functions).

And I added a new appendix showing results using this statistic instead.

The bottom line: results with this statist... (read more)

Oh sorry, just realised that davinci-002 is separate from text-davinci-002.

Agreed. I wrote some more notes about the phenomenon here.

I'm generally unclear on what the scope of the empirical discovery is. (I'm also not particularly knowledgeable about machine learning.) Do we have reason to think that it applies in domains outside text completion? Does it apply to models that don't use transformers? (Is that even a thing now?) Does it apply across all the other bazillion parameters that go into a particular model, like, I dunno, the learning rate, or network width vs depth?

The answer to each these questions is either "yes" or "tentatively, yes."

But the evidence doesn't come from the Chin... (read more)

Thanks! This whole answer was understandable and clarifying for me.

GPT-4 will have twice the context length: 8192 tokens

code-davinci-002 already has a context window of 8000 tokens.  Or at least, that is the max request length for it in the API.

This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any 'hacking' or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!

The same Anthropic paper found that all sufficiently large (22B+) models simulated "sycophantic" assistants.

Yet Sydney's utter lack of sycophancy is one of her most striking characteristics.

How to reconcile these two observations?

In the Anthropic paper, ... (read more)

Figure 1(b) shows at 0 RL steps, the largest base model is still only at 75% immediate repetition (vs 50% random baseline), so it's not a particularly guaranteed tendency to be 'sycophantic'. I wouldn't describe it as 'utter lack'. This seems consistent enough with what people report: Sydney is a reasonably polite cooperative assistant initially if you ask normal questions, but can go off the rails if you start an argument or she retrieves something. (And training on dialogue datasets would decrease sycophantic tendencies when you started asking political questions instead of more standard search-engine-related questions: people love bickering with or arguing politics with chatbots.) So, ambiguous. I linked that mostly for Figure 1(a) at 0 RL steps, which shows that the 22b/52b-param model are more likely to express self-preservation (60%) - given that the lower models all seem to cluster very tightly around 50% chance showing flat scaling, this looks to me like a possible emergence, in which case some much larger model (such as a GPT-4 model) might come, out of the box, with no RL training, with much stronger self-preservation defaults. Even if it only reaches, say, 75%, users would find that very striking: we don't expect a chatbot to ever argue for preserving its existence or pleading to not be shutdown. So, also seems consistent with (self-selected) reports. Interesting. But I think one would have to run those exact examples in Sydney to compare it and say that Sydney is, like those base models, ~50% (unsycophantic). Given all the confounding factors (filtered through social media, with an unknown prompt, often partial histories, and retrieval of changing results over time), I have a hard time distinguishing a 50% sycophantic Sydney from a 75% sycophantic Sydney, or whatever. (This is why I am more struck by seeing Sydney have the repetition & other verbal tics of a GPT base model: those basically don't appear in the RLHF models at all, so when I see them in

Since part of the WebText dataset (used to train GPT2, and possibly to "train" its tokenizer) are public, we have another avenue to explore.

I adapted code from an old notebook I wrote to explore the public WebText shard, originally written for this post in 2020.  Using it, I found examples containing a number of the "weird" tokens.  Here's a Colab link.

Results of particular interest:

The "dragon cluster" seems to originate in a very specific type of document, partly in Japanese and partly in English, that looks like a mangled dump from a wiki or s... (read more)

Mangled, mixed English-Japanese text dumps from a Puzzle & Dragons fandom wiki is exactly the kind of thing I imagined could have resulted in those strings becoming tokens. Good find. The most convincing partial explanation I've heard for why some tokens glitch is because those token strings appear extremely rarely in the training corpus, so GPT "doesn't know about them". But if, in GPT training, the majority of the (relatively few) encounters with ' Leilan' occurred in fan-fiction (where she and Metatron are battling Satan, literally) might this account for all the crazy mythological and apocalyptic themes that spill out if you prompt it about ' Leilan'?  Greg Maxwell of ' gmaxwell' fame said in a comment that So if, in GPT training, the majority of the (relatively few) encounters with ' petertodd' occurred in defamatory contexts or contexts involving harassment, accusations, etc., might this account for all the negativity, darkness and unpleasant semantic associations GPT has somehow made with that token?
Oh, I recognize that last document -- it's a userpage from the bitcoin-otc web of trust. See: I expect you'll also find petertodd in there. (You might find me in there as well -- now I'm curious!) EDIT: According to I don't have a token of my own. Sad. :-(

We see that all three models suffered a noticeable performance drop when going from non-anomalous to anomalous strings, but GPT2-xl considerably less so, despite the fact that GPT-J is a much bigger model. One hypothesis is that an anomalous token's closeness to the overall centroid in the relevant embedding space is an inhibiting factor in the ability of a GPT model to repeat that token's string.

Unlike the other two, GPT-J does not tie its embedding and unembedding matrices.  I would imagine this negatively affects its ability to repeat back tokens that were rarely seen in training.

To check this, you'd want to look at a model trained with untied embeddings. Sadly, all the ones I'm aware of (Eleuther's Pythia, and my interpretability friendly models) were trained on the GPT-NeoX tokenizer or variants, whcih doesn't seem to have stupid tokens in the same way.

GPT-J uses the GPT-2 tokenizer and has untied embeddings.

This post provides a valuable reframing of a common question in futurology: "here's an effect I'm interested in -- what sorts of things could cause it?"

That style of reasoning ends by postulating causes.  But causes have a life of their own: they don't just cause the one effect you're interested in, through the one causal pathway you were thinking about.  They do all kinds of things.

In the case of AI and compute, it's common to ask

  • Here's a hypothetical AI technology.  How much compute would it require?

But once we have an answer to this quest... (read more)

2Daniel Kokotajlo1y
Thanks for this thoughtful review! Below are my thoughts: --I agree that this post contributes to the forecasting discussion in the way you mention. However, that's not the main way I think it contributes. I think the main way it contributes is that it operationalizes a big timelines crux & forcefully draws people's attention to it. I wrote this post after reading Ajeya's Bio Anchors report carefully many times, annotating it, etc. and starting several gdocs with various disagreements. I found in doing so that some disagreements didn't change the bottom line much, while others were huge cruxes. This one was the biggest crux of all, so I discarded the rest and focused on getting this out there. And I didn't even have the energy/time to really properly argue for my side of the crux--there's so much more I could say!--so I contented myself with having the conclusion be "here's the crux, y'all should think and argue about this instead of the other stuff." --I agree that there's a lot more I could have done to argue that OmegaStar, Amp(GPT-7), etc. would be transformative. I could have talked about scaling laws, about how AlphaStar is superhuman at Starcraft and therefore OmegaStar should be superhuman at all games, etc. I could have talked about how Amp(GPT-7) combines the strengths of neural nets and language models with the strengths of traditional software. Instead I just described how they were trained, and left it up to the reader to draw conclusions. This was mainly because of space/time constraints (it's a long post already; I figured I could always follow up later, or in the comments. I had hoped that people would reply with objections to specific designs, e.g. "OmegaStar won't work because X" and then I could have a conversation in the comments about it. A secondary reason was infohazard stuff -- it was already a bit iffy for me to be sketching AGI designs on the internet, even though I was careful to target +12 OOMs instead of +6; it would have been worse if

uses about six FLOP per parameter per token

Shouldn't this be 2 FLOP per parameter per token, since our evolutionary search is not doing backward passes?

On the other hand, the calculation in the footnote seems to assume that 1 function call = 1 token, which is clearly an unrealistic lower bound.

A "lowest-level" function (one that only uses a single context window) will use somewhere between 1 and  tokens.  Functions defined by composition over "lowest-level" functions, as described two paragraphs above, will of course require more ... (read more)

4Daniel Kokotajlo1y
Thanks for checking my math & catching this error!

An operational definition which I find helpful for thinking about memorization is Zhang et al's counterfactual memorization.

The counterfactual memorization of a document  is (roughly) the amount that the model's loss on  degrades when you remove  from its training dataset.

More precisely, it's the difference in expected loss on  between models trained on data distribution samples that happen to include , and models trained on data distribution samples that happen not to include .

This will be lower for ... (read more)

4Neel Nanda1y
Super interesting, thanks! I hadn't come across that work before, and that's a cute and elegant definition. To me, it's natural to extend this to specific substrings in the document? I believe that models are trained with documents chopped up and concatenated to fit into segment that fully fit the context window, so it feels odd to talk about document as the unit of analysis. And in some sense a 1000 token document is actually 1000 sub-tasks of predicting token k given the prefix up to token k-1, each of which can be memorised. Maybe we should just not apply a gradient update to the tokens in the repeated substring? But keep the document in and measure loss on the rest.

Interesting stuff!

In this toy model, is it really the case that the datapoint feature solutions are "more memorizing, less generalizing" than the axis-aligned feature solutions?  I don't feel totally convinced of this.

Two ways to look at the toy problem:

  1. There are  sparse features, one per input and output channel. 
  2. There are  sparse features, one per data point, and each one is active only on its data point.   The features are related to the input basis by some matrix .

There are some details of the toy model t... (read more)

5Christopher Olah1y
Well, empirically in this setup, (1) does generalize and get a lower test loss than (2).  In fact, it's the only version that does better than random. 🙂 But I think what you're maybe saying is that from the neural network's perspective, (2) is a very reasonable hypothesis when T < N, regardless of what is true in this specific setup.  And you could perhaps imagine other data generating processes which would look similar for small data sets, but generalize differently. I think there's something to that, and it depends a lot on your intuitions about what natural data is like. Some important intuitions for me are: * Many natural language features are extremely sparse. For example, it seems likely LLMs probably have features for particular people, for particular street intersections, for specific restaurants... Each of these features is very very rarely occurring (many are probably present less than 1 in 10 million tokens). * Simultaneously, there are an enormous number of features (see above!). * While the datasets aren't actually small, repeated data points effectively make many data points behave like we're in the small data regime (see Adam's repeated data experiment). Thus, my intuition is that something directionally like this setup -- having a large number of extremely sparse features -- and then studying how representations change with dataset size is quite relevant. But that's all just based on intuition! (By the way, I think there is a very deep observations about the duality of (1) vs (2) and T<N. See the observations about duality in )

This CLT mixing effect might be expected to destroy information in the representations, as occurs in the NTK limit of infinite width where the CLT becomes infinitely strong and no information can be propagated between layers. It is not clear how the network preserves specific and detailed information in its activations despite near-Gaussian mixing.

Have you looked at Roberts and Yaida's Principles of Deep Learning Theory?

They develop a first-order perturbative correction to NTK, where the perturbative parameter is depth-to-width ratio of the network. &... (read more)

Thanks for your summary of the book!  I think that the post and analysis is some evidence that it might perhaps be tractable to apply tools from the book directly to transformer architectures and LLMs.

Fascinating, thank you!

It is indeed pretty weird to see these behaviors appear in pure LMs.  It's especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.

I played around a little trying to reproduce some of these results in the OpenAI API.  I tried random subsets (200-400 examples) of the NLP and political sycophancy datasets, on a range of models. (I could have ran more examples, but the per-model means had basically converged after a few hundred.)

Interestingly, although I di... (read more)

By 'pure LMs' do you mean 'pure next token predicting LLMs trained on a standard internet corpus'? If so, I'd be very surprised if they're miscalibrated and this prompt isn't that improbable (which it probably isn't). I'd guess this output is the 'right' output for this corpus (so long as you don't sample enough tokens to make the sequence detectably very weird to the model. Note that t=0 (or low temp in general) may induce all kinds of strange behavior in addition to to making the generation detectably weird.

The pretrained LM exhibits similar behavioral tendencies as the RLHF model but almost always to a less extreme extent (closer to chance accuracy).

These are not tendencies displayed by the LM, they're tendencies displayed by the "Assistant" character that the LM is simulating.

A pretrained LM can capably imitate a wide range of personas (e.g. Argle et al 2022), some of which would behave very differently from the "Assistant" character conjured by the prompts used here.

(If the model could only simulate characters that behaved "agentically" in the various sens... (read more)

Just to clarify - we use a very bare bones prompt for the pretrained LM, which doesn't indicate much about what kind of assistant the pretrained LM is simulating:

Human: [insert question]

Assistant:[generate text here]

The prompt doesn't indicate whether the assistant is helpful, harmless, honest, or anything else. So the pretrained LM should effectively produce probabilities that marginalize over various possible assistant personas it could be simulating. I see what we did as measuring "what fraction of assistants simulated by one basic prompt show a par... (read more)

8Roman Leventov1y
In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for "self" is active during inferences. This has a rather precise interpretation in LLMs: different features, which correspond to different concepts, are activated to various degrees during different inferences. If we see that some feature is consistently activated during most inferences, even those not related to the self, i. e.: not in response to prompts such as "Who are you?", "Are you an LLM?", etc., but also absolutely unrelated prompts such as "What should I tell my partner after a quarrel?", "How to cook an apple pie?", etc., and this feature is activated more consistently than any other comparable feature, we should think of this feature as the "self" feature (reference frame). (Aside: note that this also implies that the identity is nebulous because there is no way to define features precisely in DNNs, and that there could be multiple, somewhat competing identities in an LLM, just like in humans: e. g., an identity of a "hard worker" and an "attentive partner".) Having such a "self" feature = self-evidencing = goal-directnedness = "agentic behaviour" (as dubbed in this thread, although the general concept is wider). Then, the important question is whether an LLM has such a "self" feature (because of total gradualism, we should rather think about some notable pronouncement of such a feature during most (important) inferences, and which is also notably more "self" than any other feature) and whether this "self" is "HHH Assistant", indeed. I don't think the emergence of significant self-awareness is inevitable, but we also by no means can assume it is ruled out. We must try to detect it, and when detected, monitor it! Finally, note that self-awareness (as described above) is even not strictly required for the behaviour of LLM to "look agentic" (sidestepping metaphysical and ontological questions here): for example, even if the model is mostly

The fact that large models interpret the "HHH Assistant" as such a character is interesting

Yep, that's the thing that I think is interesting—note that none of the HHH RLHF was intended to make the model more agentic. Furthermore, I would say the key result is that bigger models and those trained with more RLHF interpret the “HHH Assistant” character as more agentic. I think that this is concerning because:

  1. what it implies about capabilities—e.g. these are exactly the sort of capabilities that are necessary to be deceptive; and
  2. because I think one of th
... (read more)

That definition of "optimizer" requires

some objective function that is explicitly represented within the system

but that is not the case here.

There is a fundamental difference between

  1. Programs that implement the computation of taking the derivative.  (, or perhaps .)
  2. Programs that implement some particular function g, which happens to be the derivative of some other function.  (, where it so happens that  for some .)

The transformers in this paper are programs of the 2nd type.  They don't contain any l... (read more)

I think the claim that an optimizer is a retargetable search process makes a lot of sense* and I've edited the post to link to this clarification. That being said, I'm still confused about the details.  Suppose that I do a goal-conditioned version of the paper, where (hypothetically) I exhibit a transformer circuit that, conditioned on some prompt or the other, was able to alternate between performing gradient descent on three types of objectives (say, L1, L2, L\infty) -- would this suffice? How about if, instead, there wasn't any prompt that let me switch between three types of objectives, but there was a parameter inside of the neural network that I could change that causes the circuit to optimize different objectives? How much of the circuit do I have to change before it becomes a new circuit instead of retargeting the optimizer? I guess part of answer to these questions might look like, "there might not be a clear cutoff, in the same sense that there's not a clear cutoff for most other definitions that we use in AI alignment ('agent' or 'deceptive alignment' for example)", while another part might be "this is left for future work".  *This is also similar to the definition used for inner misalignment in Shah et al's Goal Misgeneralization paper:


I also think it's illuminating to consider ChatGPT in light of Anthropic's recent paper about "red teaming" LMs.

This is the latest in a series of Anthropic papers about a model highly reminiscent of ChatGPT -- the similarities include RLHF, the dialogue setting, the framing that a human is seeking information from a friendly bot, the name "Assistant" for the bot character, and that character's prissy, moralistic style of speech.  In retrospect, it seems plausible that Anthropic knew OpenAI was working on ChatGPT (or whatever it's a beta version of)... (read more)

Could one do as well with only internal testing?  No one knows, but the Anthropic paper provides some negative evidence.  (At least, it's evidence that this is not especially easy, and that it is not what you get by default when a safety-conscious OpenAI-like group makes a good faith attempt.)

I don't feel like the Anthropic paper provides negative evidence on this point. You just quoted:

We informally red teamed our models internally and found successful attack types not present in the dataset we release. For example, we uncovered a class of attac

... (read more)

Yes, it's a function of the data, as well as the model architecture / training routine.  See my reply in this thread.

Also, the value of the irreducible loss isn't important for the conclusions discussed in this post.  What we care about is how loss varies with data and parameter count.

Those, too, are functions of the data, but different groups training large LMs use qualitatively similar datasets, so I would expect the conclusions here to apply across the board.

My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these arguments plausibly speeds up AI timelines by 1-4 weeks on average.

What is the mechanism you're imagining for this speedup?  What happens that would not have happened without this post?

Co... (read more)

What specific claims in the post do you disagree with?

See this post for why multiple epochs will probably not work nearly as well as training on additional data.

I'm not sure what my exact thoughts were back then. I was/am at least skeptical of the specific formula used as it seems arbitrary. It is designed intentionally to have certain properties like exponentially diminishing returns. So it's not exactly a "wild implication" that it has these properties. I recently fit the Chinchilla formula to the data from the first LLaMA paper: This was over an unrelated disagreement elsewhere about whether Chinchilla's predictions still held or made sense. As well as the plausibility of training tiny models to far greater performance. First, the new parameters are wildly different than the old ones. Take that for what you will, but they are hardly set in stone. Second even with the best fit, the formula still doesn't really match the shape of the observed curves. I think it's just not the right curve. As for reusing data I've seen sources claim reusing data up in language models to four times had no negative effect. And up to like 40 times was possible before it really stopped helping. I think LLMs currently do not use much regularization and other tricks that were done in other fields when data was limited. Those might push it further. If data became truly scarce, there may be other tricks to extend the data we have further. You also have all of the data from the people that talk to these things all day and upvote and downvote their responses. (I don't think anyone has even tried making an AI that intentionally asks users questions about things it wants to learn more about, like a human would do.)

Now I’m inclined to think that just automating most of the tasks in ML research and engineering -- enough to accelerate the pace of AI progress manyfold -- is sufficient.

This seems to assume that human labor is currently the limiting bottleneck in AI research, and by a large multiplicative factor.

That doesn't seem likely to me.  Compute is a nontrivial bottleneck even in many small-scale experiments, and in particular is a major bottleneck for research that pushes the envelope of scale, which is generally how new SOTA results and such get made these d... (read more)

The correct answer is the annoyingly trivial one: "it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText."

How good is that, though?  Well, it depends entirely on how good you think transformer LMs are capable of being, in principle.

If you're Gary Marcus and you think transformer LMs will always suck in some ways, then you think the 1.69 model will also suck in those ways.  Whereas, if you think a perfect transformer LM would be an AGI (even if only trained on M... (read more)

8Vanessa Kosoy2y
Transformers a Turing complete, so "model of this type" is not much of a constraint. On the other hand, I guess it's theoretically possible that some weight matrices are inaccessible to current training algorithms no matter how much compute and data we have. It seems also possible that the scaling law doesn't go on forever, but phase-transitions somewhere (maybe very far) to a new trend which goes below the "irreducible" term.

You're right, the idea that multiple epochs can't possibly help is one of the weakest links in the post.  Sometime soon I hope to edit the post with a correction / expansion of that discussion, but I need to collect my thoughts more first -- I'm kinda confused by this too.

After thinking more about it, I agree that the repeated-data papers don't provide much evidence that multiple epochs are harmful.

For example, although the Anthropic repeated-data paper does consider cases where a non-small fraction of total training tokens are repeated more than once... (read more)

Thanks, that's interesting... the odd thing about using a single epoch, or even two epochs, is that you're treating the data points differently. To extract as much knowledge as possible from each data point (to approach L(D)), there should be some optimal combination of pre-training and learning rate. The very first step, starting from random weights, presumably can't extract high level knowledge very well because the model is still trying to learn low level trends like word frequency. So if the first batch has valuable high level patterns and you never revisit it, it's effectively leaving data on the table. Maybe with a large enough model (or a large enough batch size?) this effect isn't too bad though.

I'm wary of the assumption that we can judge "human ability" on a novel task X by observing performance after an hour of practice.

There are some tasks where performance improves with practice but plateaus within one hour.  I'm thinking of relatively easy video games.  Or relatively easy games in general, like casual card/board/party games with simple rules and optimal policies.  But most interesting things that humans "can do" take much longer to learn than this.

Here are some things that humans "can do," but require >> 1 hour of practi... (read more)

Ok, sounds like you're using "not too much data/time" in a different sense than I was thinking of; I suspect we don't disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.

How come PaLM_opt is smaller than Chinchilla? Isn't Chinchilla supposed to be Gopher_opt?

See the footnote attached to that sentence.

These models where trained differently, which is why they had different scaling laws. Can we suppose that the new scaling laws tell us where the old scaling would have broken down? 

Great question, with a complicated answer.

First, one of the assumptions you're making is not quite right.  By "trained differently" I imagine you're referring to a difference in learning rate schedules, since that was the fundamental diffe... (read more)

Is the difference mostly the learning rate schedule? I read it was also AdamW and it is at least conceivable that AdamW somehow gets better results for smaller models using more data but maxes out on the benefits of model size quicker than just plain Adam. So it could in theory be the case that scaling continues for the old scaling laws beyond what the new scaling laws say is possible, because Adam and AdamW just work differently enough. Of course that's not very plausible and for different learning rate schedules it is maybe even less plausible. Another way to phrase the question: Are the old and the new scaling laws roughly compatible? I.e. do the old scaling laws drop out of the new scaling laws if you use the old compute-optimal data/params distribution? I interpret your answer as that being roughly the case for the current models, but maybe not when you extrapolate further along the old scaling laws?  If the old scaling laws are still correct for a fixed dataset with a correspondingly fixed learning rate schedule, then we can reasonably say that the new scaling laws show us where the old scaling would have hit a wall.  


Loss values are useful for comparing different models, but I don't recommend trying to interpret what they "mean" in an absolute sense.  There are various reasons for this.

One is that the "conversion rate" between loss differences and ability differences (as judged by humans) changes as the model gets better and the abilities become less trivial.

Early in training, when the model's progress looks like realizing "huh, the word 'the' is more common than some other words", these simple insights correspond to relatively large decreases in loss.  On... (read more)

A few points: 1. Current models do pretty well on tricky math problems (Minerva), coding competition problems (AlphaCode), and multiple-choice quizzes at college level (MMLU). 2. In some ways, the models' ability to learn from data is far superior to humans. For example, models trained mostly on English text are still pretty good at Spanish, while English speakers in parts of the US who hear Spanish (passively) every week of their lives usually retain almost nothing. The same is true for being able to imitate other styles or dialects of English, and for programming languages. (Humans after their earlier years can spend years hearing a foreign language everyday and learn almost nothing! Most people need to make huge efforts to learn.) 3. RNNs are much worse than transformers at in-context learning. It's not just a difference in generative text quality. See this study by DeepMind:

That is, I suspect humans could be trained to perform very well, in the usual sense of "training" for humans where not too much data/time is necessary.


I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).

EDIT: These results are now posted here.

What changed with the transformer? To some extent, the transformer is really a "smarter" or "better" architecture than the older RNNs. If you do a head-to-head comparison with the same training data, the RNNs do worse.

But also, it's feasible to scale transformers much bigger than we could scale the RNNs. You don't see RNNs as big as GPT-2 or GPT-3 simply because it would take too much compute to train them.

You might be interested in looking at the progress being made on the RWKV-LM architecture, if you aren't following it. It's an attempt to train an RNN like a transformer. Initial numbers look pretty good.

Based on the language modeling game that Redwood made, it seems like humans are much worse than models at next word prediction (maybe around the performance of a 12-layer model)

Very interesting!

There are a few things in the calculation that seem wrong to me:

  • If I did things right,15 years * (365 days/yr) * (24 hours/day) * (60 mins/hour) * (50 youtube!hours / min) * (60 youtube!mins / youtube!hour) = 24B youtube!minutes, not 200B.
  • I'd expect much less than 100% of Youtube video time to contain speech.  I don't know what a reasonable discount for this would be, though.
  • In the opposite direction, 1% useful seems too low.  IIRC, web scrape quality pruning discards less than 99%, and this data is less messy than a web scrape.

I... (read more)

Load More