All of StefanHex's Comments + Replies

Hi, and thanks for the comment!

Do you think there should be a preference to the whether one patches clean --> corrupt or corrupt --> clean?

Both of these show slightly different things. Imagine an "AND circuit" where the result is only correct if two attention heads are clean. If you patch clean->corrupt (inserting a clean attention head activation into a corrupt prompt) you will not find this; but you do if you patch corrupt->clean. However the opposite applies for a kind of "OR circuit". I historically had more success with corrupt->clean s... (read more)

Thanks for finding this!

There was one assumption in the StackExchange post I didn't immediately get, that the variance of  is . But I just realized the proof for that is rather short: Assuming  (the variance of ) is the identity then the left side is

and the right side is

so this works out. (The  symbols are sums here.)

Thank for for the extensive comment! Your summary is really helpful to see how this came across, here's my take on a couple of these points:

2.b: The network would be sneaking information about the size of the residual stream past LayerNorm. So the network wants to implement an sort of "grow by a factor X every layer" and wants to prevent LayerNorm from resetting its progress.

  1. There's the difference between (i) How does the model make the residual stream grow exponentially -- the answer is probably theory 1, that something in the weights grow exponentially
... (read more)
4TurnTrout25d
Although -- naive speculation -- the deletion-by-magnitude theory could enforce locality in what layers read what information, which seems like it would cut away exponentially many virtual heads? That would be awfully convenient for interpretability. (More trying to gesture at some soft "locality" constraint, rather than make a confident / crisp claim in this comment.)

If I'm interpreting this correctly, then it sounds like the network is learning exponentially larger weights in order to compensate for an exponentially growing residual stream. However, I'm still not quite clear on why LayerNorm doesn't take care of this.

I understand the network's "intention" the other way around, I think that the network wants to have an exponentially growing residual stream. And in order to get an exponentially growing residual stream the model increases its weights exponentially.

And our speculation for why the model would want this is our "favored explanation" mentioned above.

Thanks for the comment and linking that paper! I think this is about training dynamics though, norm growth as a function of checkpoint rather than layer index.

Generally I find basically no papers discussing the parameter or residual stream growth over layer number, all the similar-sounding papers seem to discuss parameter norms increasing as a function of epoch or checkpoint (training dynamics). I don't expect the scaling over epoch and layer number to be related?

Only this paper mentions layer number in this context, and the paper is about solving the vani... (read more)

3Joseph Bloom1mo
Thanks for the feedback. On a second reading of this post and the paper I linked and having read the paper you linked, my thoughts have developed significantly. A few points I'll make here before making a separate comment: - The post I shared originally does indeed focus on dynamics but may have relevant general concepts in discussing the relationship between saturation and expressivity. However, it focuses on the QK circuit which is less relevant here. - My gut feel is that true explanations of related formula should have non-trivial relationships. If you had a good explanation for why norms of parameters grew during training it should relate to why norms of parameters are different across the model. However, this is a high level argument and the content of your post does of course directly address a different phenomenon (residual stream norms). If this paper had studied the training dynamics of the residual stream norm, I think it would be very relevant. 

Oh I hadn't thought of this, thanks for the comment! I don't think this apply to Pre-LN Transformers though?

  1. In Pre-LN transformers every layer's output is directly connected to the residual stream (and thus just one unembedding away from logits), wouldn't this remove the vanishing gradient problem? I just checkout out the paper you linked, they claim exponentially vanishing gradients is a problem (only) in Post-LN, and how Pre-LN (and their new method) prevent the problem, right?

  2. The residual stream norm curves seem to follow the exponential growth qu

... (read more)
2Zach Furman1mo
1. Yep, pre-LN transformers avoid the vanishing gradient problem. 2. Haven't checked this myself, but the phenomenon seems to be fairly clean? See figure 3.b in the paper I linked, or figure 1 in this paper. [https://www.andrew.cmu.edu/user/kaihu/Revisiting_Exploding_Gradient.pdf] I actually wouldn't think of vanishing/exploding gradients as a pathological training problem but a more general phenomenon about any dynamical system. Some dynamical systems (e.g. the sigmoid map) fall into equilibria over time, getting exponentially close to one. Other dynamical systems (e.g. the logistic map) become chaotic, and similar trajectories diverge exponentially over time. If you check, you'll find the first kind leads to vanishing gradients (at each iteration of the map), and the second to exploding ones. This a forward pass perspective on the problem - the usual perspective on the problem considers only implications for the backward pass, since that's where the problem usually shows up. Notice above that the system with exponential decay in the forward pass had vanishing gradients (growing gradient norms) in the backward pass - the relationship is inverse. If you start with toy single-neuron networks, you can prove this to yourself pretty easily. The predictions here are still complicated by a few facts - first, exponential divergence/convergence of trajectories doesn't necessarily imply exponentially growing/shrinking norms. Second, the layer norm complicates things, confining some dynamics to a hypersphere (modulo the zero-mean part). Haven't fully worked out the problem for myself yet, but still think there's a relationship here.

Finally, we give a simple approach to verify that a particular token is unspeakable rather than just being hard-to-speak.

You're using an optimization procedure to find an embedding that produces an output, and if you cannot find one you say it is unspeakable. How confident are you that the optimization is strong enough? I.e. what are the odds that a god-mode optimizer in this high-dimensional space could actually find an embedding that produces the unspeakable token, it's just that linprog wasn't strong enough?

Just checking here, I can totally imagine that the optimizer is an unlikely point of failure. Nice work again!

Thanks Marius for this great write-up!

However, I was surprised to find that the datapoints the network misclassified on the training data are evenly distributed across the D* spectrum. I would have expected them to all have low D* didn’t learn them.

My first intuition here was that the misclassified data points where the network just tried to use the learned features and just got it wrong, rather than those being points the network didn't bother to learn? Like say a 2 that looks a lot like an 8 so to the network it looks like a middle-of-the-spectrum 8?... (read more)

I don't think I understand the problem correctly, but let me try to rephrase this. I believe the key part is the claim whether or not ChatGPT has a global plan? Let's say we run ChatGPT one output at a time, every time appending the output token to the current prompt and calculating the next output. This ignores some beam search shenanigans that may be useful in practice, but I don't think that's the core issue here.

There is no memory between calculating the first and second token. The first time you give ChatGPT the sequence "Once upon a" and it predicts ... (read more)

Yep, it seems to be a coincidence that only the 4-layer model learned this and the 3-layer one did not. As Neel said I would expect the 3-layer model to learn it if you give it more width / more heads.

We also later checked networks with MLPs, and turns out the 3-layer gelu models (same properties except for MLPs) can do the task just fine.

Your language model game(s) are really interesting -- I've had a couple ideas when "playing" (such as adding GPT2-small suggestions for the user to choose from, some tokenization improvements) -- are you happy to share the source / tools to build this website or is it not in a state you would be happy to share? Totally fine if not, just realized that I should ask before considering building something!

Edit for future readers: Managed to do this with Heroku & flask, then switched to Streamlit -- code here, mostly written by ChatGPT: https://huggingface.co/spaces/StefanHex/simple-trafo-mech-int/tree/main

I really appreciated all the observations here and enjoyed reading this post, thank you for writing all this up!

Edit: Found it here! https://github.com/socketteer/loom/ Your setup looks quite useful, with all the extra information -- is it available publicly somewhere / would you be happy to share it, or is the tooling not in that state yet? (Totally fine, just thought I'd ask!)

Firstly thank you for writing this post, trying to "poke holes" into the "AGI might doom us all" hypothesis. I like to see this!

How is the belief in doom harming this community?

Actually I see this point, "believing" in "doom" can often be harmful and is usually useless.

Yes, being aware of the (great) risk is helpful for cases like "someone at Google accidentally builds an AGI" (and then hopefully turns it off since they notice and are scared).

But believing we are doomed anyway is probably not helpful. I like to think along the lines of "condition on us... (read more)

Image interpretability seems mostly so easy because humans are already really good

Thank you, this is a good point! I wonder how much of this is humans "doing the hard work" of interpreting the features. It raises the question of whether we will be able to interpret more advanced networks, especially if they evolve features that don't overlap with the way humans process inputs.

The language model idea sounds cool! I don't know language models well enough yet but I might come back to this once I get to work on transformers.

I think I found the problem: Omega is unable to predict your action in this scenario, i.e. the assumption "Omega is good at predicting your behaviour" is wrong / impossible / inconsistent.

Consider a day where Omicron (randomly) chose a prime number (Omega knows this). Now an EDT is on their way to the room with the boxes, and Omega has to put a prime or non-prime (composite) number into the box, predicting EDT's action.

If Omega makes X prime (i.e. coincides) then EDT two-boxes and therefore Omega has failed in predicting.

If Omega makes X non-prime (i.e. nu... (read more)

6So8res2y
If the agent is EDT and Omicron chooses a prime number, then Omega has to choose a different prime number. Fortunately, for every prime number there exists a distinct prime number. EDT's policy is not "two-box if both numbers are prime or both numbers are composite", it's "two-box if both numbers are equal". EDT can't (by hypothesis) figure out in the allotted time whether the number in the box (or the number that Omicron chose) is prime. (It can readily verify the equality of the two numbers, though, and this equality is what causes it -- erroneously, in my view -- to believe it has control over whether it gets paid by Omicron.)

This scenario seems impossible, as in contradictory / not self-consistent. I cannot say exactly why it breaks, but at least the two statements here seem to be not consistent:

today they [Omicron] happen to have selected the number X

and

[Omega puts] a prime number in that box iff they predicted you will take only the big box

Both of these statements have implications for X and cannot both be always true. The number cannot both, be random, and be chosen by Omega/you, can it?

From another angle, the statement

FDT will always see a prime number

demonstra... (read more)

3Oskar Mathiasen2y
The fact that the 2 numbers are equal is not always true, it is randomly true on this day.
1StefanHex2y
I think I found the problem: Omega is unable to predict your action in this scenario, i.e. the assumption "Omega is good at predicting your behaviour" is wrong / impossible / inconsistent. Consider a day where Omicron (randomly) chose a prime number (Omega knows this). Now an EDT is on their way to the room with the boxes, and Omega has to put a prime or non-prime (composite) number into the box, predicting EDT's action. If Omega makes X prime (i.e. coincides) then EDT two-boxes and therefore Omega has failed in predicting. If Omega makes X non-prime (i.e. numbers don't coincide) then EDT one-boxes and therefore Omega has failed in predicting. Edit: To clarify, EDT's policy is two-box if Omega and Omicron's numbers coincide, one-box if they don't.

Nice argument! My main caveats are

* Does training scale linearly? Does it take just twice as much time to get someone to 4 bits (top 3% in world, one in every school class) and from 4 to 8 bits (one in 1000)?

* Can we train everything? How much of e.g. math skills are genetic? I think there is research on this

* Skills are probably quite highly correlated, especially when it comes to skills you want in the same job. What about computer skills / programming and maths skills / science -- are they inherently correlated or is it just because the same people need both? [Edit: See point made by Gunnar_Zarncke above, better argument on this]

2johnswentworth2y
This is a good point. The exponential -> linear argument is mainly for independent skills: if they're uncorrelated in the population then they should multiply for selection; if they're independently trained then they should add for training. (And note that these are not quite the same notion of "independent", although they're probably related.) It's potentially different if we're thinking about going from 90th to 95th percentile vs 50th to 75th percentile on one axis. (I'll talk about the other two points in response to Gunnar's comment.)

That is a very broad description - are you talking about locating Fast Radio Bursts? I would be very surprised if that was easily possible.

Background: Astronomy/Cosmology PhD student

2[anonymous]3y
I'm afraid it actually only works for narrow band radio signals of potentially technological origin in the galactic disk. I will send more via p.m.