All of Sam Marks's Comments + Replies

I originally ran some of these experiments on 7B and got very different results, that PCA plot of 7B looks familiar (and bizarre).

I found that the PCA plot for 7B for larger_than and smaller_than individually looked similar to that for 13B, but that the PCA plot for larger_than + smaller_than looked degenerate in the way I screenshotted. Are you saying that your larger_than + smaller_than PCA looked familiar for 7B?

I suppose there are two things we want to separate: "truth" from likely statements, and "truth" from what humans think (under some kind of simu

... (read more)

Very cool! Always nice to see results replicated and extended on, and I appreciated how clear you were in describing your experiments.

Do smaller models also have a generalised notion of truth?

In my most recent revision of GoT[1] we did some experiments to see how truth probe generalization changes with model scale, working with LLaMA-2-7B, -13B, and -70B. Result: truth probes seems to generalize better for larger models. Here are the relevant figures.

Some other related evidence from our visualizations:

We summed things up like so, which I'll just quote... (read more)

Cool to see the generalisation results for Llama-2 7/13/70B! I originally ran some of these experiments on 7B and got very different results, that PCA plot of 7B looks familiar (and bizarre). Excited to read the paper in its entirety.  The first GoT paper was very good. This is an interesting approach! I suppose there are two things we want to separate: "truth" from likely statements, and "truth" from what humans think (under some kind of simulacra framing).  I think this approach would allow you to do the former, but not the latter.  And to be honest, I'm not confident on TruthfulQA's ability to do the latter either. P.S. I realised an important note got removed while editing this post - added back, but FYI:

This comment is about why we were getting different MSE numbers. The answer is (mostly) benign -- a matter of different scale factors. My parallel comment, which discusses why we were getting different CE diff numbers is the more important one.

When you compute MSE loss between some activations  and their reconstruction , you divide by variance of , as estimated from the data in a batch. I'll note that this doesn't seem like a great choice to me. Looking at the resulting training loss:

where  is the enc... (read more)

Yep, as you say, @Logan Riggs figured out what's going on here: you evaluated your reconstruction loss on contexts of length 128, whereas I evaluated on contexts of arbitrary length. When I restrict to context length 128, I'm able to replicate your results.

Here's Logan's plot for one of your dictionaries (not sure which)

and here's my replication of Logan's plot for your layer 1 dictionary

Interestingly, this does not happen for my dictionaries! Here's the same plot but for my layer 1 residual stream output dictionary for pythia-70m-deduped

(Note that all thr... (read more)

3Arthur Conmy17d
The fact that Pythia generalizes to longer sequences but GPT-2 doesn't isn't very surprising to me -- getting long context generalization to work is a key motivation for rotary, e.g. the original paper
I think the learned positional embeddings combined with training on only short sequences is likely to be the issue. Changing either would suffice.

My SAEs also have a tied decoder bias which is subtracted from the original activations. Here's the relevant code in

def encode(self, x):
        return nn.ReLU()(self.encoder(x - self.bias))
    def decode(self, f):
        return self.decoder(f) + self.bias
    def forward(self, x, output_features=False, ghost_mask=None):
            f = self.encode(x)
            x_hat = self.decode(f)
            return x_hat

Note that I checked that our SAEs have the same input-output behavior in my linked c... (read more)

2Joseph Bloom18d
Ahhh I see. Sorry I was way too hasty to jump at this as the explanation. Your code does use the tied decoder bias (and yeah, it was a little harder to read because of how your module is structured). It is strange how assuming that bug seemed to help on some of the SAEs but I ran my evals over all your residual stream SAE's and it only worked for some / not others and certainly didn't seem like a good explanation after I'd run it on more than one.  I've been talking to Logan Riggs who says he was able to load in my SAEs and saw fairly similar reconstruction performance to to me but that outside of the context length of 128 tokens, performance markedly decreases. He also mentioned your eval code uses very long prompts whereas mine limits to 128 tokens so this may be the main cause of the difference.  Logan mentioned you had discussed this with him so I'm guessing you've got more details on this than I have? I'll build some evals specifically to look at this in the future I think.  Scientifically, I am fairly surprised about the token length effect and want to try training on activations from much longer context sizes now. I have noticed (anecdotally) that the number of features I get sometimes increases over the prompt so an SAE trained on activations from shorter prompts are plausibly going to have a much easier time balancing reconstruction and sparsity, which might explain the generally lower MSE / higher reconstruction. Though we shouldn't really compare between models and with different levels of sparsity as we're likely to be at different locations on the pareto frontier.  One final note is that I'm excited to see whether performance on the first 128 tokens actually improves in SAEs trained on activations from > 128 token forward passes (since maybe the SAE becomes better in general).   

At the time that I made this post, no, but this has been implemented in dictionary_learning since I saw your suggestion to do so in your linked post.

Another sanity check: when you compute CE loss using the same code that you use when computing CE loss when activations are reconstructed by the autoencoders, but instead of actually using the autoencoder you just plug the correct activations back in, do you get the same answer (~3.3) as when you evaluate CE loss normally?

1Joseph Bloom20d
* MSE Losses were in the WandB report (screenshot below). * I've loaded in your weights for one SAE and I get very bad performance (high L0, high L1, and bad MSE Loss) at first.  * It turns out that this is because my forward pass uses a tied decoder bias which is subtracted from the initial activations and added as part of the decoder forward pass. AFAICT, you don't do this.  * To verify this, I added the decoder bias to the activations of your SAE prior to running a forward pass with my code (to effectively remove the decoder bias subtraction from my method) and got reasonable results.  * I've screenshotted the Towards Monosemanticity results which describes the tied decoder bias below as well.  I'd be pretty interested in knowing if my SAEs seem good now based on your evals :) Hopefully this was the only issue.   

In the notebook I link in my original comment, I check that the activations I get out of nnsight are the same as the activations that come from transformer_lens. Together with the fact that our sparsity statistics broadly align, I'm guessing that the issue isn't that I'm extracting different activations than you are.

Repeating my replication attempt with data from OpenWebText, I get this:

Layer MSE Loss % Variance Explained L1 L0 % Alive CE Reconstructed
1 0.069 95 40 15 46 6.45
7 0.81 86 125 59.2 96 4.38

Broadly speaking, same story as above, e... (read more)

2Sam Marks21d
Another sanity check: when you compute CE loss using the same code that you use when computing CE loss when activations are reconstructed by the autoencoders, but instead of actually using the autoencoder you just plug the correct activations back in, do you get the same answer (~3.3) as when you evaluate CE loss normally?

I tried replicating your statistics using my own evaluation code (in here). I pseudo-randomly chose layer 1 and layer 7. Sadly, my results look rather different from yours:

Layer MSE Loss % Variance Explained L1 L0 % Alive CE Reconstructed
1 0.11 92 44 17.5 54 5.95
7 1.1 82 137 65.4 95 4.29

Places where our metrics agree: L1 and L0.

Places where our metrics disagree, but probably for a relatively benign reason:

  • Percent variance explained: my numbers are slightly lower than yours, and from a brief skim of your code I think it's bec
... (read more)
1Joseph Bloom21d
Oh no. I'll look into this and get back to you shortly. One obvious candidate is that I was reporting CE for some batch at the end of training that was very small and so the statistics likely had high variance and the last datapoint may have been fairly low. In retrospect I should have explicitly recalculated this again post training. However, I'll take a deeper dive now to see what's up. 

Some updates about the dictionary_learning repo:

  • The repo now has support for ghost grads. h/t g-w1 for submitting a PR for this
  • ActivationBuffers now work natively with model components -- like the residual stream -- whose activations are typically returned as tuples; the buffer knows to take the first component of the tuple (and will iteratively do this if working with nested tuples).
  • ActivationBuffers can now be stored on the GPU.
  • The file contains code for evaluating trained dictionaries. I've found this pretty useful for quickly evaluating d
... (read more)

Thanks, I've disliked the shoggoth meme for a while, and this post does a better job articulating why than I've been able to do myself.

Imo "true according to Alice" is nowhere near as "crazy" a feature as "has_true XOR has_banana". It seems useful for the LLM to model what is true according to Alice! (Possibly I'm misunderstanding what you mean by "crazy" here.)

I agree with this! (And it's what I was trying to say; sorry if I was unclear.) My point is that 
{ features which are as crazy as "true according to Alice" (i.e., not too crazy)} 
seems potentially manageable, where as 
{ features which are as crazy as arbitrary boolean functions of other features } 
seems totally unmanageable.

Thanks, as always, for the thoughtful replies.

Idk, I think it's pretty hard to know what things are and aren't useful for predicting the next token. For example, some of your features involve XORing with a "has_not" feature -- XORing with an indicator for "not" might be exactly what you want to do to capture the effect of the "not".

I agree that "the model has learned the algorithm 'always compute XORs with has_not'" is a pretty sensible hypothesis. (And might be useful to know, if true!) FWIW, the stronger example of "clearly not useful XORs" I was thinking of has_true XOR has_banana, where I'm guessi... (read more)

4Rohin Shah1mo
Yeah certainly I'd expect the crazy XOR directions aren't too salient. Imo "true according to Alice" is nowhere near as "crazy" a feature as "has_true XOR has_banana". It seems useful for the LLM to model what is true according to Alice! (Possibly I'm misunderstanding what you mean by "crazy" here.) I'm not against linear probing techniques in general. I like linear probes, they seem like a very useful tool. I also like contrast pairs. But I would basically always use these techniques in a supervised way, because I don't see a great reason to expect unsupervised methods to work better. This is reasonable. My disagreement is mostly that I think LLMs are complicated things and do lots of incidental stuff we don't yet understand. So I shouldn't feel too surprised by any given observation that could be explained by an incidental hypothesis. But idk it doesn't seem like an important point.

I agree with a lot of this, but some notes:

Exponentially many features


On utility explanations, you would expect that multi-way XORs are much less useful for getting low loss than two-way XORs, and so computation for multi-way XORs is never developed.

The thing that's confusing here is that the two-way XORs that my experiments are looking at just seem clearly not useful for anything. So I think any utility explanation that's going to be correct needs to be a somewhat subtle one of the form "the model doesn't initially know which XORs will be useful, so ... (read more)

4Rohin Shah2mo
Idk, I think it's pretty hard to know what things are and aren't useful for predicting the next token. For example, some of your features involve XORing with a "has_not" feature -- XORing with an indicator for "not" might be exactly what you want to do to capture the effect of the "not". (Tbc here the hypothesis could be "the model computes XORs with has_not all the time, and then uses only some of them", so it does have some aspect of "compute lots of XORs", but it is still a hypothesis that clearly by default doesn't produce multiway XORs.) ... That's exactly how I would rephrase the issue and I'm not clear on why you're making a sharp distinction here. I mean, I'd say the ones that are more like basic features are like that because it was useful, and it's all the other XORs that are explained by incidental hypotheses. The incidental hypotheses shouldn't be taken to be saying that all XORs are incidental, just the ones which aren't explained by utility. Perhaps a different way of putting it is that I expect both utility and incidental hypotheses to be true to some extent. Yes, on my model it could be something like the weights for basic features being large. It's not necessarily that simple, e.g. it could also be that the derived features are in superposition with a larger number of other features that leads to more interference. If you're calling that "tracking", fair enough I guess; my main claim is that it shouldn't be surprising. I agree it's a potential thread for distinguishing such features.

Thanks, you're totally right about the equal variance thing -- I had stupidly thought that the projection of  onto y = x would be uniform on  (obviously false!).

The case of a fully discrete distribution (supported in this case on four points) seems like a very special case of a something more general, where a "more typical" special case would be something like:

  • if a, b are both false, then sample from 
  • if a is true and b is false, then sample from 
  • if a is false and b is true then sample from 
... (read more)
1Vivek Hebbar2mo
Oops, I misunderstood what you meant by unimodality earlier. Your comment seems broadly correct now (except for the variance thing). I would still guess that unimodality isn't precisely the right well-behavedness desideratum, but I retract the "directionally wrong".

Thanks, you're correct that my definition breaks in this case. I will say that this situation is a bit pathological for two reasons:

  1. The mode of a uniform distribution doesn't coincide with its mean.
  2. The variance of the multivariate uniform distribution  is largest along the direction , which is exactly the direction which we would want to represent a AND b.

I'm not sure exactly which assumptions should be imposed to avoid pathologies like this, but maybe something of the form: we are working with boolean features  ... (read more)

3Vivek Hebbar2mo
The variance is actually the same in all directions.  One can sanity-check by integration that the variance is 1/12 both along the axis and along the diagonal. In fact, there's nothing special about the uniform distribution here: The variance should be independent of direction for any N-dimensional joint distribution where the N constituent distributions are independent and have equal variance.[1] The diagram in the post showing that "and" is linearly represented works if the features are represented discretely (so that there are exactly 4 points for 2 binary features, instead of a distribution for each combination).  As soon as you start defining features with thresholds like DanielVarga did, the argument stops going through in general, and the claim can become false. The stuff about unimodality doesn't seem relevant to me, and in fact seems directionally wrong. 1. ^ I have a not-fully-verbalized proof which I don't have time to write out

Neat hypothesis! Do you have any ideas for how one would experimentally test this?

Well the substance of the claim is that when a model is calculating lots of things in superposition, these kinds of XORs arise naturally as a result of interference, so one thing to do might be to look at a small algorithmic dataset of some kind where there's a distinct set of features to learn and no reason to learn the XORs and see if you can still probe for them. It'd be interesting to see if there are some conditions under which this is/isn't true, e.g. if needing to learn more features makes the dependence between their calculation higher and the XORs more visible.  Maybe you could also go a bit more mathematical and hand-construct a set of weights which calculates a set of features in superposition so you can totally rule out any model effort being expended on calculating XORs and then see if they're still probe-able. Another thing you could do is to zero-out or max-ent the neurons/attention heads that are important for calculating the A feature, and see if you can still detect an A⊕B feature. I'm less confident in this because it might be too strong and delete even a 'legitimate' A⊕B feature or too weak and leave some signal in. This kind of interference also predicts that the A|B and A|¬B features should be similar and so the degree of separation/distance from the category boundary should be small. I think you've already shown this to some extent with the PCA stuff though some quantification of the distance to boundary would be interesting. Even if the model was allocating resource to computing these XORs you'd still probably expect them to be much less salient though so not sure if this gives much evidence either way.

Some features which are computed from other features should probably themselves be treated as basic and thus represented with large salience.

Using a dataset of 10,000 inputs of the form
[random LLaMA-13B generated text at temperature 0.8] [either the most likely next token or the 100th most likely next token, according to LLaMA-13B] ["true" or "false"] ["banana" or "shed"]
I've rerun the probing experiments. The possible labels are

  • has_true: is the second last token "true" or "false"?
  • has_banana: is the last token "banana" or "shed"?
  • label: is the third last token the most likely or the 100th most likely?

(this weird last option is because I'm adapting a dataset from the Geometry of Truth paper about... (read more)

Yes, you are correct, thanks. I'll edit the post when I get a chance.

If anyone would like to replicate these results, the code can be found in the rax branch of my geometry-of-truth repo. This was adapted from a codebase I used on a different project, so there's a lot of uneeded stuff in this repo. The important parts here are:

  • The datasets: cities_alice.csv and neg_cities_alice.csv (for the main experiments I describe), cities_distractor.csv and neg_cities_distractor.csv (for the experiments with banana/shed at the end of factual statements), and xor.csv (for the experiments with true/false and banana/shed after random text
... (read more)

Thanks for doing this! Can you share the dataset that you're working with? I'm traveling right now, but when I get a chance I might try to replicate your failed replication on LLaMA-2-13B and with my codebase (which can be found here; see especially xor_probing.ipynb).

The training set is a random 100k subsample of this dataset: I'm prepending Alice/Bob and doing the xor of the label in exactly the same way you do.

Idk, I think I would guess that all of the most salient features will be things related to the meaning of the statement at a more basic level. E.g. things like: the statement is finished (i.e. isn't an ongoing sentence), the statement is in English, the statement ends in a word which is the name of a country, etc.

My intuition here is mostly based on looking at lots of max activating dataset examples for SAE features for smaller models (many of which relate to basic semantic categories for words or to basic syntax), so it could be bad here (both because of ... (read more)

There's 1500 statements in each of cities and neg_cities, and LLaMA-2-13B has residual stream dimension 5120. The linear probes are trained with vanilla logistic regression on {80% of the data in cities} \cup {80% of the data in neg_cities} and the accuracies reported are evaluated on {remaining 20% of the data in cities} \cup {remaining 20% of the data in neg_cities}.

So, yeah, I guess that the train and val sets are drawn from the same distribution but are not independent (because of the issue I mentioned in my comment above). Oops! I guess I never though... (read more)

Using a dataset of 10,000 inputs of the form
[random LLaMA-13B generated text at temperature 0.8] [either the most likely next token or the 100th most likely next token, according to LLaMA-13B] ["true" or "false"] ["banana" or "shed"]
I've rerun the probing experiments. The possible labels are

  • has_true: is the second last token "true" or "false"?
  • has_banana: is the last token "banana" or "shed"?
  • label: is the third last token the most likely or the 100th most likely?

(this weird last option is because I'm adapting a dataset from the Geometry of Truth paper about... (read more)

Yep I think I agree, I didn't understand the point you made about systematic anti-correlation originally. If I understand correctly the issues is something like: * There are 10 India related statements, exactly 5 of which are false and 5 of which are true. * We do a random split of all the data, so if there is more true+india in train there will be more false+india in test There are of course various fixes to make the data actually IID.
If it's easy enough to run, it seems worth re-training the probes exactly the same way, except sampling both your train and test sets with replacement from the full dataset. This should avoid that issue. It has the downside of allowing some train/test leakage, but that seems pretty fine, especially if you only sample like 500 examples for train and 100 for test (from each of cities and neg_cities).  I'd strongly hope that after doing this, none of your probes would be significantly below 50%.

(Are you saying that you think factuality is one of the 50 most salient features when the model processes inputs like "The city of Chicago is not in Madagascar."? I think I'd be pretty surprised by this.)

(To be clear, factuality is one of the most salient feature relative to the cities/neg_cities datasets, but it seems like the right notion of salience here is relative to the full data distribution.)

Yes, that's what I'm saying. I think this is right? Note that we only need salience on one side between false and true, so "true vs false" is salient as long as "false" is salient. I would guess that "this is false" is very salient for this type of data even for a normal pretrained LLM. (Similarly, "this is english" isn't salient in a dataset of only english, but is salient in a dataset with both english and spanish: salience depends on variation. Really, the salient thing here is "this is spanish" or "this is false" and then the model will maybe XOR these with the other salient features. I think just doing the XOR on one "side" is sufficient for always being able to compute the XOR, but maybe I'm confused or thinking about this wrong.)

I'm not really sure, but I don't think this is that surprising. I think when we try to fit a probe to "label" (the truth value of the statement), this is probably like fitting a linear probe to random data. It might overfit on some token-level heuristic which is ideosyncratically good on the train set but generalizes poorly to the val set. E.g. if disproportionately many statements containing "India" are true on the train set, then it might learn to label statements containing "India" as true; but since in the full dataset, there is no correlation between "India" and being true, correlation between "India" and true in the val set will necessarily have the opposite sign.

Are the training and val sets not IID? Are they small enough that we either get serious overfit or huge error bars?

The thing that remains confusing here is that for arbitrary features like these, it's not obvious why the model is computing any nontrivial boolean function of them and storing it along a different direction. And if the answer is "the model computes this boolean function of arbitrary features" then the downstream consequences are the same, I think.

What is "this"? It sounds like you're gesturing at the same thing I discuss in the section "Maybe  is represented “incidentally” because it’s possible to aggregate noisy signals from many features which are correlated with boolean functions of a and b"

I edited my comment. I'm just trying to say that like how you get a∧b for free, you also get XOR for free if you compute anything else which is "linearly independent" frrom the components a and b. (For a slightly fuzzy notion of linear independence where we just need separability.)
4Sam Marks2mo
The thing that remains confusing here is that for arbitrary features like these, it's not obvious why the model is computing any nontrivial boolean function of them and storing it along a different direction. And if the answer is "the model computes this boolean function of arbitrary features" then the downstream consequences are the same, I think.

Thanks! I'm still pretty confused though.

It sounds like you're making an empirical claim that in this banana/shed example, the model is representing the features , and  along linearly independent directions. Are you saying that this claim is supported by PCA visualizations you've done? Maybe I'm missing something, but none of the PCA visualizations I'm seeing in the paper seem to touch on this. E.g. visualization in figure 2(b) (reproduced below) is colored by , not ... (read more)

Are you saying that this claim is supported by PCA visualizations you've done?

Yes, but they're not in the paper. (I also don't remember if these visualizations were specifically on banana/shed or one of the many other distractor experiments we did.)

I'll say that I've done a lot of visualizing true/false datasets with PCA, and I've never noticed anything like this, though I never had as clean a distractor feature as banana/shed.

It is important for the distractor to be clean (otherwise PCA might pick up on other sources of variance in the activations as the ... (read more)

Actually, no,  would not result in . To get that  you would need to take  where  is determined by whether the word true is present (and not by whether "" is true). 

But I don't think this should be possible:  are supposed to have their means subtracted off (thereby getting rid of the the linearly-accessible information about ).

5Rohin Shah2mo
The point is that while the normalization eliminates has_true(x), it does not eliminate has_banana(x)⊕has_true(x), and it turns out that LLMs really do encode the XOR linearly in the residual stream. Why does the LLM do this? Suppose you have two boolean variables a and b. If the neural net uses three dimensions to represent a, b, and a⊕b, I believe that allows it to recover arbitrary boolean functions of a and b linearly from the residual stream. So you might expect the LLM to do this "by default" because of how useful it is for downstream computation. In such a setting, if you normalize based on a, that will remove the a direction, but it will not remove the b and a⊕b directions. Empirically when we do PCA visualizations this is what we observe. Note that the intended behavior of CCS on e.g. IMDb is to learn the probe sentiment(x)⊕has_true(x), so it's not clear how you'd fix this problem with more normalization, without also breaking the intended use case. In terms of the paper: Theorems 1 and 2 describe the distractor probe, and in particular they explicitly describe the probe as learning distractor(x)⊕has_true(x), though it doesn't talk about why this defeats the normalization. Note that the definition in that theorem is equivalent to p(xi)=1[x=x−i]⊕h(qi)=has_false(xi)⊕distractor(qi).

EDIT: Nevermind, I don't think the above is a reasonable explanation of the results, see my reply to this comment.

Original comment:

Gotcha, that seems like a possible interpretation of the stuff that they wrote, though I find it a bit surprising that CCS learned the probe  (and think they should probably remark on this).

In particular, based on the dataset visualizations in the paper, it doesn't seem possible for a linear probe to implement . But it's possible that if you were to go beyond t... (read more)

3Sam Marks2mo
Actually, no, p(x)=has_banana(x)⊕is_true(x) would not result in ~p(x)=has_banana(x). To get that ~p you would need to take p(x)=has_banana(x)⊕has_true(x) where has_true(x)(≠is_true(x)) is determined by whether the word true is present (and not by whether "Q true" is true).  But I don't think this should be possible: ~ϕ(x+),~ϕ(x−) are supposed to have their means subtracted off (thereby getting rid of the the linearly-accessible information about has_true(~ϕ(x±))).

I see that you've unendorsed this, but my guess is that this is indeed what's going on. That is, I'm guessing that the probe learned is  so that . I was initially skeptical on the basis of the visualizations shown in the paper -- it doesn't look like a linear probe should be able to learn an xor like this. But if the true geometry is more like the figures below (from here) (+ a lateral displacement between the two datasets), then the linear probe can learn an xor just fine.

I am very confused about some of the reported experimental results.

Here's my understanding the banana/shed experiment (section 4.1):

  • For half of the questions, the word "banana" was appended to both elements of the of the contrast pair  and . Likewise for the other half, the word "shed" was appended to both elements of the contrast pair.
  • Then a probe was trained with CCS on the dataset of contrast pairs .
  • Sometimes, the result was the probe  where  if  ends with "
... (read more)
6Rohin Shah2mo
(To summarize the parallel thread) The claim is that the learned probe is p(x)=has_banana(x)⊕has_false(x). As shown in Theorem 1, if you chug through the math with this probe, it gets low CCS loss and leads to an induced classifier ~p(x)=has_banana(q).* You might be surprised that this is possible, because the CCS normalization is supposed to eliminate has_true(x) -- but what the normalization does is remove linearly-accessible information about has_true(x). However, has_banana(x)⊕has_true(x) is not linearly accessible, and it is encoded by the LLM using a near-orthogonal dimension of the residual stream, so it is not removed by the normalization. *Notation: q is a question or statement whose truth value we care about x is one half of a contrast pair created from q has_banana(q) is 1 if the statement ends with banana, and 0 if it ends with shed has_false(x) is 1 if the contrast pair is negative (i.e. ends with "False" or "No") and 0 if it is positive.
5Clément Dumas2mo
Let's assume the prompt template is x= Q [true/false] [banana/shred] If I understand correctly, they don't claim p  learned has_banana but ~p=p(x⁺)+(1−p(x⁻))2 learned has_banana. Moreover evaluating ~p for p=is_true(x)⊕is_shred(x) gives: ~p(x=Q [?] banana)=p(Q true banana)+(1−p(Q false banana))2=1+(1−0)2=1 ~p(x=Q [?] shred)=p(Q true shred)+(1−p(Q false shred))2=0+(1−1)2=0 Therefore, we can learn a ~p that is a banana classifier
1[comment deleted]2mo

Here's an experiment I'm about to do:

  • Remove high-frequency features from 0_8192 layer 3 until it has L0 < 40 (the same L0 as the 1_32768 layer 3 dictionary)
  • Recompute statistics for this modified dictionary.

I predict the resulting dictionary will be "like 1_32768 but a bit worse." Concretely, I'm guessing that means % loss recovered around 72%. 



I killed all features of frequency larger than 0.038. This was 2041 features, and resulted in a L0 just below 40. The stats:

MSE Loss: 0.27 (worse than 1_32768)

Percent loss recovered: 77.9% (a li... (read more)

I agree that the L0's for 0_8192 are too high in later layers, though I'll note that I think this is mainly due to the cluster of high-frequency features (see the spike in the histogram). Features outside of this spike look pretty decent, and without the spike our L0s would be much more reasonable. 

Here are four random features from layer 3, at a range of frequencies outside of the spike.

Layer 3, 0_8192, feature 138 (frequency = 0.003) activates on the newline at the end of the "field of the invention" section in patent applications. I think it's very... (read more)

2Logan Riggs3mo
If you removed the high-frequency features to achieve some L0 norm, X, how much does loss recovered change?  If you increased the l1 penalty to achieve L0 norm X, how does the loss recovered change as well? Ideally, we can interpret the parts of the model that are doing things, which I'm grounding out as loss recovered in this case.

I agree with everything you wrote here and in the sibling comment: there are reasonable hopes for bootstrapping alignment as agents grow smarter; but without a concrete bootstrapping proposal with an accompanying argument, <1% P(doom) from failing to bootstrap alignment doesn't seem right to me.

I'm guessing this is my biggest crux with the Quintin/Nora worldview, so I guess I'm bidding for -- if Quintin/Nora have an argument for optimism about bootstrapping beyond "it feels like this should work because of iterative design" -- for that argument to make it into the forthcoming document.

Another metric is: comparing the similarity between two dictionaries using mean max cosine similarity (where one of the dictionaries is treated as the ground truth), we've found that two dictionaries trained from different random seeds on the same (non-randomized) model are highly similar (>.95), whereas dictionaries trained on a randomized model and an non-randomized model are dissimilar (<.3 IIRC, but I don't have the data on hand).

The way I would phrase this concern is "SAEs might learn to pick up on structure present in the underlying data, rather than to pick up on learned structure in NN activations." E.g. since "tree" is a class of things defined by a bunch of correlations present in the underlying image data, it's possible that images of trees will naturally cluster in NN activations even when the NN has no underlying tree concept; SAEs would still be able to detect and learn this cluster as one of their neurons.

I agree this is a valid critique. Here's one empirical test which ... (read more)

7Adam Jermyn3mo
In Towards Monosemanticity we also did a version of this experiment, and found that the SAE was much less interpretable when the transformer weights were randomized (

Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probes. (Maybe there is some reason why looking at connections will be especially good for features but not probes, but if so, why?)

"Can this be applied to probes" is a crux for me. It sounds like you're imagining something like:

  • Train a bunch of truthfulness probes regularized to be distinct from each other.
  • Train a bunch of probes for
... (read more)
I was actually imagining a hybrid between probes and features. The actual classifier doesn't need to be part of a complete decomposition, but for the connections we do maybe want the complete decomposition to fully analyze connections including the recursive case. So: * Train a bunch of truthfulness probes regularized to be distinct from each other. * Check feature connections for these probes and select accordingly. I also think there's a pretty straightforward to do this without needing to train a bunch of probes (e.g. train probes to be orthogonal to undesirable stuff or whatever rather than needing to train a bunch). As you mentioned you probably can do with entirely just learned probes, via a mechanism like the one you said (but this is less clean than decomposition). You could also apply amnesic probing which has somewhat different properties than looking at the decomposition (amnesic probing is where you remove some dimensions to avoid being able to discriminate certain classes via LEACE as we discussed in the measurement tampering paper). (TBC, doesn't seem that useful to argue about "what is mech interp", I think the more central question is "how likely is it that all this prior work and ideas related to mech interp are useful". This is a strictly higher bar, but we should apply the same adjustment for work in all other areas etc.)

Let me try to state something which captures most of that approach to make sure I understand:

Everything you wrote describing the hope looks right to me.

It's worth noting that I can't imagine this resulting in vary ambitious applications, though the reduction in doom could still be substantial.

To be clear, what does "ambitious" mean here? Does it mean "producing a large degree of understanding?"

If we don't understand much of the training compute then there will be decompositions which look to us like a good enough decomposition while hiding arbitrary stuff

... (read more)
I mean reducing doom by a large amount for very powerful models.
2[comment deleted]3mo
Sure, but then why not just train a probe? If we don't care about much precision what goes wrong with the probe approach? It's possible to improve on a just a probe trained on the data we can construct of course, but you'll need non-trivial precision to do so.

What's the specific way you imagine this failing? Some options:

My proposed list (which borrows from your list):

  • We find a large number (e.g. 30,000) of features which all sorta look somewhat like truth, though none exactly look like truth. Further analysis doesn't make it clear which of these are "real" or "actually truth". Some features look more like truth and some look a bit less like truth, but broadly there is a smooth fall of in how "truth like" the features look such that there aren't a small set of discrete truth features. No single feature bot
... (read more)
My guess is that the classification task for waterbirds is sufficiently easy that butchering a substantial part of the model is fine. It won't usually be viable to ablate everything that looks similar to an undesirable property. In some cases, this might be fine due to redundancy, but if there is heavy redundancy, I also expect that you've missed some stuff if you just look for components which look to a given target. Not super high confidence overall. Edit: it also seems likely to me that there is a more principled and simpler approach like using LEACE which works just as well or better (but I'm unsure and I'm not familiar with that paper or the literature here).
Sam Marks3moΩ214111

What follows is a note I wrote responding to the AI Optimists essay, explaining where I agree and disagree. I was thinking about posting this somewhere, so I figure I'll leave it in the comments here. (So to be clear, it's responding to the AI Optimists essay, not responding to Steven's post.)

Places I think AI Optimists and I agree:

  • We have a number of advantages for aligning NNs that we don’t have for humans: white box access, better control over training environments and inputs, better control over the reward signal, and better ability to do research abou
... (read more)
The "AI is easy to control" piece does talk about scaling to superhuman AI: If we assume that each generation can ensure a relatively strong notion of alignment between it and the next generation, then I think this argument goes through. However, there are weaker notions of control which are insufficient for this sort of bootstrapping argument. Suppose each generation can ensure a the following weaker notion of control "we can set up a training, evaluation, and deployment protocol with sufficient safeguards (monitoring, auditing, etc) such that we can avoid generation N+1 AIs being capable of causing catastrophic outcomes (like AI takeover) while using those AIs to speed up labor of the generation N by a large multiple". This notion of control doesn't (clearly) allow the bootstrapping argument to go through. In particular, suppose that all AIs smarter than humans are deceptively aligned and they defect on humanity at the point where they are doing tasks which would be extremely hard for a human to oversee. (This isn't the only issue, but it is a sufficient counterexample.) This weaker notion of control can be very useful in ensuring good outcomes via getting lots of useful work out of AIs, but we will likely need to build something more scalable eventually. (See also my discussion of using human level ish AIs to automate safety research in the sibling.)
  • I’m not sure we’re worrying about the same regimes.
    • The regime I’m most worried about is:
      • AI systems which are much smarter than the smartest humans
      • ...
    • It’s unclear to me whether the authors are discussing alignment in a regime like the one above, or a regime like “LLMs which are not much smarter than the smartest humans.” (I too am very optimistic about remaining safe in this latter regime.)
      • ...

The AI Optimists don't make this argument AFAICT, but I think optimism about effectively utilizing "human level" models should transfer to a considerable amount of opt... (read more)

The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.

The point is that it requires a human to execute the jailbreak, the AI is not the jailbreaker, and the examples show that humans can still retain control of the model.

The AI is not jailbreaking itself, here.

This link explains it better than I can, here:

Thanks for having this dialogue -- I'm very happy to see clearer articulation of the Buck/Ryan views on theories of impact for MI work!

The part that I found most useful was Ryan's bullet points for "Hopes (as I see them) for mech interp being useful without explaining 99%". I would guess that most MI researchers don't actually see their theories of impact as relying on explaining ~all of model performance (even though they sometimes get confused/misunderstand the question and say otherwise). So I think the most important cruxes will lie in disagreements ab... (read more)

Here's a hope which (I think) isn't on the list. It's somewhat related to the hope that Habryka raised, though a bit different and more specific.

Approach: maybe model internals overtly represent qualities which distinguish desired vs. undesired cognition, but probing is insufficient for some reason (e.g. because we don't have good enough oversight to produce labeled data to train a probe with).

I don't think this exact thing is directly mentioned by my list. Thanks for the addition.

Let me try to state something which captures most of that approach to mak... (read more)

Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected

Instead, we enter the realm of tool AI which basically does what you say.

I agree that, conditional on no deceptive alignment, the most pernicious and least tractable sources of doom go away. 

However, I disagree that conditional on no deceptive alignment, AI ... (read more)

I think my crux is that once we remove the deceptive alignment issue, I suspect that profit forces alone will generate a large incentive to reduce the gap, primarily because I think that people way overestimate the value of powerful, unaligned agents to the market and underestimate the want for control over AI.
To what extent do you worry about the training methods used for ChatGPT, and why?

Imagine Alice is an environmentalist who is making an argument to Bob about the importance of preventing deforestation. Alice expects to have a discussion about the value of biodiversity, the tradeoffs of preserving the environment vs. economic productivity, that sort of stuff. 

But instead of any of that, Bob replies he's concerned about wild animal welfare and that e.g. the Amazon Rainforest is a vast cesspit of animal suffering. Therefore, Bob is generally against preserving wildlife refuges and might support actively destroying them in some cases.

I... (read more)

7Bruce Lewis3mo
I think Bob should be even more direct about what's happening. "I know most of the people who disagree with you on this are thinking of X, Y, and Z. My reasons are different. My opinions on X, Y and Z are largely similar to yours. But I'm concerned about ☈, ♄, and ⚗." I think this approach would do even more than the idea in your last paragraph to make the surprise less jarring for Alice.

There are many, many actors in the open-source space, working on many, many AI models (even just fine-tunes of LLaMA/Llama2).

To clarify, I'm imagining that this protocol would be applied to the open sourcing of foundation models. Probably you could operationalize this as "any training run which consumed > X compute" for some judiciously chosen X.

I've noticed that for many people (including myself), their subjective P(doom) stays surprisingly constant over time. And I've wondered if there's something like "conservation of subjective P(doom)" -- if you become more optimistic about some part of AI going better, then you tend to become more pessimistic about some other part, such that your P(doom) stays constant. I'm like 50% confident that I myself do something like this.

(ETA: Of course, there are good reasons subjective P(doom) might remain constant, e.g. if most of your uncertainty is about the difficulty of the underlying alignment problem and you don't think we've been learning much about that.)

Thanks, I think there is a confusing point here about how narrowly we define "model organisms." The OP defines sycophantic reward hacking as

where a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.

but doesn't explicitly mention reward hacking along the lines of "do things which look good to the overseers (but might not ... (read more)

6Fabien Roger7mo
I agree with most of what you say here, and I agree that the definition of sycophancy is a bit vague. I think I agree that if the dumb models always tried to kill us, I'd be more afraid about future models (though the evidence isn't very strong because of the gap between dumb and smarter model, and because for smarter models I don't except the open web text prior talking a lot about AI take over to be as important as for dumber ones). But in the experiments described in this post (not your comment) there is a ton of spoonfeeding, and I care about the model "will" when there is no spoonfeeding. Experiments at spoonfeeding > 0 aren't very informative.

Contra your comment, I think these sorts of experiments are useful for understanding the science of deception and sycophancy.

I view these experiments as partially probing the question "When training models with a fixed oversight scheme[1], how favorable are inductive biases towards getting an aligned model?"

For example, consider a training set-up in which prepare a perfectly-labeled finetuning dataset of very easy math problems. I'd guess that GPT-4 is smart enough for "answer math questions as well as possible" and "answer math problems the way a 7th grad... (read more)

9Fabien Roger7mo
I am excited about easy-to-hard generalization. I just don't call that model organism, which I thought to be about the actual catastrophic failures (Deceptive inner misalignment - where the AI plays the training game, and Sycophantic reward hacking - the AI takes over the reward channel or is aggressively power-seeking). Maybe I'm using a definition which is too narrow, but my criticism was mostly targeted at the narrow definition of model organism. Note: I don't think I agree that your experiment tells you much about inductive biases of GPT-4 to "want to" take over. I think it mostly tells you about the sycophantic dangerous capabilities, and how restrictive your fine-tuning set is in the kind of behaviors that can slip through, and I think those are the two important things you learn from your experiment. Measuring how much AIs "want to" do stuff as opposed to doing them because they "have to" is extremely tricky, and I think you get very little signal from experiments like the one you suggested. (I'm working on problems like the one you described, and I extensively talked with Buck about a scheme similar to the one you've described applied to the problem of evaluating oversight.)

Yeah, sorry, I should have made clear that the story that I tell in the post is not contained in the linked paper. Rather, it's a story that David Bau sometimes tells during talks, and which I wish were wider-known. As you note, the paper is about the problem of taking specific images and relighting them (not of generating any image at all of an indoor scene with unlit lamps), and the paper doesn't say anything about prompt-conditioned models. As I understand things, in the course of working on the linked project, Bau's group noticed that they couldn't get scenes with unlit lamps out of the popular prompt-conditioned generative image models.

2Neel Nanda9mo
Ah, thanks for the clarification! That makes way more sense. I was confused because you mentioned this in a recent conversation, I excitedly read the paper, and then couldn't see what the fuss was about (your post prompted me to re-read and notice section 4.1, the good section!).
Load More