All of Adam Jermyn's Comments + Replies

I'm guessing that the sales numbers aren't high enough to make $200k if sold at plausible markups?

The sales are at cost and don’t make money on net.

In Towards Monosemanticity we also did a version of this experiment, and found that the SAE was much less interpretable when the transformer weights were randomized (

(The results for correlations from auto-interp are less clear: they find similar correlation coefficients with and without weight randomization. However, they find that this might be due to single token features on the part of the randomized transformer and when you ignore these features (or correct in some other way I'm forgetting?), the SAE on an actual transformer indeed has higher correlation.)

Anthropic’s RSP includes evals after every 4x increase in effective compute and after every 3 months, whichever comes sooner, even if this happens during training, and the policy says that these evaluations include fine-tuning.

Do you know why 4x was picked? I understand that doing evals properly is a pretty substantial effort, but once we get up to gigantic sizes and proto-AGIs it seems like it could hide a lot. If there was a model sitting in training with 3x the train-compute of GPT4 I'd be very keen to know what it could do!

This matches my impression. At EAG London I was really stunned (and heartened!) at how many skilled people are pivoting into interpretability from non-alignment fields.

Second, the measure of “features per dimension” used by Elhage et al. (2022) might be misleading. See the paper for details of how they arrived at this quantity. But as shown in the figure above, “features per dimension” is defined as the Frobenius norm of the weight matrix before the layer divided by the number of neurons in the layer. But there is a simple sanity check that this doesn’t pass. In the case of a ReLU network without bias terms, multiplying a weight matrix by a constant factor will cause the “features per dimension” to be increased by that

... (read more)
Thanks, +1 to the clarification value of this comment. I appreciate it. I did not have the tied weights in mind when writing this. 

I think there’s tons of low-hanging fruit in toy model interpretability, and I expect at least some lessons from at least some such projects to generalize. A lot of the questions I’m excited about in interpretability are fundamentally accessible in toy models, like “how do models trade off interference and representational capacity?”, “what priors do MLP’s have over different hypotheses about the data distribution?”, etc.

A thing I really like about the approach in this paper is that it makes use of a lot more of the model's knowledge of human values than traditional RLHF approaches. Pretrained LLM's already know a ton of what humans say about human values, and this seems like a much more direct way to point models at that knowledge than binary feedback on samples.

How does this correctness check work?

I usually think of gauge freedom as saying “there is a family of configurations that all produce the same observables”. I don’t think that gives a way to say some configurations are correct/incorrect. Rather some pairs of configurations are equivalent and some aren’t.

That said, I do think you can probably do something like the approach described to assign a label to each equivalence class of configurations and do your evolution in that label space, which avoids having to pick a gauge.

How would you classify optimization shaped like "write a program to solve the problem for you". It's not directly searching over solutions (though the program you write might). Maybe it's a form of amortized optimization?

Separately: The optimization-type distinction clarifies a circle I've run around talking about inner optimization with many people, namely "Is optimization the same as search, or is search just one way to get optimization?" And I think this distinction gives me something to point to in saying "search is one way to get (direct) optimization, but there are other kinds of optimization".

I might be totally wrong here, but could this approach be used to train models that are more likely to be myopic (than e.g. existing RL reward functions)? I'm thinking specifically of the form of myopia that says "only care about the current epoch", which you could train for by (1) indexing epochs, (2) giving the model access to its epoch index, (3) having the reward function go negative past a certain epoch, (4) giving the model the ability to shutdown. Then you could maybe make a model that only wants to run for a few epochs and then shuts off, and maybe that helps avoid cross-epoch optimization?

That's definitely a thing that can happen.

I think the surgeon can always be made ~arbitrarily powerful, and the trick is making it not too powerful/trivially powerful (in ways that e.g. preclude the model from performing well despite the surgeon's interference).

So I think the core question is: are there ways to make a sufficiently powerful surgeon which is also still defeasible by a model that does what we want?

Given that we want the surgeon to be of bounded size (if we're using a neural net implementation which seems likely to me), can it still be arbitrarily powerful? That doesn't seem obvious to me.

A trick I sometimes use, related to this post, is to ask whether my future self would like to buy back my present time at some rate. This somehow makes your point about intertemporal substitution more visceral for me, and makes it easier to say "oh yes this thing which is pricier than my current rate definitely makes sense at my plausible future rate".

In fact, it's not 100% clear that AI systems could learn to deceive and manipulate supervisors even if we deliberately tried to train them to do it. This makes it hard to even get started on things like discouraging and detecting deceptive behavior.

Plausibly we already have examples of (very weak) manipulation, in the form of models trained with RLHF saying false-but-plausible-sounding things, or lying and saying they don't know something (but happily providing that information in different contexts). [E.g. ChatGPT denies having information about how to build nukes, but will also happily tell you about different methods for Uranium isotope separation.]

Yeah. Or maybe not even to zero but it isn’t increasing.

Could it be that Chris's diagram gets recovered if the vertical scale is "total interpretable capabilities"? Like maybe tiny transformers are more interpretable in that we can understand ~all of what they're doing, but they're not doing much, so maybe it's still the case that the amount of capability we can understand has a valley and then a peak at higher capability.

As in, the ratio between (interpretable capabilities / total capabilities) still asymptotes to zero, but the number of interpretable capabilities goes up (and then maybe back down) as the models gain more capabilities?

That's a good point: it definitely pushes in the direction of making the model's internals harder to adversarially attack. I do wonder how accessible "encrypted" is here versus just "actually robust" (which is what I'm hoping for in this approach). The intuition here is that you want your model to be able to identify that a rogue thought like "kill people" is not a thing to act on, and that looks like being robust.

I haven't thought about this a lot, but "encrypted" could just mean "just beyond the capabilities of the Surgeon to identify". So the gradient could be moving in a direction away from "easily identifiable early deceptive circuits" instead of "deception", and plausibly in a way that scales with how weak the Surgeon is. Do you think we can design Surgeons that are powerful enough even at interpretable sizes to net the latter? Do surgical capabilities like this generally scale linearly?

And: having a lot of capital could be very useful in the run up to TAI. Eg for pursuing/funding safety work.

Roughly, I think it’s hard to construct a reward signal that makes models answer questions when they know the answers and say they don’t know when they don’t know. Doing that requires that you are always able to tell what the correct answer is during training, and that’s expensive to do. (Though Eg Anthropic seems to have made some progress here:

So indeed with cross-entropy loss I see two plateaus! Here's rank 2:

(note that I've offset the loss to so that equality of Z and C is zero loss)

I have trouble getting rank 10 to find the zero-loss solution:

But the phenomenology at full rank is unchanged:

Got it, I see. I think of the two as really intertwined (e.g. a big part of my agenda at the moment is studying how biases/path-dependence in SGD affect interpretability/polysemanticity).

This is really interesting!

One question: do we need layer norm in networks? Can we get by with something simpler? My immediate reaction here is “holy cow layer norm is geometrically complicated!” followed by a desire to not use it in networks I’m hoping to interpret.

This post introduces the concept of a "cheerful price" and (through examples and counterexamples) narrows it down to a precise notion that's useful for negotiating payment. Concretely:

  1. Having "cheerful price" in your conceptual toolkit means you know you can look for the number at which you are cheerful (as opposed to "the lowest number I can get by on", "the highest number I think they'll go for", or other common strategies). If you genuinely want to ask for an amount that makes you cheerful and no more, knowing that such a number might exist at all is u
... (read more)

I found this post a delightful object-level exploration of a really weird phenomenon (the sporadic occurrence of the "tree" phenotype among plants). The most striking line for me was:

Most “fruits” or “berries” are not descended from a common “fruit” or “berry” ancestor. Citrus fruits are all derived from a common fruit, and so are apples and pears, and plums and apricots – but an apple and an orange, or a fig and a peach, do not share a fruit ancestor.

What is even going on here?!

On a meta-level my takeaway was to be a bit more humble in saying what complex/evolved/learned systems should/shouldn't be capable of/do.

Well, if you dissect them, you see they are actually nothing alike. They converged on a cool concept - if I surround my offspring with a hard protection and then wrap it in lots of bright, sweet softness, an animal will eat the shell and ingest the offspring without killing the offspring, and deposit it somewhere with fertiliser later - but the way a lemon vs. a cherry is made up is totally different. The number of offspring, their encasement, the way the fruit is structured, its shell, its number, it is a completely different thing. Like different human cultures that developed houses, without learning it from each other. But the structures don't match at all. Yet within the same structure, the result can still look super different too humans - yet within the same broader thing, plants are absolute whores.  A citrus will happily fuck any other citrus, and something tasty always results. You know oranges? Oranges are what happens if you breed pomelo (a huge slightly bitter citrus) with mandarin (a small sweet citrus), and they seamlessly breed and give you awesome oranges (a medium sized balanced citrus). And then if you find the result too sweet, you can breed the orange back to the pomelo, and get grapefruits.  And if you love all the citrus plants too much to decide between them, you can cut branches from your favourites, stick them into one citrus plant, and have them merge into a functional Frankenstein citrus plant that bears multiple different fruit. I love my citrus trees.

Woah, nice! Note that I didn't check rank 1 with Adam, just rank >= 2.

Erm do C and Z have to be valid normalized probabilities for this to work?

C needs to be probabilities, yeah. Z can be any vector of numbers. (You can convert C into probabilities with softmax) 

Got it, I was mostly responding to the third paragraph (insight into why SGD works, which I think is mostly an interpretability question) and should have made that clearer.

2Arthur Conmy1y
I think the situation I'm considering in the quoted part is something like this: research is done on SGD training dynamics and researcher X finds a new way of looking at model component Y, and only certain parts of it are important for performance. So they remove that part, scale the model more, and the model is better. This to me meets the definition of "why SGD works" (the model uses the Y components to achieve low loss). I think interpretability that finds ways models represent information (especially across models) is valuable, but this feels different from "why SGD works".

(with the caveat that this is still "I tried a few times" and not any quantitative study)

It's a good caution, but I do see more bumps with Adam than with SGD across a number of random initializations.

1Adam Jermyn1y
(with the caveat that this is still "I tried a few times" and not any quantitative study)

Something like this?

def loss(learned, target):
   p_target = torch.exp(target)
   p_target = p_target / torch.sum(p_target)
   p_learned = torch.exp(learned)
   p_learned = p_learned / torch.sum(p_learned)
   return -torch.sum(p_target * torch.log(p_learned))

Well, I'd keep everything in log space and do the whole thing with log_sum_exp for numerical stability, but yeah.  EDIT: e.g. something like:

I'd be very excited to see a reproduction :-)

This problem is not neglected, and it is very unclear how any insight into why SGD works wouldn’t be directly a capabilities contribution.

I strongly disagree! AFAICT SGD works so well for capabilities that interpretability/actually understanding models/etc. is highly neglected and there's low-hanging fruit all over the place.

2Arthur Conmy1y
To me, the label "Science of DL" is far more broad than interpretability. However, I was claiming that the general goal of Science of DL is not neglected (see my middle paragraph).

I don't think so? I think that just means you keep the incorrect initialization around while also learning the correct direction.

I agree with both of your rephrasings and I think both add useful intuition!

Regarding rank 2, I don't see any difference in behavior from rank 1 other than the "bump" in alignment that Lawrence mentioned. Here's an example:

This doesn't happen in all rank-2 cases but is relatively common. I think usually each vector grows primarily towards 1 or the other target. If two vectors grow towards the same target then you get this bump where one of them has to back off and align more towards a different target [at least that's my current understanding, see my reply... (read more)

I caution against over-interpreting the results of single runs -- I think there's a good chance the number of bumps varies significantly by random seed.
There's lots of ways to do this, but the obvious way is to flatten C and Z and treat them as logits. 

I don't, but here's my best guess: there's a sense in which there's competition among vectors for which learned vectors capture which parts of the target span. 

As a toy example, suppose there are two vectors,  and , such that the closest target vector to each of these at initialization is . Then both vectors might grow towards . At some point  is represented enough in the span, and it's not optimal for two vectors to both play the role of representing , so it becomes optimal for at least one of them to s... (read more)

Oh, huh, that makes a lot of sense! I'll see if I can reproduce these results. I'm not sure this explains the grokking bumps from the mod add stuff -- I'm not sure what the should be "competition" should be given we see the bumps on every key frequency. 

This is really interesting! One extension that comes to mind: SVD will never recover a Johnson-Lindenstrauss packing, because SVD can only return as many vectors as the rank of the relevant matrix. But you can do sparse coding to e.g. construct an overcomplete basis of vectors such that typical samples are sparse combinations of those vectors. Have you tried/considered trying something like that?

Yes, this is correct. SVD necessarily won't recover the full JL packing. Given that we don't know the extent to which the network uses the full JL capacity, then SVD might still get a reasonable fraction of the relevant directions. Also, if the network packs semantically similar vectors close to one another, then the SVD direction might also represent some kind of useful average of them.   Indeed, we are looking at sparse coding to try to construct an over complete basis, as a parallel project. Stay tuned for this.

That's not a scalar, do you mean the trace of that? If so, doesn't that just eliminate the term that causes the incorrect initialization to decay?

Sorry, I meant ⟨a,a⟩⋅⟨b,b⟩. And yes, that should eliminate the term that causes the incorrect initialization to decay. Doesn't that cause the learning to be in the correct direction from the start?

Ah that's right. Will edit to fix.

Do you have results with noisy inputs?

Nope! Do you have predictions for what noise might do here?

The negative bias lines up well with previous sparse coding implementations:

Oooo I'll definitely take a look. This looks very relevant.

Note that in that research, the negative bias has a couple of meanings/implications:

  • It should correspond to the noise level in your input channel.
  • Higher negative biases directly contribute to
... (read more)
My weak prediction is that adding low levels of noise would change the polysemantic activations, but not the monosemantic ones. Adding L1 to the loss allows the network to converge on solutions that are more monosemantic than otherwise, at the cost of some estimation error. Basically, the network is less likely to lean on polysemantic neurons to make up small errors. I think your best bet is to apply the L1 loss on the hidden layer and the output later activations.

Thanks for these thoughts!

Although it would be useful to have the plotting code as well, if that's easy to share?

Sure! I've just pushed the plot helper routines we used, as well as some examples.

I agree that N (true feature dimension) > d (observed dimension), and that sparsity will be high, but I'm uncertain whether the other part of the regime (that you don't mention here), that k (model latent dimension) > N, is likely to be true. Do you think that is likely to be the case? As an analogy, I think the intermediate feature dimensions in MLP layers i

... (read more)
I guess the recent work on Polysemanticity and Capacity seems to suggest the latter case, especially in sparser settings, given the zone where multiple feature are represented polysemantically, although I can't remember if they investigate power-law feature frequencies or just uniform frequencies My impression is that that discussion was more about whether the empirical results (i.e. do ResNets have linear mode connectivity?) held up, rather than whether the methodology used and present in the code base could be used to find whether linear mode connectivity is present between two models (up to permutation) for a given dataset. I imagine you could take the code and easily adapt it to check for LMC between two trained models pretty quickly (it's something I'm considering trying to do as well, hence the code requests). That would defiitely be interesting to see. I guess this is kind of presupposing that the models are in different basins (which I also believe but hasn't yet been verified). I also think looking at basins and connectivity would be more interesting in the case where there was more noise, either from initialisation, inherently in the data, or by using a much lower batch size so that SGD was noisy. In this case it's less likely that the same configuration results in the same basin, but if your interventions are robust to these kinds of noise then it's a good sign. That's cool, looking forward to seeing more detail. I think these results don't seem that related to the LTH (if I understand your explanation correctly), as LTH involves finding sparse subnetworks in dense ones. Possibly it only actually holds in model with many more parameters, I haven't seen it investigated in models that aren't overparametrised in a classical sense. I think if iterative magnitude pruning (IMP) on these problems produced much sparse subnetworks that also maintained the monosemanticity levels, then that would suggest that sparsity doesn't penalise monosemanticity (or polysem

Sorry for my confusion about something so silly, but shouldn't the following be "when 

Oh you're totally right. And k=1 should be k=d there. I'll edit in a fix.

I'm also a bit confused about why we can think of  as representing "which moment of the interference distribution we care about."

It's not precisely which moment, but as we vary  the moment(s) of interest vary monotonically.

Perhaps some of my confusion here stems from the fact that it seems to me that the optimal number of subspaces, , is an increasing fun

... (read more)

I like the distinction between implementing the results of acausal decision theories and explicitly performing the reasoning involved. That seems useful to have.

The taxes example I think is more complicated: at some scale I do think that governments have some responsiveness to their tax receipts (e.g. if there were a surprise doubling of tax receipts governments might well spend more). It's not a 1:1 relation, but there's definitely a connection.

Just to say I really enjoyed reading this post, and found it helpful as a way to get a sense of what mode collapse looks like in practice.

From the Afterword: “Note that I have presented a there-exists argument against a for-all proposition. Responses of the form “But I can think of something similar that won’t work” don’t count as counterarguments.” [This has been edited in the main text to sharpen the phrasing.]

I saw this, but I think it sets a somewhat unhelpful standard. In practice we need to make choices about which approaches are most promising, which to pursue, etc., and evidence that there is more probability mass on success in one area does feel useful. 

So, for instance, my poi... (read more)

5Eric Drexler1y
Yes, we need to weigh the considerations you mention, and many others. Copy-collusion must be considered carefully, and its implications will depend on both potential contexts for using copies (sometimes problematic, sometimes OK) and on potential alternatives to using copies. Copies will always “collude” in the trivial sense of giving identical, potentially deceptive responses to identical inputs, but might also offer potentially deceptive, mutually-supporting responses to differing inputs. For some purposes, a multicomponent architecture must treat multiple copies as a single source of information. (Note that solving the alignment problem is often taken to require that we be able to trust a single copy, or a system that consists of nothing but copies of a single entity — this seems strictly riskier than anything I’ve suggested.) Some constraints can be effective even among copies: Constrained communication can be applied to preclude coordination that requires new, shared knowledge. Lack of access to a shared clock can be applied to preclude coordination of simultaneous or cleverly sequenced actions. Systems that audit alternative behaviors by re-running processes from checkpoints can be applied to offline copies, where adaptive, multi-trial probing (by what may be highly intelligent systems!) can likely be leveraged to make some kinds of deceptions unwise. Of course a central idea is that many roles call for systems that are not copies: proposers vs. critics, designers vs. testers, actors vs. auditors, diverse members of ensembles, and so on. Before asking whether one approach is superior to another, we should ask whether they are competitors or complements. Multicomponent approaches for making non-collusion robust seem quite complementary to strategies for making individual systems more trustworthy. Regarding criteria for investment, neglectedness is of course a huge consideration.

I like the framing of "how can we disrupt collusion?", but I'm a little skeptical that some of the solutions are practical. Specifically:

B2: Collusion can happen between agents that are copies of each other, who then have no reason to defect (so long as they employ a decision theory with a broad enough cartesian boundary). If we knew we could engineer systems with narrow decision theories (e.g. "Even though that's a copy of me, I'll pretend it's a different agent") I'd feel much better here, but as far as I know we don't know how to do that. I'd be excited... (read more)

5Eric Drexler1y
From the Afterword: “Note that I have presented a there-exists argument against a for-all proposition. Responses of the form “But I can think of something similar that won’t work” don’t count as counterarguments.” [This has been edited in the main text to sharpen the phrasing.] B2: Copies might collude with one another, provided they have a broad decision theory: ∃(some sets of models) among which some mechanisms for impeding collusion might not work. B4: A lot of models may be deployed over the internet: ∃(some communication channels) for which one of 6 mechanisms for impeding collusion might not work (given flawed security). B6: LLM's will share a lot of knowledge: ∃(high performance models for some tasks) for which one of 6 mechanisms for impeding collusion doesn’t contribute to safety. C1: Large models tend to beat compositions of small models for some tasks: ∃(some tasks) where some models should be large, and narrow knowledge won’t contribute to impeding collusion. Not too worrisome, I think, but these are all considerations important to building robust multicomponent systems. BTW, regarding the (non-)goal oriented nature of LLMs, I highly recommend the “Simulators” post by Janus.

Oh yes you're totally right.

I think partitions can get you more orthogonality than your specific example of overlapping orthogonal sets. Take n vectors and pack them into d dimensions in two ways:

  1. A tegum product with k subspaces, giving (n/k) vectors per subspace and n^2*(1-1/k)orthogonal pairs.
  2. (n/d) sets of vectors, each internally orthogonal but each overlapping with the others, giving n*d orthogonal pairs.

If d < n*(1-1/k) the tegum product buys you more orthogonal pairs. If n > d then picking large k (so low-dimensional spaces) makes the tegum product preferred.

This doesn't mean there isn't some other arrangement that does better though...

3Neel Nanda1y
Yeah, agreed that's not an optimal arrangement, that was just a proof of concept for 'non tegum things can get a lot of orthogonality

That's good to hear! And I agree with your new intuition.

I think if you want interference terms to actually be zero you have to end up with tegum products, because that means you want orthogonal vectors and that implies disjoint subspaces. Right?

4Neel Nanda1y
I don't think so? If you have eg 8 vectors arranged evenly in a 2D plane (so at 45 degrees to each other) there's a lot of orthogonality, but no tegum product. I think the key weirdness of a tegum product is that it's a partition, where every pair in different bits of the partition is orthogonal. I could totally imagine that eg the best way to fit 2n vectors is n dimensional space is two sets of n orthogonal vectors, but at some arbitrary angle to each other. I can believe that tegum products are the right way to maximise the number of orthogonal pairs, though that still feels a bit weird to me. (technically, I think that the optimal way to fit kn vectors in R^n is to have n orthogonal directions and k vectors along each direction, maybe with different magnitudes - which is a tegum product. It forming 2D-3D subspaces feels odd though).

AI’s may make decisions in ways that we find counterintuitive. These ways are likely to be shaped to be highly effective (because being effective is instrumentally useful to many goals), so understanding the space of highly effective decision theories is one way to think about how advanced AI’s will reason without having one in front of us right now.

Living in the moment helps. There’s joy and beauty and life right here, right now, and that’s worth enjoying.

I guess I’m just not following what the causal reasons are here?

Load More