LESSWRONG
LW

14
jake_mendel
1260Ω2274220
Message
Dialogue
Subscribe

technical AI safety program associate at OpenPhil

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
5jake_mendel's Shortform
1y
3
How quick and big would a software intelligence explosion be?
jake_mendel1mo*123

Thanks for this post!

Caveat: I haven't read this very closely yet, and I'm not an economist. I'm finding it hard to understand why you think it's reasonable to model an increase in capabilities by an increase in number of parallel copies. That is: in the returns to R&D section, you look at data on how increasing numbers of human-level researchers in AI affect algorithmic progress, but we have ~no data on what happens when you sample researchers from a very different (and superhuman) capability profile. It seems to me entirely plausible that a few months into the intelligence explosion, the best AI researchers are qualitatively superintelligent enough that their research advances per month aren't the sort of thing that could be done by ~any number of humans[1] acting in parallel in a month. I acknowledge that this is probably not tractable to model, but that seems like a problem because it seems to me that this qualitative superintelligence is a (maybe the) key driving force of the intelligence explosion.

Some intuition pumps for why this seems reasonably likely:

  • My understanding is that historians of science disagree on whether science is driven mostly by a few geniuses or not. It probably varies by discipline, and by how understanding-driven progress is. Compared to other fields in hard STEM, ML is currently probably less understanding-driven right now, but it is still relatively understanding-driven. I think there are good reasons to think that it could plausibly transition to being more understanding driven when the researchers become superhuman, because interp, agent foundations, GOFAI etc haven't made zero progress and don't seem fundamentally impossible to me. And if capabilities research becomes loaded on understanding very complicated things, then it could become extremely dependent on quite how capable the most capable researchers are in a way that can't easily be substituted for by more human-level researchers.
  • Suppose I take a smart human and give them the ability/bandwidth to memorise and understand the entire internet. That person would be really different to any normal human, and also really different to any group of humans. So when they try to do research, they approach the tree of ideas to pick the low hanging fruit from a different direction to all of society's research efforts beforehand, so it seems possible that from their perspective there is a lot of low hanging fruit left on the tree — lots of things that seem easy from their vantage point and nearly impossible to grasp from our perspective[2]. And, research into how much diminishing returns we've seen to ideas in the field is not useful for predicting how much research progress that enhanced human would make in their first year.
    • It seems hard to know quite how many angles of approach there are on the tree of ideas, but it seems possible to me that on more than one occasion when you build a new AI that is now the most intelligent being in the world, it starts doing research and finds many ideas that are easy for it and near impossible for all the beings in the world that came before it.
  1. ^

    or at least only by an extremely large number of humans, who are doing something more like brute force search and less like thinking

  2. ^

    This is basically the same idea as Dwarkesh's point that a human-level LLM should be able to make all sorts of new discoveries by connecting dots that humans can't connect because we can't read and take in the whole internet.

Reply
How do you feel about LessWrong these days? [Open feedback thread]
jake_mendel3mo30

I think the biggest-alignment relevant update is that I expected RL fine-tuning over longer horizons (or even model-based RL a la AlphaZero) to be a bigger deal. I was really worried about it significantly improving performance and making alignment harder. In 2018-2019 my mainline picture was more like AlphaStar or AlphaZero, with RL fine-tuning being the large majority of compute. I've updated about this and definitely acknowledge I was wrong.[3] I don't think it totally changes the picture though: I'm still scared of RL, I think it is very plausible it will become more important in the future, and think that even the kind of relatively minimal RL we do now can introduce many of the same risks.

Curious to hear how you would revisit this prediction in light of reasoning models? Seems like you weren't as wrong as you thought a year ago, but maybe you still think there are some key ways your predictions about RL finetuning predictions were off?

Reply
Obstacles in ARC's agenda: Mechanistic Anomaly Detection
jake_mendel4mo22

This was really useful to read thanks very much for writing these posts!

Reply11
Josh Engels's Shortform
jake_mendel4mo31

Very happy you did this!

Reply
The case for unlearning that removes information from LLM weights
jake_mendel6mo20

Do you have any idea about whether the difference between unlearning success on synthetic facts fine-tuned in after pretraining vs real facts introduced during pretraining comes mainly from the 'synthetic' part or the 'fine-tuning' part? I.e. if you took the synthetic facts dataset and spread it out through the pretraining corpus, do you expect it would be any harder to unlearn the synthetic facts? or maybe this question doesn't make sense because you'd have to make the dataset much larger or something to get it to learn the facts at all during pretraining? If so, it seems like a pretty interesting research question to try to understand which properties a dataset of synthetic facts needs to have to defeat unlearning.

Reply
The Case Against AI Control Research
jake_mendel8mo40

Fair point. I guess I still want to say that there's a substantial amount of 'come up with new research agendas' (or like sub-agendas) to be done within each of your bullet points, but I agree the focus on getting trustworthy slightly superhuman AIs and then not needing control anymore makes things much better. I also do feel pretty nervous about some of those bullet points as paths to placing so much trust in your AI systems that you don't feel like you want to bother controlling/monitoring them anymore, and the ones that seem further towards giving me enough trust in the AIs to stop control are also the ones that seem to have the most very open research questions (eg EMs in the extreme case). But I do want to walk back some of the things in my comment above that apply only to aligning very superintelligent AI.

Reply
The Case Against AI Control Research
jake_mendel8mo1710

If you are (1) worried about superintelligence-caused x-risk and (2) have short timelines to both TAI and ASI, it seems like the success or failure of control depends almost entirely on getting the early TAIS to do stuff like "coming up with research agendas"? Like, most people (in AIS) don't seem to think that unassisted humans are remotely on track to develop alignment techniques that work for very superintelligent AIs within the next 10 years — we don't really even have any good ideas for how to do that that haven't been tested. Therefore if we have very superintelligent AIs within the next 10 years (eg 5y till TAI and 5y of RSI), and if we condition on having techniques for aligning them, then it seems very likely that these techniques depend on novel ideas and novel research breakthroughs made by AIs in the period after TAI is developed. It's possible that most of these breakthroughs are within mechinterp or similar, but that's a pretty lose constraint, and 'solve mechinterp' is really not much more of a narrow, well-scoped goal than 'solve alignment'. So it seems like optimism about control rests somewhat heavily on optimism that controlled AIs can safely do things like coming up with new research agendas.

Reply
Activation space interpretability may be doomed
jake_mendel8mo40

[edit: I'm now thinking that actually the optimal probe vector is also orthogonal to span{→vj|j≠i} so maybe the point doesn't stand. In general, I think it is probably a mistake to talk about activation vectors as linear combinations of feature vectors, rather than as vectors that can be projected into a set of interpretable readoff directions. see here for more.] 

Yes, I'm calling the representation vector the same as the probing vector. Suppose my activation vector can be written as →a=∑ifi→vi where fi are feature values and →vi are feature representation vectors. Then the probe vector which minimises MSE (explains most of the variance) is just →vi. To avoid off target effects, the vector →si you want to steer with for feature i might be the vector that is most 'surgical': it only changes the value of this feature and no other features are changed. In that case it should be the vector that lies orthogonal to span{→vj|j≠i} which is only the same as →vi if the set {→vi} are orthogonal. 

Obviously I'm working with a non-overcomplete basis of feature representation vectors here. If we're dealing with the overcomplete case, then it's messier. People normally talk about 'approximately orthogonal vectors' in which case the most surgical steering vector →si≈→vi but (handwaving) you can also talk about something like 'approximately linearly independent vectors' in which case my point stands I think (note that SAE decoder directions are definitely not approximately orthogonal). For something less handwavey see this appendix.

Reply11
Activation space interpretability may be doomed
jake_mendel8mo60

A thought triggered by reading issue 3:

I agree issue 3 seems like a potential problem with methods that optimise for sparsity too much, but it doesn't seem that directly related to the main thesis? At least in the example you give, it should be possible in principle to notice that the space can be factored as a direct sum without having to look to future layers. I guess what I want to ask here is: 

It seems like there is a spectrum of possible views you could have here:

  1. It's achievable to come up with sensible ansatzes (sparsity, linear representations, if we see the possibility to decompose the space into direct sums then we should do that, and so on) which will get us most of the way to finding the ground truth features, but there are edge cases/counterexamples which can only be resolved by looking at how the activation vector is used. this is compatible with the example you gave in issue 3 where the space is factorisable into a direct sum which seems pretty natural/easy to look for in advance, although of course that's the reason you picked that particular structure as an example.
  2. There are many many ways to decompose an activation vector, corresponding to many plausible but mutually incompatible sets of ansatzes, and the only way to know which is correct for the purposes of understanding the model is to see how the activation vector is used in the later layers.
    1. Maybe there are many possible decompositions but they are all/mostly straightforwardly related to each other by eg a sparse basis transformation, so finding any one decomposition is a step in the right direction.
    2. Maybe not that.
  3. Any sensible approach to decomposing an activation vector without looking forward to subsequent layers will be actively misleading. The right way to decompose the activation vector can't be found in isolation with any set of natural ansatzes because the decomposition depends intimately on the way the activation vector is used.

The main strategy being pursued in interpretability today is (insofar as interp is about fully understanding models):

  • First decompose each activation vector individually. Then try to integrate the decompositions of different layers together into circuits. This may require merging found features into higher level features, or tweaking the features in some way, or filtering out some features which turn out to be dataset features. (See also superseding vs supplementing superposition).

This approach is betting that the decompositions you get when you take each vector in isolation are a (big) step in the right direction, even if they require modification, which is more compatible with stance (1) and (2a) in the list above. I don't think your post contains any knockdown arguments that this approach is doomed (do you agree?), but it is maybe suggestive. It would be cool to have some fully reverse engineered toy models where we can study one layer at a time and see what is going on. 

Reply
Activation space interpretability may be doomed
jake_mendel8mo100

Nice post! Re issue 1, there are a few things that you can do to work out if a representation you have found is a 'model feature' or a 'dataset feature'. You can:

  • Check if intervening on the forward pass to modify this feature produces the expected effect on outputs. Caveats:

    • the best vector for probing is not the best vector for steering (in general the inverse of a matrix is not the transpose, and finding a basis of steering vectors from a basis of probe vectors involves inverting the basis matrix)
    • It's possible that the feature you found is causally upstream of some features the model has learned, and even if the model hasn't learned this feature, changing it affects things the model is aware of. OTOH, I'm not sure whether I want to say that this feature has not been learned by the model in this case.
    • Some techniques eg crosscoders don't come equipped with a well defined notion of intervening on the feature during a forward pass.

    Nonetheless, we can still sometimes get evidence this way, in particular about whether our probe has found subtle structure in the data that is really causally irrelevant to the model. This is already a common technique in interpretability (see eg the initimitable golden gate claude, and many more systematic steering tests like this one),

  • Run various shuffle/permutation controls:
    • Measure the selectivity of your feature finding technique: replace the structure in the data with some new structure (or just remove the structure) and then see if your probe finds that new structure. To the extent that the probe can learn the new structure, it is not telling you about what the model has learned.
      Most straightforwardly: if you have trained a supervised probe, you can train a second supervised probe on a dataset with randomised labels, and look at how much more accurate the probe is when trained on data with true labels. This can help distinguish between the hypothesis that you have found a real variable in the model, and the null hypothesis that the probing technique is powerful enough to find a direction that can classify any dataset with that accuracy. Selectivity tests should do things like match the bias of the train data (eg if training a probe on a sparsely activating feature, then the value of the feature is almost always zero and that should be preserved in the control).
      You can also test unsupervised techniques like SAEs this way by training them on random sequences of tokens. There's probably more sophisticated controls that can be introduced here: eg you can try to destroy all the structure in the data and replace it with random structure that is still sparse in the same sense, and so on.
    • In addition to experiments that destroy the probe training data, you can also run experiments that destroy the structure in the model weights. To the extent that the probe works here, it is not telling you about what the model has learned.
      For example, reinitialise the weights of the model, and train the probe/SAE/look at the PCA directions. This is a weak control: a stronger control could do something like reiniatialising the weights of the model that matches the eigenspectrum of each weight matrix to the eigenspectrum of the corresponding matrix in the trained model (to rule out things like the SAE didn't work in the randomised model because the activation vector is too small etc), although that control is still quite weak.
      This control was used nicely in Towards Monosemanticity here, although I think much more research of this form could be done with SAEs and their cousins.
    • I am told by Adam Shai that in experimental neuroscience, it is something of a sport to come up with better and better controls for testing the hypothesis that you have identified structure. Maybe some of that energy should be imported to interp?
  • Probably some other things not on my mind right now??

I am aware that there is less use in being able to identify whether your features are model features or dataset features than there is in having a technique that zero-shot identifies model features only. However, a reliable set of tools for distinguishing what type of feature we have found would give us feedback loops that could help us search for good feature-finding techniques. eg. good controls would give us the freedom to do things like searching over (potentially nonlinear) probe architectures for those with a high accuracy relative to the control (in the absence of the control, searching over architectures would lead us to more and more expressive nonlinear probes that tell us nothing about the model's computation). I'm curious if this sort of thing would lead us away from treating activation vectors in isolation, as the post argues.

Reply2
Load More
117Research directions Open Phil wants to fund in technical AI safety
Ω
7mo
Ω
21
111Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas
Ω
7mo
Ω
0
108Attribution-based parameter decomposition
Ω
8mo
Ω
22
131Circuits in Superposition: Compressing many small neural networks into one
Ω
11mo
Ω
9
5jake_mendel's Shortform
1y
3
65[Interim research report] Activation plateaus & sensitive directions in GPT2
Ω
1y
Ω
2
229SAE feature geometry is outside the superposition hypothesis
Ω
1y
Ω
17
93Apollo Research 1-year update
Ω
1y
Ω
0
23Interpretability: Integrated Gradients is a decent attribution method
1y
7
108The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Ω
1y
Ω
4
Load More