Lee Sharkey

Apollo Research (London).

My main research interests are mechanistic interpretability and inner alignment.

Wiki Contributions

Apollo Research (org)

4mo

Comments

Sorted by

Newest

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Lee Sharkey3moΩ8110

It would be interesting to meditate in the question "What kind of training procedure could you use to get a meta-SAE directly?" And I think answering this relies in part on mathematical specification of what you want.

At Apollo we're currently working on something that we think will achieve this. Hopefully will have an idea and a few early results (toy models only) to share soon.

Circumventing interpretability: How to defeat mind-readers

Lee Sharkey3moΩ120

So I believe I had in mind "active means [is achieved deliberately through the agent's actions]".

I think your distinction makes sense. And if I ever end up updating this article I would consider incorporating it. However, I think the reason I didn't make this distinction at the time is because the difference is pretty subtle.

The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to help orchestrate the actions required to do it. This is true even if they've been selected for passively.

So I'd argue that they all need to be implemented actively (if they're to have a reasonable probability of success) but may be selected for passively or actively. I'm curious if you agree with this? If not, then I may have missed your argument.

(
For the convenience of other readers of this thread, the kind of strategies I labelled as strictly active are:

1.1.3. Environmental artifacts that corrupt its future self: A misaligned AI may leave behind artifacts that serve as adversarial examples or poisoned data that corrupt its future self. (Strictly active)
1.1.4. Deleting (and relearning) dangerous representations: When not under surveillance, an AI might be able to delete dangerous representations so that it looks safe when surveillance resumes. It might also be able to schedule when it relearns those representations. (Strictly active)
1.1.5. Weights poisoning: Instead of providing manipulated inputs, the AI might manipulate its own weights so that a small subset of normal inputs behave like adversarial examples. (Strictly active)
1.1.6. Pre-committing to following potentially dangerous instructions: Hubinger: "... if the model has the ability to read and write to some sort of state ... it could write out a pre-commitment to follow the hard-coded policy and then just have a policy of always following any pre-commitments it finds such that it never thinks about the deception that led it to that policy in the first place.” (Strictly active)
1.1.7. Hard-coded neural shortcuts: Hubinger again: “... it could hack its own gradient to produce a new model which will still exhibit the same deceptive behavior, but without doing any of the detectable deceptive cognition. … not because it's trying to act deceptive, but just because it's a hard-coded part of its algorithm” (Strictly active)
1.1.8. Following cached thoughts: Nanda: "[An AI might develop] the instinct to follow cached thoughts without thinking about them, and in occasional moments of deception to generate a lot of cached thoughts that will lead to misaligned actions, but with no deceptive cognition in the moment." (Strictly active)
1.1.9. Instrumental incapability: A misaligned AI that’s able to modify its own parameters could damage capabilities that serve some goals (e.g. aligned goals) more than its misaligned goals. Generalizes instrumental clumsiness. (Strictly active)
1.1.10. Instrumental clumsiness: An AI might hack its future self so that it is more likely to make certain kinds of mistakes while leaving other capabilities intact. (Strictly active)
1.3.3. Interpretability illusions: By manipulating the dataset shown to the tool, an AI can control its interpretations. (Strictly active)
1.4.1. Adversarial attack against networks in the interpretation tool: The AI could adversarially attack networks in the tool to manipulate the tool’s outputs. (Strictly active)
2.4.1. Manipulating, mind-hacking, or incapacitating the interpreter: An AI might be able to manipulate, directly control, or incapacitate humans using specific kinds of outputs. (Strictly active)

)

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Lee Sharkey3moΩ242

Extremely glad to see this! The Guez et al. model has long struck me as one of the best instances of a mesaoptimizer and it was a real shame that it was closed source. Looking forward to the interp findings!

Efficient Dictionary Learning with Switch Sparse Autoencoders

Lee Sharkey3mo30

Both of these seem like interesting directions (I had parameters in mind, but params and activations are too closely linked to ignore one or the other). And I don't have a super clear idea but something like representational similarity analysis between SwitchSAEs and regular SAEs could be interesting. This is just one possibility of many though. I haven't thought about it for long enough to be able to list many more, but it feels like a direction with low hanging fruit for sure. For papers, here's a good place to start for RSA: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3730178/

Efficient Dictionary Learning with Switch Sparse Autoencoders

Lee Sharkey4mo50

Great work! Very excited to see work in this direction (In fact, I didn't know you were working on this, so I'd expressed enthusiasm for MoE SAEs in our recent list of project ideas published just a few days ago!)

Comments:

I'd love to see some geometric analysis of the router. Is it just approximately a down-projection from the encoder features learned by a dense SAE trained on the same activations?
Consider integrating with SAELens.

Following Fedus et al., we route to a single expert SAE. It is possible that selecting several experts will improve performance. The computational cost will scale with the number of experts chosen.

If there are some very common features in particular layers (e.g. an 'attend to BOS' feature), then restricting one expert to be active at a time will potentially force SAEs to learn common features in every expert.

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey4moΩ120

Thanks! Fixed now

SAE feature geometry is outside the superposition hypothesis

Lee Sharkey4mo172

I've no great resources in mind for this. Off the top of my head, a few examples of ideas that are common in comp neuro that might have bought mech interp some time if more people in mech interp were familiar with them when they were needed:
- Polysemanticity/distributed representations/mixed selectivity
- Sparse coding
- Representational geometry
I am not sure about what future mech interp needs will be, so it's hard to predict which ideas or methods that are common in neuro will be useful (topological data analysis? Dynamical systems?). But it just seems pretty likely that a field that tackles such a similar problem will continue to be a useful source of intuitions and methods. I'd love if someone were to write a review or post on the interplay between the parallel fields of comp neuro and mech interp. It might help flag places where there ought to be more interplay.

SAE feature geometry is outside the superposition hypothesis

Lee Sharkey4mo2817

I'll add a strong plus one to this and a note for emphasis:

Representational geometry is already a long-standing theme in computational neuroscience (e.g. Kriegeskorte et al., 2013)

Overall I think mech interp practitioners would do well to pay more attention to ideas and methods in computational neuroscience. I think mech interp as a field has a habit of overlooking some hard-won lessons learned by that community.

Transcoders enable fine-grained interpretable circuit analysis for language models

Lee Sharkey6moΩ120

I'm pretty sure that there's at least one other MATS group (unrelated to us) currently working on this, although I'm not certain about any of the details. Hopefully they release their research soon!

There's recent work published on this here by Chris Mathwin, Dennis Akar, and me. The gated attention block is a kind of transcoder adapted for attention blocks.

Nice work by the way! I think this is a promising direction.

Note also the similar, but substantially different, use of the term transcoder here, whose problems were pointed out to me by Lucius. Addressing those problems helped to motivate our interest in the kind of transcoders that you've trained in your work!

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey7moΩ120

Trying to summarize my current understanding of what you're saying:

Yes all four sound right to me.
To avoid any confusion, I'd just add an emphasis that the descriptions are mathematical, as opposed semantic.

I'd guess you have intuitions that the "short description length" framing is philosophically the right one, and I probably don't quite share those and feel more confused how to best think about "short descriptions" if we don't just allow arbitrary Turing machines (basically because deciding what allowable "parts" or mathematical objects are seems to be doing a lot of work). Not sure how feasible converging on this is in this format (though I'm happy to keep trying a bit more in case you're excited to explain).

I too am keen to converge on a format in terms of Turing machines or Kolmogorov complexity or something else more formal. But I don't feel very well placed to do that, unfortunately, since thinking in those terms isn't very natural to me yet.

LESSWRONG
LW

Posts

Wiki Contributions

Comments