LESSWRONG
LW

213
lewis smith
789Ω1504400
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
The ‘strong’ feature hypothesis could be wrong
lewis smith25d10

Aren't the MLPs in a transformer straightforward examples of this?

Thats certainly the most straightforward interpretation! I think a lot of the ideas I'm talking about here are downstream of the toy models paper, which introduces the idea that MLPs might fundamentally be explained in terms of (approximate) manipulations of these kinds of linear subspaces; i.e that everything that wasnt' expliciable in this way would sort of function like noise, rather than playing a functional role.

I think that I agree this should have been treated with a lot more suspicion that it was in interpretability circles, but lots of people were excited about this paper, and then SAEs seemed to be sort of 'magically' finding lots of cool interpretable features based on this linear direction hypothesis. I think this seemed a bit like a validation of the linear feature idea in the TMS paper which explains a certain amount of the excitement around it.

I think the main point I wanted to make with the post was that the TMS model was a hypothesis, that it was a kind of vague hypothesis that we should probably have a clearer idea of so we could think about whether it was true or not, and that the strongest versions of it were probably not true.

Reply
The ‘strong’ feature hypothesis could be wrong
lewis smith26d10

in the context of the SFH; feature means what I called 'atom' in the text, i.e linear direction in activation space with a specific function in the model. This implies that any mechanism can be usefully decomposed in terms of these. Finding a mechanism which is difficult to express in this model is counter evidence. I think you could rescue the 'feature hypothesis' by using a vaguer definition of 'feature' (which is a common move)

Reply
The ‘strong’ feature hypothesis could be wrong
lewis smith1mo10

I think that they are distinguishable. For instance, if you can find an example of a structure which doesn't fit the 'feature' model but clearly serves some algorithmic function, that would seem to be strong counter-evidence? For example this paper https://arxiv.org/abs/2405.14860 demonstrates that at least the one-dimensional feature model is not complete. There might be some way to express that in 'strong feature hypothesis' form by adding a lot of epicycles, but I think that sort of thing would be evidence against the idea of independent 1-dimensional linear features. The strong feature hypothesis does have the virtue of being strong; therefore it's quite vulnerable to counter evidence! The main thing that makes this a bit more confusing is that I think exactly what the 'feature' hypothesis was was often left fairly vague; disproving a vague hypothesis is quite difficult.

Reply
The ‘strong’ feature hypothesis could be wrong
lewis smith3mo10

The weak LRH I would say is now well supported by considerable empirical evidence.

a couple of people have shrug reacted to this sentence. . I think the theory I had in mind as being 'well supported by empirical evidence' was something like 'you can often find examples of networks representing stuff with a linear direction pretty uncontroversially'. I think this is probably still a fair statement, though it's a bit vague, and you could argue that it's a bit of a leap from this (which is like >0 things are represented as linear directions) to how I phrased the weak LRH in the post. 

while I think this post has mostly aged quite well, I think, looking back, I was hedging to try to avoid writing a post entitled 'here's why SAEs are doomed'.

Reply
evhub's Shortform
lewis smith4mo42

Strong agree on the mentalistic language. In fact I would go a bit further than saying that work on deception is hard to understand without mentalistic language: I think this is a central point to work on deception / scheming (that the authors of this paper gesture at a little bit): the any definition of strategic deception (e.g Agent A is trying to make agent B believe X, while agent A believes ~X) requires taking the intentional stance and attributing mental states to A and B. I think that it's reasonable to probe whether attributing these mental states makes sense, and we shouldn't just uncritically apply the intentional stance. But coming up with experiments that distinguish whether a model is in a given intentional state is quite hard!

Reply
A Problem to Solve Before Building a Deception Detector
lewis smith7mo10

although having said that, even simpler probes require some operationalisation of the target state (e.g the model is lying) which is normally behavioural rather than 'bottom up' (lying requires believing things, which is an intentional state again.)

Reply
A Problem to Solve Before Building a Deception Detector
lewis smith7mo10

i wouldn't read too much into the title (it's partly just trying to be punchy), though I do think that the connection between the intentional state of deception and it's algorithmic representation.

re. point 2.; Possibly this was a bit over-confident. I do think that a priori I think that the simple correspondence theory is unlikely to be right, but maybe I should have more weight on 'simple correspondence will just hold for deception'.

Another thing is that maybe something I would do differently if we wrote this again is to be a bit more specific about the kind of deception detector; I think I was thinking a lot of a 'circuit level' or 'representational' version of mechanistic interpretability here (e.g working on finding the deception circuit or the deception representation). I think this is sometimes gestured at (e.g the idea that we need a high level of confidence in the internals of the model in order to make progress on deception).

i'm not sure that, for example, a supervised probe for some particular undesirable behaviour needs you to solve the correspondence problem (which might well count as a 'deception detector').

Reply
A Problem to Solve Before Building a Deception Detector
lewis smith9mo40

I think it's important to push back against the assumption that this will always happen, or that  something like the refusal direction has to exist for every possible state of interest.

and to expand on this a little bit more: it seems important that we hedge against this possibility by at least spending a bit of time thinking about plans that don't rhyme with 'I sure hope everything turns out to be a simple correspondence'! I think eleni and i feel that this is a suprisingly widespread move in interpretability plans, which is maybe why some of the post is quite forceful in arguing against it

Reply1
A Problem to Solve Before Building a Deception Detector
lewis smith9mo40

I think this is along the right sort of lines. Indeed I think this plan is the sort of thing I hoped to prompt people to think about with the post. But I think there are a few things wrong with it:

  • i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence. It's also easy to imagine this being true for some categories of static facts about the external world (e.g paris being in france) but you need to be careful about extending this to the category of all propositional statements (e.g the model thinks that this safeguard is adequate, or the model can't find any security flaws in this program).

  • relatedly, your second bullet point assumes that you can identify the 'fact' related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly?

  • I think that detecting/preventing models from knowingly lying would be a good research direction and it's clearly related to strategic deception, but I'm not actually sure that it's a superset (consider a case when I'm bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don't know or care whether what I'm saying is true or false or whatever).

but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!

Reply
A Problem to Solve Before Building a Deception Detector
lewis smith9mo40

I don't think we actually disagree very much?

I think that it's totally possible that there do turn out to be convenient 'simple correspondences' for some intentional states that we care about (as you say, we have some potential examples of this already), but I think it's important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.

re.

Even in the case of irreducible complexity, it seems too strong to call it a category mistake; there's still an algorithmic implementation of (eg) recognizing a good chess move, it might just not be encapsulable in a nicely simple description. In the most extreme case we can point to the entire network as the algorithm underlying the intentional state.

This seems like a restatement of what I would consider an important takeaway from this post; that this sort of emergence is at least a conceptual possibility. I think if this is true, it is a category mistake to think about the intentional states as being implemented by a part or a circuit in the model; they are just implemented by the model as a whole. 

I don't think that a takeaway from our argument here is that you necessarily need to have like a complete account of how intentional states emerge from algorithmic ones (e.g see point 4. in the conclusion). I think our idea is more to point out that this conceptual distinction between intentional and algorithmic states is important to make, and that it's an important thing to think about looking for empirically. See also conclusion/suggestion 2: we aren't arguing that interpretability work is hopeless, we are trying to point it at the problems that matter for building a deception detector, and give you some tools for evaluating existing or planned research on that basis. 

Reply
Load More
12lewis smith's Shortform
1y
7
53Towards data-centric interpretability with sparse autoencoders
Ω
3mo
Ω
2
113Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
Ω
7mo
Ω
15
77A Problem to Solve Before Building a Deception Detector
Ω
9mo
Ω
12
12lewis smith's Shortform
1y
7
235The ‘strong’ feature hypothesis could be wrong
Ω
1y
Ω
28
63Improving Dictionary Learning with Gated Sparse Autoencoders
Ω
2y
Ω
38
80[Full Post] Progress Update #1 from the GDM Mech Interp Team
Ω
2y
Ω
10
73[Summary] Progress Update #1 from the GDM Mech Interp Team
Ω
2y
Ω
0
24Dropout can create a privileged basis in the ReLU output model.
3y
3