lewis smith — LessWrong

The ‘strong’ feature hypothesis could be wrong

The weak LRH I would say is now well supported by considerable empirical evidence.

a couple of people have shrug reacted to this sentence. . I think the theory I had in mind as being 'well supported by empirical evidence' was something like 'you can often find examples of networks representing stuff with a linear direction pretty uncontroversially'. I think this is probably still a fair statement, though it's a bit vague, and you could argue that it's a bit of a leap from this (which is like >0 things are represented as linear directions) to how I phrased the weak LRH in the post.

while I think this post has mostly aged quite well, I think, looking back, I was hedging to try to avoid writing a post entitled 'here's why SAEs are doomed'.

evhub's Shortform

lewis smith2mo42

Strong agree on the mentalistic language. In fact I would go a bit further than saying that work on deception is hard to understand without mentalistic language: I think this is a central point to work on deception / scheming (that the authors of this paper gesture at a little bit): the any definition of strategic deception (e.g Agent A is trying to make agent B believe X, while agent A believes ~X) requires taking the intentional stance and attributing mental states to A and B. I think that it's reasonable to probe whether attributing these mental states makes sense, and we shouldn't just uncritically apply the intentional stance. But coming up with experiments that distinguish whether a model is in a given intentional state is quite hard!

A Problem to Solve Before Building a Deception Detector

lewis smith5mo10

although having said that, even simpler probes require some operationalisation of the target state (e.g the model is lying) which is normally behavioural rather than 'bottom up' (lying requires believing things, which is an intentional state again.)

A Problem to Solve Before Building a Deception Detector

lewis smith5mo10

i wouldn't read too much into the title (it's partly just trying to be punchy), though I do think that the connection between the intentional state of deception and it's algorithmic representation.

re. point 2.; Possibly this was a bit over-confident. I do think that a priori I think that the simple correspondence theory is unlikely to be right, but maybe I should have more weight on 'simple correspondence will just hold for deception'.

Another thing is that maybe something I would do differently if we wrote this again is to be a bit more specific about the kind of deception detector; I think I was thinking a lot of a 'circuit level' or 'representational' version of mechanistic interpretability here (e.g working on finding the deception circuit or the deception representation). I think this is sometimes gestured at (e.g the idea that we need a high level of confidence in the internals of the model in order to make progress on deception).

i'm not sure that, for example, a supervised probe for some particular undesirable behaviour needs you to solve the correspondence problem (which might well count as a 'deception detector').

A Problem to Solve Before Building a Deception Detector

lewis smith8mo40

I think it's important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.

and to expand on this a little bit more: it seems important that we hedge against this possibility by at least spending a bit of time thinking about plans that don't rhyme with 'I sure hope everything turns out to be a simple correspondence'! I think eleni and i feel that this is a suprisingly widespread move in interpretability plans, which is maybe why some of the post is quite forceful in arguing against it

A Problem to Solve Before Building a Deception Detector

lewis smith8mo40

I think this is along the right sort of lines. Indeed I think this plan is the sort of thing I hoped to prompt people to think about with the post. But I think there are a few things wrong with it:

i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence. It's also easy to imagine this being true for some categories of static facts about the external world (e.g paris being in france) but you need to be careful about extending this to the category of all propositional statements (e.g the model thinks that this safeguard is adequate, or the model can't find any security flaws in this program).
relatedly, your second bullet point assumes that you can identify the 'fact' related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly?
I think that detecting/preventing models from knowingly lying would be a good research direction and it's clearly related to strategic deception, but I'm not actually sure that it's a superset (consider a case when I'm bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don't know or care whether what I'm saying is true or false or whatever).

but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!

A Problem to Solve Before Building a Deception Detector

lewis smith8mo40

I don't think we actually disagree very much?

I think that it's totally possible that there do turn out to be convenient 'simple correspondences' for some intentional states that we care about (as you say, we have some potential examples of this already), but I think it's important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.

re.

Even in the case of irreducible complexity, it seems too strong to call it a category mistake; there's still an algorithmic implementation of (eg) recognizing a good chess move, it might just not be encapsulable in a nicely simple description. In the most extreme case we can point to the entire network as the algorithm underlying the intentional state.

This seems like a restatement of what I would consider an important takeaway from this post; that this sort of emergence is at least a conceptual possibility. I think if this is true, it is a category mistake to think about the intentional states as being implemented by a part or a circuit in the model; they are just implemented by the model as a whole.

I don't think that a takeaway from our argument here is that you necessarily need to have like a complete account of how intentional states emerge from algorithmic ones (e.g see point 4. in the conclusion). I think our idea is more to point out that this conceptual distinction between intentional and algorithmic states is important to make, and that it's an important thing to think about looking for empirically. See also conclusion/suggestion 2: we aren't arguing that interpretability work is hopeless, we are trying to point it at the problems that matter for building a deception detector, and give you some tools for evaluating existing or planned research on that basis.

eggsyntax's Shortform

lewis smith8mo32

i do agree with that, although 'step 1 is identify the problem'

eggsyntax's Shortform

lewis smith8mo40

I think I agree that there are significant quibbles you can raise with the picture chalmers outlines, but in general I think he's pointing at an important problem for interpretability; that it's not clear what the relationship between a circuit-level algorithmic understanding and the kind of statements we would like to rule out (e.g this system is scheming against me) is.

Sherlockian Abduction Master List

lewis smith8mo10

re. the article saying it's hard to observe; I think the short nails are pretty hard to spot (many people keep their nails short) but the long fingerstyle nails are quite unusual looking, though also not that common.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments