DuncanFowler — LessWrong

Hello everyone,

This is a new real name account purely for the discussion of AI.

A few months ago, I introduced a concept here on mechanistic interpretability.[1] It was an approach supported by a PyTorch implementation that could derive insights from neurons without explicit training on the intermediate concepts. As an illustration, my model can identify strategic interactions between chess pieces, despite being trained solely on win/loss/draw outcomes. One thing distinguishes it from recent work, such as the one by Anthropic ("Towards Monosemanticity: Decomposing Language Models With Dictionary Learning"), is the lack of need for training additional interpretative networks, although it is possible that both approaches could be used together.

Since sharing, I've had two people read the paper (not on LW). Their feedback varied, describing it respectivly as "interesting" and "confusing." I realise I generally find it easier to come up with ideas then to explain them to other people.

The apparent simplicity of my method makes me think that other people must already have considered and discarded this approach, but in case this has genuinly been overlooked [2], I'd love to get more eyes on this.
Any feedback would be invaluable, especially regarding:

Any previous work done on this approch.
Potential pitfalls.
Suggestions for formal submission venues (I'm not aware of a dedicated journal for alignment).

If you want I can write a shorter summary.

Thank you for your time and consideration.

1: https://www.lesswrong.com/posts/eETbeqAJDDkG2kd7t/exploring-concept-specific-slices-in-weight-matrices-for
2: https://www.lesswrong.com/posts/ziNCZEm7FE9LHxLai/don-t-dismiss-simple-alignment-approaches

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments