LESSWRONG
LW

f3mi — LessWrong

I was thinking something potentially similar. This is super nitpicky, but the better equation would be impact = Magnitude * ||Direction||

f3mi1yQuick Take

On mechanistic interpretability, I've been thinking about how one might interpret an llm trained on chess and fine-tuned with RLHF to be fun.

This is much harder than interpreting a language model to me, because language has plenty of human legible structure. Given enough context, any human can write a pretty human looking continuation of text.

The same cannot be said for chess. You'd expect a pre-trained model to infer the ELO of each of the players playing, and predict moves accordingly. Outside of chess experts, I suspect none of us can do such a thing. I strongly doubt that outside of investigating "legal move mechanisms" there is an equivalent to IOI in chess... (read more)

-2

f3mi's Shortform

f3mi

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

In the context of AI interp. What is a feature exactly?

f3mi

As I read more about previous interpretability work, I've noticed this trend that implicitly defines a feature in this weird human centric way. It's this weird prior that expects networks to automatically generate features that correspond with how we process images/text because... why exactly?

Chris Olah's team at Anthropic thinks about features as "Something a large enough neural network would dedicate a neuron to". Which doesn't have the human-centric bias, but just begs the question of what is a thing a large enough network will dedicate an neuron to? They admit that this is flawed, but say it's their best current definition. This never felt like a good enough answer, even to go... (read more)

Self Explaining Neural Networks, the interpretability technique no one seems to be talking about.

f3mi

Disclaimer: I'm very new to alignment as a whole. I wouldn't be surprised if this turned out to be a nothing burger.

This is the coolest paper I've seen in a while, yet I've never heard of the technique. It's not mentioned on blog posts about AI interpretability/AI safety, and I've found very few papers trying to build on the technique, but it seems like a very promising approach.

It's an intrinsic method, meaning that it achieves interpretability goals by making changes to model architecture, not by interpreting weights or activations.

non linear-linearity

The paper starts by outlining what it really means to have an interpretation of a model.

Explicitness: How easy is it to understand the

... (read 965 more words →)