redhatbluehat — LessWrong

Hi Neel! Thanks so much for all these online resources. I've been finding them really interesting and helpful.

I have a question about research methods. "How far can you get with really deeply reverse engineering a neuron in a 1 layer (1L) model? (solu-1l, solu-1l-pile or gelu-1l in TransformerLens)."

I've loaded up solu-1l in my Jupyter notebook but now feeling a bit lost. For your IOI tutorial, there was a very specific benchmark and error signal. However, when I'm just playing around with a model without a clear capability in mind, it's harder to know how to measure performance. I could make a list of capabilities/benchmarks, systematically run the model on them, and then pick a capability and start ablating the model and seeing effect on performance. However, I'm then restricted to these predefined capabilities. Like, I'm not even sure what the capabilities of solu-1l are.

I could start feeding solu-1l with random inputs and just "looking" at the attention patterns. But I'm wondering if there's a more efficient way to do this-- or another strategy where research does feel like play, as you describe in your notebook.

Thank you!

200 COP in MI: Image Model Interpretability

redhatbluehat3y10

Thank you for the post! Either to Neel, or to anyone else in the comments: what datasets have you found most useful for testing tiny image transformers?

The vit-pytorch repo uses the cats vs dogs repo from Kaggle but I'm wondering if this is too complex for the very simple image transformers.

200 COP in MI: The Case for Analysing Toy Language Models

redhatbluehat3yΩ010

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments