Posts

Sorted by New

Wiki Contributions

Comments

Transcoders enable fine-grained interpretable circuit analysis for language models

Hey Jacob + Philippe,

I took the liberty of making a clean installable version of your original codebase. Hope you don't mind, and happy to make any changes that you request! https://github.com/dtch1997/transcoders-slim

Reply

Toward A Mathematical Framework for Computation in Superposition

Daniel Tan22d10

This work is very exciting to me, and I'm curious to hear the authors' thoughts on whether we could verify specific predictions made by this model in real models.

For example, the proposed U-AND operator - do we expect this to occur in real LLMs, and could we try to find evidence of this by applying mech interp to carefully-chosen toy models?

I have a more detailed write-up on model organisms of superposition here: https://docs.google.com/document/d/1hwI30HNNB2MkOrtEzo7hppG9X7Cn7Xm9a-1LBqcttWc/edit?usp=sharing

Would love to discuss this more!

Reply