Instead of one massive model, imagine at inference time, you generate a small, query specific model by carving out a small subset of the big model.
Interp on this smaller model for a given query has to be better than on the oom bigger model.
Inference gets substantially faster if you can cluster/batch similar queries together, so you can actually get financial support if this works at the frontier.
The core insight? Gradients smear knowledge everywhere. Tag training data with an ontology learned from the query stream of a frontier model. We can automate the work of millions of librarians to classify our training data. Only let the base model + tagged components be active for a given training example. Sgd can't smear the gradient across inactive components. Knowledge will be roughly partitioned. When you do a forward pass you generate some python code, you don't pay for dot products that know that Jackson is an entirely ignorable town in New Jersey.
What if interpretability is a spectrum?
Instead of one massive model, imagine at inference time, you generate a small, query specific model by carving out a small subset of the big model.
Interp on this smaller model for a given query has to be better than on the oom bigger model.
Inference gets substantially faster if you can cluster/batch similar queries together, so you can actually get financial support if this works at the frontier.
The core insight? Gradients smear knowledge everywhere. Tag training data with an ontology learned from the query stream of a frontier model. We can automate the work of millions of librarians to classify our training data. Only let the base model + tagged components be active for a given training example. Sgd can't smear the gradient across inactive components. Knowledge will be roughly partitioned. When you do a forward pass you generate some python code, you don't pay for dot products that know that Jackson is an entirely ignorable town in New Jersey.