Neel Nanda

Sequences

GDM Mech Interp Progress Updates
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
Mechanistic Interpretability Puzzles
Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability
My Overview of the AI Alignment Landscape

Wiki Contributions

Comments

Sorted by

Huh, are there examples of right leaning stuff they stopped funding? That's new to me

+1. Concretely this means converting every probability p into p/(1-p), and then multiplying those (you can then convert back to probabilities)

Intuition pump: Person A says 0.1 and Person B says 0.9. This is symmetric, if we instead study the negation, they swap places, so any reasonable aggregation should give 0.5

Geometric mean does not, instead you get 0.3

Arithmetic gets 0.5, but is bad for the other reasons you noted

Geometric mean of odds is sqrt(1/9 * 9) = 1, which maps to a probability of 0.5, while also eg treating low probabilities fairly

Neel NandaΩ570

Interesting thought! I expect there's systematic differences, though it's not quite obvious how. Your example seems pretty plausible to me. Meta SAEs are also more incentived to learn features which tend to split a lot, I think, as then they're useful for more predicting many latents. Though ones that don't split may be useful as they entirely explain a latent that's otherwise hard to explain.

Anyway, we haven't checked yet, but I expect many of the results in this post would look similar for eg sparse linear regression over a smaller SAEs decoder. Re why meta SAEs are interesting at all, they're much cheaper to train than a smaller SAE, and BatchTopK gives you more control over the L0 than you could easily get with sparse linear regression, which are some mild advantages, but you may have a small SAE lying around anyway. I see the interesting point of this post more as "SAE latents are not atomic, as shown by one method, but probably other methods would work well too"

What's wrong with twitter as an archival source? You can't edit tweets (technically you can edit top level tweets for up to an hour, but this creates a new URL and old links still show the original version). Seems fine to just aesthetically dislike twitter though

To me, this model predicts that sparse autoencoders should not find abstract features, because those are shards, and should not be localisable to a direction in activation space on a single token. Do you agree that this is implied?

If so, how do you square that with eg all the abstract features Anthropic found in Sonnet 3?

I expect there's lots of new forms of capabilities elicitation for this kind of model, which their standard framework may not have captured, and which requires more time to iterate on

Thanks for the post!

sample five random users’ forecasts, score them, and then average

Are you sure this is how their bot works? I read this more as "sample five things from the LLM, and average those predictions". For Metaculus, the crowd is just given to you, right, so it seems crazy to sample users?

Yeah, fair point, disagreement retracted

I think this is important to define anyway! (and likely pretty obvious). This would create a lot more friction for someone to take on such a role though, or move out

Load More