Adam Scherlis - LessWrong

SAE feature geometry is outside the superposition hypothesis

I strongly agree with this post.

I'm not sure about this, though:

We are familiar with modular addition being performed in a circle from Nanda et al., so we were primed to spot this kind of thing — more evidence of street lighting.

It could be the streetlight effect, but it's not that surprising that we'd see this pattern repeatedly. This circular representation for modular addition is essentially the only nontrivial representation (in the group-theoretic sense) for modular addition, which is the only (simple) commutative group. It's likely to pop up in many places whether or not we're looking for it (like position embeddings, as Eric pointed out, or anything else Fourier-flavored).

Also:

As for where in the activation space each feature vector is placed, oh that doesn't really matter and any nearly orthogonal overcomplete basis will do. Or maybe if I'm being more sophisticated, I can specify the correlations between features and that’s enough to pin down all the structure that matters — all the other details of the overcomplete basis are random.

The correlations between all pairs of features are sufficient to pin down an arbitrary amount of structure -- everything except an overall rotation of the embedding space -- so someone could object that the circular representation and UMAP results are "just" showing the correlations between features. I would probably say the "superposition hypothesis" is a bit stronger than that, but weaker than "any nearly orthogonal overcomplete basis will do": it says that the total amount of correlation between a given feature and all other features (i.e. interference from them) matters, but which other features are interfering with it doesn't matter, and the particular amount of interference from each other feature doesn't matter either. This version of the hypothesis seems pretty well falsified at this point.

What's up with all the non-Mormons? Weirdly specific universalities across LLMs

Adam Scherlis3mo149

I suspect a lot of this has to do with the low temperature.

The phrase "person who is not a member of the Church of Jesus Christ of Latter-day Saints" has a sort of rambling filibuster quality to it. Each word is pretty likely, in general, given the previous ones, even though the entire phrase is a bit specific. This is the bias inherent in low-temperature sampling, which tends to write itself into corners and produce long phrases full of obvious-next-words that are not necessarily themselves common phrases.

Going word by word, "person who is not a member..." is all nice and vague and generic; by the time you get to "a member of the", obvious continuations are "Church" or "Communist Party"; by the time you have "the Church of", "England" is a pretty likely continuation. Why Mormons though?

"Since 2018, the LDS Church has emphasized a desire for its members be referred to as "members of The Church of Jesus Christ of Latter-day Saints"." --Wikipedia

And there just aren't that many other likely continuations of the low-temperature-attracting phrase "members of the Church of".

(While "member of the Communist Party" is an infamous phrase from McCarthyism.)

If I'm right, sampling at temperature 1 should produce a much more representative set of definitions.

My Interview With Cade Metz on His Reporting About Slate Star Codex

Adam Scherlis4mo10

That's a reasonable argument but doesn't have much to do with the Charlie Sheen analogy.

The key difference, which I think breaks the analogy completely, is that (hypothetical therapist) Estevéz is still famous enough as a therapist for journalists to want to write about his therapy method. I think that's a big enough difference to make the analogy useless.

If Charlie Sheen had a side gig as an obscure local therapist, would journalists be justified in publicizing this fact for the sake of his patients? Maybe? It seems much less obvious than if the therapy was why they were interested!

REQ: Latin translation for HPMOR

Adam Scherlis4mo20

In "no Lord hath the champion", the subject of "hath" is "champion". I think this matches the Latin, yes? "nor for a champion [is there] a lord"

My Interview With Cade Metz on His Reporting About Slate Star Codex

Adam Scherlis4mo10

In that case, "journalists writing about the famous Estevéz method of therapy" would be analogous to journalists writing about Scott's "famous" psychiatric practice.

If a journalist is interested in Scott's psychiatric practice, and learns about his blog in the process of writing that article, I agree that they would probably be right to mention it in the article. But that has never happened because Scott is not famous as a psychiatrist.

My Interview With Cade Metz on His Reporting About Slate Star Codex

Adam Scherlis4mo20

That might be relevant if anyone is ever interested in writing an article about Scott's psychiatric practice, or if his psychiatric practice was widely publicly known. It seems less analogous to the actual situation.

To put it differently: you raise a hypothetical situation where someone has two prominent identities as a public figure. Scott only has one. Is his psychiatrist identity supposed to be Sheen or Estevéz, here?

Toni Kurz and the Insanity of Climbing Mountains

Adam Scherlis4mo30

Nick Bostrom? You mean Thoreau?

Two Percolation Puzzles

Adam Scherlis1y10

Correct.

Hell is Game Theory Folk Theorems

Adam Scherlis1y86

Correct me if I'm wrong:

The equilibrium where everyone follows "set dial to equilibrium temperature" (i.e. "don't violate the taboo, and punish taboo violators") is only a weak Nash equilibrium.

If one person instead follows "set dial to 99" (i.e. "don't violate the taboo unless someone else does, but don't punish taboo violators") then they will do just as well, because the equilibrium temp will still always be 99. That's enough to show that it's only a weak Nash equilibrium.

Note that this is also true if an arbitrary number of people deviate to this strategy.

If everyone follows this second strategy, then there's no enforcement of the taboo, so there's an active incentive for individuals to set the dial lower.

So a sequence of unilateral changes of strategy can get us to a good equilibrium without anyone having to change to a worse strategy at any point. This makes the fact of it being a (weak) Nash equilibrium not that compelling to me; people don't seem trapped unless they have some extra laziness/inertia against switching strategies.

But (h/t Noa Nabeshima) you can strengthen the original, bad equilibrium to a strong Nash equilibrium by tweaking the scenario so that people occasionally accidentally set their dials to random values. Now there's an actual reason to punish taboo violators, because taboo violations can happen even if everyone is following the original strategy.

Big Mac Subsidy?

Adam Scherlis1y30

Beef is far from the only meat or dairy food consumed by Americans.

LESSWRONG
LW

Posts

Wiki Contributions

Comments