johnswentworth

Sequences

From Atoms To Agents
"Why Not Just..."
Basic Foundations for Agent Models
Framing Practicum
Gears Which Turn The World
Abstraction 2020
Gears of Aging
Model Comparison

Wiki Contributions

Comments

Man, that top one was a mess. Fixed now, thank you!

Answer by johnswentworthApr 23, 2024110

Here are some candidates from Claude and Gemini (Claude Opus seemed considerably better than Gemini Pro for this task). Unfortunately they are quite unreliable: I've already removed many examples from this list which I already knew to have multiple independent discoverers (like e.g. CRISPR and general relativity). If you're familiar with the history of any of these enough to say that they clearly were/weren't very counterfactual, please leave a comment.

  • Noether's Theorem
  • Mendel's Laws of Inheritance
  • Godel's First Incompleteness Theorem (Claude mentions Von Neumann as an independent discoverer for the Second Incompleteness Theorem)
  • Feynman's path integral formulation of quantum mechanics
  • Onnes' discovery of superconductivity
  • Pauling's discovery of the alpha helix structure in proteins
  • McClintock's work on transposons
  • Observation of the cosmic microwave background
  • Lorentz's work on deterministic chaos
  • Prusiner's discovery of prions
  • Yamanaka factors for inducing pluripotency
  • Langmuir's adsorption isotherm (I have no idea what this is)

I somehow missed that John Wentworth and David Lorell are also in the middle of a sequence on this same topic here.

Yeah, uh... hopefully nobody's holding their breath waiting for the rest of that sequence. That was the original motivator, but we only wrote the one post and don't have any more in development yet.

Point is: please do write a good stat mech sequence, David and I are not really "on that ball" at the moment.

(Didn't read most of the dialogue, sorry if this was covered.)

But the way transformers work is they greedily think about the very next token, and predict that one, even if by conditioning on it you shot yourself in the foot for the task at hand.

That depends on how we sample from the LLM. If, at each "timestep", we take the most-probable token, then yes that's right.

But an LLM gives a distribution over tokens at each timestep, i.e. . If we sample from that distribution, rather than take the most-probable at each timestep, then that's equivalent to sampling non-greedily from the learned distribution over text. It's the chain rule:

Writing collaboratively is definitely something David and I have been trying to figure out how to do productively.

How sure are we that models will keeptracking Bayesian belief states, and so allow this inverse reasoning to be used, when they don't have enough space and compute to actually track a distribution over latent states?

One obvious guess there would be that the factorization structure is exploited, e.g. independence and especially conditional independence/DAG structure. And then a big question is how distributions of conditionally independent latents in particular end up embedded.

We're now working through understanding all the pieces of this, and we've calculated an MSP which doesn't quite look like the one in the post:

(Ignore the skew, David's still fiddling with the projection into 2D. The important noticeable part is the absence of "overlap" between the three copies of the main shape, compared to the fractal from the post.)

Specifically, each point in that visual corresponds to a distribution  for some value of the observed symbols . The image itself is of the points on the probability simplex. From looking at a couple of Crutchfield papers, it sounds like that's what the MSP is supposed to be.

The update equations are:

with  given by the transition probabilities,  given by the observation probabilities, and  a normalizer. We generate the image above by running initializing some random distribution , then iterating the equations and plotting each point.

Off the top of your head, any idea what might account for the mismatch (other than a bug in our code, which we're already checking)? Are we calculating the right thing, i.e. values of  ? Are the transition and observation probabilities from the graphic in the post the same parameters used to generate the fractal? Is there some thing which people always forget to account for when calculating these things?

Can you elaborate on how the fractal is an artifact of how the data is visualized?

I don't know the details of the MSP, but my current understanding is that it's a general way of representing stochastic processes, and the MSP representation typically looks quite fractal. If we take two approximately-the-same stochastic processes, then they'll produce visually-similar fractals.

But the "fractal-ness" is mostly an artifact of the MSP as a representation-method IIUC; the stochastic process itself is not especially "naturally fractal".

(As I said I don't know the details of the MSP very well; my intuition here is instead coming from some background knowledge of where fractals which look like those often come from, specifically chaos games.)

That there is a linear 2d plane in the residual stream that when you project onto it you get that same fractal seems highly non-artifactual, and is what we were testing.

A thing which is highly cruxy for me here, which I did not fully understand from the post: what exactly is the function which produces the fractal visual from the residual activations? My best guess from reading the post was that the activations are linearly regressed onto some kind of distribution, and then the distributions are represented in a particular way which makes smooth sets of distributions look fractal. If there's literally a linear projection of the residual stream into two dimensions which directly produces that fractal, with no further processing/transformation in between "linear projection" and "fractal", then I would change my mind about the fractal structure being mostly an artifact of the visualization method.

[EDIT: I no longer endorse this response, see thread.]

(This comment is mainly for people other than the authors.)

If your reaction to this post is "hot damn, look at that graph", then I think you should probably dial back your excitement somewhat. IIUC the fractal structure is largely an artifact of how the data is visualized, which means the results visually look more striking than they really are.

It is still a cool piece of work, and the visuals are beautiful. The correct amount of excitement is greater than zero.

Load More