Thanks! In my experience Computational Mechanics has many of those types of technical insights. My background is in neuroscience and in that context it really helped me think about computation in brains, and design experiments. Now I'm excited to use Comp Mech in a more concrete and deeper way to understand how artificial neural network internal structures relate to their behavior. Hopefully this is just the start!

Computational Mechanics Hackathon (June 1 & 2)

Adam Shai11d10

Also a good point. Thanks

Computational Mechanics Hackathon (June 1 & 2)

Adam Shai11d10

No, thanks for pointing this out

Alexander Gietelink Oldenziel's Shortform

Adam Shai23d63

Lengthening from what to what?

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai1mo10

This is a great question, and one of the things I'm most excited about using this framework to study in the future! I have a few ideas but nothing to report yet.

But I will say that I think we should be able to formalize exactly what it would mean for a transformer to create/discover new knowledge, and also to apply the structure from one dataset and apply it to another, or to mix two abstract structures together, etc. I want to have an entire theory of cognitive abilities and the geometric internal structures that support them.

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai1mo20

If I'm understanding your question correctly, then the answer is yes, though in practice it might be difficult (I'm actually unsure how computationally intensive it would be, haven't tried anything along these lines yet). This is definitely something to look into in the future!

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai1mo32

It's surprising for a few reasons:

The structure of the points in the simplex is NOT
- The next token prediction probabilities (ie. the thing we explicitly train the transformer to do)
- The structure of the data generating model (ie. the thing the good regulator theorem talks about, if I understand the good regulator theorem, which I might not)

The first would be not surprising because it's literally what our loss function asks for, and the second might not be that surprising since this is the intuitive thing people often think about when we say "model of the world." But the MSP structure is neither of those things. It's the structure of inference over the model of the world, which is quite a different beast than the model of the world.

Others might not find it as surprising as I did - everyone is working off their own intuitions.

edit: also I agree with what Kave said about the linear representation.

Adam Shai's Shortform

Adam Shai1mo60

A neglected problem in AI safety technical research is teasing apart the mechanisms of dangerous capabilities exhibited by current LLMs. In particular, I am thinking that for any model organism ( see Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research) of dangerous capabilities (e.g. sleeper agents paper), we don't know how much of the phenomenon depends on the particular semantics of terms like "goal" and "deception" and "lie" (insofar as they are used in the scratchpad or in prompts or in finetuning data) or if the same phenomenon could be had by subbing in more or less any word. One approach to this is to make small toy models of these type of phenomenon where we can more easily control data distributions and yet still get analogous behavior. In this way we can really control for any particular aspect of the data and figure out, scientifically, the nature of these dangers. By small toy model I'm thinking of highly artificial datasets (perhaps made of binary digits with specific correlation structure, or whatever the minimum needed to get the phenomenon at hand).

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai2mo10

This all looks correct to me! Thanks for this.