Daniel Murfet — LessWrong

There's a certain point where commutative algebra outgrows arguments that are phrased purely in terms of ideals (e.g. at some point in Matsumura the proofs stop being about ideals and elements and start being about long exact sequences and Ext, Tor). Once you get to that point, and even further to modern commutative algebra which is often about derived categories (I spent some years embedded in this community), I find that I'm essentially using a transplanted intuition from that "old world" but now phrased in terms of diagrams in derived categories.

E.g. a lot of Atiyah and Macdonald style arguments just reappear as e..g arguments about how to use the residue field to construct bounded complexes of finitely generated modules in the derived category of a local ring. Reconstructing that intuition in the derived category is part of making sense of the otherwise gun-metal machinery of homological algebra.

Ultimately I don't see it as different, but the "externalised" view is the one that plugs into homological algebra and therefore, ultimately, wins.

(Edit: saw Simon's reply after writing this, yeah agree!)

Zach Furman's Shortform

Daniel Murfet5mo90

Yeah it's a nice metaphor. And just as the most important thing in a play is who dies and how, so too we can consider any element as a module homomorphism $ϕ_{x} : R \to M$ and consider the kernel $Ann (x) = Ker ϕ_{x}$ which is called the annihilator (great name). Then $ϕ_{x}$ factors as $R \to R / Ann (x) \to M$ where the second map is injective, and so in some sense $M$ is "made up" of all sorts of quotients $R / I$ where $I$ varies over annihilators of elements.

There was a period where the structure of rings was studied more through the theory of ideals (historically this as in turn motivated by the idea of an "ideal" number) but through ideas like the above you can see the theory of modules as a kind of "externalisation" of this structure which in various ways makes it easier to think about. One manifestation of this I fell in love with (actually this was my entrypoint into all this since my honours supervisor was an old-school ring theorist and gave me Stenstrom to read) is in torsion theory.

johnswentworth's Shortform

Daniel Murfet6mo190

One of my son's most vivid memories of the last few years (and which he talks about pretty often) is playing laser tag at Wytham Abbey, a cultural practice I believe instituted by John and which was awesome, so there is a literal five-year-old (well seven-year-old at the time) who endorses this message!

Prospects for Alignment Automation: Interpretability Case Study

Daniel Murfet8mo30

Makes sense to me, thanks for the clarifications.

I found working through the details of this very informative. For what it's worth, I'll share here a comment I made internally at Timaeus about it, which is that in some ways this factorisation into and $M_{2}$ reminds me of the factorisation into the map $m \mapsto S_{m}$ from a model to its capability vector (this being the analogue of $M_{2}$ ) and the map $S_{m} \mapsto σ^{- 1} (E_{m}) = β^{T} S_{m} + α$ from capability vectors to downstream metrics (this being the analogue of $M_{1}$ ) in Ruan et al's observational scaling laws paper.

In your case the output metrics have an interesting twist, in that you don't want to just predict performance but also in some sense variations of performance within a certain class (by e.g. varying the prompt), so it's some kind of "stable" latent space of capabilities that you're constructing.

Anyway, factoring the prediction of downstream performance/capabilities through some kind of latent space object $I (M)$ in your case, or latent spaces of capabilities in Ruan et al's case, seems like a principled way of thinking about the kind of object we want to put at the center of interpretability.

As an entertaining aside: as an algebraic geometer the proliferation of $I_{1} (M), I_{2} (M), \dots$ 's i.e. "interpretability objects" between models $M$ and downstream performance metrics reminds me of the proliferation of cohomology theories and the search for "motives" to unify them. That is basically interpretability for schemes!

Prospects for Alignment Automation: Interpretability Case Study

Daniel Murfet8mo*30

I is evaluated on utility for improving time-efficiency and accuracy in solving downstream tasks

There seems to be a gap between this informal description and your pseudo-code, since in the pseudo-code the parameters only parametrise the R&D agent $M_{1}$ . On the other hand $M_{2}$ is distinct and presumed to be not changing. At first reasoning from the pseudo-code I had the objection that the execution agent can't be completely static, because it somehow has to make use of whatever clever interpretability outputs the R&D agent comes up with (e.g. SAEs don't use themselves to solve OOD detection or whatever). Then I wondered if you wanted to bound the complexity of $M_{2}$ somewhere. Then I looked back and saw the formula $I (x, t, I (M))$ which seems to cleverly bypass this by having the R&D agent have to do both steps but factoring its representation of $M$ .

However this does seem different from the pseudo-code. If this is indeed different, which one do you intend?

Edit: no matter, I should just read more closely $e v a l u a t e (M_{2} (t, I_{M}, I))$ clearly takes $I$ as input so I think I'm not confused. I'll leave this comment here as a monument to premature question-asking.

Later edit: ok no I'm still confused. It seems $M_{1}$ doesn't get used in your inner loop unless it is in fact $I$ (which in the pseudo-code means just a part of what was called $I$ in the preceding text). That is, when we update $θ$ we update $I$ for the next round. In which case things fit with your original formula but having essentially factored $I$ into two pieces ( $M_{2}$ on the outside, $M_{1}$ on the inside) you are only allowing the inside piece $M_{1}$ to vary over the course of this process. So I think my original question still stands.

So to check the intuition here: we factor the interpretability algorithm $I$ into two pieces. The first piece never sees tasks and has to output some representation of the model $M$ . The second piece never sees the model and has to, given the representation and some prediction task for the original model $M$ perform well across a sufficiently broad range of such tasks. It is penalised for computation time in this second piece. So overall the loss is supposed to motivate

Discovering the capabilities of the model as operationalised by its performance on tasks, and also how that performance is affected by variations of those tasks (e.g. modifying the prompt for your Shapley values example, and for your elicitation example).
Representing those capabilities in a way that amortises the computational cost of mapping a given task onto this space of capabilities in order to make the above predictions (the computation time penalty in the second part).

This is plausible for the same reason that the original model can have good general performance: there are general underlying skills or capabilities that can be assembled to perform well on a wide range of tasks, and if you can discover those capabilities and their structure you should be able to generalise to predict other task performance and how it varies.

Indirectly there is a kind of optimisation pressure on the complexity of $I (M)$ just because you're asking this to be broadly useful (for a computationally penalised $M_{2}$ ) for prediction on many tasks, so by bounding the generalisation error you're likely to bound the complexity of that representation.

I'm on board with that, but I think it is possible that some might agree this is a path towards automated research of something but not that the something is interpretability. After all, your $I (M)$ need not be interpretable in any straightforward way. So implicitly the space of $θ$ 's you are searching over is constrained to something instrinsically reasonably interpretable?

Since later you say "human-led interpretability absorbing the scientific insights offered by I*" I guess not, and your point is that there are many safety-relevant applications of I*(M) even if it is not very human comprehensible.

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Daniel Murfet8mo20

Arora et al.,

Wu et al?

johnswentworth's Shortform

Daniel Murfet8mo110

There's plenty, including a line of work by Carina Curto, Katrin Hess and others that is taken seriously by a number of mathematically inclined neuroscience people (Tom Burns if he's reading can comment further). As far as I know this kind of work is the closest to breaking through into the mainstream. At some level you can think of homology as a natural way of preserving information in noisy systems, for reasons similar to why (co)homology of tori was a useful way for Kitaev to formulate his surface code. Whether or not real brains/NNs have some emergent computation that makes use of this is a separate question, I'm not aware of really compelling evidence.

There is more speculative but definitely interesting work by Matilde Marcolli. I believe Manin has thought about this (because he's thought about everything) and if you have twenty years to acquire the prerequisites (gamma spaces!) you can gaze into deep pools by reading that too.

johnswentworth's Shortform

Daniel Murfet8mo170

I'm ashamed to say I don't remember. That was the highlight. I think I have some notes on the conversation somewhere and I'll try to remember to post here if I ever find it.

I can spell out the content of his Koan a little, if it wasn't clear. It's probably more like: look for things that are (not there). If you spend enough time in a particular landscape of ideas, you can (if you're quiet and pay attention and aren't busy jumping on bandwagons) get an idea of a hole, which you're able to walk around but can't directly see. In this way new ideas appear as something like residues from circumnavigating these holes. It's my understanding that Khovanov homology was discovered like that, and this is not unusual in mathematics.

By the way, that's partly why I think the prospect of AIs being creative mathematicians in the short term should not be discounted; if you see all the things you see all the holes.

johnswentworth's Shortform

Daniel Murfet8mo195

I visited Mikhail Khovanov once in New York to give a seminar talk, and after it was all over and I was wandering around seeing the sights, he gave me a call and offered a long string of general advice on how to be the kind of person who does truly novel things (he's famous for this, you can read about Khovanov homology). One thing he said was "look for things that aren't there" haha. It's actually very practical advice, which I think about often and attempt to live up to!

Proof idea: SLT to AIT

Daniel Murfet9mo110

Ok makes sense to me, thanks for explaining. Based on my understanding of what you are doing, the statement in the OP that in your setting is "sort of" K-complexity is a bit misleading? It seems like you will end up with bounds on $D (μ | | ξ)$ that involve the actual learning coefficient, which you then bound above by noting that un-used bits in the code give rise to degeneracy. So there is something like $λ \leq K$ going on ultimately.

If I understand correctly you are probably doing something like:

Identified a continuous space $W$ (parameters of your NN run in recurrent mode)
Embedded a set of Turing machine codes $W^{c o d e}$ into $W$ (by encoding the execution of a UTM into the weights of your transformer)
Used $p (y | x, w)$ parametrised by the transformer, where $w \in W$ to provide what I would call a "smooth relaxation" of the execution of the UTM for some number of steps
Use this as the model in the usual SLT setting, and then noted that because of the way you encoded the UTM and its step function, if you vary $w$ away from the configuration corresponding to a TM code $[M]$ in a bit of the description that corresponds to unused states or symbols, it can't affect the execution and so there is degeneracy in the KL divergence $K$
Hence, $λ ([M]) \leq len ([M])$ and if then repeating this over all TMs $M$ which perfectly fit the given data distribution, we get a bound on the global $λ \leq K$ .

Proving Theorem 4.1 was the purpose of Clift-Wallbridge-Murfet, just with a different smooth relaxation. The particular smooth relaxation we prefer for theoretical purposes is one coming from encoding a UTM in linear logic, but the overall story works just as well if you are encoding the step function of a TM in a neural network and I think the same proof might apply in your case.

Anyway, I believe you are doing at least several things differently: you are treating the iid case, you are introducing $D (μ | | ξ)$ and the bound on that (which is not something I have considered) and obviously the Transformer running in recurrent mode as a smooth relaxation of the UTM execution is different to the one we consider.

From your message it seems like you think the global learning coefficient might be lower than $K$ , but that locally at a code the local learning coefficient might be somehow still to do with description length? So that the LLC in your case is close to something from AIT. That would be surprising to me, and somewhat in contradiction with e.g. the idea from simple versus short that the LLC can be lower than "the number of bits used" when error-correction is involved (and this being a special case of a much broader set of ways the LLC could be lowered).

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments