All of Daniel Murfet's Comments + Replies

You might reconstruct your sacred Jeffries prior with a more refined notion of model identity, which incorporates derivatives (jets on the geometric/statistical side and more of the algorithm behind the model on the logical side).

Except nobody wants to hear about it at parties.

 

You seem to do OK... 

If they only would take the time to explain things simply you would understand. 

This is an interesting one. I field this comment quite often from undergraduates, and it's hard to carve out enough quiet space in a conversation to explain what they're doing wrong. In a way the proliferation of math on YouTube might be exacerbating this hard step from tourist to troubadour.

As a supervisor of numerous MSc and PhD students in mathematics, when someone finishes a math degree and considers a job, the tradeoffs are usually between meaning, income, freedom, evil, etc., with some of the obvious choices being high/low along (relatively?) obvious axes. It's extremely striking to see young talented people with math or physics (or CS) backgrounds going into technical AI alignment roles in big labs, apparently maximising along many (or all) of these axes!

Especially in light of recent events I suspect that this phenomenon, which appears too good to be true, actually is.

Please develop this question as a documentary special, for lapsed-Starcraft player homeschooling dads everywhere.

I don't understand the strong link between Kolmogorov complexity and generalisation you're suggesting here. I think by "generalisation" you must mean something more than "low test error". Do you mean something like "out of distribution" generalisation (whatever that means)?

1Joar Skalse12d
Yes, I mostly just mean "low test error". I'm assuming that real-world problems follow a distribution that is similar to the Solomonoff prior (i.e., that data generating functions are more likely to have low Kolmogorov complexity than high Kolmogorov complexity) -- this is where the link is coming from. This is an assumption about the real world, and not something that can be established mathematically.
3Roman Leventov12d
Kolmogorov complexity is definitely a misleading path here, and it's unfortunate that Joar chose it as the "leading" example of complexity in the post. Note this passage: This quote from the above comment is better: I've expressed this idea with some links here: Then if we combine two claims: * Joar's "DNNs are (kind of) Bayesian" (for the reasons that I don't understand because I didn't read their papers, so I just take his word here), and * Fields et al.'s "brains are 'almost' Bayesian because Bayesian learning is information-efficient (= energy-efficient), and there is a strong evolutionary pressure for brains in animals to be energy-efficient", is this an explanation explanation of DNNs' remarkable generalisation ability? Or more quantification should be added to both of these claims to turn this into a good explanation?

Well neural networks do obey Occam's razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam's razor).

I think that expression of Jesse's is also correct, in context.

However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something profound about (ii) ... (read more)

1Joar Skalse12d
Would that not imply that my polynomial example also obeys Occam's razor?  Yes, I think this probably is the case. I also think the vast majority of readers won't go deep enough into the mathematical details to get a fine-grained understanding of what the maths is actually saying. Yes, I very much agree with this too. Yes, absolutely! I also think that SLT probably will be useful for understanding phase shifts and training dynamics (as I also noted in my post above), so we have no disagreements there either.

However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map)


Seems reasonable to me!

Re: the articles you link to. I think the second one by Carroll is quite careful to say things like "we can now understand why singular models have the capacity to generalise well" which seems to me uncontroversial, given the definitions of the terms involved and the surrounding discussion. 

I agree that Jesse's post has a title "Neural networks generalize because of this one weird trick" which is clickbaity, since SLT does not in fact yet explain why neural networks appear to generalise well on many natural datasets. However the actual article is more... (read more)

2Joar Skalse13d
The title of the post is Why Neural Networks obey Occam's Razor! It also cites Zhang et al, 2017, and immediately after this says that SLT can help explain why neural networks have the capacity to generalise well. This gives the impression that the post is intended to give a solution to problem (ii) in your other comment, rather than a solution to problem (i). Jesse's post includes the following expression: Complex Singularities⟺Fewer Parameters⟺Simpler Functions⟺Better Generalization I think this also suggests an equivocation between the RLCT measure and practical generalisation behaviour. Moreover, neither post contains any discussion of the difference between (i) and (ii).  

I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don't think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?


That seems probable. Maybe it's useful for me to lay out a more or less complete picture of what I think SLT does say about generalisation in deep learning in its current form, so that we're on the same page. When people refer to the "generalisation puzzle" in... (read more)

5Joar Skalse13d
Yes, absolutely. However, I also don't think that (i) is very mysterious, if we view things from a Bayesian perspective. Indeed, it seems natural to say that an ideal Bayesian reasoner should assign non-zero prior probability to all computable models, or something along those lines, and in that case, notions like "overparameterised" no longer seem very significant. Yes, this is basically exactly what my criticism of SLT is -- I could not have described it better myself! I agree that this reduction is relevant and non-trivial. I don't have any objections to this per se. However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map).

The easiest way to explain why this is the case will probably be to provide an example. Suppose we have a Bayesian learning machine with 15 parameters, whose parameter-function map is given by

and whose loss function is the KL divergence. This learning machine will learn 4-degree polynomials. Moreover, it is overparameterised, and its loss function is analytic in its parameters, etc, so SLT will apply to it.


In your example there are many values of the parameters that encode the zero function (e.g. ... (read more)

5Joar Skalse14d
Ah, yes, I should have made the training data be (1,1), rather than (0,0). I've fixed the example now! Yes, that is exactly right! The notion of complexity that I have in mind is even more pre-theoretic than that; it's something like "x2 looks like an intuitively less plausible guess than 0". However, if we want to keep things strictly mathematical, then we can substitute this for the definition in terms of UTM codes. I'm well aware of that -- that is what my example attempts to show! My point is that the kind of complexity which SLT talks about does not allow us to make inferences about inductive bias or generalisation behaviour, contra what is claimed e.g. here and here. As far as I can tell, we don't disagree about any object-level technical claims. Insofar as we do disagree about something, it may be more methodolocical meta-questions. I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don't think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?  

First of all, SLT is largely is based on examining the behaviour of learning machines in the limit of infinite data


I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see (I think I probably also see the gap between Bayesian learning and SGD as bigger than you do).

I've discussed this a bit with my colleague Liam Hodgkinson, whose recent papers https://arxiv.org/abs/2307.07785 and https://arxiv.org/abs/2311.07013 might... (read more)

1Joar Skalse14d
Yes, I agree with this. I think my main objections are (1) the fact that it mostly abstacts away from the parameter-function map, and (2) the infinite-data limit. I largely agree, though depends somewhat on what your aims are. My point there was mainly that theorems about generalisation in the infinite-data limit are likely to end up being weaker versions of more general results from statistical and computational learning theory.

I think that the significance of SLT is somewhat over-hyped at the moment


Haha, on LW that is either already true or at current growth rates will soon be true, but it is clearly also the case that SLT remains basically unknown in the broader deep learning theory community.

3Joar Skalse14d
Yes, I meant specifically on LW and in the AI Safety community! In academia, it remains fairly obscure.

I claim that this is fairly uninteresting, because classical statistical learning theory already gives us a fully adequate account of generalisation in this setting which applies to all learning machines, including neural networks

 

I'm a bit familiar with the PAC-Bayes literature and I think this might be an exaggeration. The linked post merely says that the traditional PAC-Bayes setup must be relaxed, and sketches some ways of doing so. Could you please cite the precise theorem you have in mind?

2Joar Skalse14d
For example, the agnostic PAC-learning theorem says that if a learning machine L (for binary classification) is an empirical risk minimiser with VC dimension d, then for any distribution D over X×{0,1}, if L is given access to at least Ω((d/ϵ2)+(d/ϵ2)log(1/δ)) data points sampled from D, then it will with probability at least 1−δ learn a function whose (true) generalisation error (under D) is at most ϵ worse than the best function which L is able to express (in terms of its true generalisation error under D). If we assume that that D corresponds to a function which L can express, then the generalisation error of L will with probability at least 1−δ be at most ϵ. This means that, in the limit of infinite data, L will with probability arbitrarily close to 1 learn a function whose error is arbitrarily close to the optimal value (among all functions which L is able to express). Thus, any empirical risk minimiser with a finite VC-dimension will generalise well in the limit of infinite data.

Very loosely speaking, regions with a low RLCT have a larger "volume" than regions with high RLCT, and the impact of this fact eventually dominates other relevant factors.

 

I'm going to make a few comments as I read through this, but first I'd like to thank you for taking the time to write this down, since it gives me an opportunity to think through your arguments in a way I wouldn't have done otherwise.

Regarding the point about volume. It is true that the RLCT can be written as (Theorem 7.1 of Watanabe's book "Algebraic Geometry and Statistical Learni... (read more)

2Joar Skalse14d
Thank you for the detailed responses! I very much enjoy discussing these topics :) My intuitions around the RLCT are very much geometrically informed, and I do think of it as being a kind of flatness measure. However, I don't think of it as being a "macroscopic" quantity, but rather, a local quantity. I think the rest of what you say coheres with my current picture, but I will have to think about it for a bit, and come back later!

Good question. What counts as a "-" is spelled out in the paper, but it's only outlined here heuristically. The "5 like" thing it seems to go near on the way down is not actually a critical point.

The change in the matrix W and the bias b happen at the same time, it's not a lagging indicator.

SLT predicts when this will happen!

Maybe. This is potentially part of the explanation for "data double descent" although I haven't thought about it beyond the 5min I spent writing that page and the 30min I spent talking about it with you at the June conference. I'd be very interested to see someone explore this more systematically (e.g. in the setting of Anthropic's "other" TMS paper https://www.anthropic.com/index/superposition-memorization-and-double-descent which contains data double descent in a setting where the theory of our recent TMS paper might allow you to do something).

There is quite a large literature on "stage-wise development" in neuroscience and psychology, going back to people like Piaget but quite extensively developed in both theoretical and experimental directions. One concrete place to start on the agenda you're outlining here might be to systematically survey that literature from an SLT-informed perspective. 

we can copy the relevant parts of the human brain which does the things our analysis of our models said they would do wrong, either empirically (informed by theory of course), or purely theoretically if we just need a little bit of inspiration for what the relevant formats need to look like.

I struggle to follow you guys in this part of the dialogue, could you unpack this a bit for me please?

3Garrett Baker1mo
The idea is that currently there's a bunch of formally unsolved alignment problems relating to things like ontology shifts, value stability under reflection & replication, non-muggable decision theories, and potentially other risks we haven't thought of yet such that if an agent pursues your values adequately in a limited environment, its difficult to say much confidently about whether it will continue to pursue your values adequately in a less limited environment. But we see that humans are generally able to pursue human values (or at least, not go bonkers in the ways we worry about above), so maybe we can copy off of whatever evolution did to fix these traps.  The hope is that either SLT + neuroscience can give us some light into what that is, or just tell us that our agent will think about these sorts of things in the same way that humans do under certain set-ups in a very abstract way, or give us a better understanding of what risks above are actually something you need to worry about versus something you don't need to worry about.
2kave1mo
I think Garrett is saying: our science gets good enough that we can tell that, in some situations, our models are going to do stuff we don't like. We then look at the brain and try and see what the brain would do in that situation.

Though I'm not fully confident that is indeed what they did

The k-gons are critical points of the loss, and as  varies the free energy is determined by integrals restricted to neighbourhoods of these critical points in weight space.

Thanks Akash. Speaking for myself, I have plenty of experience supervising MSc and PhD students and running an academic research group, but scientific institution building is a next-level problem. I have spent time reading and thinking about it, but it would be great to be connected to people with first-hand experience or who have thought more deeply about it, e.g.

  • People with advice on running distributed scientific research groups
  • People who have thought about scientific institution building in general (e.g. those with experience starting FROs in bioscienc
... (read more)

That's what we're thinking, yeah.

Yes I think that's right. I haven't closely read the post you link to (but it's interesting and I'm glad to have it brought to my attention, thanks) but it seems related to the kind of dynamical transitions we talk briefly about in the Related Works section of Chen et al.

I think it is too early to know how many phase transitions there are in e.g. the training of a large language model. If there are many, it seems likely to me that they fall along a spectrum of "scale" and that it will be easier to find the more significant ones than the less significant ones (e.g. we discover transitions like the onset of in-context learning first, because they dramatically change how the whole network computes).

As evidence for that view, I would put forward the fact that putting features into superposition is known to be a phase transitio... (read more)

5Algon1mo
Thank you, that was helpful. If I'm getting this right, you think the "big" transitions plausibly correspond to important capability gains. So under that theory, "chain of thought" and "reflection" arised due to big phase transitions in GPT-3 and 4. I think it'd be great if researchers could, if not access training checkpoints of these models, then at least make bids for experiments to be performed on said models. 

Great question, thanks. tldr it depends what you mean by established, probably the obstacle to establishing such a thing is lower than you think.

To clarify the two types of phase transitions involved here, in the terminology of Chen et al:

  • Bayesian phase transition in number of samples: as discussed in the post you link to in Liam's sequence, where the concentration of the Bayesian posterior shifts suddenly from one region of parameter space to another, as the number of samples increased past some critical sample size . There are also Bayesian phase t
... (read more)
7ryan_greenblatt1mo
Thanks for the detailed response! So, to check my understanding: The toy cases discussed in Multi-Component Learning and S-Curves are clearly dynamical phase transitions. (It's easy to establish dynamical phase transitions based on just observation in general. And, in these cases we can verify this property holds for the corresponding differential equations (and step size is unimportant so differential equations are a good model).) Also, I speculate it's easy to prove the existence of a bayesian phase transition in the number of samples for these toy cases given how simple they are.

To see this, we use a slight refinement of the dynamical estimator, where we restrict sampling to lie within the normal hyperplane of the gradient vector at initialization, which seems to make this behavior more robust.

 

Could you explain the intuition behind using the gradient vector at initialization? Is this based on some understanding of the global training dynamics of this particular network on this dataset?

3Dmitry Vaintrob2mo
Oh I can see how this could be confusing. We're sampling at every step in the orthogonal complement to the gradient at that step ("initialization" here refers to the beginning of sampling, i.e., we don't update the normal vector during sampling). And the reason to do this is that we're hoping to prevent the sampler from quickly leaving the unstable point and jumping into a lower-loss basin (by restricting we are guaranteeing that the unstable point is a critical point)

Not easily detected.  As in, there might be a sudden (in SGD steps) change in the internal structure of the network over training that is not easily visible in the loss or other metrics that you would normally track. If you think of the loss as an average over performance on many thousands of subtasks, a change in internal structure (e.g. a circuit appearing in a phase transition) relevant to one task may not change the loss much.

Thanks, that makes a lot of sense to me. I have some technical questions about the post with Owen Lynch, but I'll follow up elsewhere.

4. Goals misgeneralize out of distribution.

See: Goal misgeneralization: why correct specifications aren't enough for correct goals, Goal misgeneralization in deep reinforcement learning

OAA Solution: (4.1) Use formal methods with verifiable proof certificates[2]. Misgeneralization can occur whenever a property (such as goal alignment) has been tested only on a subset of the state space. Out-of-distribution failures of a property can only be ruled out by an argument for a universally quantified statement about that property—but such arguments can in fact be

... (read more)

I think you’re directionally correct; I agree about the following:

  • A critical part of formally verifying real-world systems involves coarse-graining uncountable state spaces into (sums of subsets of products of) finite state spaces.
  • I imagine these would be mostly if not entirely learned.
  • There is a tradeoff between computing time and bound tightness.

However, I think maybe my critical disagreement is that I do think probabilistic bounds can be guaranteed sound, with respect to an uncountable model, in finite time. (They just might not be tight enough to... (read more)

Induction heads? Ok, we are maybe on track to retro engineer the mechanism of regex in LLMs. Cool.

 

This dramatically undersells the potential impact of Olsson et al. You can't dismiss modus ponens as "just regex". That's the heart of logic!

For many the argument for AI safety being a urgent concern involves a belief that current systems are, in some rough sense, reasoning, and that this capability will increase with scale, leading to beyond human-level intelligence within a timespan of decades. Many smart outsiders remain sceptical, because they a... (read more)

That intuition sounds reasonable to me, but I don't have strong opinions about it.

One thing to note is that training and test performance are lagging indicators of phase transitions. In our limited experience so far, measures such as the RLCT do seem to indicate that a transition is underway earlier (e.g. in Toy Models of Superposition), but in the scenario you describe I don't know if it's early enough to detect structure formation "when it starts". 

For what it's worth my guess is that the information you need to understand the structure is present at the transition itself, and you don't need to "rewind" SGD to examine the structure forming one step at a time.

If the cost is a problem for you, send a postal address to daniel.murfet@gmail.com and I'll mail you my physical copy. 

Thanks for the article. For what it's worth, here's the defence I give of Agent Foundations and associated research, when I am asked about it (for background, I'm a mathematician, now working on mathematical aspects of AI safety different from Agent Foundations). I'd be interested if you disagree with this framing.

We can imagine the alignment problem coming in waves. Success in each wave merely buys you the chance to solve the next. The first wave is the problem we see in front of us right now, of getting LLMs to Not Say Naughty Things, and we can glimpse ... (read more)

6Alexander Gietelink Oldenziel5mo
It's an interesting framing,  Dan. Agent foundations for Quantum superintelligence. To me, motivation for Agent Foundations mostly comes from different considerations. Let me explain.  To my mind, agent foundations is not primarily about some mysterious future quantum superintelligences (though hopefully it will help us when they arrive!) -  but about real agents in This world, TODAY. That means humans and animals but also many systems that are agentic to a degree like markets, large organizations, Large Language models etc. One could call these pseudo-agents or pre-agents or egregores but at the moment there is no accepted terminology for not-quite-agents which may contribute to the persistent confusion that agent foundations is only concerned with expected utility maximizers.  The reason that so far research in Agent Foundations has mostly restricted itself to highly ideal 'optimal' agents is primarily because of mathematical tractability. Focusing on highly ideal agents also make sense from the point of view where we are focused on 'reflectively stable agents' i.e. we'd like to know what agents converge to upon-reflection. But primarily the reason we don't much study more complicated, complex realistic models of real-life agents is that the mathematics simply isn't there yet.  A different perspective on agent foundations is primarily that of deconfusion: we are at present confused about many of the key concepts of aligning future superintelligent agents. We need to be less confused.  Another point of view on the importance of Agent Foundations: Ultimately, it is inevitable that humanity will delegate more and more power to AIs. Ensuring the continued surviving and flourishing of the human species is then less about interpretability, more about engineering reflectively stable well-steered superintelligent systems. This is more about decision theory & (relatively) precise engineering, less about the online neuroscience of mechInterp. Perhaps this is what you m

I think this is a very nice way to present the key ideas. However, in practice I think the discretisation is actually harder to reason about than the continuous version. There are deeper problems, but I'd start by wondering how you would ever compute c(f) defined this way, since it seems to depend in an intricate way on the details of e.g. the floating point implementation.

I'll note that the volume codimension definition of the RLCT is essentially what you have written down here, and you don't need any mathematics beyond calculus to write that down. You only need things like resolutions of singularities if you actually want to compute that value, and the discretisation doesn't seem to offer any advantage there.

3Ege Erdil6mo
I would say that the discretization is going to be easier for people with a computer science background to grasp, even though formally I agree it's going to be less pleasant to reason about or to do computations with. Still, if properties of NNs that only appeared when they are continuous functions on Rn were essential for their generalization, we might be in trouble as people keep lowering the precision of their floating point numbers. This explanation makes it clear that while assuming NNs are continuous (or even analytic!) might be useful for theoretical purposes, the claims about generalization hold just as well in a more realistic discrete setting. Yes, my definition is inspired by the volume codimension definition, though here we don't need to take a limit as some ε→0 because the counting measure makes our life easy. The problem you have in a smooth setting is that descending the Lebesgue measure in a dumb way to subspaces with positive codimension gives trivial results, so more care is necessary to recover and reason about the appropriate notions of volume.

The set of motivated, intelligent people with the relevant skills to do technical alignment work in general, and mechanistic interpretability in particular, has a lot of overlap with the set of people who can do capabilities work. That includes many academics, and students in masters and PhD programs. One way or another they're going to publish, would you rather it be alignment/interpretability work or capabilities work?

It seems to me that speeding up alignment work by several orders of magnitude is unlikely to happen without co-opting a significant number... (read more)

Nietzsche also had mixed views on Socrates, for similar reasons.  He talks about this in many of his books, including "The Birth of Tragedy" and "Gay Science". 

By the zero-shot hyperparameter work do you mean https://arxiv.org/abs/2203.03466 "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer"? I've been sceptical of NTK-based theory, seems I should update.