You might reconstruct your sacred Jeffries prior with a more refined notion of model identity, which incorporates derivatives (jets on the geometric/statistical side and more of the algorithm behind the model on the logical side).
Except nobody wants to hear about it at parties.
You seem to do OK...
If they only would take the time to explain things simply you would understand.
This is an interesting one. I field this comment quite often from undergraduates, and it's hard to carve out enough quiet space in a conversation to explain what they're doing wrong. In a way the proliferation of math on YouTube might be exacerbating this hard step from tourist to troubadour.
As a supervisor of numerous MSc and PhD students in mathematics, when someone finishes a math degree and considers a job, the tradeoffs are usually between meaning, income, freedom, evil, etc., with some of the obvious choices being high/low along (relatively?) obvious axes. It's extremely striking to see young talented people with math or physics (or CS) backgrounds going into technical AI alignment roles in big labs, apparently maximising along many (or all) of these axes!Especially in light of recent events I suspect that this phenomenon, which appears too good to be true, actually is.
Please develop this question as a documentary special, for lapsed-Starcraft player homeschooling dads everywhere.
Thanks for setting this up!
I don't understand the strong link between Kolmogorov complexity and generalisation you're suggesting here. I think by "generalisation" you must mean something more than "low test error". Do you mean something like "out of distribution" generalisation (whatever that means)?
Well neural networks do obey Occam's razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam's razor).I think that expression of Jesse's is also correct, in context.However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something profound about (ii) ... (read more)
However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map)
Seems reasonable to me!
Re: the articles you link to. I think the second one by Carroll is quite careful to say things like "we can now understand why singular models have the capacity to generalise well" which seems to me uncontroversial, given the definitions of the terms involved and the surrounding discussion. I agree that Jesse's post has a title "Neural networks generalize because of this one weird trick" which is clickbaity, since SLT does not in fact yet explain why neural networks appear to generalise well on many natural datasets. However the actual article is more... (read more)
I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don't think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?
That seems probable. Maybe it's useful for me to lay out a more or less complete picture of what I think SLT does say about generalisation in deep learning in its current form, so that we're on the same page. When people refer to the "generalisation puzzle" in... (read more)
The easiest way to explain why this is the case will probably be to provide an example. Suppose we have a Bayesian learning machine with 15 parameters, whose parameter-function map is given byf(x)=θ1+θ2θ3x+θ4θ5θ6x2+θ7θ8θ9θ10x3+θ11θ12θ13θ14θ15x4,and whose loss function is the KL divergence. This learning machine will learn 4-degree polynomials. Moreover, it is overparameterised, and its loss function is analytic in its parameters, etc, so SLT will apply to it.
The easiest way to explain why this is the case will probably be to provide an example. Suppose we have a Bayesian learning machine with 15 parameters, whose parameter-function map is given byf(x)=θ1+θ2θ3x+θ4θ5θ6x2+θ7θ8θ9θ10x3+θ11θ12θ13θ14θ15x4,
and whose loss function is the KL divergence. This learning machine will learn 4-degree polynomials. Moreover, it is overparameterised, and its loss function is analytic in its parameters, etc, so SLT will apply to it.
In your example there are many values of the parameters that encode the zero function (e.g. θ1... (read more)
First of all, SLT is largely is based on examining the behaviour of learning machines in the limit of infinite data
I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see (I think I probably also see the gap between Bayesian learning and SGD as bigger than you do).I've discussed this a bit with my colleague Liam Hodgkinson, whose recent papers https://arxiv.org/abs/2307.07785 and https://arxiv.org/abs/2311.07013 might... (read more)
I think that the significance of SLT is somewhat over-hyped at the moment
Haha, on LW that is either already true or at current growth rates will soon be true, but it is clearly also the case that SLT remains basically unknown in the broader deep learning theory community.
I claim that this is fairly uninteresting, because classical statistical learning theory already gives us a fully adequate account of generalisation in this setting which applies to all learning machines, including neural networks
I'm a bit familiar with the PAC-Bayes literature and I think this might be an exaggeration. The linked post merely says that the traditional PAC-Bayes setup must be relaxed, and sketches some ways of doing so. Could you please cite the precise theorem you have in mind?
Very loosely speaking, regions with a low RLCT have a larger "volume" than regions with high RLCT, and the impact of this fact eventually dominates other relevant factors.
I'm going to make a few comments as I read through this, but first I'd like to thank you for taking the time to write this down, since it gives me an opportunity to think through your arguments in a way I wouldn't have done otherwise.Regarding the point about volume. It is true that the RLCT can be written as (Theorem 7.1 of Watanabe's book "Algebraic Geometry and Statistical Learni... (read more)
Good question. What counts as a "-" is spelled out in the paper, but it's only outlined here heuristically. The "5 like" thing it seems to go near on the way down is not actually a critical point.
The change in the matrix W and the bias b happen at the same time, it's not a lagging indicator.
SLT predicts when this will happen!
Maybe. This is potentially part of the explanation for "data double descent" although I haven't thought about it beyond the 5min I spent writing that page and the 30min I spent talking about it with you at the June conference. I'd be very interested to see someone explore this more systematically (e.g. in the setting of Anthropic's "other" TMS paper https://www.anthropic.com/index/superposition-memorization-and-double-descent which contains data double descent in a setting where the theory of our recent TMS paper might allow you to do something).
There is quite a large literature on "stage-wise development" in neuroscience and psychology, going back to people like Piaget but quite extensively developed in both theoretical and experimental directions. One concrete place to start on the agenda you're outlining here might be to systematically survey that literature from an SLT-informed perspective.
we can copy the relevant parts of the human brain which does the things our analysis of our models said they would do wrong, either empirically (informed by theory of course), or purely theoretically if we just need a little bit of inspiration for what the relevant formats need to look like.
I struggle to follow you guys in this part of the dialogue, could you unpack this a bit for me please?
Though I'm not fully confident that is indeed what they did
The k-gons are critical points of the loss, and as n varies the free energy is determined by integrals restricted to neighbourhoods of these critical points in weight space.
because a physicist made these notes
Thanks Akash. Speaking for myself, I have plenty of experience supervising MSc and PhD students and running an academic research group, but scientific institution building is a next-level problem. I have spent time reading and thinking about it, but it would be great to be connected to people with first-hand experience or who have thought more deeply about it, e.g.
That's what we're thinking, yeah.
Yes I think that's right. I haven't closely read the post you link to (but it's interesting and I'm glad to have it brought to my attention, thanks) but it seems related to the kind of dynamical transitions we talk briefly about in the Related Works section of Chen et al.
I think it is too early to know how many phase transitions there are in e.g. the training of a large language model. If there are many, it seems likely to me that they fall along a spectrum of "scale" and that it will be easier to find the more significant ones than the less significant ones (e.g. we discover transitions like the onset of in-context learning first, because they dramatically change how the whole network computes).
As evidence for that view, I would put forward the fact that putting features into superposition is known to be a phase transitio... (read more)
Great question, thanks. tldr it depends what you mean by established, probably the obstacle to establishing such a thing is lower than you think.
To clarify the two types of phase transitions involved here, in the terminology of Chen et al:
Oh that makes a lot of sense, yes.
To see this, we use a slight refinement of the dynamical estimator, where we restrict sampling to lie within the normal hyperplane of the gradient vector at initialization, which seems to make this behavior more robust.
Could you explain the intuition behind using the gradient vector at initialization? Is this based on some understanding of the global training dynamics of this particular network on this dataset?
Not easily detected. As in, there might be a sudden (in SGD steps) change in the internal structure of the network over training that is not easily visible in the loss or other metrics that you would normally track. If you think of the loss as an average over performance on many thousands of subtasks, a change in internal structure (e.g. a circuit appearing in a phase transition) relevant to one task may not change the loss much.
Thanks, that makes a lot of sense to me. I have some technical questions about the post with Owen Lynch, but I'll follow up elsewhere.
4. Goals misgeneralize out of distribution.See: Goal misgeneralization: why correct specifications aren't enough for correct goals, Goal misgeneralization in deep reinforcement learningOAA Solution: (4.1) Use formal methods with verifiable proof certificates. Misgeneralization can occur whenever a property (such as goal alignment) has been tested only on a subset of the state space. Out-of-distribution failures of a property can only be ruled out by an argument for a universally quantified statement about that property—but such arguments can in fact be
See: Goal misgeneralization: why correct specifications aren't enough for correct goals, Goal misgeneralization in deep reinforcement learning
OAA Solution: (4.1) Use formal methods with verifiable proof certificates. Misgeneralization can occur whenever a property (such as goal alignment) has been tested only on a subset of the state space. Out-of-distribution failures of a property can only be ruled out by an argument for a universally quantified statement about that property—but such arguments can in fact be
I think you’re directionally correct; I agree about the following:
However, I think maybe my critical disagreement is that I do think probabilistic bounds can be guaranteed sound, with respect to an uncountable model, in finite time. (They just might not be tight enough to... (read more)
Induction heads? Ok, we are maybe on track to retro engineer the mechanism of regex in LLMs. Cool.
This dramatically undersells the potential impact of Olsson et al. You can't dismiss modus ponens as "just regex". That's the heart of logic!
For many the argument for AI safety being a urgent concern involves a belief that current systems are, in some rough sense, reasoning, and that this capability will increase with scale, leading to beyond human-level intelligence within a timespan of decades. Many smart outsiders remain sceptical, because they a... (read more)
That intuition sounds reasonable to me, but I don't have strong opinions about it.
One thing to note is that training and test performance are lagging indicators of phase transitions. In our limited experience so far, measures such as the RLCT do seem to indicate that a transition is underway earlier (e.g. in Toy Models of Superposition), but in the scenario you describe I don't know if it's early enough to detect structure formation "when it starts".
For what it's worth my guess is that the information you need to understand the structure is present at the transition itself, and you don't need to "rewind" SGD to examine the structure forming one step at a time.
If the cost is a problem for you, send a postal address to email@example.com and I'll mail you my physical copy.
Thanks for the article. For what it's worth, here's the defence I give of Agent Foundations and associated research, when I am asked about it (for background, I'm a mathematician, now working on mathematical aspects of AI safety different from Agent Foundations). I'd be interested if you disagree with this framing.
We can imagine the alignment problem coming in waves. Success in each wave merely buys you the chance to solve the next. The first wave is the problem we see in front of us right now, of getting LLMs to Not Say Naughty Things, and we can glimpse ... (read more)
I think this is a very nice way to present the key ideas. However, in practice I think the discretisation is actually harder to reason about than the continuous version. There are deeper problems, but I'd start by wondering how you would ever compute c(f) defined this way, since it seems to depend in an intricate way on the details of e.g. the floating point implementation.
I'll note that the volume codimension definition of the RLCT is essentially what you have written down here, and you don't need any mathematics beyond calculus to write that down. You only need things like resolutions of singularities if you actually want to compute that value, and the discretisation doesn't seem to offer any advantage there.
The set of motivated, intelligent people with the relevant skills to do technical alignment work in general, and mechanistic interpretability in particular, has a lot of overlap with the set of people who can do capabilities work. That includes many academics, and students in masters and PhD programs. One way or another they're going to publish, would you rather it be alignment/interpretability work or capabilities work?
It seems to me that speeding up alignment work by several orders of magnitude is unlikely to happen without co-opting a significant number... (read more)
Nietzsche also had mixed views on Socrates, for similar reasons. He talks about this in many of his books, including "The Birth of Tragedy" and "Gay Science".
By the zero-shot hyperparameter work do you mean https://arxiv.org/abs/2203.03466 "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer"? I've been sceptical of NTK-based theory, seems I should update.