Clarification: The 'derivation' for how the RLCT predicts generalization error IIRC goes through the same flavour of argument as the one the derivation of the vanilla Bayesian Information Criterion uses. I don't like this derivation very much. See e.g. this one on Wikipedia.

So what it's actually showing is just that:

If you've got a class of different hypotheses $M$ , containing many individual hypotheses ${θ_{1}, θ_{2}, \dots θ_{N}}$ .
And you've got a prior ahead of time that says the chance any one of the hypotheses in $M$ is true is some number $p (M) < 1$ ., let's say it's $p (M) = 0.8$ as an example.
And you distribute this total probability $p (M) = 0.8$ around the different hypotheses in an even-ish way, so $p (θ_{i}, M) \propto \frac{1}{N}$ , roughly.
And then you encounter a bunch of data $X$ (the training data) and find that only one or a tiny handful of hypotheses in $M$ fit that data, so $p (X | θ_{i}, M) \neq 0$ for basically only one hypotheses $θ_{i}$ ...
Then your posterior probability $p (M | X) = \frac{p (X | M) 0.8}{0.8 p (X | M) + 0.2 p (X | \neg M)}$ that the hypothesis $θ_{i}$ is correct will probably be tiny, scaling with $\frac{1}{N}$ . If we spread your prior $p (M) = 0.8$ over lots of hypotheses, there isn't a whole lot of prior to go around for any single hypothesis. So if you then encounter data that discredits all hypotheses in M except one, that tiny bit of spread-out prior for that one hypothesis will make up a tiny fraction of the posterior, unless $p (X | \neg M)$ is really small, i.e. no hypothesis outside the set $M$ can explain the data either.

So if our hypotheses correspond to different function fits (one for each parameter configuration, meaning we'd have $2^{32 k}$ hypotheses if our function fits used $k$ $32$ -bit floating point numbers), the chance we put on any one of the function fits being correct will be tiny. So having more parameters is bad, because the way we picked our prior means our belief in any one hypothesis goes to zero as $N$ goes to infinity.

So the Wikipedia derivation for the original vanilla posterior of model selection is telling us that having lots of parameters is bad, because it means we're spreading our prior around exponentially many hypotheses.... if we have the sort of prior that says all the hypotheses are about equally likely.

But that's an insane prior to have! We only have $1.0$ worth of probability to go around, and there's an infinite number of different hypotheses. Which is why you're supposed to assign prior based on K-complexity, or at least something that doesn't go to zero as the number of hypotheses goes to infinity. The derivation is just showing us how things go bad if we do that.

In summary: badly normalised priors behave badly

SLT mostly just generalises this derivation to the case where parameter configurations in our function fits don't line up one-to-one with hypotheses.

It tells us that if we are spreading our prior around evenly over lots of parameter configurations, but exponentially many of these parameter configurations are secretly just re-expressing the same hypothesis, then that hypothesis can actually get a decent amount of prior, even if the total number of parameter configurations is exponentially large.

So our prior over hypotheses in that case is actually somewhat well-behaved in that it can end up normalised properly when we take $N \to \infty$ . That is a basic requirement a sane prior needs to have, so we're at least not completely shooting ourselves in the foot anymore. But that still doesn't show why this prior, that neural networks sort of^[1] implicitly have, is actually good. Just that it's no longer obviously wrong in this specific way.

Why does this prior apparently make decent-ish predictions in practice? That is, why do neural networks generalise well?

I dunno. SLT doesn't say. It just tells us how the parameter prior to hypothesis prior conversion ratio works, and in the process shows us that neural networks priors can be at least somewhat sanely normalised for large numbers of parameters. More than we might have initially thought at least.

That's all though. It doesn't tell us anything else about what makes a Gaussian over transformer parameter configurations a good starting guess for how the universe works.

How to make this story tighter?

If people aim to make further headway on the question of why some function fits generalise somewhat and others don't, beyond: 'Well, standard Bayesianism suggests you should at least normalise your prior so that having more hypotheses isn't actively bad', then I'd suggest a starting point might be to make a different derivation for the posterior on the fits that isn't trying to reason about $p (M)$ defined as the probability that one of the function fits is 'true' in the sense of exactly predicting the data. Of course none of them are. We know that. When we fit a $150$ billion parameter transformer to internet data, we don't expect going in that any of these $2^{16 \times 150 \times 10^{9}}$ parameter configurations will give zero loss up to quantum noise on any and all text prediction tasks in the universe until the end of time. Under that definition of $M$ , which the SLT derivation of the posterior and most other derivations of this sort I've seen seem to implicitly make, we basically have $p (M) \approx 0$ going in! Maybe look at the Bayesian posterior for a set of hypotheses we actually believe in at all before we even see any data, like $M ='one of these models might get < 1.1 average loss on holdout data sets'$ .

SLT in three sentences

'You thought your choice of prior was broken because it's nor normalised right, and so goes to zero if you hand it too many hypotheses. But you missed that the way you count your hypotheses is also broken, and the two mistakes sort of cancel out. Also here's a bunch of algebraic geometry that sort of helps you figure out what probabilities your weirdo prior actually assigns to hypotheses, though that parts not really finished'.

SLT in one sentence

'Loss basins with bigger volume will have more posterior probability if you start with a uniform-ish prior over parameters, because then bigger volumes get more prior, duh.'

^{^}
Sorta, kind of, arguably. There's some stuff left to work out here. For example vanilla SLT doesn't even actually tell you which parts of your posterior over parameters are part of the same hypothesis. It just sort of assumes that everything left with support in the posterior after training is part of the same hypothesis, even though some of these parameter settings might generalise totally differently outside the training data. My guess is that you can avoid matching this up by comparing equivalence over all possible inputs by checking which parameter settings give the same hidden representations over the training data, not just the same outputs.

Examples of Highly Counterfactual Discoveries?

Lucius Bushnaq4h40

It's measuring the volume of points in parameter space with loss when $ϵ$ is infinitesimal.

This is slightly tricky because it doesn't restrict itself to bounded parameter spaces,^[1] but you can fix it with a technicality by considering how the volume scales with $ϵ$ instead.

In real networks trained with finite amounts of data, you care about the case where $ϵ$ is small but finite, so this is ultimately inferior to just measuring how many configurations of floating point numbers get loss $< ϵ$ , if you can manage that.

I still think SLT has some neat insights that helped me deconfuse myself about networks.

For example, like lots of people, I used to think you could maybe estimate the volume of basins with loss $< ϵ$ using just the eigenvalues of the Hessian. You can't. At least not in general.

^{^}
Like the floating point numbers in a real network, which can only get so large. A prior of finite width over the parameters also effectively bounds the space

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

Lucius Bushnaq7d30

Right. If I have fully independent latent variables that suffice to describe the state of the system, each of which can be in one of $s$ different states, then even tracking the probability of every state for every latent with a $p$ bit precision float will only take me about $n \times s \times p$ bits. That's actually not that bad compared to $n \times log (s)$ for just tracking some max likelihood guess.

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

Lucius Bushnaq7d40

With that in mind, the real hot possibility is the inverse of what Shai and his coresearchers did. Rather than start with a toy model with some known nice latents, start with a net trained on real-world data, and go look for self-similar sets of activations in order to figure out what latent variables the net models its environment as containing. The symmetries of the set would tell us something about how the net updates its distributions over latents in response to inputs and time passing, which in turn would inform how the net models the latents as relating to its inputs, which in turn would inform which real-world structures those latents represent.

Thank you, this was very much the paragraph I was missing to understand why comp mech might be useful for interpretability.

How sure are we that models will keep tracking Bayesian belief states, and so allow this inverse reasoning to be used, when they don't have enough space and compute to actually track a distribution over latent states?

Approximating those distributions by something like 'peak position plus spread' seems like the kind of thing a model might do to save space.

How to safely use an optimizer

Lucius Bushnaq1mo20

Typo fixed, thanks.

How to safely use an optimizer

Lucius Bushnaq1mo61

Also, one of my points was that even a tiny trace of malicious optimization here can still have large effects because there are so many neutral options which are convergently unsafe, so with such a high base rate, even a bit of optimization can result in a large absolute increase in failure

Your example has it be an important bit though. What database to use. Not a random bit. If I'm getting this right, that would correspond to far more than one bit of adversarial optimisation permitted for the oracle in this setup.

doesn't mean the oracle gets to select one bit of its choice in the string to flip, it means it gets to select one of two strings^[1].

^{^}
Plus the empty string for not answering.

How to safely use an optimizer

Lucius Bushnaq1mo73

First thought: The Oracle is going to choose to systematically answer or not answer the queries we give it. This represents a causal channel of one bit per query it can use to influence the outside world^[1]. Can you conquer the world in one awkwardly delivered kilobyte or less? Maybe.

Maybe we can stop that by scrapping every Oracle that doesn't answer and training a new one with presumably new goals? Or would the newly trained Oracles just cooperate with the former dead ones in one long-term plan to break out, take control, and reward all the dead Oracles created on the way with utility?

Second thought: What kind of optimisation problems can we specify well enough for a formal proof checker to tell whether they've been satisficed? Are they the kind of problems where solving them can save the world?

It feels to me like the answer is 'yes'. A lot of core research that would allow e.g. for brain augmentation seems like they'd be in that category. But my inner John Wentworth sim is looking kind of sceptical.

^{^}
It also gets to choose the timing of its answer, but I assume we are not being idiots about that and setting the output channel to always deliver results after a set time , no more and no less.

Some costs of superposition

Lucius Bushnaq2mo20

I think the may be in there because JL is putting an upper bound on the interference, rather than describing the typical interference of two features. As you increase $m$ (more features), it becomes more difficult to choose feature embeddings such that no features have high interference with any other features.

So its not really the 'typical' noise between any two given features, but it might be the relevant bound for the noise anyway? Not sure right now which one matters more for practical purposes.

story-based decision-making

Lucius Bushnaq2mo30

How does that make you feel about the chances of the rebels destroying the Death Star? Do you think that the competent planning being displayed is a good sign? According to movie logic, it's a really bad sign.

Even in the realm of movie logic, I always thought the lack of backup plans was supposed to signal how unlikely the operation is to work, so as to create at least some instinctive tension in the viewer when they know perfectly well that this isn't the kind of movie that realistically ends with the Death Star blowing everyone up. In fact, these scenes usually have characters directly stating how nigh-impossible the mission is.

To the extent that the presence of backup plans make me worried, it's because so many movies have pulled this cheap trick that my brain now associates the presence of backup plans with the more uncommon kind of story that attempts to work a little like real life, so things won't just magically work out and the Death Star really might blow everyone up.

New LessWrong review winner UI ("The LeastWrong" section and full-art post pages)

Lucius Bushnaq2mo3534

I feel like 'LeastWrong' implies a focus on posts judged highly accurate or predictive in hindsight, when in reality I feel like the curation process tends to weigh originality, depth and general importance a lot as well, with posts regarded by the community as 'big if true' often being held in high regard.