Jesse Hoogland — LessWrong

Executive director at Timaeus. Working on singular learning theory and developmental interpretability.

Do you have an intuition on whether or not using LoRA for the SGMCMC sampling of the BIF breaks everything?

No, I’m not sure about this. I think it’s an interesting question and would make a valuable paper to really investigate this.

in particular arguing that it shouldn't affect emergent misalignment, which is basically just a special case of BIFs.

I don’t understand what this means. In what sense is the BIF a special case of what?

TLDR: One model’s trauma is another’s enlightenment.

Why study model development? One reason is that training data can have the opposite effect on a model depending on when it is shown. So alignment is not just about training on the right data but training in the right order.

We just put out a new paper that explains how this can arise and provides some examples (mostly toy). This builds on our recent work introducing a new influence function technique (see my previous shortform). I thought I’d write up a quick note on how these papers are connected: Why does a generalization of influence functions imply influence functions changing over training?

Static view of development ⇔ classical influence functions. Assume regularity (a single, nondegenerate global minimum). This implies:

Development is a gradual convergence to the true parameters, controlled by the curvature (spectrum of the Hessian) around this point. For Bayesian learners, this follows from the Bernstein–von Mises theorem. For (stochastic) optimization, there are similar results, such as the Polyak-Ruppert averaging theorem.
The classical influence function is all you need to characterize influence. That is, the full Bayesian influence function (BIF) asymptotically reduces to just the classical influence function.

Dynamic view of development ⇔ Bayesian influence functions. Drop the regularity assumption (allow for a set of non-unique, degenerate minima). This implies:

Development is a stagewise succession of phase transitions, controlled by a tradeoff between loss and complexity. For Bayesian learners, this follows from Watanabe’s free energy formula and the singular learning process. For stochastic optimization, the theory is not yet developed enough to handle this regime.
The classical influence function is insufficient to characterize influence, even in the asymptotic limit. In this regime, the asymptotic equivalence of the BIF and the classical IF breaks down.

For more on how this plays out in real-world training, see the announcement thread for our new paper (reproduced partially below):

Training Data Attribution (TDA) should account for learning dynamics! The same data can influence model behavior in dramatically different ways at different time points of training. We call for a shift towards stagewise data attribution and the study of influence dynamics.

1/11
Influence changes over training – sometimes dramatically. The same data that helps learn general categories early in training can harm later specialization. Why is this?

2/11

Developmental interpretability ≠ interpretability-over-time. Two years ago, we proposed "developmental interpretability," a research agenda that applies singular learning theory (SLT) to study how neural networks learn. In the time since, the broader field of "interpretability-over-time" has grown, and our ambitions for "SLT-for-safety" have expanded beyond just understanding learning.

In response to these changes, I thought I'd write a quick clarification on where the current boundaries are:

"Interpretability-over-time" is about applying interpretability techniques throughout training to study how structure forms in models (see examples using crosscoders, circuit discovery tools, and behavioral signals). This is a field that predates Timaeus and doesn’t a priori require SLT.

We first coined the term "developmental interpretability" with a narrower, technical meaning in mind. The term has since drifted in practice, with many using it interchangeably with "interpretability-over-time." For clarity about our own research agenda, we use "developmental interpretability" to refer to the following specific methodology:

SGD to Bayes: Model the SGD learning process (which is easy to implement but hard to describe theoretically) with an idealized Bayesian learning process (which is hard to implement but easy to describe theoretically).
Invoke SLT: Use singular learning theory (SLT) to make predictions (based on the singular learning process of SLT) and measuring devices ("spectroscopy") for studying this idealized process.
Back to SGD: Apply those predictions and tools in the original SGD setting to discover novel developmental phenomena and interpret them.

The singular learning process. Singular learning theory is a theory of Bayesian statistics that predicts that the learning process is organized by Bayesian phase transitions (aka "developmental stages"). The "novel developmental phenomena" we’re hoping to discover and interpret with SLT are precisely these phase transitions. We’ve now successfully applied this pipeline across a range of settings, including synthetic toy models (superposition, list sorting, and in-context linear regression), vision models, and language models.

Spectroscopy, in this context, refers to the broader toolkit of SLT-derived measuring devices, including LLCs, refined LLCs, susceptibilities, and Bayesian influence functions (which are really just another type of susceptibility). The name is borrowed from "spectroscopy" in statistical physics, which refers to the study of (electromagnetic) spectra emitted by a physical system to infer that system’s microscopic structure. It’s the same math, just different materials.

One of the reasons we started by focusing on development was practical. The LLC is a scalar. The LLC-over-time is a function. Studying development allowed us to extract much more information from a fixed coarse instrument. As our tools have improved, we’re able to use them in new contexts without always needing additional developmental information.

Outlook. Our theory of change remains centered on development. But as the research has succeeded, it has become clear that the range of phenomena we can study with SLT is not limited to development. It has also become clearer that we can use these tools not just for passive interpretability but also for active control.

We recently put out a new paper on a scalable generalization of influence functions, which quantify how training data affects model behavior (see Nina's post). I'm excited about this because it takes a completely new methodological approach to measuring influence.

Instead of relying on a Hessian inverse (which is ill-defined and expensive), our new "Bayesian" influence functions (BIF) rely on a covariance calculation (which can be scalably estimated with MCMC). This approach is more theoretically sound (no more Hessian inverses), and it achieves what I think are a more desirable set of engineering tradeoffs (better model-size scaling but worse dataset-size scaling).

At Timaeus, we think these kinds of techniques are on the critical path to safety. Modern alignment techniques like RLHF and Constitutional AI are about controlling model behavior by selecting the right training data. If this continues to be the case, we will need better tools for understanding and steering the pipeline from data to behavior.

It's still early days for the BIF. We've done some initial validation on retraining benchmarks and other quantitative tests (follow-up work coming soon), where the BIF comes out looking strong, but more work will be needed to understand the full set of costs and benefits. As that foundation gets established, we expect we'll be able to start applying these techniques directly to safety-relevant problems.

You can read the full announcement thread on X (reproduced below):

How does training data shape model behavior? Well, it’s complicated…
But we can make progress by studying a simplified, linear version of the mapping from data to behavior. This is the idea behind influence functions (IF), which are one of the main pillars in modern training data attribution.
Unfortunately, classical influence functions are fundamentally limited:
Theoretically, IFs assume a unique, isolated global minimum. This is never true for NNs.
Practically, the Hessian dependency poses a severe memory bottleneck that explodes with model size
Still, there’s hope. The Bayesian influence function (BIF) addresses both issues. The idea:
Study influence not on a single minimum but on the distribution of low-loss solutions.
Skip the Hessian inversion. Compute covariance over this distribution.
At first glance, this looks like a step backwards: computing a covariance over the full Bayesian posterior is much more intractable than computing the Hessian! And we typically care about influence for a specific checkpoint, not aggregated over all possible solutions.
In our new paper, we solve both problems by introducing:
A local version of the BIF that applies to individual NN checkpoints.
A scalable stochastic-gradient MCMC estimator.
The local BIF bypasses the Hessian bottleneck and is well-defined even for degenerate models. It can be batched and scales to billions of parameters. One of the best perks is that we get fine-grained per-token influence functions for no extra compute cost.
To validate the BIF, we test it on a standard retraining benchmark, via the Linear Datamodeling Score (LDS). We find that it is competitive with leading IF-based approximations, especially in the small-dataset regime.
There are caveats: the BIF exhibits worse scaling with dataset size, we’re still in the early days of understanding the role of SGMCMC hyperparameters, and generally more investigation is needed! But we see straightforward ways to make progress on these problems.
To read more, check out our new paper, out now:
Bayesian Influence Functions for Hessian-Free Data Attribution
@Philipp Alexander Kreer, @Wilson Wu, @Maxwell Adam, @Zach Furman, @Jesse Hoogland

"Scalable" in the sense of "scalable oversight" refers to designing methods for scaling human-level oversight to systems beyond human-level capabilities. This doesn't necessarily mean the same thing as designing methods for reliably converting compute into more alignment. In practice, techniques for scalable oversight like debate, IDA, etc. often have the property that they scale with compute, but this is left as an implicit desideratum rather than an explicit constraint.

Looking back at this, I think this post is outdated and was trying a little too hard to be provocative. I agree with everything you say here. Especially: "One could reasonably say that PAC learning is somewhat confused, but learning theorists are working on it!"

Forgive my youthful naïvité. For what it's worth, I think the generalization post in this sequence has stood the test of time much better.

Claude 3.7 reward hacks. During training, Claude 3.7 Sonnet sometimes resorted to "special-casing" to pass tests when it got stuck — including directly hardcoding expected outputs or even modifying test files themselves. Rumors are circulating that o1/o3 was doing similar things — like overwriting equality operators to get Python tests to pass — and this may have contributed to the delayed release.

This seems relevant to claims that "we'll soon have reward models sophisticated enough to understand human values" and that inner alignment is the real challenge. Instead, we're seeing real examples of reward-hacking at the frontier.

RL is becoming important again. We should expect old failure modes to rear their ugly heads.

But "models have singularities and thus number of parameters is not a good complexity measure" is not a valid criticism of VC theory.

Right, this quote is really a criticism of the classical Bayesian Information Criterion (for which the "Widely applicable Bayesian Information Criterion" WBIC is the relevant SLT generalization).

Ah, I didn't realize earlier that this was the goal. Are there any theorems that use SLT to quantify out-of-distribution generalization? The SLT papers I have read so far seem to still be talking about in-distribution generalization, with the added comment that Bayesian learning/SGD is more likely to give us "simpler" models and simpler models generalize better.

That's right: existing work is about in-distribution generalization. It is the case that, within the Bayesian setting, SLT provides an essentially complete account of in-distribution generalization. As you've pointed out there are remaining differences between Bayes and SGD. We're working on applications to OOD but have not put anything out publicly about this yet.

To be precise, it is a property of singular models (which includes neural networks) in the Bayesian setting. There are good empirical reasons to expect the same to be true for neural networks trained with SGD (across a wide range of different models, we observe the LLC progressively increase from ~0 over the course of training).

The key distinction is that VC theory takes a global, worst-case approach — it tries to bound generalization uniformly across an entire model class. This made sense historically but breaks down for modern neural networks, which are so expressive that the worst-case is always very bad and doesn't get you anywhere.

The statistical learning theory community woke up to this fact (somewhat) with the Zhang et al. paper, which showed that deep neural networks can achieve perfect training loss on randomly labeled data (even with regularization). The same networks, when trained on natural data, will generalize well. VC dimension can't explain this. If you can fit random noise, you get a huge (or even infinite) VC dimension and the resulting bounds fail to explain empircally observed generalization performance.

So I'd argue that dependence on the true-data distribution isn't a weakness, but one of SLT's great strengths. For highly expressive model classes, generalization only makes sense in reference to a data distribution. Global, uniform approaches like VC theory do not explain why neural networks generalize.

Thus if multiple parameter values lead to the same behaviour, this isn't a problem for the theory at all because these redundancies do not increase the VC-dimension of the model class.

Multiple parameter values leading to the same behavior isn't a problem — this is "the one weird trick." The reason you don't get the terribly generalizing solution that is overfit to noise is because simple solutions occupy more volume in the loss landscape, and are therefore easier to find. At the same time, simpler solutions generalize better (this is intuitively what Occam's razor is getting at, though you can make it precise in the Bayesian setting). So it's the solutions that generalize best that end up getting found.

If the claim is that it only needs to know certain properties of the true distribution that can be estimated from a small number of samples, then it will be nice to have a proof of such a claim (not sure if that exists).

I would say that this is a motivating conjecture and deep open problem (see, e.g., the natural abstractions agenda). I believe that something like this has to be true for learning to be at all possible. Real-world data distributions have structure; they do not resemble noise. This difference is what enables models to learn to generalize from finite samples.

Also note that if is allowed access to samples, then predicting whether your model generalizes is as simple as checking its performance on the test set.

For in-distribution generalization, yes, this is more or less true. But what we'd really like to get at is an understanding of how perturbations to the true distribution lead to changes in model behavior. That is, out-of-distribution generalization. Classical VC theory is completely hopeless when it comes to this. This only makes sense if you're taking a more local approach.

See also my post on generalization here.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments