Dmitry Vaintrob

Wiki Contributions

Comments

Sorted by

Thanks! Are you saying there is a better way to find citations than a random walk through the literature? :)

I didn't realize that the pictures above limit to literal pieces of sin and cos curves (and Lissajous curves more generally). I suspect this is a statement about the singular values of the "sum" matrix S of upper-triangular 1's?

The "developmental clock" observation is neat! Never heard of it before. Is it a qualitative "parametrization of progress" thing or are there phase transition phenomena that happen specifically around the midpoint?

Hmm, I'm not sure how what you're describing (learn on a bunch of examples of (query, well-thought-out guess)) is different from other forms of supervised learning.

Based on the paper Adam shared, it seems that part of the "amortizing" picture is that instead of simple supervised learning you look at examples of the form (context1, many examples from context1), (context2, many examples from context2), etc., in order to get good at quickly performing inference on new contexts.

It sounds like in the Paul Christiano example, you're assuming access to some internal reasoning components (like activations or chain-of-thought) to set up a student-teacher context. Is this equivalent to the other picture I mentioned?

I'm also curious about what you said about o3 (and maybe have a related confusion about this). I certainly believe that NN's, including RL models, learn by parallel heuristics (there's a lot of interp and theory work that suggests this), but I don't know any special properties of o3 that make it particularly supportive of this point of view

Thanks! I spent a bit of time understanding the stochastic inverse paper, though haven't yet fully grokked it. My understanding here is that you're trying to learn the conditional probabilities in a Bayes net from samples. The "non-amortized" way to do this for them is to choose a (non-unique) maximal inverse factorization that satisfies some d-separation condition, then guess the conditional probabilities on the latent-generating process by just observing frequencies of conditional events -- but of course this is very inefficient, in particular because the inverse factorization isn't a general Bayes net, but must satisfy a bunch of consistency conditions; and then you can learn a generative model for these consistency conditions by a NN and then perform some MCMC sampling on this learned prior.

So is the "moral" you want to take away here then that by exploring a diversity of tasks (corresponding to learning this generative prior on inverse Bayes nets) a NN can significantly improve its performance on single-shot prediction tasks?

FWIW, I like John's description above (and probably object much less than baseline to humorously confrontational language in research contexts :). I agree that for most math contexts, using the standard definitions with morphism sets and composition mappings is easier to prove things with, but I think the intuition described here is great and often in better agreement with how mathematicians intuit about category-theoretic constructions than the explicit formalism.

This phenomenon exists, but is strongly context-dependent. Areas of math adjacent to abstract algebra are actually extremely good at updating conceptualizations when new and better ones arrive. This is for a combination of two related reasons: first, abstract algebra is significantly concerned about finding "conceptual local optima" of ways of presenting standard formal constructions, and these are inherently stable and require changing infrequently; second, when a new and better formalism is found, it tends to be so powerfully useful that papers that use the old formalism (in concepts where the new formalism is more natural) quickly become outdated -- this happened twice in living memory, once with the formalism of schemes replacing other points of view in algebraic geometry and once with higher category theory replacing clunkier conceptualizations of homological algebra and other homotopical methods in algebra. This is different from fields like AI or neuroscience, where oftentimes using more compute, or finding a more carefully taylored subproblem is competitive or better than "using optimal formalism". That said, niceness of conceptualizations depends on context and taste, and there do exist contexts where "more classical" or "less universal" characterizations are preferable to the "consensus conceptual optimum".

This is very nice! So the way I understand what you linked is this: the class of perturbative expansions in the "Edgeworth expansion" picture I was distilling is that the order-d approximation for the probability distribution associated to the sum variable S_n above is where is the probability distribution associated with a Gaussian and is a polynomial in t and the perturbative parameter . The paper you linked says that a related natural thing to do is to take the Fourier transform, which will be the product of the Gaussian pdf and a different polynomial in the fourier parameter t and the inverse perturbation parameter . You can then look at the leading terms, which will be (maybe up to some fixed scaling) a polynomial in and this gives some kind of "leading" Edgeworth contribution. 

Here this can be interpreted as a stationary phase formula, but you can only get "perturbative" theories, i.e. the relevant critical set will be nonsingular (and everything is expressed as a Feynman diagram with edges decorated by the inverse Hessian). But you're saying that if you take this idea and apply it to different interesting sequences of random variables (not sum variables, but other natural asymptotic limits of other random processes), you can get singular stationary phase (i.e. the Watanabe expansion). Is there an easy way to describe the simplest case that gives an interesting Watanabe expansion?

Thanks for asking! I said in a later shortform that I was trying to do too many things in this post, with only vague relationships between them, and I'm planning to split it into pieces in the future.

Your 1-3 are mostly correct. I'd comment as follows:

  1. (and also kind of 3) That advice of using the tempered local Bayesian posterior (I like the term -- let's shorten it to TLBP) is mostly aimed at non-SLT researchers (but may apply also to some SLT experiments). The suggestion is simpler than to compute expectations. Rather, it's just to run a single experiment at a weight sampled from the TLBP. This is analogous to tuning a precision dial on your NN to noise away all circuits for which the quotient (usefulness)/(description length) is bounded above by 1/t (where usefulness is measured in reduction of loss). At t = 0, you're adding no noise and at you're fully noising it. 

    This is interesting to do in interp experiments for two general reasons: 

    1. You can see whether the behavior your experiment finds is general or spurious. The higher the temperature range it persists over, the more general it is in the sense of usefulness/description length (and all else being equal, the more important your result is).
    2. If you are hoping to say that a behavior you found, e.g. a circuit, is "natural from the circuit's point of view" (i.e., plausibly occurs in some kind of optimal weight- or activation-level description of your model), you need to make sure your experiment isn't just putting together bits of other circuits in an ad-hoc way and calling it a circuit. One way to see this, that works 0% of the time, is to notice that turning this circuit on or off affects the output on exactly the context/ structure you care about, and has absolutely no effect at all on performance elsewhere. This never works because our interp isn't at a level where we can perform uber-precise targeted interventions, and whenever we do something to a network in an experiment, this always significantly affects loss on unrelated inputs. By having a tunable precision parameter (as given by the TLBP for example), you have more freedom to find such "clean" effects that only do what you want and don't affect loss otherwise. In general, in an imprecise sense, you expect each "true" circuit to have some "temperature of entanglement" with the rest of the model, and if this circuit is important enough to survive tempering to this temperature of entanglement, you expect to see much cleaner and nicer results in the resulting tempered model.
  2. In the above context, you rarely want to use the Watanabe temperature or any other temperature that only depends on the number of samples n, since it's much too low in most cases. Instead, you're either looking for a characteristic temperature associated with an experiment or circuit (which in general will not depend on n much), or fishing for behaviors that you hope are "significantly general". Here the characteristic temperature associated with the level of generality that "is not literally memorizing" is the Watanabe temperature or very similar, but it is probably more interesting to consider larger scales.
  3. (maybe more related to your question 1): Above, I explained why I think performing experiments at TLBP weight values is useful for "general interp". I also explain that you sometimes have a natural "characteristic temperature" for the TLBP that is independent of sample number (e.g. meaningful at infinite samples), which is the difference between the loss of the network you're studying and a SOTA NN, which you think of as that "true optimal loss". In large-sample (highly underparameterized) cases, this is probably a better characteristic temperature than the Watanabe temperature, including for notions of effective parameter count: indeed, insofar as your NN is "an imperfect approximation of an optimal NN", the noise inherent in this imperfection is on this scale (and not the Watanabe scale). Of course there are issues with this PoV as less expressive NN's are rarely well-conceptualized as TLBP samples (insofar as they find a subset of a "perfect NN's circuits", they find the easily learnable ones rather than the maximally general ones). However it's still reasonable to think of this as a first stab at the inherent noise scale associated to an underparametrized model, and to think of the effective parameter count at this scale (i.e., free energy / log temperature) as a better approximatin of some "inherent" parameter count.

Why you should try degrading NN behavior in experiments.

I got some feedback on the post I wrote yesterday that seems right. The post is trying to do too many things, and not properly explaining what it is doing, why this is reasonable, and how the different parts are related.

I want to try to fix this, since I think the main piece of advice in this post is important, but gets lost in all the mess.

This main point is:

experimentalists should in many cases run an experiment on multiple neural nets with a variable complexity dial that allows some "natural" degradations of the NN's performance, and certain dials are better than others depending on context.

I am eventually planning splitting out the post into a few parts, one of which explains this more carefully. When I do this I will replace the current version of the post with just a discussion of the "koan" itself: i.e., nitpicks about work that isn't careful about thinking about the scale at which it is performing interpretability.

For now I want to give a quick reductive take on what I hope to be the main takeaway of this discussion. Namely, why I think "interpretability on degraded networks" is important for better interpretability.

Basically: when ML experiments modify a neural net to identify or induce a particular behavior, this always degrades performance. Now there are two hypotheses for what is going on:

  1. You are messily pulling your NN in the direction of a particular behavior, and confusing this spurious messy phenomenon with finding a "genuine" phenomenon from the program's point of view.

  2. You are messily pulling your NN in the direction of a particular behavior, but also singling out a few "real" internal circuits of the NN that are carrying out this behavior.

Because of how many parameters you have to play with and the polysemanticity of everything in a NN, it's genuinely hard to tell these two behaviors apart. You might find stuff that "looks" like a core circuit, but actually is just bits of other circuits combined together, and your circuit-fitting experiment makes look like a coherent behavior, and any nice properties of the resulting behavior that make it seem like an "authentic" circuit are just artefacts of the way you set up the experiment.

Now the idea behind running this experiment at "natural" degradations of network performance is to try to separate out these two possibilities more cleanly. Namely, an ideal outcome is that in running your experiment on some class of natural degradation of your neural net, you find a regime such that

  • the intervention you are running no longer significantly affects the (naturally degraded) performance
  • the observed effect still takes place.

Then what you've done is effectively "cleaned up" your experiment such that you are still probably finding interpretable behaviors in the original neural net (since a good degradation is likely to contain a subset of circuits/behaviors of your original net and not many "new behaviors), in a way that sufficiently reduced the complexity that the behavior you're seeking is no longer "entangled" with a bunch of other behaviors; this should significantly update you that the behavior is indeed "natural" and not spurious.

This is of course a very small, idealized sketch. But the basic idea behind looking at neural nets with degraded performance is to "squeeze" the complexity in a controlled way to suitably match the complexity of the circuit (and how it's embedded in the rest of the network/how it interacts with other circuits). If you then have a circuit of "the correct complexity" that explains a behavior, there is in some sense no "complexity room" for other sneaky phenomena to confound it.

In the post, the natural degradation I suggested is the physics-inspired "SLGD sampling" process which in some sense tries to add a maximal amount of noise to your NN while only having a limited impact on performance (measured by loss); this has a bias of keeping "generally useful" circuits and interactions and noising more inessential/ memorize-y circuits. Other interventions that have different properties are "just adding random noise" (either to weights or activations) to suitable reduce performance, or looking at earlier training checkpoints. I suspect that different degradations (or combinations thereof) are appropriate to isolate the relevant complexity of different experiments.

Thanks so much for this! Will edit

Load More