Lucius Bushnaq

AI notkilleveryoneism researcher, focused on interpretability. 

Personal account, opinions are my own. 

I have signed no contracts or agreements whose existence I cannot mention.

Wikitag Contributions

Comments

Sorted by

I don't see how the linked post would suggest a more optimistic interpretation of the results? The baseline is just asking GPT-4o to look at the chain of thought and guess whether the other model is lying. If linear probes and SAEs struggle to beat that baseline, then they're really not delivering on the promise of interpretability for evals many hoped for: Making manifest information about the model's cognition that isn't visible in the chain of thought. 

EDIT: It is technically possible for the linear probes to pick up on hidden information and still do no better than blackbox due to other shortcomings, I guess.

I think he messed up the lingo a bit, but looking at the code he seems to have done k-means with a number of clusters similar to the number of SAE latents, which seems fine.

If you de-slopify the models, how do you avoid people then using them to accelerate capabilities research just as much as safety research? Why wouldn't that leave us with the same gap in progress between the two we have right now, or even a worse gap? Except that everything would be moving to the finish line even faster, so Earth would have even less time to react.

Is the idea that it wouldn't help safety go differentially faster at all, but rather just that it may preempt people latching on to false slop-solutions for alignment as an additional source of confidence that racing ahead is fine? If that is the main payoff you envision, I don't think it'd be worth the downside of everything happening even faster. I think time is very precious, and sources of confidence already abound for those who go looking for them.

Yes. The hope would be that step sizes are more even across short time intervals when you’re performing a local walk in some space, instead of jumping to a point with random distance to the last point at every update. There’s probably a better way to do it. It was just an example to convey the kind of thing I mean, not a serious suggestion.

I have not updated on these results much so far. Though I haven't looked at them in detail yet. My guess is that if you already had a view of SAE-style interpretability somewhat similar to mine [1,2], these papers shouldn't be much of an additional update for you.

I don't share the feeling that not enough of relevance has happened over the last ten years for us to seem on track for solving it in a hundred years, if the world's technology[1] were magically frozen in time.

Some more insights from the past ten years that look to me like they're plausibly nascent steps in building up a science of intelligence and maybe later, alignment:

  • We understood some of the basics of general pattern matching: How it is possible for embedded minds that can't be running actual Solomonoff induction to still have some ability to extrapolate from old data to new data. This used to be a big open problem in embedded agency, at least to me, and I think it is largely solved now. Admittedly a lot of the core work here actually happened more than ten years ago, but people in ml or our community didn't know about it. [1,2]
  • Natural latents. [1,2,3]
  • Some basic observations and theories about the internal structure of the algorithms neural networks learn, and how they learn them. Yes, our networks may be a very small corner of mind space, but one example is way better than no examples! There's a lot on this one, so the following is just a very small and biased selection. Note how some of these works are starting to properly build on each other. [1,2,3,4,5,6,7,8,9,10,11,12]
  • Some theory trying to link how AIs work to how human brains work. I feel less able to evaluate this one, but if the neurology basics are right it seems quite useful. [1]
  • QACI. What I'd consider the core useful QACI insight maybe sounds kind of obvious once you know about it. But I, at least, didn't know about it. Like, if someone had told me: "A formal process we can describe that we're pretty sure would return the goals we want an AGI to optimise for is itself often a sufficient specification of those goals." I would've replied: "Well, duh." But I wouldn't have realised the implication. I needed to see an actual example for that. Plausibly MIRI people weren't as dumb as me here and knew this pre-2015, I'm not sure.
  • The mesa-optimiser paper. This one probably didn't have much insight that didn't already exist pre-2015. But I think it communicated something central about the essence of the alignment problem to many people who hadn't realised it before. [1]

If we were a normal scientific field with no deadline, I would feel very good about our progress here. Particularly given how small we are. CERN costs ca. 1.2 billion a year, I think all the funding for technical work and governance over the past 20 years taken together doesn't add up to one year of that. Even if at the end of it all we still had to get ASI alignment right on the first try, I would still feel mostly good about this, if we had a hundred years. 

I would also feel better about the field building situation if we had a hundred years. Yes, a lot of the things people tried for field building over the past ten years didn't work as well as hoped. But we didn't try that many things, a lot of the attempts struck me as inadequate in really basic ways that seem fixable in principle, and I would say the the end result still wasn't no useful field building. I think the useful parts of the field have grown quite a lot even in the past three years! Just not as much as people like John or me thought they would, and not as much as we probably needed them to with the deadlines we seem likely to have.

Not to say that I wouldn't still prefer to do some human intelligence enhancement first, even if we had a hundred years. That's just the optimal move, even in a world where things look less grim. 

But what really kills it for me is just the sheer lack of time. 
 

  1. ^

    Specifically AI and intelligence enhancement

I wonder whether this is due to the fact that he's used to thinking about human brains, where we're (AFAIK) nowhere near being able to identify the representation of specific concepts, and so we might as well use the most philosophically convenient description.

I don't think this description is philosophically convenient. Believing  and believing things that imply  are genuinely different states of affairs in a sensible theory of mind. Thinking through concrete mech interp examples of the former vs. the latter makes it less abstract in what sense they are different, but I think I would have objected to Chalmer's definition even back before we knew anything about mech interp. It would just have been harder for me to articulate what exactly is wrong with it.

(Abstract) I argue for the importance of propositional interpretability, which involves interpreting a system’s mechanisms and behavior in terms of propositional attitudes
...
(Page 5) Propositional attitudes can be divided into dispositional and occurrent. Roughly speaking, occurrent attitudes are those that are active at a given time. (In a neural network, these would be encoded in neural activations.) Dispositional attitudes are typically inactive but can be activated. (In a neural network, these would be encoded in the weights.) For example, I believe Paris is the capital of France even when I am asleep and the belief is not active. That is a dispositional belief. On the other hand, I may actively judge France has won more medals than Australia. That is an occurrent mental state, sometimes described as an “occurrent belief”, or perhaps better, as a “judgment” (so judgments are active where beliefs are dispositional). One can make a similar distinction for desires and other attitudes.

I don't like it. It does not feel like a clean natural concept in the territory to me.

Case in point:

(Page 9) Now, it is likely that a given AI system may have an infinite number of propositional attitudes, in which case a full log will be impossible. For example, if a system believes a proposition p, it arguably dispositionally believes p-or-q for all q. One could perhaps narrow down to a finite list by restricting the log to occurrent propositional attitudes, such as active judgments. Alternatively, we could require the system to log the most significant propositional attitudes on some scale, or to use a search/query process to log all propositional attitudes that meet a certain criterion.

I think what this is showing is that Chalmer's definition of "dispositional attitudes" has a problem: It lacks any notion of the amount and kind of computational labour required to turn 'dispositional' attitudes into 'occurrent' ones. That's why he ends up with AI systems having an uncountably infinite number of dispositional attitudes. 

One could try to fix up Chalmer's definition by making up some notion of computational cost, or circuit complexity or something of the sort, that's required to convert a dispositional attitude into an occurrent attitude, and then only list dispositional attitude up to some cost cutoff  we are free to pick as applications demand.

But I don't feel very excited about that. At that point, what is this notion of "dispositional attitudes" really still providing us that wouldn't be less cumbersome to describe in the language of circuits? There, you don't have this problem. An AI can have a query-key lookup for proposition  and just not have a query-key lookup for the proposition . Instead, if someone asks whether  is true, it first performs the lookup for , then uses some general circuits for evaluating simple propositional logic to calculate that  is true. This is an importantly different computational and mental process from having a query-key lookup for  in the weights and just directly performing that lookup, so we ought to describe a network that does the former differently from a network that does the latter. It does not seem like Chalmer's proposed log of 'propositional attitudes' would do this. It'd describe both of these networks the same way, as having a propositional attitude of believing , discarding a distinction between them that is important for understanding the models' mental state in a way that will let us do things such as successfully predicting the models' behavior in a different situation. 

I'm all for trying to come up with good definitions for model macro-states which throw away tiny implementation details that don't matter, but this definition does not seem to me to carve the territory in quite the right way. It throws away details that do matter.

The idea of the motivation is indeed that you want to encode the attribution of each rank-1 piece separately. In practice, computing the attribution of  as a whole actually does involve calculating the attributions of all rank-1 pieces and summing them up, though you're correct that nothing we do requires storing those intermediary results. 

While it technically works out, you are pointing at a part of the math that I think is still kind of unsatisfying. If Bob calculates the attributions and sends them to Alice, why would Alice care about getting the attribution of each rank-1 pieces separately if she doesn't need them to tell what component to activate? Why can't Bob just sum them before he sends them? It kind of vaguely makes sense to me that Alice would want the state of a multi-dimensional object on the forward pass described with multiple numbers, but what exactly are we assuming she wants that state for? It seems that she has to be doing something with it that isn't just running her own sparser forward pass.

I'm brooding over variations of this at the moment, trying to find something for Alice to do that connects better to what we actually want to do. Maybe she is trying to study the causal traces of some forward passes, but pawned the cost of running those traces off to Bob, and now she wants to get the shortest summary of the traces for her investigation under the constraint that uncompressing the summary shouldn't cost her much compute. Or maybe Alice wants something else. I don't know yet.

This paper claims to sample the Bayesian posterior of NN training, but I think it's wrong.

"What Are Bayesian Neural Network Posteriors Really Like?" (Izmailov et al. 2021) claims to have sampled the Bayesian posterior of some neural networks conditional on their training data (CIFAR-10, MNIST, IMDB type stuff) via Hamiltonian Monte Carlo sampling (HMC). A grand feat if true! Actually crunching Bayesian updates over a whole training dataset for a neural network that isn't incredibly tiny is an enormous computational challenge. But I think they're mistaken and their sampler actually isn't covering the posterior properly.

They find that neural network ensembles trained by Bayesian updating, approximated through their HMC sampling, generalise worse than neural networks trained by stochastic gradient descent (SGD). This would have been incredibly surprising to me if it were true. Bayesian updating is prohibitively expensive for real world applications, but if you can afford it, it is the best way to incorporate new information. You can't do better.[1] 

This is kind of in the genre of a lot of papers and takes I think used to be around a few years back, which argued that the then still quite mysterious ability of deep learning to generalise was primarily due to some advantageous bias introduced by SGD. Or momentum, or something along these lines. In the sense that SGD/momentum/whatever were supposedly diverging from Bayesian updating in a way that was better rather than worse

I think these papers were wrong, and the generalisation ability of neural networks actually comes from their architecture, which assigns exponentially more weight configurations to simple functions than complex functions. So, most training algorithms will tend to favour making simple updates, and tend to find simple solutions that generalise well, just because there's exponentially more weight settings for simple functions than complex functions. This is what Singular Learning Theory talks about. From an algorithmic information theory perspective, I think this happens for reasons similar to why exponentially more binary strings correspond to simple programs than complex programs in Turing machines.

This picture of neural network generalisation predicts that SGD and other training algorithms should all generalise worse than Bayesian updating, or at best do similarly. They shouldn't do better

So, what's going on in the paper? How are they finding that neural network ensembles updated on the training data with Bayes rule make predictions that generalise worse than predictions made by neural networks trained the normal way?

My guess: Their Hamiltonian Monte Carlo (HMC) sampler isn't actually covering the Bayesian posterior properly. They try to check that it's doing a good job by comparing inter-chain and intra-chain variance in the functions learned. 

We apply the classic Gelman et al. (1992) “” potential-scale-reduction diagnostic to our HMC runs. Given two or more chains,  estimates the ratio between the between-chain variance (i.e., the variance estimated by pooling samples from all chains) and the average within-chain variance (i.e., the variances estimated from each chain independently). The intuition is that, if the chains are stuck in isolated regions, then combining samples from multiple chains will yield greater diversity than taking samples from a single chain.

They seem to think a good  in function space implies that the chains are doing a good job of covering the important parts of the space. But I don't think that's true. You need to mix in weight space, not function space, because weight space is where the posterior lives. Function space and weight space are not bijective, that's why it's even possible for simpler functions to have exponentially more prior than complex functions. So good mixing in function space does not necessarily imply good mixing in weight space, which is what we actually need. The chains could be jumping from basin to basin very rapidly instead of spending more time in the bigger basins corresponding to simpler solutions like they should. 

And indeed, they test their chains' weight space  value as well, and find that it's much worse:

Figure 2. Log-scale histograms of  convergence diagnostics. Function-space s are computed on the test-set softmax predictions of the classifiers and weight-space s are computed on the raw weights. About 91% of CIFAR-10 and 98% of IMDB posterior-predictive probabilities get an  less than 1.1. Most weight-space  values are quite small, but enough parameters have very large s to make it clear that the chains are sampling from different distributions in weight space.
...
(From section 5.1) In weight space, although most parameters show no evidence of poor mixing, some have very large s, indicating that there are directions in which the chains fail to mix.

....
(From section 5.2) The qualitative differences between (a) and (b) suggest that while each HMC chain is able to navigate the posterior geometry the chains do not mix perfectly in the weight space, confirming our results in Section 5.1.

So I think they aren't actually sampling the Bayesian posterior. Instead, their chains jump between modes a lot and thus unduly prioritise low-volume minima compared to high volume minima. And those low-volume minima are exactly the kind of solutions we'd expect to generalise poorly.

I don't blame them here. It's a paper from early 2021, back when very few people understood the importance of weight space degeneracy properly aside from some math professor in Japan whom almost nobody in the field had heard of. For the time, I think they were trying something very informative and interesting. But since the paper has 300+ citations and seems like a good central example of the SGD-beats-Bayes genre, I figured I'd take the opportunity to comment on it now that we know so much more about this. 

The subfield of understanding neural network generalisation has come a long way in the past four years.

Thanks to Lawrence Chan for pointing the paper out to me. Thanks also to Kaarel Hänni and Dmitry Vaintrob for sparking the argument that got us all talking about this in the first place.

  1. ^

     See e.g. the first chapters of Jaynes for why.

Load More