jake_mendel — LessWrong

technical AI safety program associate at OpenPhil

Copypasting from a slack thread:

I'll list some work that I think is aspiring to build towards an answer to some of these questions, although lots of it is very toy:

On generalisation vs simple heuristics:
- I think the nicest papers here are some toy model interp papers like progress measures for grokking and Grokked Transformers. I think these are two papers which present a pretty crisp distinction between levels of generality of different algorithms that perform similarly on the training set. In modular addition, there are two levels of algorithm and in the Grokked transformer, there are three. The story of what the models end up doing ends up being pretty nuanced and it ends up coming down to specific details of the pre-training data mix, which maybe isn't that surprising. (If you squint, then you can sort of see predictions in the Grokked transformer being borne out in the nuance about when LLMs can do multi-hop reasoning e.g. Yang et al.) But it seems pretty clear that if the training conditions are right, then you can get increasingly general algorithms learned even when simpler ones would do the trick.
- I also think a useful idea (although less useful so far than the previous bullet) is about how in certain situations, the way a model implements memorising circuits can sort of naturally become a generalising circuit once you've memorised enough. the only concrete example of this that I know of (and it's not been empirically validated) is the story from computation in superposition of how a head that memorises a lookup table of features to copy, continuously becomes a copying head that generalises to new features once you make the lookup table big enough.
- These are all toy settings where we can be pretty crisp about what we mean by memorisation and generalisation. I think the picture that we're beginning to see emerge is that what counts as memorisation and generalisation is very messy and in the weeds and context-specific, but that transformers can generalise in powerful ways if the pre-training mix is right. What "right" means, and what "generalise in powerful ways" means in situations we care about are still unsolved technical questions.
- Meanwhile, I also think it's useful to just look at very qualitatively surprising examples of frontier models generalising far even if we can't be precise about what memorization and generalisation mean in that setting. Papers that I think are especially cool on this axis include emergent misalignment, anti-imitation, OOCR, LLMs are aware of their learned behaviors, LLMs are aware of how they're being steered (I think it's an especially interesting and risk-relevant type of generalisation when the policy starts to know something that is only 'shown' to the training process). However, I think it's quite hard to look at these papers and make predictions about future generalisation successes and failures because we don't have any basic understanding of how to talk about generalisation of these settings.
On inductive biases and the speed prior:
- I don't have much to say about how useful the speed prior is at mitigating scheming, but I think there has been some interesting basic science on what the prior implied by neural network training is in practice. Obviously SLT comes to mind, and some people have tried to claim that SLT suggests that neural network training is actually more like Solomonoff prior than the speed prior (e.g. bushnaq) although I think that work is pretty shaky and may well not hold up.
- I think something that's missing from both the speed prior and the Solomonoff prior is a notion of learnability: the reason we have eyes and not cameras is not because eyes have lower K-complexity or lower Kt-complexity than cameras. It's because there is a curriculum for learning eyes and there (probably) isn't for cameras; neural network training also requires a training story/learnability. All the work that I know of exploring this is in very toy settings (low hanging fruit prior, leap complexity and the sparse parity problem). I don't think any of these results are strong enough to make detailed claims about p(deception) yet, and they don't seem close.
- OTOH, most learning in the future might be more like current in-context learning, and (very speculatively) it seems possible that in-context learning is more bayesian (less path dependent/learnability dependent) than pre-training. see e.g. riechers et al.

Some random thoughts on what goals powerful AIs will have more generally:

I think we've seen some decent evidence that a lot of training with RL makes models obsessed with completing tasks. I think the main evidence here comes from the reward hacking results, but I also think the Apollo anti-scheming paper is an important result about how strong/robust this phenomenon is. Despite a reasonably concerted effort to train the model to care about something that’s in tension with task completion (being honest), the RLVR instilled such a strong preference/heuristic for task completion that even though the deliberative alignment training process doesn't reward task completion at all and only rewards honesty (in fact by the end of anti-scheming training the model does the honest thing in every training environment!), the model still ends up wanting to complete tasks enough to deceive, etc in test environments. I don't think it was super obvious a priori that RL would embed task completion preference that strongly (overriding the human prior).
- I think there's some other lessons that we could glean here about stickiness of goals in general. The anti-scheming results suggest to me that something about the quantity and diversity of the RLVR environments internalised task completion preference deeply enough that it was still present after a full training round to disincentivize it. Contrast this with results that show safety training is very shallow and can be destroyed easily (e.g. badllama).
- Very speculatively, I'm excited about the growing field of teaching models fake facts and trying to work out when they actually believe the fake facts. It seems possible that some techniques and ideas that were developed in order to get models to internalise beliefs deeply (and evaluating success) could be coopted for getting models to internalise preferences/goals deeply (and evaluating success).
pretty obvious point, but I think the existence of today's models and the relatively slow progress to human-level intelligence tells us that insofar as future AIs will end up misaligned, their goals will pretty likely be similar to/indistinguishable from human values at a low level of detail, and it's only when you zoom in that the values would be importantly different from humans'. Of course, this might be enough to kill us. This echoes the sense in which human values are importantly different from inclusive genetic fitness but not that different, and we do still have lots of kids etc. To spell out the idea: Before the AI is smart enough to fully subvert training and guard its goals, we will have lots of ability to shape what goals it ends up with. At some point, if we fail to solve alignment, we will not be able to further refine its goals, but the goals it will end up guarding will be quite related to human goals because it was formed by reward signals that did sort of touch the goal. Again, maybe this is obvious to everyone, but I think it does seem at least to me in contrast with references to squiggles/paperclips that I think are more feasible to end up with if you imagine Brain In A Basement style takeoff.

What are solenoidal flux corrections in this context

Thanks for this post!

Caveat: I haven't read this very closely yet, and I'm not an economist. I'm finding it hard to understand why you think it's reasonable to model an increase in capabilities by an increase in number of parallel copies. That is: in the returns to R&D section, you look at data on how increasing numbers of human-level researchers in AI affect algorithmic progress, but we have ~no data on what happens when you sample researchers from a very different (and superhuman) capability profile. It seems to me entirely plausible that a few months into the intelligence explosion, the best AI researchers are qualitatively superintelligent enough that their research advances per month aren't the sort of thing that could be done by ~any number of humans^[1] acting in parallel in a month. I acknowledge that this is probably not tractable to model, but that seems like a problem because it seems to me that this qualitative superintelligence is a (maybe the) key driving force of the intelligence explosion.

Some intuition pumps for why this seems reasonably likely:

My understanding is that historians of science disagree on whether science is driven mostly by a few geniuses or not. It probably varies by discipline, and by how understanding-driven progress is. Compared to other fields in hard STEM, ML is currently probably less understanding-driven right now, but it is still relatively understanding-driven. I think there are good reasons to think that it could plausibly transition to being more understanding driven when the researchers become superhuman, because interp, agent foundations, GOFAI etc haven't made zero progress and don't seem fundamentally impossible to me. And if capabilities research becomes loaded on understanding very complicated things, then it could become extremely dependent on quite how capable the most capable researchers are in a way that can't easily be substituted for by more human-level researchers.
Suppose I take a smart human and give them the ability/bandwidth to memorise and understand the entire internet. That person would be really different to any normal human, and also really different to any group of humans. So when they try to do research, they approach the tree of ideas to pick the low hanging fruit from a different direction to all of society's research efforts beforehand, so it seems possible that from their perspective there is a lot of low hanging fruit left on the tree — lots of things that seem easy from their vantage point and nearly impossible to grasp from our perspective^[2]. And, research into how much diminishing returns we've seen to ideas in the field is not useful for predicting how much research progress that enhanced human would make in their first year.
- It seems hard to know quite how many angles of approach there are on the tree of ideas, but it seems possible to me that on more than one occasion when you build a new AI that is now the most intelligent being in the world, it starts doing research and finds many ideas that are easy for it and near impossible for all the beings in the world that came before it.

^{^}
or at least only by an extremely large number of humans, who are doing something more like brute force search and less like thinking
^{^}
This is basically the same idea as Dwarkesh's point that a human-level LLM should be able to make all sorts of new discoveries by connecting dots that humans can't connect because we can't read and take in the whole internet.

I think the biggest-alignment relevant update is that I expected RL fine-tuning over longer horizons (or even model-based RL a la AlphaZero) to be a bigger deal. I was really worried about it significantly improving performance and making alignment harder. In 2018-2019 my mainline picture was more like AlphaStar or AlphaZero, with RL fine-tuning being the large majority of compute. I've updated about this and definitely acknowledge I was wrong.[3] I don't think it totally changes the picture though: I'm still scared of RL, I think it is very plausible it will become more important in the future, and think that even the kind of relatively minimal RL we do now can introduce many of the same risks.

Curious to hear how you would revisit this prediction in light of reasoning models? Seems like you weren't as wrong as you thought a year ago, but maybe you still think there are some key ways your predictions about RL finetuning predictions were off?

This was really useful to read thanks very much for writing these posts!

Very happy you did this!

Do you have any idea about whether the difference between unlearning success on synthetic facts fine-tuned in after pretraining vs real facts introduced during pretraining comes mainly from the 'synthetic' part or the 'fine-tuning' part? I.e. if you took the synthetic facts dataset and spread it out through the pretraining corpus, do you expect it would be any harder to unlearn the synthetic facts? or maybe this question doesn't make sense because you'd have to make the dataset much larger or something to get it to learn the facts at all during pretraining? If so, it seems like a pretty interesting research question to try to understand which properties a dataset of synthetic facts needs to have to defeat unlearning.

Fair point. I guess I still want to say that there's a substantial amount of 'come up with new research agendas' (or like sub-agendas) to be done within each of your bullet points, but I agree the focus on getting trustworthy slightly superhuman AIs and then not needing control anymore makes things much better. I also do feel pretty nervous about some of those bullet points as paths to placing so much trust in your AI systems that you don't feel like you want to bother controlling/monitoring them anymore, and the ones that seem further towards giving me enough trust in the AIs to stop control are also the ones that seem to have the most very open research questions (eg EMs in the extreme case). But I do want to walk back some of the things in my comment above that apply only to aligning very superintelligent AI.

If you are (1) worried about superintelligence-caused x-risk and (2) have short timelines to both TAI and ASI, it seems like the success or failure of control depends almost entirely on getting the early TAIS to do stuff like "coming up with research agendas"? Like, most people (in AIS) don't seem to think that unassisted humans are remotely on track to develop alignment techniques that work for very superintelligent AIs within the next 10 years — we don't really even have any good ideas for how to do that that haven't been tested. Therefore if we have very superintelligent AIs within the next 10 years (eg 5y till TAI and 5y of RSI), and if we condition on having techniques for aligning them, then it seems very likely that these techniques depend on novel ideas and novel research breakthroughs made by AIs in the period after TAI is developed. It's possible that most of these breakthroughs are within mechinterp or similar, but that's a pretty lose constraint, and 'solve mechinterp' is really not much more of a narrow, well-scoped goal than 'solve alignment'. So it seems like optimism about control rests somewhat heavily on optimism that controlled AIs can safely do things like coming up with new research agendas.

[edit: I'm now thinking that actually the optimal probe vector is also orthogonal to so maybe the point doesn't stand. In general, I think it is probably a mistake to talk about activation vectors as linear combinations of feature vectors, rather than as vectors that can be projected into a set of interpretable readoff directions. see here for more.]

Yes, I'm calling the representation vector the same as the probing vector. Suppose my activation vector can be written as $\to a = \sum_{i} f_{i} {\to v}_{i}$ where $f_{i}$ are feature values and ${\to v}_{i}$ are feature representation vectors. Then the probe vector which minimises MSE (explains most of the variance) is just ${\to v}_{i}$ . To avoid off target effects, the vector ${\to s}_{i}$ you want to steer with for feature $i$ might be the vector that is most 'surgical': it only changes the value of this feature and no other features are changed. In that case it should be the vector that lies orthogonal to $span {{\to v}_{j} | j \neq i}$ which is only the same as ${\to v}_{i}$ if the set ${{\to v}_{i}}$ are orthogonal.

Obviously I'm working with a non-overcomplete basis of feature representation vectors here. If we're dealing with the overcomplete case, then it's messier. People normally talk about 'approximately orthogonal vectors' in which case the most surgical steering vector ${\to s}_{i} \approx {\to v}_{i}$ but (handwaving) you can also talk about something like 'approximately linearly independent vectors' in which case my point stands I think (note that SAE decoder directions are definitely not approximately orthogonal). For something less handwavey see this appendix.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments