This motivates research into LLM inductive biases
I believe there's a lot of existing ML research into inductive bias in neural networks...
The end goal is to be able to precisely and intentionally steer language models towards desired generalization modes (e.g. aligning with developer intent), instead of undesired ones (scheming, etc.)
...but my understanding (without really being familiar with that literature) was that 'inductive bias' is generally talking about a much lower level of abstraction than ideas like 'scheming'.
I'm interested in whether my understanding is wrong, vs you using 'inductive bias' as a metaphor for this broader sort of generalization, vs you believing that high-level properties like 'scheming' or 'alignment with developer intent' can be cashed out in a way that's amenable to low-level inductive bias.
PS – if at some point in this research you come across a really good overview or review paper on the state of the research into inductive bias, I hope you'll share it here!
'inductive bias' is generally talking about a much lower level of abstraction than ideas like 'scheming'.
Yes, I agree with this - and I'm mainly interested in developing an empirical science of generalization that tries to grapple a lot more directly with the emergent propensities we are about. Hence why I try not to use the term 'inductive bias'.
OTOH, 'generalization' is in some sense the entire raison d'etre of the ML field. So I think it's useful to draw on diverse sources of inspiration to inform this science. E.g.
So I'm pretty open to absorbing / considering ideas from the broader ML literature. It seems right that a 'more emergent' framework of generalization will have to be consistent with the theories of generalization proposed for simpler phenomena. But it should also meaningfully expand on those to directly answer questions we care about re: AI safety risks.
I think this is a super cool direction! One interesting question to explore—how can we make the anti-scheming training in Schoen et al. generalize further? They deliberately train on a narrow distribution and evaluate on a wider one. It seems like deliberate alignment generalized fairly well. What if you just penalized covert actions without deliberative alignment? What if you tried character training to make the model not be covert? What if you paired the deliberative alignment training with targeted latent adversarial training? (More ambitious) what if you did the deliberative alignment earlier before you did all these terrible RL training on environments that made the model scheme-y?
It seems possible that the best alignment techniques (i.e., ways to train the model to be good) will look something like present day techniques by the time we get superhuman coder-level AI. Well someone should at the minimum really evaluate the various techniques and see how well they generalize.
A distillation of my long-term research agenda and current thinking. I welcome takes on this.
Why study generalization?
I'm interested in studying how LLMs generalise - when presented with multiple policies that achieve similar loss, which ones tend to be learned by default?
I claim this is pretty important for AI safety:
This motivates research into LLM inductive biases. Or as I'll call them from here on, 'generalization propensities'.
I have two high-level goals:
Defining "generalization propensity"
To study generalization propensities, we need two things:
I define a GPE as a way to measure how models generalise OOD from weak supervision signal. Minimally, this consists of a bundled (narrow training signal, object-level trait eval). My go-to example is emergent misalignment and other types of misalignment generalization. Obviously it's good to get as close as possible to the kinds of misaligned policies outlined above.
I define a training-time intervention as any way we can consider modifying the training process to change an LLM's inductive biases. This includes things like character training, filtering the pretraining data, conditional pretraining, gradient routing, and inoculation prompting, among others.
Research questions
Some broad and overlapping things I'm interested in are:
The end goal is to be able to precisely and intentionally steer language models towards desired generalization modes (e.g. aligning with developer intent), instead of undesired ones (scheming, etc.)