Shouldn't e.g. existing theoretical ML results also inform (at least somewhat, and conditioned on assumptions about architecture, etc.) p(scheming)? E.g. I provide an argument / a list of references on why one might expect a reasonable likelihood that 'AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass' here.
I think this line of reasoning plausibly also suggests trying to differentially advance architectures / 'scaffolds' / 'scaffolding' techniques which are less likely to result in scheming.
Quoting from @zhanpeng_zhou's latest work - Cross-Task Linearity Emerges in the Pretraining-Finetuning Paradigm: 'i) Model averaging takes the average of weights of multiple models, which are finetuned on the same dataset but with different hyperparameter configurations, so as to improve accuracy and robustness. We explain the averaging of weights as the averaging of features at each layer, building a stronger connection between model averaging and logits ensemble than before. ii) Task arithmetic merges the weights of models, that are finetuned on different tasks, via simple arithmetic operations, shaping the behaviour of the resulting model accordingly. We translate the arithmetic operation in the parameter space into the operations in the feature space, yielding a feature-learning explanation for task arithmetic. Furthermore, we delve deeper into the root cause of CTL and underscore the impact of pretraining. We empirically show that the common knowledge acquired from the pretraining stage contributes to the satisfaction of CTL. We also take a primary attempt to prove CTL and find that the emergence of CTL is associated with the flatness of the network landscape and the distance between the weights of two finetuned models. In summary, our work reveals a linear connection between finetuned models, offering significant insights into model merging/editing techniques. This, in turn, advances our understanding of underlying mechanisms of pretraining and finetuning from a feature-centric perspective.'
Haven't read it as deeply as I'd like to, but Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models seems like potentially significant progress towards formalizing / operationalizing (some of) the above.
It seems to me like there might be a more general insight to draw here, something along the following line. As long as we're still in the current paradigm, where most model capabilities (including undesirable ones) come from pre-training, and they'd (mostly) only get “wrapped" by fine-tuning the model, (with appropriate prompting - or even other elicitation tools) the pre-trained models can serve as "model organisms" for about any elicitable misbehavior.
I tried to write one story here. Notably, activation vectors don't need to scale all the way to superintelligence, e.g. them scaling up to ~human-level automated AI safety R&D would be ~enough.
Also, the ToI could be disjunctive too, so it doesn't have to be only one of those necessarily.
'theoretical paper of early 2023 that I can't find right now' -> perhaps you're thinking of Fundamental Limitations of Alignment in Large Language Models? I'd also recommend LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?.
I've been / am on the lookout for related theoretical results of why grounding a la Grounded language acquisition through the eyes and ears of a single child works (e.g. with contrastive learning methods) - e.g. some recent works: Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP, Contrastive Learning is Spectral Clustering on Similarity Graph, Optimal Sample Complexity of Contrastive Learning; (more speculatively) also how it might intersect with alignment, e.g. if alignment-relevant concepts might be 'groundable' in fMRI data (and then 'pointable to') - e.g. https://medarc-ai.github.io/mindeye/ uses contrastive learning with fMRI - image pairs
This seems pretty good for safety (as RAG is comparatively at least a bit more transparent than fine-tuning): https://twitter.com/cwolferesearch/status/1752369105221333061
What follows will all be pretty speculative, but I still think should probably provide some substantial evidence for more optimism.
I think that we basically have no way of ensuring that we get this nice “goals based on pointers to the correct concepts”/corrigible alignment thing using behavioral training. This seems like a super specific way to set up the AI, and there are so many degrees of freedom that behavioral training doesn’t distinguish.
The results in Robust agents learn causal world models suggest that robust models (to distribution shifts; arguably, this should be the case for ~all substantially x-risky models) should converge towards learning (approximately) the same causal world models. This talk suggests theoretical reasons to expect that the causal structure of the world (model) will be reflected in various (activation / rep engineering-y, linear) properties inside foundation models (e.g. LLMs), usable to steer them.
For the Representation Engineering thing, I think the “workable” version of this basically looks like “Retarget the Search”, where you somehow do crazy good interp and work out where the “optimizer” is, and then point that at the right concepts which you also found using interp. And for some reason, the AI is set up such that you can "retarget it" with breaking everything.
I don't think the "optimizer" ontology necessarily works super-well with LLMs / current SOTA (something like simulators seems to me much more appropriate); with that caveat, e.g. In-Context Learning Creates Task Vectors and Function Vectors in Large Language Models (also nicely summarized here), A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity seem to me like (early) steps in this direction already. Also, if you buy the previous theoretical claims (of convergence towards causal world models, with linear representations / properties), you might quite reasonably expect such linear methods to potentially work even better in more powerful / more robust models.
I definitely don’t expect to be able to representation engineer our way into building an AI that is corrigible aligned, and remains that way even when it is learning a bunch of new things and is in very different distributions. (I do think that actually solving this problem would solve a large amount of the alignment problem).
The activation / representation engineering methods might not necessarily need to scale that far in terms of robustness, especially if e.g. you can complement them with more control-y methods / other alignment methods / Swiss cheese models of safety more broadly; and also plausibly because they'd "only" need to scale to ~human-level automated alignment researchers / scaffolds of more specialized such automated researchers, etc. And again, based on the above theoretical results, future models might actually be more robustly steerable 'by default' / 'for free'.