Agree, I do mostly discuss LLMs, but I think there's significant overlap in aligning LLMs and LMAs.
Also agree that LMAs could scale all the way, but I also think once you get ~human-level automated alignment research, its likely applicability to other types of systems too (than LMAs and LLMs) should still be a nice bonus.
...So once an AI system trained end-to-end can produce similarly much value per token as a human researcher can produce per second, AI research will be more than fully automated. This means that, when AI first contributes more to AI research than humans do, the average research progress produced by 1 token of output will be significantly less than an average human AI researcher produces in a second of thinking[6]. Instead, the collective’s intelligence will largely come from a combination of things like:
- Individual systems “thinking” for a long time, churning
One somewhat obvious thing to do with ~human-level systems with "initial loose alignment" is to use them to automate alignment research (e.g. the superalignment plan). I think this kind of two-step plan is currently the best we have, and probably by quite some margin. Many more details for why I believe this in these slides and in this AI safety camp '24 proposal.
This (both theoretical and empirical results) also seems relevant vs. watermarking schemes: Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models; blogpost.
More reasons to believe that studying empathy in rats (which should be much easier than in humans, both for e.g. IRB reasons, but also because smaller brains, easier to get whole connectomes, etc.) could generalize to how it works in humans and help with validating/implementing it in AIs (I'd bet one can already find something like computational correlates in e.g. GPT-4 and the correlation will get larger with scale a la https://arxiv.org/abs/2305.11863) https://twitter.com/e_knapska/status/1722194325914964036
You might be interested in this AI safety camp '23 project I proposed of fine-tuning LMs on fMRI data and in some of the linkposts I've published on LW, including e.g. The neuroconnectionist research programme, Scaling laws for language encoding models in fMRI and Mapping Brains with Language Models: A Survey. Personally, I'm particularly interested in low-res uploads for automated alignment research, e.g. to plug into something like the superalignment plan (I have some shortform notes on this).
I don't know if I should be surprised by CoT not helping that much on MMLU; MMLU doesn't seem to require [very] long chains of inference? In contrast, I expect takeover plans would. Somewhat related, my memory is that CoT seemed very useful for Theory of Mind (necessary for deception, which seems like an important component of many takeover plans), but the only reference I could find quickly is https://twitter.com/Shima_RM_/status/1651467500356538368.
I'll note that there's actually a lot of evidence (especially theoretical) on the need for scratchpad/CoT and how it leads to much higher expressivity, both for Transformers and (conjectured) more generally for any parallelizable architecture (crucial for efficient training); to the point that I think we should expect this to hold in the future too with significantly >50% probability, probably >90%. See e.g. The Parallelism Tradeoff: Limitations of Log-Precision Transformers, Auto-Regressive Next-Token Predictors are Universal Learners, Chain of Thou...
One significant worry here would be that bounds from (classical) learning theory seem to be pretty vacuous most of the time. But I'm excited about comparing brains and learning algos, also see many empirical papers.
As others have hinted at/pointed out in the comments, there is an entire science of deep learning out there, including on high-level (vs. e.g. most of low-level mech interp) aspects that can be highly relevant to alignment and that you seem to not be aware of/dismiss. E.g. follow the citation trail of An Explanation of In-context Learning as Implicit Bayesian Inference.
Some of Nate’s quick thoughts (paraphrased), after chatting with him:
Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-me...
Haven't read in detail but Fig. 2 seems to me to support the exciting claim (also because overparameterized models with 70k trainable parameters)?
Agree that it doesn't imply caring for. But I think given cumulating evidence for human-like representations of multiple non-motivational components of affect, one should also update at least a bit on the likelihood of finding / incentivizing human-like representations of the motivational component(s) too (see e.g. https://en.wikipedia.org/wiki/Affect_(psychology)#Motivational_intensity_and_cognitive_scope).
But my point isn't just that the AI is able to produce similar ratings to humans' for aesthetics, etc., but that it also seems to do so through at least partially overlapping computational mechanisms to humans', as the comparisons to fMRI data suggest.
Eliezer (among others in the MIRI mindspace) has this whole spiel about human kindness/sympathy/empathy/prosociality being contingent on specifics of the human evolutionary/cultural trajectory, e.g. https://twitter.com/ESYudkowsky/status/1660623336567889920 and about how gradient descent is supposed to be nothing like that https://twitter.com/ESYudkowsky/status/1660623900789862401. I claim that the same argument (about evolutionary/cultural contingencies) could be made about e.g. image aesthetics/affect, and this hypothesis should lose many Bayes points wh...
Yes, roughly (the next comment is supposed to make the connection clearer, though also more speculative); RLFH / supervised fine-tuned models would correspond to 'more mode-collapsed' / narrower mixtures of simulacra here (in the limit of mode collapse, one fine-tuned model = one simulacrum).
Even more speculatively, in-context learning (ICL) as Bayesian model averaging (especially section 4.1) and ICL as gradient descent fine-tuning with weight - activation duality (see e.g. first figures from https://arxiv.org/pdf/2212.10559.pdf and https://www.lesswrong.com/posts/firtXAWGdvzXYAh9B/paper-transformers-learn-in-context-by-gradient-descent) could be other ways to try and link activation engineering / Inference-Time Intervention and task arithmetic. Though also see skepticism about the claims of the above ICL as gradient descent papers, including...
Related: Language is more abstract than you think, or, why aren't languages more iconic? argues that abstract concepts (like 'cooperation', I'd say) are naturally grounded in language; Brain embeddings with shared geometry to artificial contextual embeddings, as a code for representing language in the human brain.
Here's one / a couple of experiments which could go towards making the link between activation engineering and interpolating between different simulacra: check LLFC (if adding the activations of the different models works) on the RLHF fine-tuned models from Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards models; alternately, do this for the supervised fine-tuned models from section 3.3 of Exploring the Benefits of Training Expert Language Models over Instruction Tuning, where they show LMC for supervised fine-tuning of LLMs.
Great work and nice to see you on LessWrong!
Minor correction: 'making the link between activation engineering and interpolating between different simulators' -> 'making the link between activation engineering and interpolating between different simulacra' (referencing Simulators, Steering GPT-2-XL by adding an activation vector, Inference-Time Intervention: Eliciting Truthful Answers from a Language Model).
Contrastive methods could be used both to detect common latent structure across animals, measuring sessions, multiple species (https://twitter.com/LecoqJerome/status/1673870441591750656) and to e.g. look for which parts of an artificial neural network do what a specific brain area does during a task assuming shared inputs (https://twitter.com/BogdanIonutCir2/status/1679563056454549504).
And there are theoretical results suggesting some latent factors can be identified using multimodality (all the following could be intepretable as different modalities - mul...
(As reply to Zvi's 'If someone was founding a new AI notkilleveryoneism research organization, what is the best research agenda they should look into pursuing right now?')
LLMs seem to represent meaning in a pretty human-like way and this seems likely to keep getting better as they get scaled up, e.g. https://arxiv.org/abs/2305.11863. This could make getting them to follow the commonsense meaning of instructions much easier. Also, similar methodologies to https://arxiv.org/abs/2305.11863 could be applied to other alignment-adjacent domains/tasks, e.g. moral...
Change my mind: outer alignment will likely be solved by default for LLMs. Brain-LM scaling laws (https://arxiv.org/abs/2305.11863) + LM embeddings as model of shared linguistic space for transmitting thoughts during communication (https://www.biorxiv.org/content/10.1101/2023.06.27.546708v1.abstract) suggest outer alignment will be solved by default for LMs: we'll be able to 'transmit our thoughts', including alignment-relevant concepts (and they'll also be represented in a [partially overlapping] human-like way).
From a (somewhat) related proposal (from footnote 1): 'My proposal is simple. Are you developing a method of interpretation or analyzing some property of a trained model? Don’t just look at the final checkpoint in training. Apply that analysis to several intermediate checkpoints. If you are finetuning a model, check several points both early and late in training. If you are analyzing a language model, MultiBERTs, Pythia, and Mistral provide intermediate checkpoints sampled from throughout training on masked and autoregressive language models, respectively....
Here's a reference you might find relevant: Social value at a distance: Higher identification with all of humanity is associated with reduced social discounting.
AIs could have representations of human values without being motivated to pursue them; also, their representations could be a superset of human representations.
(In practice, I do think having overlapping representations with human values likely helps, for reasons related to e.g. Predicting Inductive Biases of Pre-Trained Models and Alignment with human representations supports robust few-shot learning.)
Yes, there are similar results in a bunch of other domains, including vision, see for a review e.g. The neuroconnectionist research programme.
I wouldn't interpret this as necessarily limiting the space of AI values, but rather (somewhat conservatively) as shared (linguistic) features between humans and AIs, some/many of which are probably relevant for alignment.
Yes, predictive processing as the reason behind related representations has been the interpretation in a few papers, e.g. The neural architecture of language: Integrative modeling converges on predictive processing. There's also some pushback against this interpretation though, e.g. Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data.
There are some papers suggesting this could indeed be the case, at least for language processing e.g. Shared computational principles for language processing in humans and deep language models, Brain embeddings with shared geometry to artificial contextual embeddings, as a code for representing language in the human brain.
Seems very related: Linear Spaces of Meanings: Compositional Structures in Vision-Language Models. Notably, the (approximate) compositionality of language/reality should bode well for the scalability of linear activation engineering methods.
Also, this translation function might be simple w.r.t. human semantics, based on current evidence about LLMs: https://www.lesswrong.com/posts/rjghymycfrMY2aRk5/llm-cognition-is-probably-not-human-like?commentId=KBpfGY3uX8rDJgoSj
The (overlapping) evidence from Deep learning models might be secretly (almost) linear could also be useful / relevant, as well as these 2 papers on 'semantic differentials' and (contextual) word embeddings: SensePOLAR: Word sense aware interpretability for pre-trained contextual word embeddings, Semantic projection recovers rich human knowledge of multiple object features from word embeddings.
Here's a related conceptual framework and some empirical evidence which might go towards explaining why the other activation vectors work (and perhaps would predict your proposed vector should work).
In Language Models as Agent Models, Andreas makes the following claims (conceptually very similar to Simulators):
'(C1) In the course of performing next-word prediction in context, current LMs sometimes infer approximate, partial representations of the beliefs, desires and intentions possessed by the agent that produced the context, and other agents mentioned wi...
Here's one potential reason why this works and a list of neuroscience papers which empirically show linearity between LLMs and human linguistic representations.
Here goes (I've probably still missed some papers, but the most important ones are probably all here):
Brains and algorithms partially converge in natural language processing
Shared computational principles for language processing in humans and deep language models
Deep language algorithms predict semantic comprehension from brain activity
The neural architecture of language: Integrative modeling converges on predictive processing (video summary); though maybe also see Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Mod...
Thanks for engaging. Can you say more about which papers you've looked at / in which ways they seemed very weak? This will help me adjust what papers I'll send; otherwise, I'm happy to send a long list.
Also, to be clear, I don't think any specific paper is definitive evidence, I'm mostly swayed by the cumulated evidence from all the work I've seen (dozens of papers), with varying methodologies, neuroimaging modalities, etc.
I think there's a lot of cumulated evidence pointing against the view that LLMs are (very) alien and pointing towards their semantics being quite similar to those of humans (though of course not identical). E.g. have a look at papers (comparing brains to LLMs) from the labs of Ev Fedorenko, Uri Hasson, Jean-Remi King, Alex Huth (or twitter thread summaries).
Related - context distillation / prompt compression, perhaps recursively too - Learning to Compress Prompts with Gist Tokens.
Thanks for your comment and your perspective, that's an interesting hypothesis. My intuition was that worse performance at false belief inference -> worse at deception, manipulation, etc. As far as I can tell, this seems mostly born out by a quick Google search e.g. Autism and Lying: Can Autistic Children Lie?, Exploring the Ability to Deceive in Children with Autism Spectrum Disorders, People with ASD risk being manipulated because they can't tell when they're being lied to, Strategic Deception in Adults with Autism Spectrum Disorder.
It's not possible to let an AGI keep its capability to engineer nanotechnology while taking out its capability to deceive and plot, any more than it's possible to build an AGI capable of driving red cars but not blue ones. They're "the same" capability in some sense, and our only hope is to make the AGI want to not be malign.
Seems very overconfident if not plain wrong; consider as an existence proof that 'mathematicians score higher on tests of autistic traits, and have higher rates of diagnosed autism, compared with people in the general population' and c...
Related - I'd be excited to see connectome studies on how mice are mechanistically capable of empathy; this (+ computational models) seems like it should be in the window of feasibility given e.g. Towards a Foundation Model of the Mouse Visual Cortex: 'We applied the foundation model to the MICrONS dataset: a study of the brain that integrates structure with function at unprecedented scale, containing nanometer-scale morphology, connectivity with >500,000,000 synapses, and function of >70,000 neurons within a ∼ 1mm3 volume spanning multiple areas of ...
Another reason to expect approximate linearity in deep learning models: point 12 + arguments about approximate (linear) isomorphism between human and artificial representations (e.g. search for 'isomorph' in Understanding models understanding language and in Grounding the Vector Space of an Octopus: Word Meaning from Raw Text).
It seems to me the the results here that 'instruction tuning strengthens both the use of semantic priors and the capacity to learn input-label mappings, but more of the former' could be interpreted as some positive evidence for the optimistic case (and perhaps more broadly, for 'Do What I Mean' being not-too-hard); summary twitter thread, see especially tweets 4 and 5
Linear decoding also works pretty well for others' beliefs in humans: Single-neuronal predictions of others’ beliefs in humans
Probably not, from the paper: 'We used LeetCode in Figure 1.5 in the introduction, where GPT-4 passes all stages of mock interviews for major tech companies. Here, to test on fresh questions,
we construct a benchmark of 100 LeetCode problems posted after October 8th, 2022, which is after GPT-4’s pretraining period.'
Good point. It's a bit weird that performance on easy Codeforces questions is so bad (0/10) though.
https://twitter.com/cHHillee/status/1635790330854526981
Agree, and I've had similar/related thoughts on how DWIM seems like a pretty natural target for LLM alignment: https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=65czxJGyBuhqhBRex https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=GRjfMwLDFgw6qLnDv