But my point isn't just that the AI is able to produce similar ratings to humans' for aesthetics, etc., but that it also seems to do so through at least partially overlapping computational mechanisms to humans', as the comparisons to fMRI data suggest.
Eliezer (among others in the MIRI mindspace) has this whole spiel about human kindness/sympathy/empathy/prosociality being contingent on specifics of the human evolutionary/cultural trajectory, e.g. https://twitter.com/ESYudkowsky/status/1660623336567889920 and about how gradient descent is supposed to be nothing like that https://twitter.com/ESYudkowsky/status/1660623900789862401. I claim that the same argument (about evolutionary/cultural contingencies) could be made about e.g. image aesthetics/affect, and this hypothesis should lose many Bayes points when we observe concrete empirical evidence of gradient descent leading to surprisingly human-like aesthetic perceptions/affect, e.g. The Perceptual Primacy of Feeling: Affectless machine vision models robustly predict human visual arousal, valence, and aesthetics; Towards Disentangling the Roles of Vision & Language in Aesthetic Experience with Multimodal DNNs; Controlled assessment of CLIP-style language-aligned vision models in prediction of brain & behavioral data; Neural mechanisms underlying the hierarchical construction of perceived aesthetic value.
Yes, roughly (the next comment is supposed to make the connection clearer, though also more speculative); RLFH / supervised fine-tuned models would correspond to 'more mode-collapsed' / narrower mixtures of simulacra here (in the limit of mode collapse, one fine-tuned model = one simulacrum).
Even more speculatively, in-context learning (ICL) as Bayesian model averaging (especially section 4.1) and ICL as gradient descent fine-tuning with weight - activation duality (see e.g. first figures from https://arxiv.org/pdf/2212.10559.pdf and https://www.lesswrong.com/posts/firtXAWGdvzXYAh9B/paper-transformers-learn-in-context-by-gradient-descent) could be other ways to try and link activation engineering / Inference-Time Intervention and task arithmetic. Though also see skepticism about the claims of the above ICL as gradient descent papers, including e.g. that the results mostly seem to apply to single-layer linear attention (and related, activation engineering doesn't seem to work in all / any layers / attention heads).
Related: Language is more abstract than you think, or, why aren't languages more iconic? argues that abstract concepts (like 'cooperation', I'd say) are naturally grounded in language; Brain embeddings with shared geometry to artificial contextual embeddings, as a code for representing language in the human brain.
Here's one / a couple of experiments which could go towards making the link between activation engineering and interpolating between different simulacra: check LLFC (if adding the activations of the different models works) on the RLHF fine-tuned models from Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards models; alternately, do this for the supervised fine-tuned models from section 3.3 of Exploring the Benefits of Training Expert Language Models over Instruction Tuning, where they show LMC for supervised fine-tuning of LLMs.
Great work and nice to see you on LessWrong!
Minor correction: 'making the link between activation engineering and interpolating between different simulators' -> 'making the link between activation engineering and interpolating between different simulacra' (referencing Simulators, Steering GPT-2-XL by adding an activation vector, Inference-Time Intervention: Eliciting Truthful Answers from a Language Model).
Contrastive methods could be used both to detect common latent structure across animals, measuring sessions, multiple species (https://twitter.com/LecoqJerome/status/1673870441591750656) and to e.g. look for which parts of an artificial neural network do what a specific brain area does during a task assuming shared inputs (https://twitter.com/BogdanIonutCir2/status/1679563056454549504).
And there are theoretical results suggesting some latent factors can be identified using multimodality (all the following could be intepretable as different modalities - multiple brain recording modalities, animals, sessions, species, brains-ANNs), while being provably unindentifiable without the multiple modality - e.g. results on nonlinear ICA in single-modal vs. multi-modal settings https://arxiv.org/abs/2303.09166. This might a way to bypass single-model interpretability difficulties, by e.g. 'comparing' to brains or to other models.
Example of cross-species application: empathy mechanisms seem conserved across species https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4685523/. Example of brain-ANN applications: 'matching' to modular brain networks, e.g. language network - ontology-relevant, non-agentic (e.g. https://www.biorxiv.org/content/10.1101/2021.07.28.454040v2) or Theory of Mind network - could be very useful for detecting deception-relevant circuits (e.g. https://www.nature.com/articles/s41586-021-03184-0).
Examples of related interpretability across models https://arxiv.org/abs/2303.10774, across brain measurement modalities https://www.nature.com/articles/s41586-023-06031-6, across animals and brain-ANN https://arxiv.org/abs/2305.11953.
(As reply to Zvi's 'If someone was founding a new AI notkilleveryoneism research organization, what is the best research agenda they should look into pursuing right now?')
LLMs seem to represent meaning in a pretty human-like way and this seems likely to keep getting better as they get scaled up, e.g. https://arxiv.org/abs/2305.11863. This could make getting them to follow the commonsense meaning of instructions much easier. Also, similar methodologies to https://arxiv.org/abs/2305.11863 could be applied to other alignment-adjacent domains/tasks, e.g. moral reasoning, prosociality, etc.
Step 2: e.g. plug the commonsense-meaning-of-instructions following models into OpenAI's https://openai.com/blog/introducing-superalignment.
Related intuition: turning LLM processes/simulacra into [coarse] emulations of brain processes.
(https://twitter.com/BogdanIonutCir2/status/1677060966540795905)
Agree that it doesn't imply caring for. But I think given cumulating evidence for human-like representations of multiple non-motivational components of affect, one should also update at least a bit on the likelihood of finding / incentivizing human-like representations of the motivational component(s) too (see e.g. https://en.wikipedia.org/wiki/Affect_(psychology)#Motivational_intensity_and_cognitive_scope).