Inducing human-like biases in moral reasoning LMs
Meta. This post is less polished than we would ideally prefer. However, we still think publishing it as is is reasonable, to avoid further delays. We are open to answering questions and to feedback in the comments. TL;DR. This presents an inconclusive attempt to create a proof-of-concept that fMRI data from human brains can help improve moral reasoning in large language models. Code is available at https://github.com/ajmeek/Inducing-human-like-biases-in-moral-reasoning-LLMs. Introduction Our initial motivation was to create a proof of concept of applying an alignment research agenda we are particularly interested in, based on neuroconnectionism and brain-LM similarities (and their relevance for alignment): ‘neuroconnectionism as a general research programme centered around ANNs as a computational language for expressing falsifiable theories about brain computation’. Moral reasoning is an interesting application area, both for its relevance to AI alignment and because of the availability of public neuroimaging data, as well as of publicly-available LMs fine-tuned for moral reasoning. During the last few years, a series of high-profile papers have shown that LMs partially converge towards brain-like solutions and share fundamental computational principles with humans, making them a ‘biologically feasible computational framework for studying the neural basis of language’. For more (recent) evidence of apparent convergence towards human-like representations, see also Scaling laws for language encoding models in fMRI, Large Language Models Converge on Brain-Like Word Representations. To the best of our awareness though, the potential LM-brain similarities for linguistic inputs rich in morally-relevant content (e.g. moral scenarios) have not been explored previously. Nor has anyone tried to improve LM moral reasoning using moral reasoning neuroimaging datasets (though similar ideas have been explored for LMs more broadly and e.g. for Convolutional Neural Networks