Inducing human-like biases in moral reasoning LMs
Meta. This post is less polished than we would ideally prefer. However, we still think publishing it as is is reasonable, to avoid further delays. We are open to answering questions and to feedback in the comments. TL;DR. This presents an inconclusive attempt to create a proof-of-concept that fMRI data from human brains can help improve moral reasoning in large language models. Code is available at https://github.com/ajmeek/Inducing-human-like-biases-in-moral-reasoning-LLMs. Introduction Our initial motivation was to create a proof of concept of applying an alignment research agenda we are particularly interested in, based on neuroconnectionism and brain-LM similarities (and their relevance for alignment): ‘neuroconnectionism as a general research programme centered around ANNs as a computational language for expressing falsifiable theories about brain computation’. Moral reasoning is an interesting application area, both for its relevance to AI alignment and because of the availability of public neuroimaging data, as well as of publicly-available LMs fine-tuned for moral reasoning. During the last few years, a series of high-profile papers have shown that LMs partially converge towards brain-like solutions and share fundamental computational principles with humans, making them a ‘biologically feasible computational framework for studying the neural basis of language’. For more (recent) evidence of apparent convergence towards human-like representations, see also Scaling laws for language encoding models in fMRI, Large Language Models Converge on Brain-Like Word Representations. To the best of our awareness though, the potential LM-brain similarities for linguistic inputs rich in morally-relevant content (e.g. moral scenarios) have not been explored previously. Nor has anyone tried to improve LM moral reasoning using moral reasoning neuroimaging datasets (though similar ideas have been explored for LMs more broadly and e.g. for Convolutional Neural Networks
In essence, yes, stego is something that looks innocuous and requires special interpretation (the decoding function) (I wrote about that here). The interpreter should have certain abilities, i.e. it either should have undisclosed (hidden in weights?) decoding function (unknown to a monitor) or it should have some kind of a private key (though there is this public key stego). Otherwise, it is not that dangerous: we can just run the same prompt with the modified output on the same model and get the secret (like in the case with dog-whistling). Or the monitor might be able to do it if it shares some weights with the observed model (or it is the same model). This is the Kerckhoffs's principle.