We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it.
This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams.
We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two...
Meta. This post is less polished than we would ideally prefer. However, we still think publishing it as is is reasonable, to avoid further delays. We are open to answering questions and to feedback in the comments.
TL;DR. This presents an inconclusive attempt to create a proof-of-concept that fMRI data from human brains can help improve moral reasoning in large language models.
Code is available at https://github.com/ajmeek/Inducing-human-like-biases-in-moral-reasoning-LLMs.
Our initial motivation was to create a proof of concept of applying an alignment research agenda we are particularly interested in, based on neuroconnectionism and brain-LM similarities (and their relevance for alignment): ‘neuroconnectionism as a general research programme centered around ANNs as a computational language for expressing falsifiable theories about brain computation’. Moral reasoning is an interesting application area, both for its relevance to AI alignment and...
Mrinank, Austin, and Alex wrote a paper on the results from Understanding and controlling a maze-solving policy network, Maze-solving agents: Add a top-right vector, make the agent go to the top-right, and Behavioural statistics for a maze-solving agent.
...Abstract: To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares. We find this network pursues multiple context-dependent goals, and we further identify circuits within the network that correspond to one of these goals. In particular, we identified eleven channels that track the location of the goal. By modifying these channels, either with hand-designed interventions or by combining forward passes, we can partially control the policy. We show that this network
Hi Charlie, thanks for your comment and apologies for the late reply. To echo what Artyom said, we didn't observe a significant difference between the models we fine-tuned vs. the base models with regard to the brain-score. Our models did not end up becoming more correlated with the neuroimaging data of subjects taking these moral reasoning tests. In the future it would be neat if this works and we can start utilizing more neuroimaging data for alignment, but this initial stab didn't make those concrete connections.
To go into a bit more detail on the brain... (read more)