Character alignment II

p.b.

In character alignment I lay out a view on alignment that focuses on behavioral priors aka character traits as opposed to values, goals or behaviors. The idea being that values and behaviors are complex, high dimensional, and not necessarily consistent and therefore hard to pin down. Behavioral priors like “being cooperative” are (approximately) one-dimensional and in many circumstances, they can even be mathematically described.

For example, in a multi-agent setting with known goals for each agent (“put the red thing into the blue box” etc.) it would not be difficult to mathematically describe and therefore train a maximally cooperative agent. While putting red things into blue boxes has no relation to real world human values, being maximally cooperative is something that might well generalize from training settings to the real world because of its one-dimensional nature.

Several corollaries drop out of this view on alignment:

Large language models are less than perfectly aligned because they lack coherent long-term goals and a deep intrinsic understanding of multi-agent dynamics that could drive their decisions. Instead, they are simulators that switch roles easily and whose decisions are driven by the immediate context.
While agents that are explicitly trained in scenarios involving conflict and cooperation and that can pursue coherent long-term goals are clearly much more dangerous, they might also be much easier to align.
Agents that lack long-term coherence and an understanding of multi-agent dynamics that fundamentally drives their decisions might be the worst of both worlds – hard to align and dangerous. Of course, we are currently on track to get exactly these kind of agents – cobbled together from different parts with LLMs providing most of the intelligence.

But how to turn this view on alignment into something actionable?

One idea I had roughly a year ago is a fine-tuning technique that I called “Rawlsian Reinforcement Learning” – let the LLM make decisions in multi-agent situations, let a static LLM evaluate the result for each agent and reinforce the decisions based on the outcome for a randomly selected agent (yes, not efficient, but otherwise I wouldn’t be able to call it “Rawlsian”).

I never wrote about this because the LLM would likely just stay a context driven simulator – LLM outputs are fundamentally not decisions about interactions with other agents in a long-term context, but it’s a cute idea. (But see here for a cool approach using the context window to pin down single traits.)

Another possible direction of research would be to find out whether it is possible to ground the thinking of language models in additional models that provide that missing understanding and coherence. This is something I might potentially get to within my research agenda.

Of course, aligning to values or to a goal is more ambitious and if achieved also more powerful than implementing one or two character traits. Some would probably say that creating a maximally cooperative AI would not solve the alignment problem, because you’d still have to align the user of the AI, but I think it would be a great start.

LESSWRONG
LW

Character alignment II

5

5