Aligning alignment with performance

This post is a short sketch of a problem with alignment that I wanted to share. The solution suggested below is not fleshed out at all yet, I may update this in a future post. I feel its worth sharing the initial idea already though.

The Problem

Claim: Usually, making an AI aligned will reduce its performance, i.e. increases in alignment are a sufficient condition for decreases in performance.

This is almost by definition, since alignment is about ruling out behavior we deem bad, which would otherwise be the AI doing "its best" to follow our instructions literally.

Some examples of how this might look:

The AI is trained with an objective function that contains an additive alignment term, similar to a regularization term. The System then chooses to either go for the rest of its objective function, or pay respect to the alignment term.
We filter out unwanted inputs, i.e. clean the training data. This results in a less capable system.
We filter out unwanted outputs, perhaps by training a second AI to recognize them. This also leads to a less capable system.

If the above claim is true, it makes sense to think about whether there are architectures and training paradigms where this tradeoff doesn't exist, but where, instead, increases in alignment are a necessary condition for increases in performance.

Let's call such a setup alignment-inducing.

A Potential Solution

I think many properties of society are alignment-inducing, i.e. human personal growth necessitates alignment with social circles.

Some arguments for this:

Antisocial behavior decreases future input. E.g. mentors choose students that they sympathize with, not ones they only find promising in their performance in some domain.
Intelligent people tend to be more altruistic. I recall reading a study on this, but can't find it at the moment.
Reputation systems such as social status and serotonin reward prosocial behavior.
Growing up in a social environment forces you to model the needs and wants of others, take them into consideration and find an equilibrium with your own needs.

We should consider social multi-agent systems, where agents interface with other agents (and maybe sometimes with humans as well), that naturally select for kindness. Here, the agents naturally have an incentive to model the needs and wants of others, and agents that rise in the hierarchy/perform well should naturally be "good" toward other agents. The loss/reward of individual agents is additionally very fine-grained, as it is implemented by other agents.

Some major caveats:

Transferring the alignment of one of the system's agents with the rest of the agents to alignment with humans is a huge leap.
It may be difficult to design a reputation system that prevents or detects psychopathy, or any forms of increases in performance that do not necessitate alignment. Status is a negative example of this.

I know there is lots of work on multi-agent systems, but don't know if any of it is aimed at alignment.

I appreciate any comments.

LESSWRONG
LW

LESSWRONG
LW

2

Aligning alignment with performance

2

2

The Problem

A Potential Solution