The primary point of this post is that I want to draw everyone’s attention to an excellent podcast by John David Presserman where he chats with Zvi.

 

John goes into the post itching for a fight with Zvi over LessWrong style arguments about deceptive inner-alignment. But Zvi doesn’t really take the bait. Instead they spend most of the podcast talking about John’s ideas regarding RLaiF.

What does RLaiF got to do with Normative Ethics?

 

John spends most of the podcast talking about the tendency of RLaiF models to fall into a degenerate state if you over-train them. The canonical example of this is a “Yes spammer”. In this case, the reinforcement learning is receiving its reward from a 2nd AI model that answer “yes” or “no” to “is this a good result”. Eventually the AI we are training learns to repeatedly output the word “yes” as this tricks the judge AI into also responding “yes” every single time.

John analogizes the tendency of RLaiF models to fall into degenerate feedback modes with the difference between rule-based ethical systems and outcome-based ethics systems (utilitarianism). In normative ethics systems, people follow a variety of rules based on historical practice such as “do not murder”, “do not steal", etc. By contrast, in utilitarianism, there are no hard and fast rules, instead adherents simply seek to maximize a single objective function, the total amount of “goodness” or utility in the world.

Just as utilitarianism has both bad theoretic and practical failure modes, when AI is trained with a single objective function, it inevitably falls into strange undesirable local maxima. In ethics, we solve this problem by relying on a set of historically developed rules that have been empirically proven to avoid the worst outcomes. Similarly, John argues that we can avoid perverse-maximization in RLaiF frameworks by valuing the process by which we arrive at the global maxima.

A Simple Research Proposal

 

I would like to propose testing John’s hypothesis by building the simplest possible test case:

Train a RLaiF model that degenerates into the “yes spammer”

Along the way, “checkpoint” the reward function periodically

Create a new reward function that is a weighted sum of the checkpoints

Verify that with this new reward function training does not degenerate into the “yes spammer” (note this is trivially true if we allow our set of weights to be 1,0,0,0…)

Identify what the boundary condition is where RLaiF degenerates into the yes-spammer

Conclusion

 

I would also like to echo John’s conclusions at the end of the podcast:

Object-level discussion is valuable, but specific criticisms are more effective than broad, inaccurate statements.

Concerns about AI should focus on specific issues, like the stability of complex feedback loops, rather than general doubts.

there is an important trade-off between normative and consequentialist reasoning in AI.

Pure consequentialist reasoning in AI, without normative ethics, risks leading to negative outcomes.

The main concern is super consequentialism in AI, rather than super intelligence itself.

Super consequentialism is widely acknowledged as potentially dangerous.

The conditions under which super consequentialism arises should be a central part of AI ethics debates.

Current AI discourse suffers from mutual frustration due to its format and participants.

More peer-to-peer discussions, rather than expert-public or advocacy debates can lead to deeper understanding.

We should engage one another as peers, not as opponents, to foster constructive dialogue.

New Comment