LESSWRONG
LW

The Ethicophysics
AI
Frontpage

-15

Enkrateia: a safe model-based reinforcement learning algorithm

by MadHatter
30th Nov 2023
2 min read
4

-15

This is a linkpost for https://github.com/epurdy/saferl/blob/main/saferl_draft.pdf
AI
Frontpage

-15

Previous:
Homework Answer: Glicko Ratings for War
1 comments-45 karma
Next:
A Formula for Violence (and Its Antidote)
6 comments-22 karma
Log in to save where you left off
Enkrateia: a safe model-based reinforcement learning algorithm
4Brendan Long
1MadHatter
4Brendan Long
-5MadHatter
New Comment
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 9:07 PM
[-]Brendan Long2y42

Section 2.3 seems to be the part that addresses alignment, and the proposed solution is to use reinforcement learning (train the AI on examples of what humans would do) and then to give up (either by leaving a human in the loop forever or just deciding that turning people into paperclips really is better).

The way these kinds of problems keep getting buried deep in the writing (sometimes through linked PDF's) makes me I really think this is some sort of Sokal-hoax-style prank.

Reply
[-]MadHatter2y10

What's so bad about keeping a human in the loop forever? Do we really think we can safely abdicate our moral responsibilities?

Reply1
[-]Brendan Long2y*44
  • It defeats the purpose of AI, so realistically no one will do it
  • It doesn't actually solve the problem if the AI is deceptive

I'm not convinced we can safely run AGI, with or without a human in the loop. That's what the alignment problem is.

Reply
[+]MadHatter2y-5-3
Moderation Log
More from MadHatter
View more
Curated and popular this week
4Comments

In this post, we link to an accessible, academically written control-theoretic account of how to perform safe model-based reinforcement learning in a domain with significant moral constraints.

The ethicophysics can most properly be thought of as a new scientific paradigm in AI safety designed to give rigorous theoretical and historical justification to a particular style of learning cost functions for safety-critical systems.

In the rest of this post, we simply copy and paste the introduction of the paper.

###

We describe LQPR (Linear-Quadratic-Program-Regulator), an algorithm for model-based reinforcement learning that allows us to prove PAC-style bounds on the behavior of the system. We describe proposed experiments that can be performed in the DeepMind AI Safety Gridworlds domain; we have not had time to implement these experiments yet, but we provide predictions as to the outcome of each experiment. Potential future work may include scaling up this work to non-trivial tasks using a neural network approximator, as well as proving additional theoretical results about the safety and stability of such a system. We believe that this system is a potential basis for aligning large language models and other powerful near-term AI’s with human preferences.

1 Introduction In this section, we lay out necessary context. 1.1 Background 

There are three dominant approaches to discussing ethics in philosophy. These approaches are consequentialism, deontology, and virtue ethics. Any truly safe RL algorithm needs to be capable of incorporating all three of these approaches, since each has weaknesses that are addressed by the other two. The algorithm described in this document is capable of implementing all three of these approaches. Many have noted, especially Yudkowsky, that consequentialist agents are likely to be badly misaligned. This perspective seems quite accurate to us.

Meanwhile, it seems likely that deontological agents will be hamstrung by their strict adherence to rules, and end up paying too heavy of an alignment tax to be capable. Virtue ethics might be the sweet spot between these two extremes, since it seems likely that it is both alignable and capable. For the sake of completeness, we provide implementations of all three ethical paradigms within a single unified framework. The wise reader is encouraged not to implement a consequentialist agent without proving a sufficiently reassuring number of theorems that are verified to the limits of human capability. Even given such reassurances, it still seems unwise to implement a strongly capable consequentialist agent for anything other than military purposes. 

1.2 Scope This paper is intended to lay the theoretical foundations for safe model-based reinforcement learning. We do not discuss the problem of generalization, since that seems to require significant extensions that we do not have designs for yet. We also examine only smaller, simpler versions of the relevant problem. We will briefly indicate how these might be extended when such an extension seems relatively clear.