LESSWRONG
LW

The Theoretical Foundations of Reward Learning

Feb 28, 2025 by Joar Skalse

In this sequence I provide an overview of the theoretical reward learning research agenda, including its motivating assumptions, several core results, and some starting points for how to contribute to it further.

26The Theoretical Reward Learning Research Agenda: Introduction and Motivation
Ω
Joar Skalse
4mo
Ω
4
16Partial Identifiability in Reward Learning
Ω
Joar Skalse
4mo
Ω
0
19Misspecification in Inverse Reinforcement Learning
Ω
Joar Skalse
4mo
Ω
0
11STARC: A General Framework For Quantifying Differences Between Reward Functions
Ω
Joar Skalse
4mo
Ω
0
9Misspecification in Inverse Reinforcement Learning - Part II
Ω
Joar Skalse
4mo
Ω
0
15Defining and Characterising Reward Hacking
Ω
Joar Skalse
4mo
Ω
0
16Other Papers About the Theory of Reward Learning
Ω
Joar Skalse
4mo
Ω
0
16How to Contribute to Theoretical Reward Learning Research
Ω
Joar Skalse
4mo
Ω
0