Joar Skalse

How to Contribute to Theoretical Reward Learning Research

This is the eighth (and, for now, final) post in the theoretical reward learning sequence, which starts in this post. Here, I will provide a few pointers to anyone who might be interested in contributing to further work on this research agenda, in the form of a few concrete and...

Feb 28, 202517

Other Papers About the Theory of Reward Learning

This is the seventh post in the theoretical reward learning sequence, which starts in this post. Here, I will provide shorter summaries of a few additional papers on the theory of reward learning, but without going into as much depth as I did in the previous posts (but if there...

Feb 28, 202516

Defining and Characterising Reward Hacking

In this post, I will provide a summary of the paper Defining and Characterising Reward Hacking, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the sixth post in the theoretical reward learning sequence, which starts in this post (though this post is...

Feb 28, 202515

Misspecification in Inverse Reinforcement Learning - Part II

In this post, I will provide a summary of the paper Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the fifth post in the theoretical reward learning sequence, which starts in this post....

Feb 28, 20259

STARC: A General Framework For Quantifying Differences Between Reward Functions

In this post, I will provide a summary of the paper STARC: A General Framework For Quantifying Differences Between Reward Functions, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the fourth post in the theoretical reward learning sequence, which starts in this...

Feb 28, 202512

Misspecification in Inverse Reinforcement Learning

In this post, I will provide a summary of the paper Misspecification in Inverse Reinforcement Learning, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the third post in the theoretical reward learning sequence, which starts in this post (though this post is...

Feb 28, 202519

Partial Identifiability in Reward Learning

In this post, I will provide a summary of the paper Invariance in Policy Optimisation and Partial Identifiability in Reward Learning, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the second post in the theoretical reward learning sequence, which starts in this...

Feb 28, 202516

Joar Skalse

Joar Skalse

Risks from Learned Optimization: Introduction

Goodhart's Law in Reinforcement Learning

Deceptive Alignment

The Inner Alignment Problem

Joar Skalse

Risks from Learned Optimization: Introduction

Goodhart's Law in Reinforcement Learning

Deceptive Alignment

The Inner Alignment Problem

How to Contribute to Theoretical Reward Learning Research

Other Papers About the Theory of Reward Learning

Defining and Characterising Reward Hacking

Misspecification in Inverse Reinforcement Learning - Part II

STARC: A General Framework For Quantifying Differences Between Reward Functions

Misspecification in Inverse Reinforcement Learning

Partial Identifiability in Reward Learning