LESSWRONG
LW

Risks from Learned Optimization

May 31, 2019 by evhub

This is a sequence version of the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper. Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence.

The goal of this sequence is to analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this sequence. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned?

187Risks from Learned Optimization: Introduction
Ω
evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant
6y
Ω
42
86Conditions for Mesa-Optimization
Ω
evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant
6y
Ω
48
105The Inner Alignment Problem
Ω
evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant
6y
Ω
17
118Deceptive Alignment
Ω
evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant
6y
Ω
20
82Risks from Learned Optimization: Conclusion and Related Work
Ω
evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant
6y
Ω
5