LESSWRONGTags
LW

Mesa-Optimization

EditHistorySubscribe

Help improve this page

EditHistorySubscribe

Help improve this page

Mesa-Optimization

Contributors

Mesa-Optimization is the situation that occurs when a learned model (such as a neural network) is itself an optimizer. In this situation, a base optimizer creates a second optimizer, called a mesa-optimizer. The primary reference work for this concept is Hubinger et al.'s "Risks from Learned Optimization in Advanced Machine Learning Systems".

Example: Natural selection is an optimization process that optimizes for reproductive fitness. Natural selection produced humans, who are themselves optimizers. Humans are therefore mesa-optimizers of natural selection.

In the context of AI alignment, the concern is that a base optimizer (e.g., a gradient descent process) may produce a learned model that is itself an optimizer, and that has unexpected and undesirable properties. Even if the gradient descent process is in some sense "trying" to do exactly what human developers want, the resultant mesa-optimizer will not typically be trying to do the exact same thing.^[1]...

Posts tagged Mesa-Optimization

20

184Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant

5y

42

13

153Matt Botvinick on the spontaneous emergence of learning algorithms

4y

87

6

180Embedded Agency (full-text version)

Scott Garrabrant, abramdemski

5y

17

6

54Mesa-Search vs Mesa-Control

4y

45

4

95Trying to Make a Treacherous Mesa-Optimizer

1y

14

4

86Searching for Search

NicholasKees, janus

1y

7

4

83Conditions for Mesa-Optimization

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant

5y

48

4

32Why almost every RL agent does learned optimization

1y

3

3

99Subsystem Alignment

abramdemski, Scott Garrabrant

5y

12

2

315Feature Selection

2y

24

2

179Inner Alignment: Explain like I'm 12 Edition

4y

46

2

137Anomalous tokens reveal the original identities of Instruct models

1y

16

2

121Utility ≠ Reward

5y

24

2

117Deceptive Alignment

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant

5y

20

2

103The Inner Alignment Problem

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant

5y

17