LESSWRONG
LW

282
Wikitags

Mesa-Optimization

Edited by riceissa, Rob Bensinger, Ruby, et al. last updated 20th Sep 2022

Mesa-Optimization is the situation that occurs when a learned model (such as a neural network) is itself an optimizer. In this situation, a base optimizer creates a second optimizer, called a mesa-optimizer. The primary reference work for this concept is Hubinger et al.'s "Risks from Learned Optimization in Advanced Machine Learning Systems".

Example: Natural selection is an optimization process that optimizes for reproductive fitness. Natural selection produced humans, who are themselves optimizers. Humans are therefore mesa-optimizers of natural selection.

In the context of AI alignment, the concern is that a base optimizer (e.g., a gradient descent process) may produce a learned model that is itself an optimizer, and that has unexpected and undesirable properties. Even if the gradient descent process is in some sense "trying" to do exactly what human developers want, the resultant mesa-optimizer will not typically be trying to do the exact same thing.[1]

 

History

Previously work under this concept was called Inner Optimizer or Optimization Daemons.

Wei Dai brings up a similar idea in an SL4 thread.[2]

The optimization daemons article on Arbital was published probably in 2016.[1]

Jessica Taylor wrote two posts about daemons while at MIRI:

  • "Are daemons a problem for ideal agents?" (2017-02-11)
  • "Maximally efficient agents will probably have an anti-daemon immune system" (2017-02-23)

 

See also

  • Inner Alignment
  • Complexity of value
  • Thou Art Godshatter

External links

Video by Robert Miles

Some posts that reference optimization daemons:

  • "Cause prioritization for downside-focused value systems": "Alternatively, perhaps goal preservation becomes more difficult the more capable AI systems become, in which case the future might be controlled by unstable goal functions taking turns over the steering wheel"
  • "Techniques for optimizing worst-case performance": "The difficulty of optimizing worst-case performance is one of the most likely reasons that I think prosaic AI alignment might turn out to be impossible (if combined with an unlucky empirical situation)." (the phrase "unlucky empirical situation" links to the optimization daemons page on Arbital)
  1. ^

    "Optimization daemons". Arbital.

  2. ^

    Wei Dai. '"friendly" humans?' December 31, 2003.

Subscribe
Discussion
3
Subscribe
Discussion
3
Posts tagged Mesa-Optimization
187Risks from Learned Optimization: Introduction
Ω
evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant
6y
Ω
42
154Matt Botvinick on the spontaneous emergence of learning algorithms
Ω
Adam Scholl
5y
Ω
87
210Embedded Agency (full-text version)
Ω
Scott Garrabrant, abramdemski
7y
Ω
17
59Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Ω
Adrià Garriga-alonso, taufeeque, AdamGleave, ChengCheng
1y
Ω
8
55Mesa-Search vs Mesa-Control
Ω
abramdemski
5y
Ω
45
97Searching for Search
Ω
NicholasKees, janus
3y
Ω
9
95Trying to Make a Treacherous Mesa-Optimizer
Ω
MadHatter
3y
Ω
14
86Conditions for Mesa-Optimization
Ω
evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse, Scott Garrabrant
6y
Ω
48
32Why almost every RL agent does learned optimization
Ω
Lee Sharkey
3y
Ω
3
102Subsystem Alignment
Ω
abramdemski, Scott Garrabrant
7y
Ω
12
44Mlyyrczo
lsusr
3y
14
323Feature Selection
Zack_M_Davis
4y
24
185Inner Alignment: Explain like I'm 12 Edition
Ω
Rafael Harth
5y
Ω
47
140Anomalous tokens reveal the original identities of Instruct models
Ω
janus, jdp
3y
Ω
16
131Utility ≠ Reward
Ω
Vlad Mikulik
6y
Ω
24
Load More (15/114)
Add Posts