Critiquing Risks From Learned Optimization, and Avoiding Cached Theories

ProofBySonnet

What I'm doing with this post and why

I've been told that it's a major problem (this post, point 4), of alignment students just accepting the frames they're given without question. The usual advice (the intro of this) is to first do a bunch of background research and thinking on your own, to come up with your own frames if at all possible. BUT: I REALLY want to learn about existing alignment frames, it's Fucking Goddamn Interesting.

<sidenote>At the last possible minute while editing this post, I've had the realization that "the usual advice" I somehow acquired might not actually be "usual" at all. In fact, upon actually rereading the study guide, it looks like he actually recommended to read up on the existing frames and to try applying them. So, I must've just completely hallucinated that advice! Or, at least, extremely overestimated how important that point from 7 traps was. I'm still going to follow through with this exercise, but holy shit my internal category of "possible study routes" has expanded DRASTICALLY. </sidenote>

So I ask the question: is it possible to read about and pick apart an existing frame, such that you don't just end up uncritically accepting it? What questions do you need to ask or what thoughts do you need to think, in order to figure out the assumptions and/or problems with an existing framework?

I've chosen Risks From Learned Optimization because I've already been exposed to the basic concept, from various videos by Robert Miles. I shouldn't be losing anything by learning this more deeply and trying to critique it.

Moreover, I want to generate evidence of my competence anyway, so I might as well write this up.

Method:

I'm writing pretty much all the notes I generated while going through this process here, though I somewhat re-organized them to make them more readable and coherent.

generate questions to critique it with beforehand
read it (summarize it)
generate questions during the previous step
afterwards, attempt to answer said questions

Initial Questions

These were all generated before reading the sequence.

What questions does this model answer?
What assumptions does it make about the process of training an AI, or the alignment problem generally? Are these assumptions true?
What thought process would you need to have, in order to end up at this model?
If in the above questions you've come up with flaws, what would a model be like that solves these flaws?

Summary

This is high-level: I don't need all the details in order to generate a useful critique.

First post: Introduction

two questions:
- under what circumstances will learned algorithms be mesa-optimizers?
- when the previous is true, what will its objective be, how can it be aligned?
this frames the alignment problem as two separate ones, the inner alignment problem and the outer alignment problem
the basic cases we're trying to distinguish between: "pseudo-alignment" versus "robust alignment." First is our failure mode, second is the goal.
safety problems:
- sometimes we don't WANT the created algorithm to be an optimizer, and want to avoid even the possibility of mesa-optimization in principle
- when we do want an algorithm that is an optimizer, how do we align it? (the "inner alignment problem")

Second post: conditions for mesa-optimization

there are two major things that could affect whether mesa-optimization occurs:
- the task: the environment, and the base objective function from the base optimizer
  - many details and possibilities here: once you assume that the framing is accurate for your problem, you pay attention to these.
- the base optimizer: the machine learning algorithm (grad descent etc), and the architecture of the final thing that's created.
  - many details and possibilities here: ditto

Third post: the inner alignment problem

there are multiple kinds of pseudo-alignment
- [details]
there are multiple ways that the details of the TASK can affect whether the final mesa-optimizer is pseudo-aligned
- [details]
there are multiple ways that the details of the BASE OPTIMIZER can affect whether the final mesa-optimizer is pseudo-aligned
- [details]

Fourth post: deceptive alignment

why deceptive alignment might, in principle, happen
- the base objective might be incorporated into the mesa-optimizer, OR
- the base objective could be modelled, inferred by the mesa-optimizer. (this is where deception might occur)
here's the reasons why deceptive alignment could go really bad
conditions for deceptive alignment
- the mesa-optimizer must have a consistent objective (its mesa-objective must not be modified too much during this part of the proccess)
- it must be able to model that it's being SELECTED FOR based on the base objective, and it must be able to model said objective
- the mesa-optimizer must BELIEVE that, at some point in the future, it will no longer be modified if it takes a treacherous turn
dynamics: claim: pure deception (rather than trying to optimize for BOTH objectives) is more "stable."
some details: various factors make internalization (taking on the base objective) or deception more or less likely.
distributional shift also affects this situation. Namely, whether the agent can tell the difference between training and deployment.

Fifth post: Conclusions and related work

a bunch of related work, if one wanted to learn more
concluding remarks
- if mesa-optimizers aren't likely to pop up and we aren't trying to make them, none of this is relevant.
- the opposite: if they are likely to pop up and are difficult to prevent, this is very relevant.
- question: is it more useful to just avoid the creation of mesa-optimizers entirely?
- it's important to figure out how relatively difficult these problems are, in order to figure out the relative effectiveness of various FAI approaches.

Comments or Questions I had while reading

I jotted these down as I went, and during this editing process I grouped them into a few useful categories.

The source of this model

what thought process would lead you to generating this frame/model? How might that process have skewed your perspective and caused you to miss things?
- an observation, somehow aquired: a true AGI will probably be an "optimizer." However, the process that we use to train neural networks (and other such ML processes) are ALSO optimizers.
- a question, an application to the alignment problem: we want the FINAL optimizer to pursue the same goals as us. How might this double-optimization affect that?
so, the initial observation was "both X and Y are Z."
What other things in the alignment problem are Z?
What "things" are there, in the alignment problem?
- humans, doing things
- humans create computers and computer programs
- computers
- programs, on computers
- computer-reality interfaces
- with the above previous three, a program does things. Will the things it does be nice? Long term? Short term?
- I'm running out of things to list out. This is a hard question, to figure out what's important to pay attention to.
Can most things be easily categorized as Z or not Z? Is Z a cleanly divided category? Good question.
are there other categories other than "is this Z" that matter? What other properties in alignment, matter?
- this is a hard question.

The purpose of this framework

An earlier observation: "X is an optimizer" is a way of thinking about X that allows us to better predict X's behavior. If we see it do things, we wonder about what objective would fit those actions. And if we know that X has objective A, then we can come up with possible things that X would be more likely to do, based on how much said action contributes to A.

Our true goal, somewhat: predict what our AI agent is going to do, and take actions such that it does the sorts of things we want it to do.

I don't know about what other categories or ways of thinking would help with this. Damn.

Results

What alternate theories/frames would look like

a model might have a more complex/deep/technical definition and understanding of what an "optimizer" is, and would reconsider the alignment problem with that knowledge in mind. What "kind of" or "level of" optimizer do we want future AI systems to be? Should our ML methods themselves be different kinds of optimzers? What kind of optimizers are humans? If we're "playing the AI" and want to do well by human values, but humans aren't full-on optimizers with actual utility functions, how do we do well? What sort of mathematical object describes human values and human preferences, as well as actual human behavior?
QUESTION: when should one include the information of "there are things that aren't pure optimizers" and when is it fine to assume that the components ARE pure optimizers?
Note: there are probably components of the problem other than the "optimizer" or "semi-optimizer" category that are relevant. I have no idea what these other components might be, though.

My internal "preface" for this framework

(these questions are for my own benefit) (ask before applying?)

Are the objects or agents in your particular problem full-on optimizers? Or do they break the optimizer/non-optimizer dichotomy? If the latter, consider that fact while reviewing this framework.

Is the optimizer-ness of the objects or agents in your problem the most relevant point? What other factors or categories might be relevant? What is the ACTUAL PROBLEM/QUESTION you are trying to solve/answer, and how might this framework help or hinder?

Takeaways

How I could improve or streamline this process, and what I want to do next.

I can make this process more effective by not getting bogged down in details while reading. I spent way too long reading through things I later marked as simply "[details]" in my summary.
I could take my critiques and try to answer these weird questions I've generated.
I want to try to apply this technique to a theory/frame I haven't heard of before.
What should I DO, after having done this? How do I MAKE USE of these critiques? What is the POINT of them?
How can I tell, in the future, if I've "blindly accepted" a frame or if I have a more complete perspective on it?
A possibility: I have the frame itself, and then internally I have a "preface" that describes when to apply said frame, and when this frame fails. When tempted to apply this frame, I first reconsider its preface (after considering this idea, then I wrote the previous preface).

After finishing this post: I plan to pick a new conceptual alignment tool or alignment framework, and try out this method on it. I'm not sure how to determine if I've "critiqued correctly," though. How can I tell if a framework has permanently skewed my thoughts? What is a test I can run?

LESSWRONG
LW