Agent Meta-Foundations and the Rocket Alignment Problem

Chris_Leong

Many people are quite skeptical of the value of agent foundations. The kinds of problems that MIRI worries about, in terms of accounting for perfect predictors, co-operation with clones and being easily predictable are a world away from the kinds of problems that are being faced in machine learning. Many people think that proper research in this area would involve code. They may also think that this kind of research consists purely of extremely rare edge cases of no practical importance, won't be integratable into the kinds of AI systems that are likely to be produced or is just much, much less pressing than solving the kinds of safety challenges that we can already see arising in our current AI system.

In order to convey his intuitions that Agent Foundations being important, Eliezier wrote the Rocket Alignment Problem. The argument is roughly that any attempt to define what an AI should do is build upon shaky premises. This means that it is practically impossible to provide guarantees that things will go well. The hope is that by exploring highly simplified, theoretical problems we may learn something important that would deconfuse us. However, it is also noted that this might not pan out, as it is hard to see how useful improved theoretical understanding is before you've obtained it. Further it is argued that AI safety is an unusually difficult area where things that sound like pretty good solutions could result in disasterous outcomes. For example, powerful utility maximisers are very good at finding the most obscure loopholes in their utility function to achieve a higher score. Powerful approval based agents are likely to try find a way to manipulating us. Powerful boxed agents are likely to find a way to escape that box.

Most of my current work on Less Wrong is best described as Agent Foundations Foundations. It involves working through the various claims or results in Agent Foundations, finding aspects that confuse me and then digging deeper to determine whether I'm confused, or something needs to be patched, or there's a deeper problem.

Agent Foundations Foundations research looks very different from most Agent Foundations research. Agent Foundations is primarily about producing mathematical formulations, while Agent Foundations Foundations is primarily about questioning philosophical assumptions. Agent Foundations Foundations is intended to eventually lead to mathematical formalisations, but the focus is more on figuring out exactly what we want from out maths. There is less focus on producing formalisations as rushing out and producing incorrect formalisations is seen as a waste of time.

Agent Foundations research relies on a significant amount of philosophical assumptions and this is also an area where the default is disaster. The best philosophers are extremely careful with every step of the argument, yet they often come to entirely different conclusions. Given this, attempting to rush over this terrain by handwaving philosophical arguments is likely to end badly.

One challenge is that mathematicians can be blinded by elegant formalisations, to the point where they can't objectively assess the merits of the assumptions it is build upon. Another key issue is that when someone is able to use a formalisation to produce a correct result, they will often assume that the formalisation must be correct. Agent Foundations Foundations attempts to fight against these biases.

Agent Foundations Foundations focuses on what often appears to be weird niche issues from the perspective of Agent Foundations. This includes questions such as:

Given a perfect predictor, what is it predicting when the counterfactual is impossible (Counterfactuals for Perfect Predictors)
Why doesn't Timeless Decision Theory depend on backwards causation? (The Prediction Problem, further discussion here)
What class of problems should our decision theory optimise over? (One doubt about Timeless Decision Theories)?
What should we optimise for when your reference class depends on your decision? (Evil Genie Problem, Decision Theory with F@#!ed-Up Reference Classes)
What general kind of entities are logical counterfactuals? (Logical Counterfactuals & the Cooperation Game, Deconfusing Logical Counterfactuals)

Of course, lots of other people have done work in this vein too. I didn't want to spend a lot of time browsing the archive, but some examples include:

I don't want to pretend that the separation is clean at all. But in Agent Foundations work, the maths is first and the philosophical assumptions are second. For Agent Foundations Foundations, it is the other way round. Obviously, this distinction is somewhat subjective and messy. However, I think it model is useful as it opens up discussions about whether the current balance of research is right and provides suggestions of areas for further research. It also clarifies why some of these problems might turn out to be more important than they first appear.

Update: One issue is that I almost want to use the term in two different ways. One way to think about Meta-Foundations is in an absolute sense where it focuses on the philosophical assumptions while Foundations focuses more on formalisations vs ML which focuses on writing programs. Another is in a relative sense, where you have a body of work termed Agent Foundations and I want to encourage a body of work that responds to it and probes these assumptions further. And these senses are different, because when Agent Foundations work is pursued, they'll usually be some investigation into the philosophy, but it'll often be the minimal amount to get a theory up and running.

At first glance, the current Agent Foundations work seems to be formal, but it's not the kind of formal where you work in an established setting. It's counterintuitive that people can doodle in math, but they can. There's a lot of that in physics and economics. Pre-formal work doesn't need to lack formality, it just doesn't follow a specific set of rules, so it can use math to sketch problems approximately the same way informal words sketch problems approximately.

I'm not saying that you can't doodle in maths. It's just that when someone stumbles upon a mathematical model, it's very easy to fall into confirmation bias, instead of really, deeply considering if what they're doing makes sense from first principles. And I'm worried that this is what is happening in Agent Foundations research.

I've been a bit busy with other things lately, but this is exactly the kind of thing I'm trying to do.

Interesting. Is the phenomenological work to try to figure out what kind of agents are conscious and therefore worthy of concern or do you expect insights into how AI could work?

Both, although I mostly consider the former question settled (via a form of panpsychism that I point at in this post) and the latter less about the technical details of how AI could work and more about the philosophical predictions of what will likely be true of AI (mostly because it would be true of all complex, conscious things).

Also the "phenomenological" in the name sounded better to me than, say, "philosophical" or "continental" or something else, so don't get too hung up on it: it's mostly a marker to say something like "doing AI philosophy from a place that much resembles the philosophy of the folks who founded modern phenomenology", i.e. my philosophical lineage is more Kierkegaard, Hegel, Schopenhauer, Husserl, and Sartre than Hume, Whitehead, Russel, and Wittgenstein.

I think the former is very important, but I'm quite skeptical of the later. What would be the best post of yours for a skeptic to read?

"Formally Stating the AI Alignment Problem" is probably the nicest introduction, but if you want a preprint of a more formal approach to how I think this matters (with a couple specific cases), you might like this preprint (though note I am working on getting this through to publication, have it halfway through review with a journal, and although I've been time constrained to make the reviewers' suggested changes, I suspect the final version of this paper will be more like what you are looking for).

I definitely agree that the AFF work is essential and does not seem to get as much attention as warranted, judging by the content of the weekly alignment newsletter. I still think that a bit more quantitative approach to philosophy would be a good thing. For example, I wrote a post "Order from Randomness" giving a toy model of how a predictable universe might spontaneously arise. I would like to see more foundational ideas from the smart folks at MIRI and elsewhere.

Fyi, if you're judging based on the list of "what links have been included in the newsletter", that seems appropriate, but if you're judging based on the list of "what is summarized in the newsletter", that's biased away from AF and AFF because I usually don't feel comfortable enough with them to summarize them properly.

I've been a bit busy with other things lately, but this is exactly the kind of thing I'm trying to do.

Interesting. Is the phenomenological work to try to figure out what kind of agents are conscious and therefore worthy of concern or do you expect insights into how AI could work?

I think the former is very important, but I'm quite skeptical of the later. What would be the best post of yours for a skeptic to read?

12

Agent Meta-Foundations and the Rocket Alignment Problem

12

12

12