Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I have some rough intuitions about the purpose of MIRI's agent foundations agenda, and I'd like to share them here. (Note: I have not discussed these with MIRI, and these should not be taken to be representative of MIRI's views.)

I think there's a common misconception that the goal of agent foundations is to try building an AGI architected with a decision theory module, a logical induction module, etc. In my mind, this is completely not the point, and my intuitions say that approach is doomed to fail.

I interpret agent foundations as being more about providing formal specifications of metaphilosophical competence, to:

  • directly extend our understanding of metaphilosophy, by adding conceptual clarity to important notions we only have fuzzy understandings of. (Will this agent fall into epistemic pits? Are its actions low-impact? Will it avoid catastrophes?) As an analogy, formally defining mathematical proofs constituted significant progress in our understanding of mathematical logic and mathematical philosophy.
  • allow us to formally verify whether a computational process will satisfy desirable metaphilosophical properties, like those mentioned in the above parenthetical. (It seems perfectly fine for these processes to be built out of illegible components, like deep neural nets—while that makes them harder to inspect, it doesn't preclude us from making useful formal statements about them. For example, in ALBA, it would help us make formal guarantees that distilled agents remain aligned.)

I want to explore logical induction as a case study. I think the important part about logical induction is the logical induction criterion, not the algorithm implementing it. I've heard the implementation criticized for being computationally intractable, but I see its primary purpose as showing the logical induction criterion to be satisfiable at all. This elevates the logical induction criterion over all the other loose collections of desiderata that may or may not be satisfiable, and may or may not capture what we mean by logical uncertainty. If we were to build an actual aligned AGI, I would expect its reasoning process to satisfy the logical induction criterion, but not look very much like the algorithm presented in the logical induction paper.

I also think the logical induction criterion provides an exact formalization—a necessary AND sufficient condition—of what it means to not get stuck in any epistemic pits in the limit. (The gist of this intuition: epistemic pits you're stuck in forever correspond exactly to patterns in the market that a trader could exploit forever, and make unbounded profits from.) This lets us formalize the question "Does X reasoning process avoid permanently falling into epistemic pits?" into "Does X reasoning process satisfy the logical induction criterion?"

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 1:36 PM

I find your recent posts insightful, but think the way you use the terms "metaphilosophy" and "metaphilosophical competence" may be confusing. In my view "metaphilosophy" as commonly used (and previously on LW) means the study of the nature of philosophy and philosophical reasoning, including how humans "do philosophy" and how to automate philosophical reasoning. "Philosophical competence" typically means the ability to reason well about philosophy (in other words, being a good philosopher), so by extension "metaphilosophical competence" would mean the ability to reason about metaphilosophy (e.g., the ability to figure out how to reason about philosophy or how to program or teach an AI to reason about philosophy).

MIRI is doing object-level philosophy as part of figuring out things like decision theory and logical uncertainty, but I don't think that research directly contributes to metaphilosophy or what I would call the metaphilosophical competence of an AI, but instead is aimed at improving general epistemic competence and rationality. In other words, even if MIRI solves all of the open problems on its current research agenda, it still wouldn't be able to create an AI that is philosophically or metaphilosophically competent (according to my understanding of the meaning of these terms).

What you've called "metaphilosophical competence" in your recent posts seems to be a combination of both rationality and philosophical competence (so MIRI is trying to formalize a part of it). Do you think that's a correct understanding?

I agree I've been using "metaphilosophical competence" to refer to some combination of both rationality and philosophical competence. I have an implicit intuition that rationality, philosophical competence, and metaphilosophical competence all sort of blur into each other, such that being sufficient in any one of them makes you sufficient in all of them. I agree this is not obvious and probably confusing.

To elaborate: sufficient metaphilosophical competence should imply broad philosophical competence, and since metaphilosophy is a kind of philosophy, sufficient philosophical competence should imply sufficient metaphilosophical competence. Sufficient philosophical competence would allow you to figure out what it means to act rationally, and cause you to act rationally.

That rationality implies philosophical competence seems the least obvious. I suppose I think of philosophical competence as some combination of not being confused by words, and normal scientific competence—that is, given a bunch of data, figuring out which data is noise and which hypotheses fit the non-noisy data. Philosophy is just a special case where the data is our intuitions about what concepts should mean, the hypotheses are criteria/definitions that capture these intuitions, and the datapoints happen to be extremely sparse and noisy. Some examples:

  • Section 1.1 in the logical induction paper lists a bunch of desiderata ("datapoints") for what logical uncertainty is. The logical induction criterion is a criterion ("hypothesis") that fits a majority of those datapoints.
  • The Von Neumann–Morgenstern utility theorem starts with a bunch of desiderata ("datapoints") for rational behavior, and expected utility maximization is a criterion ("hypothesis") that fits these datapoints.
  • I think both utilitarianism and deontology are moral theories ("hypotheses") that fit a good chunk of our moral intuitions ("datapoints"). I also think both leave much to be desired.

Philosophical progress seems objective and real like scientific progress—progress is made when a parsimonious new theory fits the data much better. One important way in which philosophical progress differs from scientific progress is that there's much less consensus on what the data is or whether a theory fits it better, but I think this is mostly a function of most people being extremely philosophically confused, rather than e.g. philosophy being inherently subjective. (The "not being confused by words" component I identified mostly corresponds to the skill of identifying which datapoints we should consider in the first place, which of the datapoints are noise, and what it means for a theory to fit the data.)

Relatedly, I think it is not a coincidence that the Sequences, which are primarily about rationality, also managed to deftly resolve a number of common philosophical confusions (e.g. MWI vs Copenhagen, free will, p-zombies).

I also suspect that a sufficiently rational AGI would simply not get confused by philosophy the way humans do, and that it would feel to it from the inside like a variant of science. For example, it's hard for me to imagine it tying itself up in knots trying to reason about theology. (I sometimes think about confusing philosophical arguments as adversarial examples for human reasoning...)

Anyway, I agree this was all unclear and non-obvious (and plausibly wrong), and I'm happy to hear any suggestions for better descriptors. I literally went with "rationality" before "metaphilosophical competence", but people complained that was overloaded and confusing...

I have an implicit intuition that rationality, philosophical competence, and metaphilosophical competence all sort of blur into each other, such that being sufficient in any one of them makes you sufficient in all of them.

I think this is plausible but I'm not very convinced by your arguments. Maybe we can have a discussion about it at a later date. I haven't been able to come up with a better term for a combination of all three that didn't sound awkward, so unless someone else has a good suggestion, perhaps you could just put some explanations at the top of your posts or when you first use the term in a post, something like "by 'metaphilosophical competence' I mean to also include philosophical competence and rationality."

To respond to the substance of your argument that being sufficient in any of rationality, philosophical competence, and metaphilosophical competence makes you sufficient in all of them:

sufficient metaphilosophical competence should imply broad philosophical competence

You could discover an algorithm for doing philosophy (implying great metaphilosophical competence) but not be able to execute it efficiently yourself.

since metaphilosophy is a kind of philosophy, sufficient philosophical competence should imply sufficient metaphilosophical competence

Philosophical competence could be a vector instead of a scalar, but I agree it's more likely than not that sufficient philosophical competence implies sufficient metaphilosophical competence.

Sufficient philosophical competence would allow you to figure out what it means to act rationally, and cause you to act rationally.

I agree with the first part, but figuring out what rationality is does not imply being motivated to act rationally. (Imagine the The Blue-Minimizing Robot, plus a philosophy module connected to a speaker but not to anything else.)

Philosophy is just a special case where the data is our intuitions about what concepts should mean, the hypotheses are criteria/​definitions that capture these intuitions, and the datapoints happen to be extremely sparse and noisy.

But where do those intuitions come from in the first place? Different people have different philosophically relevant intuitions, and having good intuitions seems to be an important part of philosophical competence, but is not implied (or at least not obviously implied) by rationality.

[-]TAG6y10

One important way in which philosophical progress differs from scientific progress is that there’s much less consensus on what the data is or whether a theory fits it better, but I think this is mostly a function of most people being extremely philosophically confused, rather than e.g. philosophy being inherently subjective

I would say that it is a function of philosophy being circular: there isn't a set of foundations that everyone agrees on, so any theory can be challenged by challenging its assumptions. Philosophical questions tend to be precisely the kind of difficult foundational issues that get kicked into philosophy from other disciplines.

I interpret agent foundations as being more about providing formal specifications of metaphilosophical competence, to [...] allow us to formally verify whether a computational process will satisfy desirable metaphilosophical properties

"Adding conceptual clarity" is a key motivation, but formal verification isn't a key motivation.

The point of things like logical induction isn't "we can use the logical induction criterion to verify that the system isn't making reasoning errors"; as I understand it, it's more "logical induction helps move us toward a better understanding of what good reasoning is, with a goal of ensuring developers aren't flying blind when they're actually building good reasoners".

Daniel Dewey's summary of the motivation behind HRAD is:

2) If we fundamentally "don't know what we're doing" because we don't have a satisfying description of how an AI system should reason and make decisions, then we will probably make lots of mistakes in the design of an advanced AI system.
3) Even minor mistakes in an advanced AI system's design are likely to cause catastrophic misalignment.

To which Nate replied at the time:

I think this is a decent summary of why we prioritize HRAD research. I would rephrase 3 as "There are many intuitively small mistakes one can make early in the design process that cause resultant systems to be extremely difficult to align with operators’ intentions.” I’d compare these mistakes to the “small” decision in the early 1970s to use null-terminated instead of length-prefixed strings in the C programming language, which continues to be a major source of software vulnerabilities decades later.
I’d also clarify that I expect any large software product to exhibit plenty of actually-trivial flaws, and that I don’t expect that AGI code needs to be literally bug-free or literally proven-safe in order to be worth running.

The position of the AI community is something like the position researchers would be in if they wanted to build a space rocket, but hadn't developed calculus or orbital mechanics yet. Maybe with enough trial and error (and explosives) you'll eventually be able to get a payload off the planet that way, but if you want things to actually work correctly on the first go, you'll need to do some basic research to cover core gaps in what you know.

To say that calculus or orbital mechanics help you "formally verify" that the system's parts are going to work correctly is missing where the main benefit lies, which is in knowing what you're doing at all, not in being able to machine-verify everything you'd like to. You need to formalize how good reasoning works because even if you can't always apply conventional formal methods, you still need to understand what you're building if you want robustness properties.