I want to work on technical AI alignment research. I am trying to choose an AI alignment area to get into. To do that, I want to understand the big picture of how to make AI go well through technical research. This post contains my views on this question. More precisely, I list types of AI safety research and their usefulness.

Probably, many things in this post are wrong. I want to become less wrong. The main reason why I am posting this is to get feedback. So, please

  • send me links to alternative big picture views on how to make AI go well
  • tell me all the places where I am wrong and all the places where I am right
  • talk to me in the comments section, via DM, via email, or schedule a video chat.

Three fuzzy clusters of AI alignment research

  1. The research aimed to deconfuse us about AI. Examples of this are AIXI, MIRI's agent foundations, Michael Dennis' game theory, Tom Everitt's agent incentives, Vanessa Kosoy's incomplete models, Comprehensive AI services.
  2. The research aimed to provide "good enough" alignment for non-superintelligent AI. This category includes more hacky and less theory-based research, which probably won't help with the alignment of superintelligent AI. By superintelligent AI, I mean AI, which is much smarter than humans, much faster, and is almost omnipotent. Examples: everything under the umbrella of (prosaic AI alignment), robustness of neural networks against adversarial examples, robustness of neural networks against out-of-distribution samples, almost everything related to neural networks, empirical work on learning human values, IDA (I think it goes into both 2 and 3), most things studied by OpenAI and Deepmind (but I am not that sure what exactly they are studying), Vanessa Kosoy's value learning protocols (I think it goes under both 2 and 3).
  3. The research aimed to directly solve one of the problems on our way to the alignment of both superintelligent and non-superintelligent AI. Examples: IDA (I think it goes into both 2 and 3), Vanessa Kosoy's value learning protocols (I think it goes under both 2 and 3), Stuart Armstrong's research agenda (I am not sure about this one).

Their usefulness

I think type-1 research is most useful, type-3 is second best, and type-2 is least useful. Here's why I think so.

  1. At some point, humanity will create a superintelligent AI, unless we go extinct before. When that happens, we won't be making important decisions anymore. Instead, the AI will.
  2. Human-level AI might be alignable using hacky, empirical testing, engineering, and "good enough" alignment. However, superintelligent AI can't be aligned with such methods.
  3. Superintelligent AI is an extremely powerful optimization process. Hence, if it's unaligned even a little, it'll be catastrophic.
  4. Therefore, it's crucial to align superintelligent AI perfectly.
  5. I don't see why it'll be easier to work on the alignment of superintelligent AI in the future rather than now, so we'd better start now. But I am unsure about this.
  6. There are too many confusing things about superintelligent AI alignment, and I don't see any clear ways to solve it without spending a lot of time on figuring out what is even going on (e.g., how can embedded agents work?). Hence, deconfusion is very important.

Many people seem to work on type-2 research. Probably, many of them have thought about it and decided that it's better. This a reason to think that I am wrong. However, I think there are other reasons people may choose to work on type-2 research, such as:

  • It's easier to get paid for type-2 research.
  • It's easier to learn all the prerequisites for type-2 research and to actually do it.
  • Humans have bias towards near-term thinking.
  • Type-2 research seems less fringe to many people.

Also, I have a feeling that approximately in the last 3 years, a small paradigm shift has happened. People interested in AI alignment started talking less about superintelligence, singletons, AGI in the abstract, recursive self-improvement and fast takeoff. Instead, they talk more about neural networks, slow takeoff, and smaller, less weird, and less powerful AI. They might be right, and this is a reason to be slightly more enthusiastic about type-2 research. However, I still think that the old paradigm, perhaps minus fast takeoff, is more useful.

New Comment
25 comments, sorted by Click to highlight new comments since: Today at 3:10 PM

Good idea to write down what you think! As someone who is moving toward AI Safety, and who has spent all this year reading, studying, working and talking with people, I disagree with some of what you write.

First, I feel that your classification comes from your decision that deconfusion is most important. I mean, your second category literally only contains things you describe as "hacky" and "not theory-based" without much differentiation, and it's clear that you believe theory and elegance (for lack of a better word) to be important. I also don't think that the third cluster makes much sense, as you point out that most of it lies in the second one too. Even deconfusion aims at solving alignment, just in a different way.

A dimension I find more useful is the need for understanding what's happening inside. This scale goes from MIRI's embedded agency approach (on the "I need to understand everything about how the system works, at a mathematical level, to make it aligned" end) to prosaic AGI on the other side (on the "I can consider the system as a black box that behaves according to some incentives and build an architecture using it that ensures alignment" end). I like this dimension because I feel a lot of my gut-feeling about AI Safety research comes from my own perspectives on the value of "understanding what happens inside", and how mathematical this understanding must be.

Here's an interesting discussion of this distinction.

How do I classify the rest? Something like that:

  • On the embedded agency end of the spectrum, things like Vanessa's research agenda and Stuart Armstrong's research agenda. Probably anything that fits in agents foundations too.
  • In the middle, I think of DeepMind's research about incentives, Evan Hubinger's research about inner alignment and myopia, probably all the cool things about interpretability like the clarity team work at Open AI (see this post for an AI Safety perspective).
  • On the prosaic AGI end of the spectrum, I would put IDA and AI Safety via debate, Drexler's CAIS, and probably most of CHAI's published research (although I am less sure on that, and would be happy to be corrected)

Now, I want to say explicitly that this dimension is not a value scale. I'm not saying that either end is more valuable in of themselves. I'm simply pointing at what I think is an underlying parameter in why people work on what they work on. Personally, I'm more excited about the middle and the embedded agency end, but I still see value and am curious about the other end of the spectrum.

It's easier to learn all the prerequisites for type-2 research and to actually do it.

I wholeheartedly disagree. I think you point out a very important aspect of AI Safety: the prerequisites are all over the place, sometimes completely different for different approaches. That being said, which prerequisites are easier to learn is more about personal fit and background than intrinsic difficulty. Is it inherently harder to learn model theory than statistical learning theory? Decision theory than Neural networks? What about psychology or philosphy. What you write feels like a judgement stemming from "Math is more demanding". But try to understand all the interpretability stuff, and you'll realize that even if it lacks a lot of deep mathematical theorems, it still requires a tremendous amount of work to grok.

(Also, as another argument against your groupings, even the mathematical snobs would not put Vanessa's work in the "less prerequisite" section. I mean, she uses measure theory, bayesian statistics, online learning and category theory, among others!)

So my position is that thinking in terms of your beliefs about "how much one needs to understand the insides to make something work" will help you choose an approach to try your hand out. It's also pretty cheap to just talk to people and try a bunch of different approaches to see which ones feel right.

A word on choosing the best approach: I feel like AI Safety as it is doesn't help at all to do that. Because of the different prerequisites, understanding deeply any approach requires a time investment that triggers a sunk-cost fallacy when evaluating the value of this approach. Also, I think it's very common to judge a certain approach from the perspective of one's current approach, which might color this judgement in incorrect ways. The best strategy I know of, which I try to apply, is to try something and update regularly on its value compared to the rest.

(Also, as another argument against your groupings, even the mathematical snobs would not put Vanessa's work in the "less prerequisite" section. I mean, she uses measure theory, bayesian statistics, online learning and category theory, among others!)

To be fair, he did put Vanessa in all three categories... :P

I like your clusters (though of course I could quibble about which research goes where). I expect these clusters strongly disagree with how many researchers imagine their research will be used, and you'll probably get some pushback as a result, but it's still a useful categorization about what impact different research agendas are actually likely to have. I'm still not convinced that they're natural clusters, as opposed to a useful classification.

Basically anyone defending type-2 will probably argue that tractability/funding/etc should not be ignored. I'll make the opposite argument: don't search under the streetlight. In practice, working on the right problem is usually orders of magnitude more important than the amount of progress made on the problem - i.e. 1 unit of progress in the best direction is more important than 1000 units of progress in a random useful direction. This is a consequence of living in a high-dimensional world. (That said, remember that tractability and neglectedness both also offer some evidence of importance - not necessarily very much evidence, but some.)

On the ease of working on superintelligent alignment now vs later... I haven't read Rohin's comment yet, but I assume he'll point out that future non-superintelligent AI could help us align more powerful AI. This is a good argument. I would much rather not trust even non-superintelligent AI that much - we'd basically be rolling the dice on whether that non-superintelligent AI is aligned well enough to get the superintelligent AI perfectly aligned to humans (not just to the non-super AI) - but it's still a good argument.

On the need for perfect alignment: we need AI to be basically-perfectly aligned if it's capable of making big changes quickly, regardless of whether it's optimizing things. E.g. nukes need to be very well aligned, even though they're not optimizers. And if we're going to get the full value out of AI, then it needs to be capable of making big changes quickly. (This matters because some people argue that we can de-risk AI by using non-optimizer architectures. I don't think that's sufficient to avoid the need for alignment.)

On the ease of working on superintelligent alignment now vs later... I haven't read Rohin's comment yet, but I assume he'll point out that future non-superintelligent AI could help us align more powerful AI. This is a good argument. I would much rather not trust even non-superintelligent AI that much - we'd basically be rolling the dice on whether that non-superintelligent AI is aligned well enough to get the superintelligent AI perfectly aligned to humans (not just to the non-super AI) - but it's still a good argument.

Amusingly to me, I said basically the same thing. I do think that "we'll have a better idea of what AGI will look like" is a more important reason for optimism about future research.

Unless you mean an omnipotent superintelligence, in which case we probably don't get much of an idea of what that looks like before it no longer matters what we do. In that case I argue that our job is not to align the omnipotent superintelligence, but to instead align the better-than-human AI whose job it is to build and align the next iteration of AI systems, and then say we'll have a better idea of what the better-than-human AI looks like in the future.

This matters because some people argue that we can de-risk AI by using non-optimizer architectures. I don't think that's sufficient to avoid the need for alignment.

+1

I'll make the opposite argument: don't search under the streetlight. In practice, working on the right problem is usually orders of magnitude more important than the amount of progress made on the problem - i.e. 1 unit of progress in the best direction is more important than 1000 units of progress in a random useful direction.

+1, though note that you can have beneficial effects other than "solving the problem", e.g. convincing people there is a problem, field-building (both reputation of the field, and people working in the field). It's still quite important for these other effects to focus on the right problem (it's not great if you build a field that then solves the wrong problem).

I think type-1 research is most useful, type-3 is second best, and type-2 is least useful. Here's why I think so.

All of your arguments seem to be about importance. Tractability and neglectedness matter too. (Unless your argument is that type 2 research will be of literally zero use for aligning superintelligent AI.) In particular, you seem dismissive of:

It's easier to learn all the prerequisites for type-2 research and to actually do it.

Even if you are perfectly altruistic and not limited by money, other people, reputation, etc this is a legitimate reason to focus on type 2 research from a perspective of minimizing x-risk.

At some point, humanity will create a superintelligent AI, unless we go extinct before. When that happens, we won't be making important decisions anymore. Instead, the AI will.

The AI will be making important decisions long before it becomes near-omnipotent, as you put it. In particular, it should be doing all the work of aligning future AI systems well before it is near-omnipotent.

(However, you still need a pretty strong guarantee about that AI system, such that when it aligns a future AI system, that future AI system remains aligned with us, so I think overall the intuition is right.)

Human-level AI might be alignable using hacky, empirical testing, engineering, and "good enough" alignment.

All alignment is "good enough" alignment, there is no such thing as "perfect" alignment except in idealized theory. All you get is more or less confidence in the AI system you're building. (I say this not to be pedantic, because I legitimately don't know what your threshold is for dismissing "hacky" alignment, or what you mean when you say it won't work on superintelligent AI systems, or what would count as "not hacky". I may or may not agree with you depending on the answer.)

Superintelligent AI is an extremely powerful optimization process. Hence, if it's unaligned even a little, it'll be catastrophic.

I agree with the intuition overall, but it isn't a particularly strong intuition. For example, any other human is at least a little unaligned with me, but there are at least some humans where I'd feel okay making them God (in the sense that in expectation I think the world would be better moment-to-moment by my values than it is today moment-to-moment; this could be an existential catastrophe because we don't control the future, but it isn't extinction).

I don't see why it'll be easier to work on the alignment of superintelligent AI in the future rather than now, so we'd better start now.

We'll have a better idea of how superintelligent AI will be built in the future.

There are too many confusing things about superintelligent AI alignment [...] Hence, deconfusion is very important.

I agree with the intuition in general.

Note though it's quite possible that some things we're confused about are also simply irrelevant to the thing we care about. (I would claim this of embedded agency with not much confidence.)

All alignment is "good enough" alignment, there is no such thing as "perfect" alignment except in idealized theory.

I strongly disagree with this. It may be true in some technical sense - e.g. we can't be 100% certain there's not a bug in our code - but I do think there exists a sharp, qualitative distinction between systems which are optimizing-for-the-thing-we-call-human-values and systems which aren't doing that. Most likely underlying generator of disagreement: I think there's a natural, precise notion of what we mean when we point to "human values", in much the same way that there's a natural, precise notion of what we mean when we point to a flower. There's still multiple steps between pointing to flowers and pointing to human values, but one feature I expect to carry over is that it's not an underspecified or fully-subjective notion - there is a well-defined sense in which the physical system of molecules comprising a human brain "wants things", and a well-defined notion of what that system wants.

I broadly agree with this perspective (and in fact it's one of my reasons for optimism about AI alignment).

But usually when LessWrongers argue against "good enough" alignment, they're arguing against alignment methods, saying that "nothing except proofs" will work, because only proofs give near-100% confidence. (I might be strawmanning this argument, I don't really understand it.)

You're talking about the internal structure of the AI system (is the AI system actually in fact optimizing for "human values", or something else), where I do expect a sharper, qualitative distinction. I'm claiming that our ability to get on the right side of that distinction is relatively smooth across the methods that we could use.

Part of my optimism about AI alignment (relative to LW) comes from thinking that since there (probably) is a relatively sharp qualitative divide between "aligned computation" and "unaligned computation", the "engineering approach" has more of a shot at working. (This isn't a big factor in my optimism though.)

I almost ended up writing a whole post more or less psychologizing this point recently.

Quotes from the probably-never-to-be-published post, which I might as well fillet out to present here:

Last year I was thinking about how humans refer to things. For example, when I say "human values," it seems like I am pointing to something (some thing), as surely as if I was using my finger to point at some material object. And so if we want an AI to learn about human values, it sure would be nice if it could follow that pointer out to the thing-being-pointed-to.
At the time, it wasn't at all obvious to me that I had already stepped off the path, but I had. Rather than trying to understand this thing humans do - refer to things - in terms of the map-making problem humans actually face [From earlier: The physical world is really complicated. Humans get some information about the world via the senses, and then we model it so that we can make sense of our senses, predict the world, and make plans. This can be a really useful starting point for explanations of confusing phenomena.], I had framed the problem with an analogy to physical objects. As if the analogy was clean, and as if objects were natural (dare I say directly-perceived) building blocks of the world.
It's a very tricky mistake to avoid, this thing of thinking that reality will respect your labels. I wanted to understand the "human values" label, and so I mistakenly tried to look for the process by which we associate that label with some natural object, or even natural pattern, out in the world that corresponds to "human values." But reality doesn't have objects for things just because we have labels for them. This is the fallacy of essentialism - the notion that if we have a word like "roundness," then there must be some thing out in the world that is roundness. The roundness-essence, if you will.

EDIT: To forestall the obvious objection to the last sentence that roundness is a pattern, and surely with a little elbow grease you could write down something about spherical symmetry that is equivalent to roundness-essence, the most relevant point to human values is that even if we have a label for a pattern, that pattern still doesn't have to exist. The label-making process of the human brain does not first require comprehension of some referent of the label.

Rather than finding a theory in which we can find a precise notion of human values, we need a theory in which we can do okay despite not having a precise notion of human values (yes, I agree that sounds paradoxical). And by the naturalization thesis, this sort of reasoning plausibly also applies to an aligned AI.

This isn't "rah rah type 2 research, boo type 1 research." What I mean is that I think the indeterminacy of human values connects the two together, like the critical point of water allows for a continuous transition between liquid and gas.

Counterargument: suppose a group of humans split off from the rest of humanity long enough ago that they have no significant shared cultural background. They develop language independently. Assuming they live in an area with trees, do they still develop a word for "tree", recognize individual trees as objects, and generally have a notion of tree which matches our notion? I think the answer is pretty clearly "yes" - in part because the number of examples a baby needs to learn what a word means is not nearly large enough to narrow down the massive object space unless they already have some latent classification for those objects.

It's true that the label-making making process of the human brain does not require a referent in order to generate a word, but most words have them anyway - including (but not limited to) any word whose meaning can be reasonably-reliably communicated to someone who's never heard it before using less than a million examples.

One human can have a word for a pattern which doesn't exist. Two humans can use that word. But if you put the two humans in separate, identical rooms and ask them both to point to the <word>, and they consistently point to the same thing, then that's pretty clear evidence that the pattern exists in the world. "Human values" are a bit too abstract for that exact test, but I think we have more than enough analogous evidence to conclude that they do exist.

Okay, let's go with "tree." Is an acorn a tree? A seedling? What if the seedling is just a sprouted acorn in a plastic bag, versus a sprouted acorn that's planted in the ground? A dead, fallen-over tree? What about a big unprocessed log? The same log but with its bark stripped off?

How likely do you think it is that there's some culture out there that disagrees with you about at least two of these? How likely is it that you would disagree with yourself, given different contextual cues?

Trees obviously exist. And I agree with you that a clever clusterer will probably find some cluster that more or less overlaps with "tree" (though who knows, there's probably a culture out there that has a word for woody-stemmed plants but not for trees specifically, or no word for trees but words for each of the three different kinds of trees in their environment specifically).

But an AI that's trying to find the "one true definition of trees" will quickly run into problems. There is no thing, nothing with the properties intuitive to an object or substance, that defines trees. And if you make an AI that goes out and looks at the world and comes up with its own clusterings and then tries to learn what "tree" means from relatively few examples, this is precisely a 'good-enough' hack of the type 2 variety.

Is an acorn a tree? A seedling? What if the seedling is just a sprouted acorn in a plastic bag, versus a sprouted acorn that's planted in the ground? A dead, fallen-over tree? What about a big unprocessed log? The same log but with its bark stripped off?

How likely do you think it is that there's some culture out there that disagrees with you about at least two of these? How likely is it that you would disagree with yourself, given different contextual cues?

Wrong questions. A cluster does not need to have sharp classification boundaries in order for the cluster itself to be precisely defined, and it's precise definition of the cluster itself that matters.

An even-more-simplified example: suppose we have a cluster in some dataset which we model as normal with mean 3.55 and variance 2.08. There may be points on the edge of the cluster which are ambiguously/uncertainly classified, and that's fine. The precision of the cluster itself is not about sharp classification, it's about precise estimation of the parameters (i.e. mean 3.55 and variance 2.08, plus however we're quantifying normality). If our algorithm is "working correctly", then there is an actual pattern out in the world corresponding to our cluster, and that pattern is the thing we want to point to - not any particular point within the pattern.

Back to trees. The one true definition of trees does not unambiguously classify all objects as tree or not-tree; that is not the sense in which it is precisely defined. Rather, there is some precisely-defined generative model for observations-of-trees, and the concept of "tree" points to that model. Assuming the human-labelling-algorithm is "working correctly", that generative model matches an actual pattern in the world, and the precision of the model follows from the pattern. None of this requires unambiguous classification of logs as tree/not-tree.

On to human values. (I'll just talk about one human at the moment, because cross-human disagreements are orthogonal to the point here.) The answer to "what does this human want?" does not always need to be unambiguous - indeed it should not always be unambiguous, because that is not the actual nature of human values. Rather, I have some precisely-defined generative model for observations-involving-my-values. Assuming my algorithm is "working correctly", there is an actual pattern out in the world corresponding to that cluster, and that pattern is the thing we want to point to. That's not just "good enough"; pointing to that pattern (assuming it exists) is perfect alignment. That's what "mission accomplished" looks like. It's the thing we're modelling when we model our own desires.

Rather, there is some precisely-defined generative model for observations-of-trees, and the concept of "tree" points to that model. Assuming the human-labelling-algorithm is "working correctly", that generative model matches an actual pattern in the world, and the precision of the model follows from the pattern. None of this requires unambiguous classification of logs as tree/not-tree.

This contains the ad-hoc assumption that if there's one history in which I'll say logs are trees, and another history in which I won't, then what I'm doing is approximating a "real concept" in which logs are sorta-trees.

This is a modeling assumption about humans that doesn't have to be true. You could just as well say that in the two different worlds, I'm actually referring to two related but distinct concepts. (Or you could model me as picking things to say about trees in a way that doesn't talk about the properties of some "concept of trees" at all.)

The root problem is that "pointing to a real pattern" is not something humans can do in a vacuum. "I'm a great communicator, but people just don't understand me," as the joke goes. As far as I can tell, what you mean is that you're envisioning an AI that learns about patterns in the world, and then matches those patterns to some collection of data that it's been told to assume is "pointing to a pattern." And there is no unique scheme for this - at the very least, you've got a choice of universal turing machine, as well as a free parameter describing the expected human level of abstraction. And this isn't a case where any choice will do, because we're in the limited-data regime, where different ontologies can easily lead to different categorizations.

This contains the ad-hoc assumption that if there's one history in which I'll say logs are trees, and another history in which I won't, then what I'm doing is approximating a "real concept" in which logs are sorta-trees.

That is not an assumption, it is an implication of the use of the concept "tree" to make predictions. For instance, if I can learn general facts about trees by examining a small number of trees, then I know that "tree" corresponds to a real pattern out in the world. This extends to logs: to the extent that a log is a tree, I can learn general facts about trees by examining logs (and vice versa), and verify what I've learned by looking at more trees/logs.

Pointing to a real pattern is indeed not something humans can do in a vacuum. Fortunately we do not live in a vacuum; we live in a universe with lots of real patterns in it. Different algorithms will indeed result in somewhat different classifications/patterns learned at any given time, but we can still expect a fairly large class of algorithms to converge to the same classifications/patterns over time, precisely because they are learning from the same universe. A perfectly-aligned AI will not have a perfect model of human values at any given time, but it can update in the right direction - in some sense it's the update-procedure which is "aligned" with the true pattern, not the model itself which is "aligned".

That's why we often talk about perfectly "pointing" to human values, rather than building a perfect model of human values. It's not about having a perfect model at any given time, it's about "having a pointer" to the real-world pattern of human values, allowing us to do things like update our model in the right direction.

As far as I can tell, what you mean is that you're envisioning an AI that learns about patterns in the world, and then matches those patterns to some collection of data that it's been told to assume is "pointing to a pattern." And there is no unique scheme for this - at the very least, you've got a choice of universal turing machine, as well as a free parameter describing the expected human level of abstraction. And this isn't a case where any choice will do, because we're in the limited-data regime...

I definitely do not imagine that some random architecture would get it right with realistic amounts of data. Picking an architecture which matches the structure of our universe closely enough to perform well with limited data is a key problem - it's exactly the sort of thing that e.g. my work on abstraction will hopefully help with.

(Also, matching the patterns to some collection of data intended to point to the pattern is not the only way of doing things, or even a very good way given the difficulty of verification, though for purposes of this discussion it's a fine approach to examine.)

That is not an assumption, it is an implication of the use of the concept "tree" to make predictions.

I would disagree in spirit - an AI can happily find a referent to the "tree" token that depends on context in a way that works like a word with multiple possible definitions.

Picking an architecture which matches the structure of our universe closely enough to perform well with limited data is a key problem

I hope this is where we can start agreeing. Because the problem isn't just finding something that performs well according to a known scoring rule. We don't quite know how to implement the notion "this method for learning human values performs well" on a computer without basically already referring to some notion of human values for "performs well."

We either need to ground "performs well" in some theory of humans as approximate agents that doesn't need to know about their values, or we need some theory that avoids the chicken-and-egg problem altogether by simultaneously learning human models and the standards to judge them by.

I hope this is where we can start agreeing. Because the problem isn't just finding something that performs well according to a known scoring rule. We don't quite know how to implement the notion "this method for learning human values performs well" on a computer without basically already referring to some notion of human values for "performs well."

To clarify, when said "performs well", I did not mean "learns human values well", nor did I have any sort of scoring rule in mind. I intended to mean that the algorithm learns patterns which are actually present in the world - much like earlier when I talked about "the human-labelling-algorithm 'working correctly'".

Probably not the best choice of words on my part; sorry for causing a tangent.

I would disagree in spirit - an AI can happily find a referent to the "tree" token that depends on context in a way that works like a word with multiple possible definitions.

I'm sure it could, but I am claiming that such a thing would have worse predictive power. Roughly speaking: if there's one notion of tree that includes saplings, and another that includes logs, then the model misses the ability to learn facts about saplings by examining logs. Conversely, if it doesn't miss those sorts of things, then it isn't actually behaving like a word with multiple possible referents. (I don't actually think it's that simple - the referent of "tree" is more than just a comparison class - but it hopefully suffices to make the point.)

To clarify, when said "performs well", I did not mean "learns human values well", nor did I have any sort of scoring rule in mind. I intended to mean that the algorithm learns patterns which are actually present in the world - much like earlier when I talked about "the human-labelling-algorithm 'working correctly'".

Ah well. I'll probably argue with you more about this elsewhere, then :)

This is very well-said, but I still want to dispute the possibility of "perfect alignment". In your clustering analogy: the very existence of clusters presupposes definitions of entities-that-correspond-to-points, dimensions-of-the-space-of-points, and measurements-of-given-points-in-given-dimensions. All of those definitions involve imperfect modeling assumptions and simplifications. Your analogy also assumes that a normal-mixture-model is capable of perfectly capturing reality; I'm aware that this is provably asymptotically true for an infinite-cluster Dirichlet process mixture, but we don't live in asymptopia and in reality it is effectively yet another strong assumption that holds at best weakly.

In other words, while I agree with (and appreciate your clear expression of) your main point that it's possible to have a well-defined category without being able to do perfect categorization, I dispute the idea that it is possible even in theory to have a perfectly-defined one.

All of those definitions involve imperfect modeling assumptions and simplifications. Your analogy also assumes that a normal-mixture-model is capable of perfectly capturing reality; I'm aware that this is provably asymptotically true for an infinite-cluster Dirichlet process mixture, but we don't live in asymptopia and in reality it is effectively yet another strong assumption that holds at best weakly.

This is a critical point; it's the reason we want to point to the pattern in the territory rather than to a human's model itself. It may be that the human is using something analogous to a normal-mixture-model, which won't perfectly match reality. But in order for that to actually be predictive, it has to find some real pattern in the world (which may not be perfectly normal, etc). The goal is to point to that real pattern, not to the human's approximate representation of that pattern.

Now, two natural (and illustrative) objections to this:

  • If the human's representation is an approximation, then there may not be a unique pattern to which their notions correspond; the "corresponding pattern" may be underdefined.
  • If we're trying to align an AI to a human, then presumably we want the AI to use the human's own idea of the human's values, not some "idealized" version.

The answer to both of these is the same: we humans often update our own notion of what our values are, in response to new information. The reality-pattern we want to point to is the pattern toward which we are updating; it's the thing our learning-algorithm is learning about. I think this is what coherent extrapolated volition is trying to get at: it asks "what would we want if we knew more, thought faster, ...". Assuming that the human-label-algorithm is working correctly, and continues working correctly, those are exactly the sort of conditions generally associated with convergence of the human's model to the true reality-pattern.

Here are my responses to your comments, sorted by how interesting they are to me, descending. Also, thanks for your input!

Non-omnipotent AI aligning omnipotent AI

The AI will be making important decisions long before it becomes near-omnipotent, as you put it. In particular, it should be doing all the work of aligning future AI systems well before it is near-omnipotent.

Please elaborate. I can imagine multiple versions of what you're imagining. Is one of the following scenarios close to what you mean?

  1. Scientists use AI-based theorem provers to prove theorems about AI alignment.
  2. There's an AI, with which you can have conversations. It tries to come up with new mathematical definitions and theorems related to what you're discussing.
  3. The AI (or multiple AIs) is not near-omnipotent yet, but it already controls most of the humanity's resources and makes most of the decisions, so it does research into AI instead of humans.

I think, the requirements for how well the non-omnipotent AI in the 3rd scenario should be aligned are basically the same as for a near-omnipotent AI. If the non-omnipotent AI in the 3rd scenario is very misaligned, but it's not catastrophic because the AI is not smart enough, the near-omnipotent AI it'll create will also be misaligned, and that'll be catastrophic.

Embedded agency

Note though it's quite possible that some things we're confused about are also simply irrelevant to the thing we care about. (I would claim this of embedded agency with not much confidence.)

So, you think embedded agency research is unimportant for AI alignment. On the contrast, I think it's very important. I worry about it mainly for 3 reasons. Suppose we don't figure out embedded agency. Then

  • An AI won't be able to safely self-modify
  • An AI won't be able to comprehend that it can be killed or damaged or modified by others
  • I am not sure about this one. I am very interested to know if this is not the case. I think, if we build an AI without understanding embedded agency, and that AI builds a new AI, that new AI also won't understand embedded agency. In other words, the set of AIs built without taking embedded agency into account is closed under the operation of an AI building a new AI. [Upd: comments under this comment mostly refute this]
  • I am even less sure about this item, but maybe such an AI will be too dogmatic (as in dogmatic prior) about how the world might work, because it is sure that it can't be killed or damaged or modified. Due to this, if the physics laws turn out to be weird (e.g. we live in a multiverse, or we're in a simulation), the AI might fail to understand that and thus fail to turn the whole world into hedonium (or whatever it is that we would want it to do with the world).
  • If an AI built without taking embedded agency into account meets very smart aliens someday, it might fuck up due to its inability to imagine that someone can predict its actions.

Usefulness of type-2 research for aligning superintelligent AI

Unless your argument is that type 2 research will be of literally zero use for aligning superintelligent AI.

I think that if one man-year of type-1 research produces 1 unit of superintelligent AI alignment, one man-year of type-2 research produces about 0.15 units of superintelligent AI alignment.

As I see it, the mechanisms by which type-2 research helps align superintelligent AI are:

  • It may produce useful empirical data which'll help us make type-1 theoretical insights.
  • Thinking about type-2 research contains a small portion of type-1 thinking.

For example, if someone works on making contemporary neural networks robust to out-of-distribution examples, and they do that mainly by experimenting, their experimental data might provide insights about the nature of robustness in abstract, and also, surely some portion of their thinking will be dedicated to theory of robustness.

My views on tractability and neglectedness

Tractability and neglectedness matter too.

Alright, I agree with you about tractability.

About neglectedness, I think type-2 research is less neglected than type-1 and type-3 and will be less neglected in the next 10 years or so, because

  • It's practical, you can sell it to companies which want to make robots or unbreakable face detection or whatever.
  • Humans have bias towards near-term thinking.
  • Neural networks are a hot topic.
Non-omnipotent AI aligning omnipotent AI

I basically mean the third scenario:

The AI (or multiple AIs) is not near-omnipotent yet, but it already controls most of the humanity's resources and makes most of the decisions, so it does research into AI instead of humans.

I agree that you still need a strong guarantee of alignment in this scenario (as I mentioned my original comment).

On the contrast, I think it's very important. I worry about it mainly for 3 reasons. Suppose we don't figure out embedded agency. Then [...]

Why don't these arguments apply to humans? Evolution didn't understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.

(I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don't buy it as an argument that we need to be deconfused about embedded agency.)

I think that if one man-year of type-1 research produces 1 unit of superintelligent AI alignment, one man-year of type-2 research produces about 0.15 units of superintelligent AI alignment.

Cool, that's more concrete, thanks. (I disagree, but there isn't really an obvious point to argue on, the cruxes are in the other points.)

About neglectedness, I think type-2 research is less neglected than type-1 and type-3 and will be less neglected in the next 10 years or so, because

Agreed. Tbc, I wasn't arguing it was neglected, just that you seemed to be ignoring tractability and neglectedness, which seemed like a mistake.

I see MIRI's research on agent foundations (including embedded agency) as something like "We want to understand ${an aspect of how agents should work}, so let's take the simplest case first and see if we understand everything about it. The simplest case is the case when the agent is nearly omniscient and knows all logical consequences. Hmm, we can't figure out this simplest case yet - it breaks down if the conditions are sufficiently weird". Since it turns out that it's difficult to understand embedded agency even for such simple cases, it seems plausible that an AI trained to understand embedded agency by a naive learning procedure (similar to the evolution) will break down under sufficiently weird conditions.

Why don't these arguments apply to humans? Evolution didn't understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.

(I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don't buy it as an argument that we need to be deconfused about embedded agency.)

Hmm, very good argument. Since I think humans have imperfect understanding of embedded agency, thanks to you I now no longer think that "If we build an AI without understanding embedded agency, and that AI builds a new AI, that new AI also won't understand embedded agency" since that would imply we can't get the "lived happily ever after" at all. We can ignore the case where we can't get the "lived happily ever after" at all, because in that case nothing matters anyway.

I suppose, we could run evolutionary search or something, selecting for AIs which can understand the typical cases of being modified by itself or by the environment, which we include in the training dataset. I wonder how we can make it understand very atypical cases of modification. A near-omnipotent AI will be a very atypical case.

Can we come up with a learning procedure to have the AI learn embedded agency on its own? It seems plausible to me that we will need to understand embedded agency better to do this, but I don't really know.

Btw, in another comment, you say

But usually when LessWrongers argue against "good enough" alignment, they're arguing against alignment methods, saying that "nothing except proofs" will work, because only proofs give near-100% confidence.But usually when LessWrongers argue against "good enough" alignment, they're arguing against alignment methods, saying that "nothing except proofs" will work, because only proofs give near-100% confidence.

I basically subscribe to the argument that nothing except proofs will work in the case of superintelligent agentic AI.

Re: embedded agency, while these are all potentially relevant points (especially self-modification), I don't see any of them as the main reason to study embedded agents from an alignment standpoint. I see the main purpose of embedded agency research as talking about humans, not designing AIs - in particular, in order to point to human values, we need a coherent notion of what it means for an agenty system embedded in its environment (i.e. a human) to want things. As the linked post discusses, a lot of the issues with modelling humans as utility-maximizers or using proxies for our goals stem directly from more general embedded agency issues.

Major props for writing down your understanding in such a readable, clear, and relatively short way. I expect this will be a benefit to you in 6, 12, 18 months, when you look back and see how your big picture thinking has changed.

By superintelligent AI, I mean AI, which is much smarter than humans, much faster, and almost passes the omnipotence test.

I'm confused by your invocation of the Omni Test here as part of your definition of just a superintelligent AI, rather than specifically an aligned superintelligent AI. From that linked page:

The Omni Test is that an advanced AI should be expected to remain aligned, or not lead to catastrophic outcomes, or fail safely, even if it suddenly knows all facts and can directly ordain any possible outcome as an immediate choice.

Did you mean to be defining an aligned superintelligent AI? Or were you thinking of the Omni Test as about what capabilities it has, rather than about whether it maintains alignment in the face of a huge increase in abilities?

As I interpret the test, a proposed AI that passes it is one that would maintain its alignment, even if its knowledge and capabilities grew arbitrarily. It's not about whether the AI is already supremely capable or not. So it seems odd to include as part of the definition of superintelligence itself (which I take to usually be just about capabilities).

(This whole comment might seem a bit nitpicky, but I was emboldened by your request for feedback :-) )

No, I was talking about an almost omnipotent AI, not necessarily aligned. I've now fixed what words I use there.