Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Paul Christiano and "MIRI" have disagreed on an important research question for a long time: should we focus research on aligning "messy" AGI (e.g. one found through gradient descent or brute force search) with human values, or on developing "principled" AGI (based on theories similar to Bayesian probability theory)? I'm going to present my current model of this disagreement and additional thoughts about it.


I put "MIRI" in quotes because MIRI is an organization composed of people who have differing views. I'm going to use the term "MIRI view" to refer to some combination of the views of Eliezer, Benya, and Nate. I think these three researchers have quite similar views, such that it is appropriate in some contexts to attribute a view to all of them collectively; and that these researchers' views constitute what most people think of as the "MIRI view".

(KANSI AI complicates this disagreement somewhat; the story here is that we can use "messy" components in a KANSI AI but these components have to have their capabilities restricted significantly. Such restriction isn't necessary if we think messy AGI can be aligned in general.)

Intuitions and research approaches

I'm generally going to take the perspective of looking for the intuitions motivating a particular research approach or produced by a particular research approach, rather than looking at the research approaches themselves. I expect it is easier to reach agreement about the how compelling a particular intuition is (at least when other intuitions are temporarily ignored), than to reach agreement on particular research approaches.

In general, it's quite possible for a research approach to be inefficient while still being based on, or giving rise to, useful intuitions. So a criticism of a particular research approach is not necessarily a criticism of the intuitions behind it.

Terminology

  • A learning problem is a task for which the AI is supposed to output some information, and if we wanted, we could give the information a score measuring how good it is the task, using less than ~2 weeks of labor. In other words, there's an inexpensive "ground truth" we have access to. This looks a little weird but I think this is a natural category, and some of the intuitions relate to learning and non-learning problems. Paul has written about learning and non-learning problems here.
  • An AI system is aligned if it is pursuing some combination of different humans' values and not significantly pursuing other values that could impact the long term future of humanity. If it is pursuing other values significantly it is unaligned.
  • An AI system is competitive if it is nearly as efficient as other AI systems (aligned or unaligned) that people could build.

Listing out intuitions

I'm going to list out a bunch of relevant intuitions. Usually I can't actually convey the intuition through text; at best I can write "what someone who has this intuition would feel like saying" and "how someone might go about gaining this intuition". Perhaps the text will make "logical" sense to you without feeling compelling; this could be a sign that you don't have the underlying intuition.

Background AI safety intuitions

These background intuitions ones that I think are shared by both Paul and MIRI.

1. Weak orthogonality. It is possible to build highly intelligent agents with silly goals such as maximizing paperclips. Random "minds from mindspace" (e.g. found through brute force search) will have values that significantly diverge from human values.

2. Instrumental convergence. Highly advanced agents will by default pursue strategies such as gaining resources and deceiving their operators (performing a "treacherous turn").

3. Edge instantiation. For most objective functions that naively seem useful, the maximum is quite "weird" in a way that is bad for human values.

4. Patch resistance. Most AI alignment problems (e.g. edge instantiation) are very difficult to "patch"; adding a patch that deals with a specific failure will fail to fix the underlying problem and instead lead to further unintended solutions.

Intuitions motivating the agent foundations approach

I think the following intuitions are sufficient to motivate the agent foundations approach to AI safety (thinking about idealized models of advanced agents to become less confused), and something similar to the agent foundations agenda, at least if one ignores contradictory intuitions for a moment. In particular, when considering these intuitions at once, I feel compelled to become less confused about advanced agents through research questions similar to those in the agent foundations agenda.

I've confirmed with Nate that these are similar to some of his main intuitions motivating the agent foundations approach.

5. Cognitive reductions are great. When we feel confused about something, there is often a way out of this confusion, by figuring out which algorithm would have generated that confusion. Often, this works even when the original problem seemed "messy" or "subjective"; something that looks messy can have simple principles behind it that haven't been discovered yet.

6. If you don't do cognitive reductions, you will put your confusion in boxes and hide the actual problem. By default, a lot of people studying a problem will fail to take the perspective of cognitive reductions and thereby not actually become less confused. The free will debate is a good example of this: most discussion of free will contains confusions that could be resolved using Daniel Dennett's cognitive reduction of free will. (This is essentially the same as the cognitive reduction discussed in the sequences.)

7. We should expect mainstream AGI research to be inefficient at learning much about the confusing aspects of intelligence, for this reason. It's pretty easy to look at most AI research and see where it's hiding fundamental confusions such as logical uncertainty without actually resolving them. E.g. if neural networks are used to predict math, then the confusion about how to do logical uncertainty is placed in the black box of "what this neural net learns to do". This isn't that helpful for actually understanding logical uncertainty in a "cognitive reduction" sense; such an understanding could lead to much more principled algorithms.

8. If we apply cognitive reductions to intelligence, we can design agents we expect to be aligned. Suppose we are able to observe "how intelligence feels from the inside" and distill these observations into an idealized cognitive algorithm for intelligence (similar to the idealized algorithm Daniel Dennett discusses to resolve free will). The minimax algorithm is one example of this: it's an idealized version of planning that in principle could have been derived by observing the mental motions humans do when playing games. If we implement an AI system that approximates this idealized algorithm, then we have a story for why the AI is doing what it is doing: it is taking action X for the same reason that an "idealized human" would take action X. That is, it "goes through mental motions" that we can imagine going through (or approximates doing so), if we were solving the task we programmed the AI to do. If we're programming the AI to assist us, we could imagine the mental motions we would take if we were assisting aliens.

9. If we don't resolve our confusions about intelligence, then we don't have this story, and this is suspicious. Suppose we haven't actually resolved our confusions about intelligence. Then we don't have the story in the previous point, so it's pretty weird to think our AI is aligned. We must have a pretty different story, and it's hard to imagine different stories that could allow us to conclude that an AI is aligned.

10. Simple reasoning rules will correctly generalize even for non-learning problems. That is, there's some way that agents can learn rules for making good judgments that generalize to tasks they can't get fast feedback on. Humans seem to be an existence proof that simple reasoning rules can generalize; science can make predictions about far-away galaxies even when there isn't an observable ground truth for the state of the galaxy (only indirect observations). Plausibly, it is possible to use "brute force" to find agents using these reasoning rules by searching for agents that perform well on small tasks and then hoping that they generalize to large tasks, but this can result in misalignment. For example, Solomonoff induction is controlled by malign consequentialists who have learned good rules for how to reason; approximating Solomonoff induction is one way to make an unaligned AI. If an aligned AI is to be roughly competitive with these "brute force" unaligned AIs, we should have some story for why the aligned AI system is also able to acquire simple reasoning rules that generalize well. Note that Paul mostly agrees with this intuition and is in favor of agent foundations approaches to solving this problem, although his research approach would significantly differ from the current agent foundations agenda. (This point is somewhat confusing; see my other post for clarification)

Intuitions motivating act-based agents

I think these following intuitions are all intuitions that Paul has that motivate his current research approach.

11. Almost all technical problems are either tractable to solve or are intractable/impossible for a good reason. This is based on Paul's experience in technical research. For example, consider a statistical learning problem where we are trying to predict a Y value from an X value using some model. It's possible to get good statistical guarantees on problems where the training distribution of X values is the same as the test distribution of X values, but when those distributions are distinguishable (i.e. there's a classifier that can separate them pretty well), there's a fundamental obstruction to getting the same guarantees: given the information available, there is no way to distinguish a model that will generalize from one that won't, since they could behave in arbitrary ways on test data that is distinctly different from training data. An exception to the rule is NP-complete problems; we don't have a good argument yet for why they can't be solved in polynomial time. However, even in this case, NP-hardness forms a useful boundary between tractable and intractable problems.

12. If the previous intuition is true, we should search for solutions and fundamental obstructions. If there is either a solution or a fundamental obstruction to a problem, then an obvious way to make progress on the problem is to alternate between generating obvious solutions and finding good reasons why a class of solutions (or all solutions) won't work. In the case of AI alignment, we should try getting a very good solution (e.g. one that allows the aligned AI to be competitive with unprincipled AI systems such as ones based on deep learning by exploiting the same techniques) until we have a fundamental obstruction to this. Such a fundamental obstruction would tell us which relaxations to the "full problem" we should consider, and be useful for convincing others that coordination is required to ensure that aligned AI can prevail even if it is not competitive with unaligned AI. (Paul's research approach looks quite optimistic partially because he is pursuing this strategy).

13. We should be looking for ways of turning arbitrary AI capabilities into equally powerful aligned AI capabilities. On priors, we should expect it to be hard for AI safety researchers to make capabilities advances; AI safety researchers make up only a small percentage of AI researchers. If this is the case, then aligned AI will be quite uncompetitive unless it takes advantage of the most effective AI technology that's already around. It would be really great if we could take an arbitrary AI technology (e.g. deep learning), do a bit of thinking, and come up with a way to direct that technology towards human values. There isn't a crisp fundamental obstruction to doing this yet, so it is the natural first place to look. To be more specific about what this research strategy entails, suppose it is possible to build built an unaligned AI system. We expect it to be competent; say it is competent for reason X. We ought to be able to either build an aligned AI system that also works for reason X, or else find a fundamental obstruction. For example, reason X could be "it does gradient descent to find weights optimizing a proxy for competence"; then we'd seek to build a system that works because it does gradient descent to find weights optimizing a proxy for competence and alignment.

14. Pursuing human narrow values presents a much more optimistic picture of AI alignment. See Paul's posts on narrow value learning, act-based agents, and abstract approval direction. The agent foundations agenda often considers problems of the form "let's use Bayesian VNM agents as our starting point and look for relaxations appropriate to realistic agents, which are naturalized". This leads to problems such as decision theory, naturalized induction, and ontology identification. However, there isn't a clear argument for why they are subproblems of the problem we actually care about (which is close to something like "pursuing human narrow values"). For example, perhaps we can understand how to have an AI pursue human narrow values without solving decision theory, since maybe humans don't actually have a utility function or a decision theory yet (though we might upon long-term reflection; pursuing narrow values should preserve the conditions for such long-term reflection). These research questions might be useful threads to pull on if solving them would tell us more about the problems we actually care about. But I think Paul has a strong intuition that working on these problems isn't the right way to make progress on pursuing human narrow values.

15. There are important considerations in favor of focusing on alignment for foreseeable AI technologies. See posts here and here. In particular, this motivates work related to alignment for systems solving learning problems.

16. It is, in principle, possible to automate a large fraction of human labor using robust learning. That is, a human can use amount of labor to oversee the AI doing something like amount of labor in a robust fashion. KWIK learning is a particularly clean (though impractical) demonstration of this. This enables the human to spend much more time overseeing a particular decision than the AI takes to make it (e.g. spending 1 day to oversee a decision made in 1 second), since only a small fraction of decisions are overseen.

17. The above is quite powerful, due to bootstrapping. "Automating a large fraction of human labor" is significantly more impressive than it first seems, since the human can use other AI systems in the course of evaluating a specific decision. See ALBA. We don't yet have a fundamental obstruction to any of ALBA's subproblems, and we have an argument that solving these subproblems is sufficient to create an aligned learning system.

18. There are reasons to expect the details of reasoning well to be "messy". That is, there are reasons why we might expect cognition to be as messy and hard to formalize as biology is. While biology has some important large-scale features (e.g. evolution), overall it is quite hard to capture using simple rules. We can take the history of AI as evidence for this; AI research often does consist of people trying to figure out how humans do something at an idealized level and formalize it (roughly similar to the agent foundations approach), and this kind of AI research does not always lead to the most capable AI systems. The success of deep learning is evidence that the most effective way for AI systems to acquire good rules of reasoning is usually to learn them, rather than having them be hardcoded.

What to do from here?

I find all the intuitions above at least somewhat compelling. Given this, I have made some tentative conclusions:

  • I think the intuition 10 ("simple reasoning rules generalize for non-learning problems") is particularly important. I don't quite understand Paul's research approach for this question, but it seems that there is convergence that this intuition is useful and that we should take an agent foundations approach to solve the problem. I think this convergence represents a great deal of progress in the overall disagreement.
  • If we can resolve the above problem by creating intractable algorithms for finding simple reasoning rules that generalize, then plausibly something like ALBA could "distill" these algorithms into a competitive aligned agent making use of e.g. deep learning technology. My picture of this is vague but if this is correct, then the agent foundations approach and ALBA are quite synergistic. Paul has written a bit about the relation between ALBA and non-learning problems here.
  • I'm still somewhat optimistic about Paul's approach of "turn arbitrary capabilities into aligned capabilities" and pessimistic about the alternatives to this approach. If this approach is ultimately doomed, I think it's likely because it's far easier to find a single good AI system than to turn arbitrary unaligned AI systems into competitive aligned AI systems; there's a kind of "universal quantifier" implicit in the second approach. However, I don't see this as a good reason not to use this research approach. It seems like if it is doomed, we will likely find some kind of fundamental obstruction somewhere along the way, and I expect a crisply stated fundamental obstruction to be quite useful for knowing exactly which relaxation of the "competitive aligned AI" problem to pursue. Though this does argue for pursuing other approaches in parallel that are motivated by this particular difficulty.
  • I think intuition 14 ("pursuing human narrow values presents a much more optimistic picture of AI alignment") is quite important, and would strongly inform research I do using the agent foundations approach. I think the main reason "MIRI" is wary of this is that it seems quite vague and confusing, and maybe fundamental confusions like decision theory and ontology identification will re-emerge if we try to make it more precise. Personally, I expect that, though narrow value learning is confusing, it really ought to dodge decision theory and ontology identification. One way of testing this expectation would be for me to think about narrow value learning by creating toy models of agents that have narrow values but not proper utility functions. Unfortunately, I wouldn't be too surprised if this turns out to be super messy and hard to formalize.

Acknowledgements

Thanks to Paul, Nate, Eliezer, and Benya for a lot of conversations on this topic. Thanks to John Salvatier for helping me to think about intuitions and teaching me skills for learning intuitions from other people.

New Comment
43 comments, sorted by Click to highlight new comments since: Today at 8:16 AM
[-]Wei Dai7yΩ6100

This seems a good opportunity for me to summarize my disagreements with both Paul and MIRI. In short, there are two axes along which Paul and MIRI disagree with each other, where I'm more pessimistic than either of them.

(One of Paul's latest replies to me on his AI control blog says "I have become more pessimistic after thinking it through somewhat more carefully." and "If that doesn’t look good (and it probably won’t) I will have to step back and think about the situation more broadly." I'm currently not sure how broadly Paul was going to rethink the situation or what conclusions he has since reached. What follows is meant to reflect my understanding of his positions up to those statements.)

One axis might be called "metaphilosophical paternalism" (a phrase I just invented, not sure if there's an existing one I should use), i.e., how much external support / error correction do humans need, or more generally how benign of an environment do we have to place them in, before we can expect them to eventually figure out their "actual" values (which implies correctly solving all relevant philosophical dependencies such as population ethics and philosophy of consciousness) and how hard will it be to design and provide such support / error correction.

MIRI's position seems to be that humans do need a lot of external support / error correction (see CEV) and this is a hard problem, but not so hard that it will likely turn out to be a blocking issue. Paul's position went from his 2012 version of "indirect normativity" which envisioned placing a human in a relatively benign simulated environment (although still very different from the kinds of environments where we have historical evidence of humans being able to make philosophical progress in) to his current ideas where humans live in very hostile environments, having to process potentially adversarial messages from superintelligent AIs under time pressure.

My own thinking is that we currently know very little about metaphilosophy, essentially nothing beyond that philosophy is some kind of computational / cognitive process implemented by (at least some) human brains, and there seems to be such a thing as philosophical truth or philosophical progress, but that is hard to define or even recognize. Without easy ways to check one'e ideas (e.g., using controlled experiments or mathematical proofs), human cognitive processes tend to diverge rather than converge. (See political and religious beliefs, for example.) If typical of philosophical problems in general, understanding metaphilosophy well enough to implement something like CEV will likely take many decades of work even after someone discovers a viable approach (which we don't yet have). Think of how confused we still are about how expected utility maximization applies in bargaining, or what priors really are or should be, many decades after those ideas were first proposed. I don't understand Paul and MIRI's reasons for being as optimistic as they each are on this issue.

The other axis of disagreement is how feasible it would be to create aligned AI that matches or beats unaligned AI in efficiency/capability. Here Paul is only trying to match unaligned AIs using the same mainstream AI techniques, whereas MIRI is trying to beat unaligned AIs in order to prevent them from undergoing intelligence explosion. But even Paul is more optimistic than I think is warranted. (To be fair, at least some within MIRI, such as Nate, may be aiming to beat unaligned AIs not because they're particularly optimistic about the prospects of doing so, but because they're pessimistic about what would happen if we merely match them.) It seems unlikely to me that alignment to complex human values comes for free. If nothing else, aligned AIs will be more complex than unaligned AIs and such complexity is costly in design, coding, maintenance, and security. Think of the security implications of having a human controller or a complex value extrapolation process at an AI's core, compared to something simpler like a paperclip maximizer, or the continuous challenges of creating improved revisions of AI design while minimizing the risk of losing alignment to a set of complex and unknown values.

Jessica's post lists searching for fundamental obstructions to aligned AI as a motivation for Paul's research direction. I think given that efficient aligned AIs almost certainly exist as points in mindspace, it's unlikely that we can find "fundamental" reasons why we can't build them. Instead they will likely just take much more resources (including time) to build than unaligned AIs, for a host of "messy" reasons. Maybe the research can show that certain approaches to building competitive aligned AIs won't succeed, but realistically such a result can only hope to cover a tiny part of AI design space, so I don't see why that kind of result would be particularly valuable.

Please note that what I wrote here isn't meant to be an argument against doing the kind of research that Paul and MIRI are doing. It's more of an argument for simultaneously trying to find and pursue other approaches to solving the "AI risk" problem, especially ones that don't require the same preconditions in order to succeed. Otherwise, since those preconditions don't seem very likely to actually obtain, we're leaving huge amounts of potential expected value on the table if we bank on just one or even both of these approaches.

[-]So8res7yΩ130

Weighing in late here, I'll briefly note that my current stance on the difficulty of philosophical issues is (in colloquial terms) "for the love of all that is good, please don't attempt to implement CEV with your first transhuman intelligence". My strategy at this point is very much "build the minimum AI system that is capable of stabilizing the overall strategic situation, and then buy a whole lot of time, and then use that time to figure out what to do with the future." I might be more optimistic than you about how easy it will turn out to be to find a reasonable method for extrapolating human volition, but I suspect that that's a moot point either way, because regardless, thou shalt not attempt to implement CEV with humanity's very first transhuman intelligence.

Also, +1 to the overall point of "also pursue other approaches".

MIRI’s position seems to be that humans do need a lot of external support / error correction (see CEV) and this is a hard problem, but not so hard that it will likely turn out to be a blocking issue.

Note that Eliezer is currently more optimistic about task AGI than CEV (for the first AGI built), and I think Nate is too. I'm not sure what Benya thinks.

[-]Wei Dai7yΩ110

Oh, right, I had noticed that, and then forgot and went back to my previous model of MIRI. I don't think Eliezer ever wrote down why he changed his mind about task AGI or how he is planning to use one. If the plan is something like "buy enough time to work on CEV at leisure", then possibly I have much less disagreement on "metaphilosophical paternalism" with MIRI than I thought.

If typical of philosophical problems in general, understanding metaphilosophy well enough to implement something like CEV will likely take many decades of work even after someone discovers a viable approach (which we don’t yet have).

Consider the following strategy the AI could take:

  1. Put a bunch of humans in a secure box containing food/housing/etc
  2. Acquire as much power as possible while keeping the box intact
  3. After 100 years, ask the humans in the box what to do next

There are lots of things that are unsatisfying about the proposal (e.g. the fact that only the humans in the box survive), but I'm curious which you find least satisfying (especially unsatisfying things that are also unsatisfying about Paul's proposals). Do you think designing this AI will require solving metaphilosophical problems? Do you think this AI will be at a substantial efficiency disadvantage relative to a paperclip maximizer?

(Note that this doesn't require humans to figure out their actual values in 100 years; they can decide some questions and kick the rest to another 100 years later)

[-]Wei Dai7yΩ120
  1. If "a bunch" is something like 10000 smartest, most sane, most philosophically competent humans on Earth, then they might give you a reasonable answer in 100 years (depending on how teachable/heritable these things are, since the original 10000 won't be alive at that point). But if you exclude most of humanity then most likely they'll contribute their resources to their own AI projects so you're starting with a small percent of power, and already losing most of potential value.
  2. That box will be a very attractive target for other AIs to attack (e.g., by sending a manipulative message to the humans inside), attack is generally easier than defense, so keeping that box secure will be hard. One problem is how do you convey "secure" to your AI in a way that covers everything that might happen in 100 years, during which all kinds of unforeseeable technologies might be developed? Then there's the problem that attackers only have to succeed once whereas your AI has to successfully defend against all attacks for a subjective eternity.
  3. I think there will be strong incentives for AIs to join into coalitions and then merge into coherent unified designs (with aggregated values) because that makes them much more efficient (it gets rid of losses from redundant computations, asymmetric information, bad equilibria in general.), and also because there are likely increasing returns to scale (for example the first coalition / merged AI to find some important insight into building the next generation of AI might gain a large additional share of power at the cost of other AIs, or the strongest coalition can just fight and destroy all others and take 100% of the universe for itself). If your AI's motivational structure is not expected utility maximization of some evaluable utility function (or whatever will be compatible with the dominant merged AIs), it might soon be forced to either self-modify into that form or lose out in this kind of coalitional race. It seems that you can either A) solve all the philosophical problems involved in safely doing this kind of merging ahead of time which will take a lot of resources (or just be impossible because we don't know how all the mergers will work in detail), B) figure out metaphilosophy and have the AI solve those problems, or C) fail to do either and then the AI self-modifies badly or loses the coalitional race.

I think all of the things I find unsatisfying above have analogues in Paul's proposals, and I've commented about them on his blog. Please let me know if I can clarify anything.

If “a bunch” is something like 10000 smartest, most sane, most philosophically competent humans on Earth, then they might give you a reasonable answer in 100 years

It seems to me like one person thinking for a day would do fine, and ten people thinking for ten days would do better, and so on. You seem to be imagining some bar for "good enough" which the people need to meet. I don't understand where that bar is though---as far as I can tell, to the extent that there is a minimal bar it is really low, just "good enough to make any progress at all."

It seems that you are much more pessimistic about the prospects of people in the box than society outside of the box, in the kind of situation we might arrange absent AI. Is that right?

Is the issue just that they aren't much better off than society outside of the box, and you think that it's not good to pay a significant cost without getting some significant improvement?

Is the issue that they need to do really well in order to adapt to a more complex universe dominated by powerful AI's?

so keeping that box secure will be hard

Physical security of the box seems no harder than physical security of your AI's hardware. If physical security is maintained, then you can simply not relay any messages to the inside of the box.

how do you convey “secure” to your AI in a way that covers everything that might happen in 100 years, during which all kinds of unforeseeable technologies might be developed

The point is that in order for the AI to work it needs to implement our views about "secure" / about good deliberation, not our views about arbitrary philosophical questions. So this allows us to reduce our ambitions. It may also be too hard to build a system that has an adequate understanding of "secure," but I don't think that arguments about the difficulty of metaphilosophy are going to establish that. So if you grant this, it seems like you should be willing to replace "solving philosophical problems" in your arguments with "adequately assessing physical security;" is that right?

fail to do either and then the AI self-modifies badly or loses the coalitional race

I can imagine situations where this kind of coalitional formation destroys value unless we have sophisticated philosophical tools. I consider this a separate problem from AI control; its importance depends on the expected damage done by this shortcoming.

Right now this doesn't look like a big deal to me. That is, it looks to me like simple mechanisms will probably be good enough to capture most of the gains from coalition formation.

An example of a simple mechanism, to help indicate why I expect some simple mechanism to work well enough: if A and B have equal influence and want to compromise, then they create a powerful agent whose goal is to obtain "meaningful control" over the future and then later flip a coin to decide which of A and B gets to use that control. (With room for bargaining between A and B at some future point, using their future understanding of bargaining, prior to the coin flip. Though realistically I think bargaining in advance isn't necessary since it can probably be done acausally after the coin flip.)

As with the last case, we've now moved most of the difficulty into what we mean for either A or B to have "meaningful control" of resources. We are also now required to keep both A and B secure, but that seems relatively cheap (especially if they can be represented as code). But it doesn't look likely to me that these kinds of things are going to be serious problems that stand up to focused attempts to solve them (if we can solve other aspects of AI control, it seems very likely that we can use our solution to ensure that A or B maintains "meaningful control" over some future resources, according to an interpretation of meaningful control that is agreeable to both A and B), and I don't yet completely understand why you are so much more concerned about it.

And if acausal trade can work rather than needing to bargain in advance, then we can probably just make the coin flip now and set aside these issues altogether. I consider that more likely than not, even moreso if we are willing to do some setup to help facilitate such trade.

Overall, it seems to me like to the extent that bargaining/coalition formation seems to be a serious issue, we should deal with it now as a separate problem (or people in the future can deal with it then as a separate problem), but that it can be treated mostly independently from AI control and that it currently doesn't look like a big deal.

I agree that the difficulty of bargaining / coalition formation could necessitate the same kind of coordinated response as a failure to solve AI control, and in this sense the two problems are related (along with all other possible problems that might require a similar response). This post explains why I don't think this has a huge effect on the value of AI control work, though I agree that it can increase the value of other interventions. (And could increase their value enough that they become higher priority than AI control work.)

[-]Wei Dai7yΩ220

I don’t understand where that bar is though—as far as I can tell, to the extent that there is a minimal bar it is really low, just “good enough to make any progress at all.”

I see two parts of the bar. One is being good enough to eventually solve all important philosophical problems. "Good enough to make any progress at all" isn't good enough, if they're just making progress on easy problems (or easy parts of hard problems). What if there are harder problems they need to solve later (and in this scenario all the other humans are dead)?

Another part is ability to abstain from using the enormous amount of power available until they figure out how to use it safely. Suppose after 100 years, the people in the box hasn't figured that out yet, what fraction of all humans would vote to go back in the box for another 100 years?

Physical security of the box seems no harder than physical security of your AI’s hardware.

An AI can create multiple copies of itself and check them against each other. It can migrate to computing substrates that are harder to attack. It can distribute itself across space and across different kinds of hardware. It can move around in space under high acceleration to dodge attacks. It can re-design its software architecture and/or use cryptographic methods to improve detection and mitigation against attacks. A box containing humans can do none of these things.

If physical security is maintained, then you can simply not relay any messages to the inside of the box.

Aside from the above, I had in mind that it may be hard to define "security" so that an attacker couldn't think of some loophole in the definition and use it to send a message into the box in a way that doesn't trigger a security violation.

if A and B have equal influence and want to compromise, then they create a powerful agent whose goal is to obtain “meaningful control” over the future and then later flip a coin to decide which of A and B gets to use that control.

Suppose A has utility linear in resources and B has utility log in resources, then moving to "flip a coin to decide which of A and B gets to use that control" makes A no worse off but B much worse off. This changes the disagreement point (what happens if they fail to reach a deal), in a way that (intuitively speaking) greatly increases A's bargaining power. B almostly certainly shouldn't go for this.

A more general objection is that you're proposing one particular way that AIs might merge, and I guess proposing to hard code that into your AI as the only acceptable way to merge, and have it reject all other proposals that don't fit this form. This just seems really fragile. How do you know that if you only accept proposals of this form, that's good enough to win the coalitional race during the next 100 years, or that the class of proposals your AIs will accept doesn't leave it open to being exploited by other AIs?

Overall, it seems to me like to the extent that bargaining/coalition formation seems to be a serious issue, we should deal with it now as a separate problem (or people in the future can deal with it then as a separate problem), but that it can be treated mostly independently from AI control and that it currently doesn’t look like a big deal.

So another disagreement between us that I forgot to list in my initial comment is that I think it's unlikely we can predict all the important/time-sensitive philosophical problems that AIs will face, and solve them all ahead of time. Bargaining/coalition formation is one class of such problems, I think self-improvement is another (what does your AI do if other AIs, in order to improve their capabilities, start using new ML/algorithmic components that don't fit into your AI control scheme?), and there are probably other problems that we can't foresee right now.

One is being good enough to eventually solve all important philosophical problems.

By "good enough to make any progress at all" I meant "towards becoming smarter while preserving their values," I don't really care about their resolution of other object-level philosophical questions. E.g. if they can take steps towards safe cognitive enhancement, if they can learn something about how to deliberate effectively...

It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can't find any further way to self-improve. At that point we could ask about the total amount of risk in the deliberative process itself, etc., but my basic point is that the risk is about the same in the "people in a box" scenario as in any other scenario where they can deliberate.

Suppose after 100 years, the people in the box hasn’t figured that out yet, what fraction of all humans would vote to go back in the box for another 100 years?

I think many people would be happy to gradually expand and improve quality of life in the box. You could imagine that over the long run this box is like a small city, then a small country, etc., developing along whatever trajectory the people can envision that is optimally conducive to sorting things out in a way they would endorse.

Compared to the current situation, they may have some unrealized ability to significantly improve their quality of life, but it seems at best modest---you can do most of the obvious life improvement without compromising the integrity of the reflective process. I don't really see how other aspects of their situation are problematic.

Re security:

There is some intermediate period before you can actually run an emulation of the human, after which the measures you discuss apply just as well to the humans (which still expand the attack surface, but apparently by an extremely tiny amount since it's not much information, it doesn't have to interact with the world, uc.). So we are discussing the total excess risk during that period. I can agree that over an infinitely long future the kinds of measures you mention are relevant, but I don't yet see the case for this being a significant source of losses over the intermediate period.

(Obviously I expect our actual mechanisms to work much better than this, but given that I don't understand why you would have significant concerns about this situation, it seems like we have some more fundamental disagreements.)

it may be hard to define “security” so that an attacker couldn’t think of some loophole in the definition

I don't think we need to give a definition. I'm arguing that we can replace "can solve philosophical problems" with "understands what it means to give the box control of resources." (Security is one aspect of giving the box control of resources, though presumably not the hardest.)

Is your claim that this concept, of letting the box control resources, is itself so challenging that your arguments about "philosophy is hard for humans" apply nearly as well to "defining meaningful control is hard for humans"? Are you referring to some other obstruction that would require us to give a precise obstruction?

B almostly certainly shouldn’t go for this [the coin flip]

It seems to me like the default is a coin flip. As long as there are unpredictable investments, a risk-neutral actor is free to keep making risky bets until they've either lost everything or have enough resources to win a war outright. Yes, you could prevent that by law, but if we can enforce such laws we could also subvert the formation of large coalitions. Similarly, if you have secure rights to deep space then B can guarantee itself a reasonable share of futre resources, but in that case we don't even care who wins the coalitional race. So I don't yet see a natural scenario where A and B have are forced to bargain but the "default" is for B to be able to secure a proportional fraction of the future.

Yes, you could propose a bargaining solution that could allow B to secure a proportional fraction of the future, but by the same token A could simply refuse to go for it.

I guess proposing to hard code that into your AI as the only acceptable way to merge

It seems that you are concerned that our AI's decisions may be bad due to a lack of certain kinds of philosophical understanding, and in particular that it may lose a bunch of value by failing to negotiate coalitions. I am pointing out that even given our current level of philosophical understanding, there is a wide range of plausible bargaining strategies, and I don't see much of an argument yet that we would end up in a situation where we are at a significant disadvantage due to our lack of philosophical understanding. To get some leverage on that claim, I'm inclined to discuss a bunch of currently-plausible bargaining approaches and then to talk about why they may fall far short.

In the kinds of scenarios I am imagining, you would never do anything even a little bit like explicitly defining a class of bargaining solution and then accepting precisely those. Even in the "put humans in a box, acquire resources, give them meaningful control over those resources" we aren't going to give a formal definition of "box," "resources," "meaningful control." The whole point is just to lower the required ability to do philosophy to the ability required to implement that plan well enough to capture most of the value.

In order to argue against that, it seems like you want to say that in fact implementing that plan is very philosophically challenging. To that end, it's great to say something like "existing bargaining strategies aren't great, much better ones are probably possible, finding them probably requires great philosophical sophistication." But I don't think one can complain about hand-coding a mechanism for bargaining.

I think it’s unlikely we can predict all the important/time-sensitive philosophical problems that AIs will face, and solve them all ahead of time

I understand your position on this. I agree that we can't reliably predict all important/time-sensitive philosophical problems. I don't yet see why this is a problem for my view. I feel like we are kind of going around in circles on this point; to me it feels like this is because I haven't communicated my view, but it could also be that I am missing some apsect of your view.

To me, the existence of important/time-sensitive philosphical problems seems similar to the existence of destructive technologies. (I think destructive technologies are a much larger problem and I don't find the argument for the likelihood of important/time-sensitive philosophy problem compelling. But my main point is the same in both cases and it's not clear that their relative importance matters.)

I discuss these issues in this post. I'm curious whether you see as the disanalogy between these cases, or think that this argument is not valid in the case of destructive technologies either, or think that this is the wrong framing for the current discussion / you are interested in answering a different question than I am / something along those lines.

I see how expecting destructive technologies / philosophical hurdles can increase the value you place on what I called "getting our house in order," as well as on developing remedies for particular destructive technologies / solving particular philosophical problems / solving metaphilosphy. I don't see how it can revise our view of the value of AI control by more than say a factor of 2.

I don't see working on metaphilosphy/philosophy as anywhere near as promising as AI control, and again I think that viewed from this perspective I don't think you are really trying to argue for that claim (it seems like that would have to be a quantitative argument about the expected damages from lack of timely solutions to philosophical problems and about the tractability of some approach to metaphilosophy or some particular line of philosophical inquiry).

I can imagine that AI control is less promising than other work on getting our house in order. My current suspicion is that AI control is more effective, but realistically it doesn't matter much to me because of comparative advantage considerations. If not for comparative advantage considerations I would be thinking more about the relative promise of getting our house in order, as well as other forms of capacity-building.

It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can’t find any further way to self-improve.

For philosophy, levels of ability are not comparable, because problems to be solved are not sufficiently formulated. Approximate one-day humans (as in HCH) will formulate different values from accurate ten-years humans, not just be worse at elucidating them. So perhaps you could re-implement cognition starting from approximate one-day humans, but values of the resulting process won't be like mine.

Approximate short-lived humans may be useful for building a task AI that lets accurate long-lived humans (ems) to work on the problem, but it must also allow them to make decisions, it can't be trusted to prevent what it considers to be a mistake, and so it can't guard the world from AI risk posed by the long-lived humans, because they are necessary for formulating values. The risk is that "getting our house in order" outlaws philosophical progress, prevents changing things based on considerations that the risk-prevention sovereign doesn't accept. So the scope of the "house" that is being kept in order must be limited, there should be people working on alignment who are not constrained.

I agree that working on philosophy seems hopeless/inefficient at this point, but that doesn't resolve the issue, it just makes it necessary to reduce the problem to setting up a very long term alignment research project (initially) performed by accurate long-lived humans, guarded from current risks, so that this project can do the work on philosophy. If this step is not in the design, very important things could be lost, things we currently don't even suspect. Messy Task AI could be part of setting up the environment for making it happen (like enforcing absence of AI or nanotech outside the research project). Your writing gives me hope that this is indeed possible. Perhaps this is sufficient to survive long enough to be able to run a principled sovereign capable of enacting values eventually computed by an alignment research project (encoded in its goal), even where the research project comes up with philosophical considerations that the designers of the sovereign didn't see (as in this comment). Perhaps this task AI can make the first long-lived accurate uploads, using approximate short-lived human predictions as initial building blocks. Even avoiding the interim sovereign altogether is potentially an option, if the task AI is good enough at protecting the alignment research project from the world, although that comes with astronomical opportunity costs.

[-]Wei Dai7yΩ000

E.g. if they can take steps towards safe cognitive enhancement

I didn't think that the scenario assumed the bunch of humans in a box had access to enough industrial/technology base to do cognitive enhancement. It seems like we're in danger of getting bogged down in details about the "people in box" scenario, which I don't think was meant to be a realistic scenario. Maybe we should just go back to talking about your actual AI control proposals?

So I don’t yet see a natural scenario where A and B have are forced to bargain but the “default” is for B to be able to secure a proportional fraction of the future.

Here's one: Suppose A, B, C each share 1/3 of the universe. If A and B join forces they can destroy C and take C's resources, otherwise it's a stalemate. (To make the problem easier assume C can't join with anyone else.) Another one is A and B each have secure rights, but they need to join together to maximize negentropy.

I’m curious whether you see as the disanalogy between these cases

I'm not sure what analogy you're proposing between the two cases. Can you explain more?

I don’t see how it can revise our view of the value of AI control by more than say a factor of 2.

I didn't understand this claim when I first read it on your blog. Can you be more formal/explicit about what two numbers you're comparing, that yields less than a factor of 2?

Maybe we should just go back to talking about your actual AI control proposals

I'm happy to drop it, we seem to go around in circles on this point as well, I thought this example might be easier to agree about but I no longer think that.

I’m not sure what analogy you’re proposing between the two cases. Can you explain more?

Certain destructive technologies will lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from using such technologies). Certain philosophical errors might lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from implementing philosophically unsophisticated solutions). The mechanisms that could cope with destructive technologies could also cope with philosophical problems.

Can you be more formal/explicit about what two numbers you’re comparing, that yields less than a factor of 2?

You argue: there are likely to exist philosophical problems which must be solved before reaching a certain level of technological sophistication, or else there will be serious negative consequences.

I reply: your argument has at most a modest effect on the value of AI control work of the kind I advocate.

Your claim does suggest that my AI control work is less valuable. If there are hard philosophical problems (or destructive physical technologies), then we may be doomed unless we coordinate well, whether or not we solve AI control.

Here is a very crude quantitative model, to make it clear what I am talking about.

Let P1 be the probability of coordinating before the development of AI that would be catastrophic without AI control, and let P2 be the probability of coordinating before the next destructive technology / killer philosophical hurdle after that.

If there are no destructive technologies or philosophical hurdles, then the value of solving AI control is (1 - P1). If there are destructive technologies or philosophical hurdles, then the value of solving AI control is (P2 - P1). I am arguing that (P2 - P1) >= 0.5 * (1 - P1).

This comes down to the claim that P(get house in order after AI but before catastrophe | not gotten house in order prior to AI) is at least 1/2.

If we both believe this claim, then it seems like the disagreement between us about philosophy could at best account for a factor of 2 difference in our estimates of how valuable AI control research is (where value is measured in terms of "fraction of the universe"---if we measure value in terms of dollars, your argument could potentially decrease our value significantly, since it might suggest that other interventions could do more good and hence dollars are more valuable in terms of "fraction of the universe").

Realistically it would account for much less though, since we can both agree that there are likely to be destructive technologies, and so all we are really doing is adjusting the timing of the next hurdle that requires coordination.

Suppose A, B, C each share 1/3 of the universe. If A and B join forces they can destroy C and take C’s resources, otherwise it’s a stalemate. (To make the problem easier assume C can’t join with anyone else.) Another one is A and B each have secure rights, but they need to join together to maximize negentropy.

I'm not sure it's worth arguing about this. I think that (1) these examples do only a little to increase my expectation of losses from insufficiently-sophisticated understanding of bargaining, I'm happy to argue about it if it ends up being important, but (2) it seems like the main difference is that I am looking for arguments that particular problems are costly such that it is worthwhile to work on them, while you are looking for an argument that there won't be any costly problems. (This is related to the discussion above.)

[-]Wei Dai7yΩ120

Unlike destructive technologies, philosophical hurdles are only a problem for aligned AIs. With destructive technologies, both aligned and unaligned AIs (at least the ones who don't terminally value destruction) would want to coordinate to prevent them and they only have to figure out how. But with philosophical problems, unaligned AIs instead want to exploit them to gain advantages over aligned AIs. For example if aligned AIs have to spend a lot of time to think about how to merge or self-improve safely (due to deferring to slow humans), unaligned AIs won't want to join some kind of global pact to all wait for the humans to decide, but will instead move forward amongst themselves as quickly as they can. This seems like a crucial disanalogy between destructive technologies and philosophical hurdles.

This comes down to the claim that P(get house in order after AI but before catastrophe | not gotten house in order prior to AI) is at least 1/2.

This seems really high. In your Medium article you only argued that (paraphrasing) AI could be as helpful for improving coordination as for creating destructive technology. I don't see how you get from that to this conclusion.

Unaligned AIs don't necessarily have efficient idealized values. Waiting for (simulated) humans to decide is analogous to computing a complicated pivotal fact about unaligned AI's values. It's not clear that "naturally occurring" unaligned AIs have simpler idealized/extrapolated values than aligned AIs with upload-based value definitions. Some unaligned AIs may actually be on the losing side, recall the encrypted-values AI example.

Speaking for myself, the main issue is that we have no idea how to do step 3, how to tell a pre-existing sovereign what to do. A task AI with limited scope can be replaced, but an optimizer has to be able to understand what is being asked of it, and if it wasn't designed to be able to understand certain things, it won't be possible to direct it correctly. If in 100 years the humans come up with new principles in how the AI should make decisions (philosophical progress), it may be impossible to express these principles as directions for an existing AI that was designed without the benefit of understanding these principles.

(Of course, the humans shouldn't be physically there, or it will be too hard to say what it means to keep them safe, but making accurate uploads and packaging the 100 years as a pure computation solves this issue without any conceptual difficulty.)

A task AI with limited scope can be replaced, but an optimizer has to be able to understand what is being asked of it, and if it wasn’t designed to be able to understand certain things, it won’t be possible to direct it correctly.

It's not clear to me why "limited scope" and "can be replaced" are related. An agent with broad scope can still be optimizing something like "what the human would want me to do today" and the human could have preferences like "now that humans believe that an alternative design would have been better, gracefully step aside." (And an agent with narrow scope could be unwilling to step aside if so doing would interfere with accomplishing its narrow task.)

Being able to "gracefully step aside" (to be replaced) is an example of what I meant by "limited scope" (in time). Even if AI's scope is "broad", the crucial point is that it's not literally everything (and by default it is). In practice it shouldn't be more than a small part of the future, so that the rest can be optimized better, using new insights. (Also, to be able to ask what humans would want today, there should remain some humans who didn't get "optimized" into something else.)

One of Paul’s latest replies...

I was talking specifically about algorithms that build a model of a human and then optimize over that model in order to do useful algorithmic work (e.g. modeling human translation quality and then choosing the optimal translation).

i.e., how much external support / error correction do humans need, or more generally how benign of an environment do we have to place them in, before we can expect them to eventually figure out their “actual” values

I still don't get your position on this point, but we seem to be going around a bit in circles. Probably the most useful thing would be responding to Jessica's hypothetical about putting humanity in a box.

searching for fundamental obstructions to aligned AI

I am not just looking for an aligned AI, I am looking for a transformation from (arbitrary AI) --> (aligned AI). I think the thing-to-understand is what kind of algorithms can plausibly give you AI systems without being repurposable to being aligned (with O(1) work). I think this is the kind of problem for which you are either going to get a positive or negative answer. That is, probably there is either such a transformation, or a hard-to-align design and an argument for why we can't align it.

(This example with optimizing over human translations seems like it could well be an insurmountable obstruction, implying that my most ambitious goal is impossible.)

I don’t understand Paul and MIRI’s reasons for being as optimistic as they each are on this issue.

I believe that both of us think that what you perceive as a problem can be sidestepped (I think this is the same issue we are going in circles around).

It seems unlikely to me that alignment to complex human values comes for free.

The hope is to do a sublinear amount of additional work, not to get it for free.

It’s more of an argument for simultaneously trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed

It seems like we are roughly on the same page; but I am more optimistic about either discovering a positive answer or a negative answer, and so I think this approach is the highest-leveraged thing to work on and you don't.

I think that cognitive or institutional enhancement is also a contender, as is getting our house in order, even if our only goal is dealing with AI risk.

[-]Wei Dai7yΩ000

I still don’t get your position on this point, but we seem to be going around a bit in circles.

Yes, my comment was more targeted to other people, who I'm hoping can provide their own views on these issues. (It's kind of strange that more people haven't commented on your ideas online. I've asked to be invited to any future MIRI workshops discussing them, in case most of the discussions are happening offline.)

I am not just looking for an aligned AI, I am looking for a transformation from (arbitrary AI) –> (aligned AI). I think the thing-to-understand is what kind of algorithms can plausibly give you AI systems without being repurposable to being aligned (with O(1) work).

Can you be more explicit and formal about what you're looking for? Is it a transformation T, such that for any AI A, T(A) is an aligned AI as efficient as A, and applying T amounts to O(1) of work? (O(1) relative to what variable? The work that originally went into building A?)

If that's what you mean, then it seems obvious that T doesn't exist, but I don't know how else to interpret your statement.

That is, probably there is either such a transformation, or a hard-to-align design and an argument for why we can’t align it.

I don't understand why is this disjunction true, which might be because of my confusion above. Also, even if you found a hard-to-align design and an argument for why we can’t align it, that doesn't show that aligned AIs can't be competitive with unaligned AIs (in order to convince others to coordinate, as Jessica wrote). The people who need convincing will just think there's almost certainly other ways to build a competitive aligned AI that doesn't involve transforming the hard-to-align design.

Can you be more explicit and formal about what you’re looking for?

Consider some particular research program that might yield powerful AI systems, e.g. (search for better model classes for deep learning, search for improved optimization algorithms, deploy these algorithms on increasingly large hardware). For each such research program I would like to have some general recipe that takes as input the intermediate products of that program (i.e. the hardware and infrastructure, the model class, the optimization algorithms) and uses them to produce an benign AI which is competitive with the output of the research program. The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong).

I suspect this is possible for some research programs and not others. I expect there are some programs for which this goal is demonstrably hopeless. I think that those research programs need to be treated with care. Moreover, if I had a demonstration that a research programs was dangerous in this way, I expect that I could convince people that it needs to be treated with care.

The people who need convincing will just think there’s almost certainly other ways to build a competitive aligned AI that doesn’t involve transforming the hard-to-align design.

Yes, at best someone might agree that a particular research program is dangerous/problematic. That seems like enough though---hopefully they could either be convinced to pursue other research programs that aren't problematic, or would continue with the problematic research program and could then agree that other measures are needed to avert the risk.

[-]Wei Dai7yΩ000

If an AI causes its human controller to converge to false philosophical conclusions (especially ones relevant to their values), either directly through its own actions or indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can't be considered benign. But given our current lack of metaphilosophical understanding, how do you hope to show that any particular AI (e.g., the output of a proposed transformation/recipe) won't cause that? Or is the plan to accept a lower burden of proof, namely assume that the AI is benign as long as no one can show that it does cause its human controller to converge to false philosophical conclusions?

The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong).

If such a recipe existed for a project, that doesn't mean it's not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It's still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI (at sublinear additional cost). Similarly, if recipes didn't exist for projects A and B individually, it might still exist for A+B. It seems like to make a meaningful statement you have to treat the entire world as one big research program. Do you agree?

If such a recipe existed for a project, that doesn’t mean it’s not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It’s still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI

My hope is to get technique A working, then get technique B working, and then get A+B working, and so on, prioritizing based on empirical guesses about what combinations will end up being deployed in practice (and hoping to develop general understanding that can be applied across many programs and combinations). I expect that in many cases, if you can handle A and B you can handle A+B, though some interactions will certainly introduce new problems. This program doesn't have much chance of success without new abstractions.

indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can’t be considered benign

This is definitely benign on my accounting. There is a further question of how well you do in conflict. A window is benign but won't protect you from inputs that will drive you crazy. The hope is that if you have an AI that is benign + powerful than you may be OK.

directly through its own actions

If the agent is trying to implement deliberation in accordance with the user's preferences about deliberation, then I want to call that benign. There is a further question of whether we mess up deliberation, which could happen with or without AI. We would like to set things up in such a way that we aren't forced to deliberate earlier than we would otherwise want to. (And this included in the user's preferences about deliberation, i.e. a benign AI will be trying to secure for the user the option of deliberating later, if the user believes that deliberating later is better than deliberating in concert with the AI now.)

Malign just means "actively optimizing for something bad," the hope is to avoid that, but this doesn't rule out other kinds of problems (e.g. causing deliberation to go badly due to insufficient competence, blowing up the world due to insufficient competence, etc.)

Overall, my current best guess is that this disagreement is better to pursue after my research program is further along, we know things like whether "benign" makes sense as an abstraction, I have considered some cases where benign agents necessarily seem to be less efficient, and so on.

I am still interested in arguments that might (a) convince me to not work on this program, e.g. because I should be working on alternative social solutions, or (b) convince others to work on this program, e.g. because they currently don't see how it could succeed but might work on it if they did, or (c) which clarify the key obstructions for this research program.

I really agree with #2 (and I think with #1, as well, but I'm not as sure I understand your point there).

I've been trying to convince people that there will be strong trade-offs between safety and performance, and have been surprised that this doesn't seem obvious to most... but I haven't really considered that "efficient aligned AIs almost certainly exist as points in mindspace". In fact I'm not sure I agree 100% (basically because "Moloch" (http://slatestarcodex.com/2014/07/30/meditations-on-moloch/)).

I think "trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed" remains perhaps the most important thing to do; do you have anything in particular in mind? Personally, I tend to think that we ought to address the coordination problem head-on and attempt a solution before AGI really "takes off".

I’ve been trying to convince people that there will be strong trade-offs between safety and performance

What do you see as the best arguments for this claim? I haven't seen much public argument for it and am definitely interested in seeing more. I definitely grant that it's prima facie plausible (as is the alternative).

Some caveats:

It's obvious there are trade-offs between safety and performance in the usual sense of "safety." But we are interested in a special kind of failure, where a failed system ends up controlling a significant share of the entire universe's resources (rather than e.g. causing an explosion), and it's less obvious that preventing such failures necessarily requires a significant cost.

Its also obvious that there is an additional cost to be paid in order to solve control, e.g. consider the fact that we are currently spending time on it. But the question is how much additional work needs to be done. Does building aligned systems require 1000% more work? 10%? 0.1%? I don't see why it should obvious that this number is on the order of 100% rather than 1%.

Similarly for performance costs. I'm willing to grant that an aligned system will be more expensive to run. But is that cost an extra 1000% or an extra 0.1%? Both seem quite plausible. From a theoretical perspective the question is whether the required overhead is linear or sublinear?

I haven't seen strong arguments for the "linear overhead" side, and my current guess is that the answer is sublinear. But again, both positions seem quite plausible.

(There are currently a few major obstructions to my approach that could plausibly give a tight theoretical argument for linear overhead, such as the translation example in the discussion with Wei Dai. In the past such obstructions have ended up seeming surmountable, but I think that it is totally plausible that eventually one won't. And at that point I hope to be able to make clean statements about exactly what kind of thing we can't hope to do efficiently+safely / exactly what kinds of additional assumptions we would have to make / what the key obstructions are).

Personally, I tend to think that we ought to address the coordination problem head-on

I think this is a good idea and a good project, which I would really like to see more people working on. In the past I may have seemed more dismissive and if so I apologize for being misguided. I've spent a little bit of time thinking about it recently and my feeling is that there is a lot of productive and promising work to do.

My current guess is that AI control is the more valuable thing for me personally to do though I could imagine being convinced out of this.

I feel that AI control is valuable given that (a) it has a reasonable chance of succeeding even if we can't solve these coordination problems, and (b) convincing evidence that the problem is hard would be a useful input into getting the AI community to coordinate.

If you managed to get AI researchers to effectively coordinate around conditionally restricting access to AI (if it proved to be dangerous), then that would seriously undermine argument (b). I believe that a sufficiently persuasive/charismatic/accomplished person could probably do this today.

If I ended up becoming convinced that AI control was impossible this would undermine argument (a) (though hopefully that impossibility argument could itself be used to satisfy desiderata (b)).

To be fair, at least some within MIRI, such as Nate, may be aiming to beat unaligned AIs not because they’re particularly optimistic about the prospects of doing so, but because they’re pessimistic about what would happen if we merely match them.

My model of Nate thinks the path to victory goes through the aligned AI project gaining a substantial first mover advantage (through fast local takeoff, more principled algorithms, and/or better coordination). Though he's also quite concerned about extremely large efficiency disadvantages of aligned AI vs unaligned AI (e.g. he's pessimistic about act-based agents helping much because they might require the AI to be good at predicting humans doing complex tasks such as research).

Jessica’s post lists searching for fundamental obstructions to aligned AI as a motivation for Paul’s research direction. I think given that efficient aligned AIs almost certainly exist as points in mindspace, it’s unlikely that we can find “fundamental” reasons why we can’t build them. Instead they will likely just take much more resources (including time) to build than unaligned AIs, for a host of “messy” reasons.

In this case I expect that in <10 years we get something like: "we tried making aligned versions of a bunch of algorithms, but the aligned versions are always less powerful because they left out some source of power the unaligned versions had. We iterated the process a few times (studying the additional sources of power and making aligned versions of them), and this continued to be the case. We have good reasons to believe that there isn't a sensible stopping point to this process." This seems pretty close to a fundamental obstruction and it seems like it would be similarly useful, especially if the "good reasons to believe there isn't a sensible stopping point to this process" tell us something new about which relaxations are promising.

I don't see this as being the case. As Vadim pointed out, we don't even know what we mean by "aligned versions" of algos, ATM. So we wouldn't know if we're succeeding or failing (until it's too late and we have a treacherous turn).

It looks to me like Wei Dai shares my views on "safety-performance trade-offs" (grep it here: http://graphitepublications.com/the-beginning-of-the-end-or-the-end-of-beginning-what-happens-when-ai-takes-over/).

I'd paraphrase what he's said as:

"Orthogonality implies that alignment shouldn't cost performance, but says nothing about the costs of 'value loading' (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don't know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don't even have clear criteria for success."

Which I emphatically agree with.

I don't see this as being the case. As Vadim pointed out, we don't even know what we mean by "aligned versions" of algos, ATM. So we wouldn't know if we're succeeding or failing (until it's too late and we have a treacherous turn).

It looks to me like Wei Dai shares my views on "safety-performance trade-offs" (grep it here: http://graphitepublications.com/the-beginning-of-the-end-or-the-end-of-beginning-what-happens-when-ai-takes-over/).

I'd paraphrase what he's said as:

"Orthogonality implies that alignment shouldn't cost performance, but says nothing about the costs of 'value loading' (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don't know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don't even have clear criteria for success."

Which I emphatically agree with.

As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing (until it’s too late and we have a treacherous turn).

Even beyond Jessica's point (that failure to improve our understanding would constitute an observable failure), I don't completely buy this.

We are talking about AI safety because there are reasons to think that AI systems will cause a historically unprecedented kind of problem. If we could design systems for which we had no reason to expect them to cause such problems, then we can rest easy.

I don't think there is some kind of magical and unassailable reason to be suspicious of powerful AI systems, there are just a bunch of particular reasons to be concerned.

Similarly, there is no magical reason to expect a treacherous turn---this is one of the kinds of unusual failures which we have reason to be concerned about. If we built a system for which we had no reason to be concerned, then we shouldn't be concerned.

I think the core of our differences is that I see minimally constrained, opaque, utility-maximizing agents with good models of the world and access to rich interfaces (sensors and actuators) as extremely likely to be substantially more powerful than what we will be able to build if we start degrading any of these properties.

These properties also seem sufficient for a treacherous turn (in an unaligned AI).

I see minimally constrained, opaque, utility-maximizing agents with good models of the world and access to rich interfaces (sensors and actuators) as extremely likely to be substantially more powerful than what we will be able to build if we start degrading any of these properties.

The only point on which there is plausible disagreement is "utility-maximizing agents." On a narrow reading of "utility-maximizing agents" it is not clear why it would be important to getting more powerful performance.

On a broad reading of "utility-maximizing agents" I agree that powerful systems are utility-maximizing. But if we take a broad reading of this property, I don't agree with the claim that we will be unable to reliably tell that such agents aren't dangerous without theoretical progress.

In particular, there is an argument of the form "the prospect of a treacherous turn makes any informal analysis unreliable." I agree that the prospect of a treacherous turn makes some kinds of informal analysis unreliable. But I think it is completely wrong that it makes all informal analysis unreliable, I think that appropriate informal analysis can be sufficient to rule out the prospect of a treacherous turn. (Most likely an analysis that keeps track of what is being optimized, and rules out the prospect that an indicator was competently optimized to manipulate our understanding of the current situation.)

Paul, I'm not sure I understand what you're saying here. Can you imagine a system "for which we had no reason to expect it to cause such problems" without an underlying mathematical theory that shows why this system is safe?

The reason AI systems will cause a historically unprecedented kind of problem, is that AI systems can outsmart humans and thus create situations that are outside our control, even when we don't a priori see the precise mechanism by which we will lose control. In order for such a system be safe, we need to know that it will not attempt anything detrimental to us, and we need to know this as an abstraction, i.e without knowing in details what the system will do (because the system is superintelligent so we by definition we cannot guess its actions).

Doesn't it seem improbable to you that we will have a way of having such knowledge by some other means than the accuracy of mathematical thought?

That is, we can have a situation like "AI running in homomorphic encryption with a quantum-generated key that is somewhere far from the AI's computer" where it's reasonable claim that the AI is safe as long as it stays encrypted (even though there is still some risk from being wrong about cryptographic conjectures or the AI exploiting some surprising sort of unknown physics), without using a theory of intelligence at all (beyond the fact that intelligence is a special case of computation), but it seems unlikely that we can have something like this while simultaneously having the AI powerful enough to protect us against other AIs that are malicious.

Can you imagine a system “for which we had no reason to expect it to cause such problems” without an underlying mathematical theory that shows why this system is safe?

Yes. For example, suppose we built a system whose behavior was only expected to be intelligent to the extent that it imitated intelligent human behavior---for which there is no other reason to believe that it is intelligent. Depending on the human being imitated, such a system could end up seeming unproblematic even without any new theoretical understanding.

We don't yet see any way to build such a system, much less to do so in a way that could be competitive with the best RL system that could be designed at a given level of technology. But I can certainly imagine it.

(Obviously I think there is a much larger class of systems that might be non-problematic, though it may depend on what we mean by "underlying mathematical theory.")

AI systems can outsmart humans and thus create situations that are outside our control, even when we don’t a priori see the precise mechanism by which we will lose control

This doesn't seem sufficient for trouble. Trouble only occurs when those systems are effectively optimizing for some inhuman goals, including e.g. acquiring and protecting resources.

That is a very special thing for a system to do, above and beyond being able to accomplish tasks that apparently require intelligence. Currently we don't have any way to accomplish the goals of AI that don't risk this failure mode, but it's not obvious that it is necessary.

Can you imagine a system “for which we had no reason to expect it to cause such problems” without an underlying mathematical theory that shows why this system is safe?

...suppose we built a system whose behavior was only expected to be intelligent to the extent that it imitated intelligent human behavior—for which there is no other reason to believe that it is intelligent.

This doesn't seem to be a valid example: your system is not superintelligent, it is "merely" human. That is, I can imagine solving AI risk by building whole brain emulations with enormous speed-up and using them to acquire absolute power. However:

  • I think this is not what is usually meant by "solving AI alignment."

  • The more you use heuristic learning algorithms instead of "classical" brain emulation the more I would be worried your algorithm does something subtly wrong in a way that distorts values, although that would also invalidate the condition that "there is no other reason to believe that it is intelligent."

  • There is a high-risk zone here where someone untrustworthy can gain this technology and use it to unwittingly create unfriendly AI.

AI systems can outsmart humans and thus create situations that are outside our control, even when we don’t a priori see the precise mechanism by which we will lose control

This doesn’t seem sufficient for trouble. Trouble only occurs when those systems are effectively optimizing for some inhuman goals, including e.g. acquiring and protecting resources.

Well, any AI is effectively optimizing for some goal by definition. How do you know this goal is "human"? In particular, if your AI is supposed to defend us from other AIs, it is very much in the business of acquiring and protecting resources.

As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing

If we fail to make the intuition about aligned versions of algorithms more crisp than it currently is, then it'll be pretty clear that we failed. It seems reasonable to be skeptical that we can make our intuitions about "aligned versions of algorithms" crisp and then go on to design competitive and provably aligned versions of all AI algorithms in common use. But it does seem like we will know if we succeed at this task, and even before then we'll have indications of progress such as success/failure at formalizing and solving scalable AI control in successively complex toy environments. (It seems like I have intuitions about what would constitute progress that are hard to convey over text, so I would not be surprised if you aren't convinced that it's possible to measure progress).

“Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.”

It seems like "value loading is very hard/costly" has to imply that the proposal in this comment thread is going to be very hard/costly, e.g. because one of Wei Dai's objections to it proves fatal. But it seems like arguments of the form "human values are complex and hard to formalize" or "humans don't know what we value" are insufficient to establish this; Wei Dai's objections in the thread are mostly not about value learning. (sorry if you aren't arguing "value loading is hard because human values are complex and hard to formalize" and I'm misinterpreting you)

I feel that there is a false dichotomy going on here. In order to say we "solved" AGI alignment, we must have some mathematical theory that defines "AGI", defines "aligned AGI" and a proof that some specific algorithm is an "aligned AGI." So, it's not that important whether the algorithm is "messy" or "principled" (actually I don't think it's a meaningful distinction), it's important what we can prove about the algorithm. We might relax the requirement of strict "proof" and be satisfied with a well-defined conjecture that has lots of backing evidence (like ) but it seems to me that we wouldn't want to give up on at least having a well-defined conjecture (unless under conditions of extreme despair, which is not what I would call a successful solution to AGI alignment).

So, we can still argue about which subproblems have the best chance of leading us to such a mathematical theory, but it feels like there is a fuzzy boundary there rather than a sharp division into two or more irreconcilable approaches.

I agree that the original question (messy vs principled) seems like a false dichotomy at this point. It's not obvious where the actual disagreement is.

My current guess is that the main disagreement is something like: on the path to victory, did we take generic AGI algorithms (e.g. deep learning technology + algorithms layered on top of it) and figure out how to make aligned versions of them, or did we make our own algorithms? Either way we end up with an argument for why the thing we have at the end is aligned. (This is just my current guess, though, and I'm not sure if it's the most important disagreement.)

Points 5-9 seem to basically be saying: "We should work on understanding principles of intelligence so that we can make sure that AIs are thinking the same way as humans do; currently we lack this level of understanding".

I don't really understand point 10, especially this part:

"They would most likely generalize in an unaligned way, since the reasoning rules would likely be contained in some sub-agent (e.g. consider how Earth interpreted as an “agent” only got to the moon by going through reasoning rules implemented by humans, who have random-ish values; Paul’s post on the universal prior also demonstrates this)."

“We should work on understanding principles of intelligence so that we can make sure that AIs are thinking the same way as humans do; currently we lack this level of understanding”

Roughly. I think the minimax algorithm would qualify as "something that thinks the same way an idealized human would", where "idealized" is doing substantial work (certainly, humans don't actually play chess using minimax).

I don’t really understand point 10, especially this part:

Consider the following procedure for building an AI:

  1. Collect a collection of AI tasks that we think are AGI-complete (e.g. a bunch of games and ML tasks)
  2. Search for a short program that takes lots of data from the Internet as input and produces a policy that does well on lots of these AI tasks
  3. Run this program on substantially different tasks related to the real world

This seems very likely to result in an unaligned AI. Consider the following program:

  1. Simulate some stochastic physics, except that there's some I/O terminal somewhere (as described in this post)
  2. If the I/O terminal gets used, give the I/O terminal the Internet data as input and take the policy as output
  3. If it doesn't get used, run the simulation again until it does

This program is pretty short, and with some non-negligible probability (say, more than 1 in 1 billion), it's going to produce a policy that is an unaligned AGI. This is because in enough runs of physics there will be civilizations; if the I/O terminal is accessed it is probably by some civilization; and the civilization will probably have values that are not aligned with human values, so they will do a treacherous turn (if they have enough information to know how the I/O terminal is being interpreted, which they do if there's a lot of Internet data).

Thanks, I think I understand that part of the argument now. But I don't understand how it relates to:

"10. We should expect simple reasoning rules to correctly generalize even for non-learning problems. "

^Is that supposed to be a good thing or a bad thing? "Should expect" as in we want to find rules that do this, or as in rules will probably do this?

It's just meant to be a prediction (simple rules will probably generalize).

Thanks for writing this, Jessica -- I expect to find it helpful when I read it more carefully!

Interesting, but could this be chasing the Solution by an absoluteness on specific algorith(s) to solve it ?
Isn't this tending to be only Mathematical about the Mind ?

I mean, I believe the Brain/Mind is based upon Principals - but why does it have to be Mathematical or Logical ?