Note: informally, the point of this paper is to argue against the instinctive "if the AI were so smart, it would figure out the right morality and everything will be fine." It is targeted mainly at philosophers, not at AI programmers. The paper succeeds if it forces proponents of that position to put forwards positive arguments, rather than just assuming it as the default position. This post is presented as an academic paper, and will hopefully be published, so any comments and advice are welcome, including stylistic ones! Also let me know if I've forgotten you in the acknowledgements.

Abstract: In his paper “The Superintelligent Will”, Nick Bostrom formalised the Orthogonality thesis: the idea that the final goals and intelligence levels of agents are independent of each other. This paper presents arguments for a (slightly narrower) version of the thesis, proceeding through three steps. First it shows that superintelligent agents with essentially arbitrary goals can exist. Then it argues that if humans are capable of building human-level artificial intelligences, we can build them with any goal. Finally it shows that the same result holds for any superintelligent agent we could directly or indirectly build. This result is relevant for arguments about the potential motivations of future agents.


1 The Orthogonality thesis

The Orthogonality thesis, due to Nick Bostrom (Bostrom, 2011), states that:

  • Intelligence and final goals are orthogonal axes along which possible agents can freely vary: more or less any level of intelligence could in principle be combined with more or less any final goal.

It is analogous to Hume’s thesis about the independence of reason and morality (Hume, 1739), but applied more narrowly, using the normatively thinner concepts ‘intelligence’ and ‘final goals’ rather than ‘reason’ and ‘morality’.

But even ‘intelligence’, as generally used, has too many connotations. A better term would be efficiency, or instrumental rationality, or the ability to effectively solve problems given limited knowledge and resources (Wang, 2011). Nevertheless, we will be sticking with terminology such as ‘intelligent agent’, ‘artificial intelligence’ or ‘superintelligence’, as they are well established, but using them synonymously with ‘efficient agent’, artificial efficiency’ and ‘superefficient algorithm’. The relevant criteria is whether the agent can effectively achieve its goals in general situations, not whether its inner process matches up with a particular definition of what intelligence is.

Thus an artificial intelligence (AI) is an artificial algorithm, deterministic or probabilistic, implemented on some device, that demonstrates an ability to achieve goals in varied and general situations[1]. We don’t assume that it need be a computer program, or a well laid-out algorithm with clear loops and structures – artificial neural networks or evolved genetic algorithms certainly qualify.

A human level AI is defined to be an AI that can successfully accomplish any task at least as well as an average human would (to avoid worrying about robot bodies and such-like, we may restrict the list of tasks to those accomplishable over the internet). Thus we would expect the AI to hold conversations about Paris Hilton’s sex life, to compose ironic limericks, to shop for the best deal on Halloween costumes and to debate the proper role of religion in politics, at least as well as an average human would.

A superhuman AI is similarly defined as an AI that would exceed the ability of the best human in all (or almost all) tasks. It would do the best research, write the most successful novels, run companies and motivate employees better than anyone else. In areas where there may not be clear scales (what’s the world’s best artwork?) we would expect a majority of the human population to agree the AI’s work is among the very best.

Nick Bostrom’s paper argued that the Orthogonality thesis does not depend on the Humean theory of motivation. This paper will directly present arguments in its favour. We will assume throughout that human level AIs (or at least human comparable AIs) are possible (if not, the thesis is void of useful content). We will also take the materialistic position that humans themselves can be viewed as non-deterministic algorithms[2]: this is not vital to the paper, but is useful for comparison of goals between various types of agents. We will do the same with entities such as committees of humans, institutions or corporations, if these can be considered to be acting in an agent-like way.

1.1 Qualifying the Orthogonality thesis

The Orthogonality thesis, taken literally, is false. Some motivations are mathematically incompatible with changes in intelligence (“I want to prove the Gödel statement for the being I would be if I were more intelligent”). Some goals specifically refer to the intelligence of the agent, directly (“I want to be an idiot!”) or indirectly (“I want to impress people who want me to be an idiot!”). Though we could make a case that an agent wanting to be an idiot could initially be of any intelligence level, it won’t stay there long, and it’s hard to see how an agent with that goal could have become intelligent in the first place. So we will exclude from consideration those goals that intrinsically refer to the intelligence level of the agent.

We will also exclude goals that are so complex or hard to describe that the complexity of the goal becomes crippling for the agent. If the agent’s goal takes five planets worth of material to describe, or if it takes the agent five years each time it checks its goal, it’s obvious that that agent can’t function as an intelligent being on any reasonable scale.

Further we will not try to show that intelligence and final goals can vary freely, in any dynamical sense (it could be quite hard to define this varying). Instead we will look at the thesis as talking about possible states: that there exist agents of all levels of intelligence with any given goals. Since it’s always possible to make an agent stupider or less efficient, what we are really claiming is that there exist high-intelligence agents with any given goal. Thus the restricted Orthogonality thesis that we will be discussing is:

  • High-intelligence agents can exist having more or less any final goals (as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence).

2 Orthogonality for theoretic agents

If we were to step back for a moment and consider, in our mind’s eyes, the space of every possible algorithm, peering into their goal systems and teasing out some measure of their relative intelligences, would we expect the Orthogonality thesis to hold? Since we are not worrying about practicality or constructability, all that we would require is that for any given goal system, there exists a theoretically implementable algorithm of extremely high intelligence.

At this level of abstraction, we can consider any goal to be equivalent with maximising a utility function. It is generally not that hard to translate given goals into utilities (many deontological systems are equivalent with maximising the expected utility of a function that gives 1 if the agent always makes the correct decision and 0 otherwise), and any agent making a finite number of decisions can always be seen as maximising a certain utility function.

For utility function maximisers, the AIXI is the theoretically best agent there is, more successful at reaching its goals (up to a finite constant) than any other agent (Hutter, 2005). AIXI itself is incomputable, but there are computable variants such as AIXItl or Gödel machines (Schmidhuber, 2007) that accomplish comparable levels of efficiency. These methods work for whatever utility function is plugged into them. Thus in the extreme theoretical case, the Orthogonality thesis seems trivially true.

There is only one problem with these agents: they require incredibly large amounts of computing resources to work. Let us step down from the theoretical pinnacle and require that these agents could actually exist in our world (still not requiring that we be able or likely to build it).

An interesting thought experiment occurs here. We could imagine an AIXI-like super-agent, with all its resources, that is tasked to design and train an AI that could exist in our world, and that would accomplish the super-agent’s goals. Using its own vast intelligence, the super-agent would therefore design a constrained agent maximally effective at accomplishing those goals in our world. Then this agent would be the high-intelligence real-world agent we are looking for. It doesn’t matter that this is a thought experiment – if the super-agent can succeed in the thought experiment, then the trained AI can exist in our world.

This argument generalises to other ways of producing the AI. Thus to deny the Orthogonality thesis is to assert that there is a goal system G, such that, among other things:

  1. There cannot exist any efficient real-world algorithm with goal G.
  2. If a being with arbitrarily high resources, intelligence, time and goal G, were to try design an efficient real-world algorithm with the same goal, it must fail.
  3. If a human society were highly motivated to design an efficient real-world algorithm with goal G, and were given a million years to do so along with huge amounts of resources, training and knowledge about AI, it must fail.
  4. If a high-resource human society were highly motivated to achieve the goals of G, then it could not do so (here the human society is seen as the algorithm).
  5. Same as above, for any hypothetical alien societies.
  6. There cannot exist any pattern of reinforcement learning that would train a highly efficient real-world intelligence to follow the goal G.
  7. There cannot exist any evolutionary or environmental pressures that would evolving highly efficient real world intelligences to follow goal G.

All of these seem extraordinarily strong claims to make! The last claims all derive from the first, and merely serve to illustrate how strong the first claim actually is. Thus until such time as someone comes up with such a G and strong arguments for why it must fulfil these conditions, we can assume the Orthogonality statement established in the theoretical case.


3 Orthogonality for human-level AIs

Of course, even if efficient agents could exist for all these goals, that doesn’t mean that we could ever build them, even if we could build AIs. In this section, we’ll look at the ground for assuming the Orthogonality thesis holds for human-level agents. Since intelligence isn’t varying much, the thesis becomes simply:

  • If we could construct human-level AIs at all, we could construct human-level AIs with more or less any final goals (as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence).

So, is this true? The arguments in this section are generally independent of each other, and can be summarised as:

  1. Some possible AI designs have orthogonality built right into them.
  2. AI goals can reach the span of human goals, which is larger than it seems.
  3. Algorithms can be combined to generate an AI with any easily checkable goal.
  4. Various algorithmic modifications can be used to further expand the space of possible goals, if needed.

3.1 Utility functions

The utility function picture of a rational agent maps perfectly onto the Orthogonality thesis: here have the goal structure, the utility function, packaged neatly and separately from the intelligence module (whatever part of the machine calculates which actions maximise expected utility). Demonstrating the Orthogonality thesis is as simple as saying that the utility function can be replaced with another. However, many putative agent designs are not utility function based, such as neural networks, genetic algorithms, or humans. Nor do we have the extreme calculating ability that we had in the purely theoretic case to transform any goals into utility functions. So from now on we will consider that our agents are not expected utility maximisers with clear and separate utility functions.

3.2 The span of human motivations

It seems a reasonable assumption that if there exists a human being with particular goals, then we can construct a human-level AI with similar goals. This is immediately the case if the AI was a whole brain emulation/upload (Sandberg & Bostrom, 2008), a digital copy of a specific human mind. Even for more general agents, such as evolved agents, this remains a reasonable thesis. For a start, we know that real-world evolution has produced us, so constructing human-like agents that way is certainly possible. Human minds remain our only real model of general intelligence, and this strongly direct and informs our AI designs, which are likely to be as human-similar as we can make them. Similarly, human goals are the easiest goals for us to understand, hence the easiest to try and implement in AI. Hence it seems likely that we could implement most human goals in the first generation of human-level AIs.

So how wide is the space of human motivations[3]? Our race spans foot-fetishists, religious saints, serial killers, instinctive accountants, role-players, self-cannibals, firefighters and conceptual artists. The autistic, those with exceptional social skills, the obsessive compulsive and some with split-brains. Beings of great empathy and the many who used to enjoy torture and executions as public spectacles[4]. It is evident that the space of possible human motivations is vast[5]. For any desire, any particular goal, no matter how niche[6], pathological, bizarre or extreme, as long as there is a single human who ever had it, we could build and run an AI with the same goal.

But with AIs we can go even further. We could take any of these goals as a starting point, make them malleable (as goals are in humans), and push them further out. We could provide the AIs with specific reinforcements to push their goals in extreme directions (reward the saint for ever-more saintly behaviour). If the agents are fast enough, we could run whole societies of them with huge varieties of evolutionary or social pressures, to further explore the goal-space.

We may also be able to do surgery directly on their goals, to introduce more yet variety. For example, we could take a dedicated utilitarian charity worker obsessed with saving lives in poorer countries (but who doesn’t interact, or want to interact, directly with those saved), and replace ‘saving lives’ with ‘maximising paperclips’ or any similar abstract goal. This is more speculative, of course – but there are other ways of getting similar results.

3.3 Interim goals as terminal goals

If someone were to hold a gun to your head, they could make you do almost anything. Certainly there are people who, with a gun at their head, would be willing to do almost anything. A distinction is generally made between interim goals and terminal goals, with the former being seen as simply paths to the latter, and interchangeable with other plausible paths. The gun to your head disrupts the balance: your terminal goal is simply not to get shot, while your interim goals become what the gun holder wants them to be, and you put a great amount of effort into accomplishing the minute details of these interim goals. Note that the gun has not changed your level of intelligence or ability.

This is relevant because interim goals seem to be far more varied in humans than terminal goals. One can have interim goals of filling papers, solving equations, walking dogs, making money, pushing buttons in various sequences, opening doors, enhancing shareholder value, assembling cars, bombing villages or putting sharks into tanks. Or simply doing whatever the guy with gun at our head orders us to do. If we could accept human interim goals as AI terminal goals, we would extend the space of goals quite dramatically.

To do we would want to put the threatened agent, and the gun wielder, together into the same AI. Algorithmically there is nothing extraordinary about this: certain subroutines have certain behaviours depending on the outputs of other subroutines. The ‘gun wielder’ need not be particularly intelligent: it simply needs to be able to establish whether its goals are being met. If for instance those goals are given by a utility function then all that is required in an automated system that measure progress toward increasing utility and punishes (or erases) the rest of the AI if not. The ‘rest of AI’ is just required to be a human-level AI which would be susceptible to this kind of pressure. Note that we do not require that it even be close to human in any way, simply that it place a highest value on self-preservation (or on some similar small goal that the ‘gun wielder’ would have power over).

For humans, another similar model is that of a job in a corporation or bureaucracy: in order to achieve the money required for their terminal goals, some human are willing to perform extreme tasks (organising the logistics of genocides, weapon design, writing long detailed press releases they don’t agree with at all). Again, if the corporation-employee relationship can be captured in a single algorithm, this would generate an intelligent AI whose goal is anything measurable by the ‘corporation’. The ‘money’ could simply be an internal reward channel, perfectly aligning the incentives.

If the subagent is anything like a human, they would quickly integrate the other goals into their own motivation[7], removing the need for the gun wielder/corporation part of the algorithm.

3.4 Noise, anti-agents and goal combination

There are further ways of extending the space of goals we could implement in human-level AIs. One simple way is simply to introduce noise: flip a few bits and subroutines, add bugs and get a new agent. Of course, this is likely to cause the agent’s intelligence to decrease somewhat, but we have generated new goals. Then, if appropriate, we could use evolution or other improvements to raise the agent’s intelligence again; this will likely undo some, but not all of effect of the noise. Or we could use some of the tricks above to make a smarter agent implement the goals of the noise-modified agent.

A more extreme example would be to create an anti-agent: an agent whose single goal is to stymie the plans and goals of single given agent. This already happens with vengeful humans, and we would just need to dial it up: have an anti-agent that would do all it can to counter the goals of a given agent, even if that agent doesn’t exist (“I don’t care that you’re dead, I’m still going to despoil your country, because that’s what you’d wanted me to not do”). This further extends the space of possible goals.

Different agents with different goals can also be combined into a single algorithm. With some algorithmic method for the AIs to negotiate their combined objective and balance the relative importance of their goals, this procedure would construct a single AI with a combined goal system. There would likely be no drop in intelligence/efficiency: committees of two can work very well towards their common goals, especially if there is some automatic penalty for disagreements.

3.5 Further tricks up the sleeve

This section started by emphasising the wide space of human goals, and then introduced tricks to push goal systems further beyond these boundaries. The list isn’t exhaustive: there are surely more devices and ideas one can use to continue to extend the space of possible goals for human-level AIs. Though this might not be enough to get every goal, we can nearly certainly use these procedures to construct a human-level AI with any human-comprehensible goal. But would the same be true for superhuman AIs?


4 Orthogonality for superhuman AIs

We now come to the area where the Orthogonality thesis seems the most vulnerable. It is one thing to have human-level AIs, or abstract superintelligent algorithms created ex nihilo, with certain goals. But if ever the human race were to design a superintelligent AI, there would be some sort of process involved – directed evolution, recursive self-improvement (Yudkowsky, 2001), design by a committee of AIs, or similar – and it seems at least possible that such a process could fail to fully explore the goal-space. If we define the Orthogonality thesis in this context as:

  • If we could construct superintelligent AIs at all, we could construct superintelligent AIs with more or less any goals (as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence).

There are two counter-theses. The weakest claim is:

  • Incompleteness: there are some goals that no superintelligence designed by us could have.

A stronger claim is:

  • Convergence: all human-designed superintelligences would have one of a small set of goals.

They should be distinguished; Incompleteness is all that is needed to contradict Orthogonality, but Convergence is often the issue being discussed. Often convergence is assumed to be to some particular model of metaethics (Müller, 2012).

4.1 No convergence

The plausibility of the convergence thesis is highly connected with the connotations of the terms used in it. “All human-designed rational beings would follow the same morality (or one of small sets of moralities)” sound plausible; in contract “All human-designed superefficient algorithms would accomplish the same task” seems ridiculous. To quote an online commentator, how good at playing chess would a chess computer have to be before it started feeding the hungry?

Similarly, if there were such a convergence, then all self-improving or constructed superintelligence must fall prey to it, even if it were actively seeking to avoid it. After all, the lower-level AIs or the AI designers have certain goals in mind (as we’ve seen in the previous section, potentially any goals in mind). Obviously, they would be less likely to achieve their goals if these goals were to change (Omohundro, 2008) (Bostrom, 2012). The same goes if the superintelligent AI they designed didn’t share these goals. Hence the AI designers will be actively trying to prevent such a convergence, if they suspected that one was likely to happen. If for instance their goals were immoral, they would program their AI not to care about morality; they would use every trick up their sleeves to prevent the AI’s goals from drifting from their own.

So the convergence thesis requires that for the vast majority of goals G:

  1. It is possible for a superintelligence to exist with goal G (by section 2).
  2. There exists an entity with goal G (by section 3), capable of building a superintelligent AI.
  3. Yet any attempt of that entity to build a superintelligent AI with goal G will be a failure, and the superintelligence’s goals will converge on some other goal.
  4. This is true even if the entity is aware of the convergence and explicitly attempts to avoid it.

This makes the convergence thesis very unlikely. The argument also works against the incompleteness thesis, but in a weaker fashion: it seems more plausible that some goals would be unreachable, despite being theoretically possible, rather than most goals being unreachable because of convergence to a small set.

There is another interesting aspect of the convergence thesis: it postulates that certain goals G will emerge, without them being aimed for or desired. If one accepts that goals aimed for will not be reached, one has to ask why convergence is assumed: why not divergence? Why not assume that though G is aimed for, random accidents or faulty implementation will lead to the AI ending up with one of a much wider array of possible goals, rather than a much narrower one.

4.2 Oracles show the way

If the Orthogonality thesis is wrong, then it implies that Oracles are impossible to build. An Oracle is a superintelligent AI that accurately answers questions about the world (Armstrong, Sandberg, & Bostrom, 2011). This includes hypothetical questions about the future, which means that we can produce a superintelligent AI with goal G by wiring a human-level AI with goal G to an Oracle: the human level AI will go through possible actions, have the Oracle check the outcomes, and choose the one that best accomplishes G.

What makes the “no Oracle” position even more counterintuitive is that any superintelligence must be able to look ahead, design actions, predict the consequences of its actions, and choose the best one available. But the convergence thesis implies that this general skill is one that we can make available only to AIs with certain specific goals. Though the agents with those narrow goals are capable of doing these predictions, they automatically lose this ability if their goals were to change.

4.3 Tricking the controller

Just as with human-level AIs, one could construct a superintelligent AI by wedding together a superintelligence with a large dedicated committee of human-level AIs dedicated to implementing a goal G, and checking the superintelligence’s actions. Thus to deny the Orthogonality thesis requires that one believes that the superintelligence is always capable of tricking this committee, no matter how detailed and thorough their oversight.

This argument extends the Orthogonality thesis to moderately superintelligent AIs, or to any situation where there was a diminishing return to intelligence. It only fails if we take AI to be fantastically superhuman: capable of tricking or seducing any collection of human-level beings.

4.4 Temporary fragments of algorithms, fictional worlds and extra tricks

These are other tricks that can be used to create an AI with any goals. For any superintelligent AI, there are certain inputs that will make it behave in certain ways. For instance, a human-loving moral AI could be compelled to follow most goals G for a day, if they were rewarded with something sufficiently positive afterwards. But its actions for that one day are the result of a series of inputs to a particular algorithm; if we turned off the AI after that day, we would have accomplished moves towards goal G without having to reward its “true” goals at all. And then we could continue the trick the next day with another copy.

For this to fail, it has to be the case that we can create an algorithm which will perform certain actions on certain inputs as long as it isn’t turned off afterwards, but that we cannot create an algorithm that does the same thing if it was to be turned off.

Another alternative is to create a superintelligent AI that has goals in a fictional world (such as a game or a reward channel) over which we have control. Then we could trade interventions in the fictional world against advice in the real world towards whichever goals we desire.

These two arguments may feel weaker than the ones before: they are tricks that may or may not work, depending on the details of the AI’s setup. But to deny the Orthogonality thesis requires not only denying that these tricks would ever work, but denying that any tricks or methods that we (or any human-level AIs) could think up, would ever work at controlling the AIs. We need to assume superintelligent AIs cannot be controlled.

4.5 In summary

Denying the Orthogonality thesis thus requires that:

  1. There are goals G, such that an entity an entity with goal G cannot build a superintelligence with the same goal. This despite the fact that the entity can build a superintelligence, and that a superintelligence will goal G can exist.
  2. Goal G cannot arise accidentally from some other origin, and errors and ambiguities do not significantly broaden the space of possible goals.
  3. Oracles and general purpose planners cannot be built. Superintelligent AIs cannot have their planning abilities repurposed.
  4. A superintelligence will always be able to trick its controllers, and there is no way the controllers can set up a reasonable system of control.
  5. Though we can create an algorithm that does certain actions if it was not to be turned off after, we cannot create an algorithm that does the same thing if it was to be turned off after.
  6. An AI will always come to care intrinsically about things in the real world.
  7. No tricks can be thought up to successfully constrain the AI’s goals: superintelligent AIs cannot be controlled.


5 Bayesian Orthogonality thesis

All the previous sections concern hypotheticals, but of a different kind. Section 2 touches upon what kinds of algorithm could theoretically exist. But sections 3 and 4 concern algorithms that could be constructed by humans (or from AIs originally constructed by humans): they refer to the future. As AI research advances, and certain approaches or groups start to show or lose prominence, we’ll start getting a better idea of how such an AI will emerge.

Thus the orthogonality thesis will narrow as we achieve better understanding of how AIs would work in practice, of what tasks they will be put to and of what requirements their designers will desire. Most importantly of all, we will get more information on the critical question as to whether the designers will actually be able to implement their desired goals in an AI. On the eve of creating the first AIs (and then the first superintelligent AIs), the Orthogonality thesis will likely have pretty much collapsed: yes, we could in theory construct an AI with any goal, but at that point, the most likely outcome is an AI with particular goals – either the goals desired by their designers, or specific undesired goals and error modes.

However, until that time arises, because we do not know any of this information currently, we remain in the grip of a Bayesian version of the Orthogonality thesis:

  • As far as we know now (and as far as we’ll know until we start building AIs), if we could construct superintelligent AIs at all, we could construct superintelligent AIs with more or less any goals (as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence).


6 Conclusion

It is not enough to know that an agent is intelligent (or superintelligent). If we want to know something about its final goals, about the actions it will be willing to undertake to achieve them, and hence its ultimate impact on the world, there are no shortcuts. We have to directly figure out what these goals are, and cannot rely on the agent being moral just because it is superintelligent/superefficient.


7 Acknowledgements

It gives me great pleasure to acknowledge the help and support of Anders Sandberg, Nick Bostrom, Toby Ord, Owain Evans, Daniel Dewey, Eliezer Yudkowsky, Vladimir Slepnev, Viliam Bur, Matt Freeman, Wei Dai, Will Newsome, Paul Crowley, Alexander Kruel and Rasmus Eide, as well as those members of the Less Wrong online community going by the names shminux, Larks and Dmytry.


8 Bibliography

Armstrong, S., Sandberg, A., & Bostrom, N. (2011). Thinking Inside the Box: Controlling and Using an Oracle AI. forthcoming in Minds and Machines .

Bostrom, N. (2012). Superintelligence: Groundwork to a Strategic Analysis of the Machine Intelligence Revolution. to be published.

Bostrom, N. (2011). The Superintelligent Will: Motivation and Instrumental Rationality in Advance Artificial Agents. forthcoming in Minds and Machines .

de Fabrique, N., Romano, S. J., Vecchi, G. M., & van Hasselt, V. B. (2007). Understanding Stockholm Syndrome. FBI Law Enforcement Bulletin (Law Enforcement Communication Unit) , 76 (7), 10-15.

Hume, D. (1739). A Treatise of Human Nature. 

Hutter, M. (2005). Universal algorithmic intelligence: A mathematical top-down approach. In B. Goertzel, & C. Pennachin (Eds.), Artificial General Intelligence. Springer-Verlag.

Müller, J. (2012). Ethics, risks and opportunities of superintelligences. Retrieved May 2012, from

Omohundro, S. M. (2008). The Basic AI Drives. In P. Wang, B. Goertzel, & S. Franklin (Eds.), Artificial General Intelligence: Proceedings of the First AGI Conference (Vol. 171).

Sandberg, A., & Bostrom, N. (2008). Whole brain emulation: A roadmap. Future of Humanity Institute Technical report , 2008-3.

Schmidhuber, J. (2007). Gödel machines: Fully self-referential optimal universal self-improvers. In Artificial General Intelligence. Springer.

Wang, P. (2011). The assumptions on knowledge and resources in models of rationality. International Journal of Machine Consciousness , 3 (1), 193-218.

Yudkowsky, E. (2001). General Intelligence and Seed AI 2.3. Retrieved from Singularity Institute for Artificial Intelligence:


[1] We need to assume it has goals, of course. Determining whether something qualifies as a goal-based agent is very tricky (researcher Owain Evans is trying to establish a rigorous definition), but this paper will adopt the somewhat informal definition that an agent has goals if it achieves similar outcomes from very different starting positions. If the agent ends up making ice cream in any circumstances, we can assume ice creams are in its goals.

[2] Every law of nature being algorithmic (with some probabilistic process of known odds), and no exceptions to these laws being known.

[3] One could argue that we should consider the space of general animal intelligences – octopuses, supercolonies of social insects, etc... But we won’t pursue this here; the methods described can already get behaviours like this.

[4] Even today, many people have had great fun torturing and abusing their characters in games like “the Sims” ( The same urges are present, albeit diverted to fictionalised settings. Indeed games offer a wide variety of different goals that could conceivably be imported into an AI if it were possible to erase the reality/fiction distinction in its motivation.

[5] As can be shown by a glance through a biography of famous people – and famous means they were generally allowed to rise to prominence in their own society, so the space of possible motivations was already cut down.

[6] Of course, if we built an AI with that goal and copied it millions of times, it would no longer be niche.

[7] Such as the hostages suffering from Stockholm syndrome (de Fabrique, Romano, Vecchi, & van Hasselt, 2007).

New to LessWrong?

New Comment
155 comments, sorted by Click to highlight new comments since: Today at 7:32 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

For utility function maximisers, the AIXI is the theoretically best agent there is, more successful at reaching its goals (up to a finite constant) than any other agent (Hutter, 2005).

False. AIXI as defined can maximize only a sensory reward channel, not a utility function over an environmental model with a known ontology. As Dewey demonstrates, this problem is not easy to fix; AIXI can have utility functions over (functions of) sensory data, but its environment-predictors vary freely in ontology via Solomonoff induction, so it can't have a predefined utility function over the future of its environment without major rewriting.

AIXI is the optimal function-of-sense-data maximizer for Cartesian agents with unbounded computing power and access to a halting oracle, in a computable environment as separated from AIXI by the Cartesian boundary, given that your prior belief about the possible environments matches AIXI's Solomonoff prior.

Thanks for the correction. Daniel hadn't mentioned that as a problem when he reviewed the paper, so, I took it as being at least approximately correct, but it is important to be as rigorous as possible. I'll see what can be rescued, and what needs to be reworked.

Here's an attack on section 4.1. Consider the possibility that "philosophical ability" (something like the ability to solve confusing problems that can't be easily formalized) is needed to self-improve beyond some threshold of intelligence, and this same "philosophical ability" also reliably causes one to decide that some particular goal G is the right goal to have, and therefore beyond some threshold of intelligence all agents have goal G. To deny this possibility seems to require more meta-philosophical knowledge than we currently possess.

Yes, to deny it requires more meta-philosophical knowledge than we currently possess. But to affirm it as likely requires more meta-philosophical knowledge than we currently possess. My purpose is to show that it's very unlikely, not that it's impossible. Do you feel I didn't make that point? Should I have addressed "moral realism" explicitly? I didn't want to put down the words, because it raises defensive hackles if I start criticising a position directly.
6Wei Dai12y
Perhaps I should have said "To conclude that this possibility is very unlikely" instead of "To deny this possibility". My own intuition seems to assign a probability to it that is greater than "very unlikely" and this was largely unchanged after reading your paper. For example, many of the items in the list in section 4.5, that have to be true if orthogonality was false, can be explained by my hypothesis, and the rest do not seem very unlikely to begin with.

My own intuition seems to assign a probability to it that is greater than "very unlikely"

Why? You're making an extraordinary claim. Something - undefined - called philosophical ability is needed (for some reason) to self improve and, for some extraordinary and unexplained reason, this ability causes an agent to have a goal G. Where goal G is similarly undefined.

Let me paraphrase: Consider the possibility that "mathematical ability" is needed to self-improve beyond some threshold of intelligence, and this same "mathematical ability" also reliably causes one to decide that some particular goal G is the right goal to have, and therefore beyond some threshold of intelligence all agents have goal G.

Why is this different? What in your intuition is doing the work "philosophical ability" -> same goals? If we call it something else than "philosophical ability", would you have the same intuition? What raises the status of that implication to the level that it's worthy of consideration?

I'm asking seriously - this is the bit in the argument I consistently fail to understand, the bit that never makes sense to me, but who's outline I can feel in most counterarguments.

8Wei Dai12y
It seems to me there are certain similarities and correlations between thinking about decision theory (which potentially makes one or an AI one builds more powerful) and thinking about axiology (what terminal goals one should have). They're both "ought" questions, and If you consider the intelligences that we can see or clearly reason about (individual humans, animals, Bayesian EU maximizer, narrow AIs that exist today), there seems a clear correlation between "ability to improve decision theory via philosophical reasoning" (as opposed to CDT-AI changing into XDT and then being stuck with that) and "tendency to choose one's goals via philosophical reasoning". One explanation for this correlation (and also the only explanation I can see at the moment, besides it being accidental) is that something we call "philosophical ability" is responsible for both. Assuming that's the case, that still leaves the question of whether philosophical ability backed up with enough computing power eventually leads to goal convergence. One major element of philosophical reasoning seems to be a distaste for and tendency to avoid arbitrariness. It doesn't seem implausible that for example "the ultimate philosopher" would decide that every goal except pursuit of pleasure / avoidance of pain is arbitrary (and think that pleasure/pain is not arbitrary due to philosophy-of-mind considerations).
If an agent has goal G1 and sufficient introspective access to know its own goal, how would avoiding arbirtrariness in its goals help it achieve goal G1 better than keeping goal G1 as its goal? I suspect we humans are driven to philosophize about what our goals ought to be by our lack of introspective access, and that searching for some universal goal, rather than what we ourselves want, is a failure mode of this philosophical inquiry.

I think we don't just lack introspective access to our goals, but can't be said to have goals at all (in the sense of preference ordering over some well defined ontology, attached to some decision theory that we're actually running). For the kind of pseudo-goals we have (behavior tendencies and semantically unclear values expressed in natural language), they don't seem to have the motivational strength to make us think "I should keep my goal G1 instead of avoiding arbitrariness", nor is it clear what it would mean to "keep" such pseudo-goals as one self-improves.

What if it's the case that evolution always or almost always produces agents like us, so the only way they can get real goals in the first place is via philosophy?

The primary point of my comment was to argue that an agent that has a goal in the strong sense would not abandon its goal as a result of philosophical consideration. Your response seems more directed at my afterthought about how our intuitions based on human experience would cause us to miss the primary point. I think that we humans do have goals, despite not being able to consistantly pursue them. I want myself and my fellow humans to continue our subjective experiences of life in enjoyable ways, without modifying what we enjoy. This includes connections to other people, novel experiences, high challenge, etc. There is, of course, much work to be done to complete this list and fully define all the high level concepts, but in the end I think there are real goals there, which I would like to be embodied in a powerful agent that actually runs a coherent decision theory. Philosophy probably has to play some role in clarifying our "pseudo-goals" as actual goals, but so does looking at our "pseudo-goals", however arbitrary they may be.
5Wei Dai12y
Such an agent would also not change its decision theory as a result of philosophical consideration, which potentially limits its power. I wouldn't argue against this as written, but Stuart was claiming that convergence is "very unlikely" which I think is too strong.
I don't think that follows, or at least the agent could change its decision theory as a result of some consideration, which may or may not be "philosophical". We already have the example that a CDT agent that learns in advance it will face Newcomb's problem could predict it would do better if it switched to TDT.
2Wei Dai12y
I wrote earlier XDT (or in Eliezer's words, "crippled and inelegant form of TDT") is closer to TDT but still worse. For example, XDT would fail to acausally control/trade with other agents living before the time of its self-modification, or in other possible worlds.
Ah, yes, I agree that CDT would modify to XDT rather than TDT, though the fact that it self modifies at all shows that goal driven agents can change decision theories because the new decision theory helps it achieve its goal. I do think that it's important to consider how a particular decision theory can decide to self modify, and to design an agent with a decision theory that can self modify in good ways.
Not strictly. If strongly goal'd agent determines that a different decision theory (or any change to itself) better maximizes its goal, it would adopt that new decision theory or change.
I agree that humans are not utility-maximizers or similar goal-oriented agents - not in the sense we can't be modeled as such things, but in the sense that these models do not compress our preferences to any great degree, which happens to be because they are greatly at odds with our underlying mechanisms for determining preference and behavior.
Avoiding arbitrariness is useful to epistemic rationality and therefore to instrumental rationality. If an AI has rationality as a goal it will avoid arbitrariness, whether or not that assists with G1.
Avoiding giving credence to arbitrary beliefs is useful to epistemic rationality and therefor to instrumental rationality, and therefor to goal G1. Avoiding arbitrariness in goals still does not help with achieving G1 if G1 is considered arbitrary. Be careful not to conflate different types of arbitrariness. Rationality is not an end goal, it is that which you do in pursuit of a goal that is more important to you than being rational.
Robin Hanson's 'far mode' (his take on construal level theory) is a plausible match to this 'something'. Hanson points out that far mode is about general categories and creative metaphors. This is a match to something from AGI research...categorization and analogical inference. This can be linked to Bayesian inference by considering analogical inference as a natural way of reasoning about 'priors'. A plausible explanation is that analogical inference is associated with sentience (subjective experience), as suggested by Douglas Hofstadter (who has stated he thinks 'analogies' are the core of conscious cognition). Since sentience is closely associated with moral reasoning, it's at least plausible that this ability could indeed give rise to converge on a particular G. Here is a way G can be defined: Analogical inference is concerned with Knowledge Representation (KR), so we could redefine ethics based on 'representations of values' ('narratives', which as Daniel Dennett has pointed out,indeed seem to be closely linked to subjective experience) rather than external consequences. At this point we can bring in the ideas of Schmidhuber and recall a powerful point made by Hanson (see below). For maximum efficiency, all AGIs with the aforementioned 'philosophical ability' (analogical inference and production of narratives) would try to minimize the complexity of the cognitive processes generating its internal narratives. This could place universal contraints of what these values are. For example, Schmidhuber pointed out that data compression could be used to get a precise definition of 'beauty'. Lets now recall a powerful point Hanson made a while back on OB: the brain/mind can be totally defined in terms of a 'signal processor'. Given this perspective, we could then view the correct G as the 'signal' and moral errors as 'noise'. Algorithmic information theory could then be used to define a complexity metric that would precisely define this G.
6Paul Crowley12y
Schmidthuber's definition of beauty is wrong. He says, roughly, that you're most pleased when after great effort you find a way to compress what was seemingly incompressible. If that were so, I could please you again and again by making up new AES keys with the first k bits random and the rest zero, and using them to generate and give you a few terabytes of random data. You'd have to brute force the key, at which point you'll have compressed down from terabytes to kilobytes. What beauty! Let's play the exact game again, with the exact same cipher but a different key, forever.
Right. That said, wireheading, aka the grounding problem, is a huge unsolved philosophical problem, so I'm not sure Schmidhuber is obligated to answer wireheading objections to his theory.
But the theory fails because this fits it but isn't wireheading, right? It wouldn't actually be pleasing to play that game.
I think you are right. The two are errors that practically, with respect to hedonistic extremism, operate in opposing directions. They are similar in form in as much as they fit the abstract notion "undesirable outcomes due to lost purposes when choosing to optimize what turns out to be a poor metric for approximating actual preferences".
Meh, yeah, maybe? Still seems like other, more substantive objections could be made. Relatedly, I'm not entirely sure I buy Steve's logic. PRNGs might not be nearly as interesting as short mathematical descriptions of complex things, like Chaitin's omega. Arguably collecting as many bits of Chaitin's omega as possible, or developing similar maths, would in fact be interesting in a human sense. But at that point our models really break down for many reasons, so meh whatever.
Unsolved philsophical problem? Huh? No additional philosophical breakthroughs are required for wireheading to not be a problem. If I want (all things considered, etc) to wirehead, I'll wirehead. If I don't want to wirehead I will not wirehead. Wireheading introduces no special additional problems and is handled the same way all other preferences about future states of the universe can be handled. (Note: It is likely that you have some more specific point regarding in what sense you consider wireheading 'unsolved'. I welcome explanations or sources.)
Unsolved in the sense that we don't know how to give computer intelligences intentional states in a way that everyone would be all like "wow that AI clearly has original intentionality and isn't just coasting off of humans sitting at the end of the chain interpreting their otherwise entirely meaningless symbols". Maybe this problem is just stupid and will solve itself but we don't know that yet, hence e.g. Peter's (unpublished?) paper on goal stability under ontological shifts. (ETA: I likely don't understand how you're thinking about the problem.)
Being able to do this would also be a step towards the related goal of trying to give computer intelligences intelligence that we cannot construe as 'intentionality' in any morally salient sense, so as to satisfy any "house-elf-like" qualms that we may have. I assume you mean Ontological Crises in Artificial Agents’ Value Systems? I just finished republishing that one. Originally published form. New SingInst style form. A good read.
Engineering ability suffices: Do philosophers have an incredibly strong ugh field around anything that can be deemed 'implementation detail'? Clearly, 'superintelligence' the string of letters can have what ever 'goals' the strings of letters, no objection here. The superintelligence in form of distributed system with millisecond or worse lag between components, and nanosecond or better clock speed, on the other hand...
Looking at your post at, I can see the sketch of an argument. It goes something like "we know that some decision theories/philosophical processes are 'objectively 'inferior, hence some are objectively superior, hence (wave hands furiously) it is at least possible that some system is objectively best". I would counter: 1) The argument is very weak. We know some mathematical axiomatic systems are contradictory, hence inferior. It doesn't follow from that that there is any "best" system of axioms. 2) A lot of philosophical progress is entirely akin to mathematical progress: showing the consequences of the axioms/assumptions. This is useful progress, but not really relevant to the argument. 3) All the philosophical progress seems to lie on the "how to make better decisions given a goal" side; none of it lies on the "how to have better goals" side. Even the expected utility maximisation result just says "if you are unable to predict effectively over the long term, then to achieve your current goals, it would be more efficient to replace these goals with others compatible with a utility function". However, despite my objections, I have to note that the argument is at least an argument, and provides some small evidence in that direction. I'll try and figure out whether it should be included in the paper.
Other possibility that is easy to see if you are to think more like an engineer and less like philosopher: The AI is to operate with light-speed delay, and has to be made of multiple nodes. It is entirely possible that some morality systems would not allow efficient solutions to this challenge (i.e. would break into some sort of war between modules, or otherwise fail to intellectually collaborate). It is likely that there's only a limited number of good solutions to P2P intelligence design, and the one that would be found would be substantially similar to our own solution of fundamentally same problem, solution which we call 'morality', complete with various non-utilitarian quirks. edit: that is, our 'morality' is the set of rules for inter-node interaction in society, and some of such rules just don't work. Orthogonality thesis for anything in any sense practical is a conjunction of potentially very huge number of propositions (which are assumed false without consideration, by omission) - any sort of consideration not yet considered can break the symmetry between different goals, then another such consideration is incredibly unlikely to add symmetry back.
If an agent with goal G1 acquires sufficient "philosophical ability", that it concludes that goal G is the right goal to have, that means that it decided that the best way to achieve goal G1 is to pursue goal G. For that to happen, I find it unlikely that goal G is anything other than a clarification of goal G1 in light of some confusion revealed by the "philosophical ability", and I find it extremely unlikely that there is some universal goal G that works for any goal G1.
Offbeat counter: You're assuming that this ontology that privileges "goals" over e.g. morality is correct. What if it's not? Are you extremely confident that you've carved up reality correctly? (Recall that EU maximizers haven't been shown to lead to AGI, and that many philosophers who have thought deeply about the matter hold meta-ethical views opposed to your apparent meta-ethics.) I.e., what if your above analysis is not even wrong?
I don't believe that goals are ontologically fundamental. I am reasoning (at a high level of abstraction) about the behavior of a physical system designed to pursue a goal. If I understood what you mean by "morality", I could reason about a physical system designed to use that and likely predict different behaviors than for the physical system designed to pursue a goal, but that doesn't change my point about what happens with goals. I don't expect EU maximizers to lead to AGI. I expect EU maximizing AGIs, whatever has led to them, to be effective EU maximizers.
Sorry, I meant "ontology" in the information science sense, not the metaphysics sense; I simply meant that you're conceptually (not necessarily metaphysically) privileging goals. What if you're wrong to do that? I suppose I'm suggesting that carving out "goals" might be smuggling in conclusions that make you think universal convergence is unlikely. If you conceptually privileged rational morality instead, as many meta-ethicists do, then your conclusions might change, in which case it seems you'd have to be unjustifiably confident in your "goal"-centric conceptualization.
I think I am only "privileging" goals in a weak sense, since by talking about a goal driven agent, I do not deny the possibility of an agent built on anything else, including your "rational morality", though I don't know what that is. Are you arguing that a goal driven agent is impossible? (Note that this is a stronger claim than it being wiser to build some other sort of agent, which would not contradict my reasoning about what a goal driven agent would do.)
(Yeah, the argument would have been something like, given a sufficiently rich and explanatory concept of "agent", goal-driven agents might not be possible --- or more precisely, they aren't agents insofar as they're making tradeoffs in favor of local homeostatic-like improvements as opposed to traditionally-rational, complex, normatively loaded decision policies. Or something like that.)
Let me try to strengthen your point. If an agent with goal G1 acquires sufficient "philosophical ability", that it concludes that goal G is the right goal to have, that means that it decided that the best way to achieve goal G1 is to pursue what it thinks is the "right goal to have". This would require it to take a kind of normative stance on goal fulfillment, which would require it to have normative machinery, which would need to be implemented in the agents mind. Is it impossible to create an agent without normative machinery of this kind? Does philosophical ability depend directly on normative machinery?

‘maximising paperclips’

Since you want a non-LWian audience, make that “maximising the number of paperclips in the universe”, otherwise the meaning might be unclear.

Although, his point would still hold if the reader was imagining the goal of making extremely large paperclips.

Couple of comments:

  • The section "Bayesian Orthogonality thesis" doesn't seem right, since a Bayesian would think in terms of probabilities rather than possibilities ("could construct superintelligent AIs with more or less any goals"). If you're saying that we should assign a uniform distribution for what AI goals will be realized in the future, that's clearly wrong.
  • I think the typical AI researcher, after reading this paper, will think "sure, it might be possible to build agents with arbitrary goals if one tried, but my approach will probably lead to a benevolent AI". (See here for an example of this.) So I'm not sure why you're putting so much effort into this particular line of argument.
This is the first step (pointed more towards philosophers). Formalise the "we could construct an AI with arbitrary goals", and with that in the background, zoom in on the practical arguments with the AI researchers. Will restructure the Bayesian section. Some philosophers argue things like "we don't know what moral theories are true, but a rational being would certainly find them"; I want to argue that this is equivalent, from our perspective, with the AI's goals ending up anywhere. What I meant to say is that ignorance of this type is like any other type of ignorance, hence the "Bayesian" terminology.
6Wei Dai12y
Ok, in that case I would just be wary about people being tempted to cite the paper to AI researchers without having the followup arguments in place, who would then think that their debating/discussion partners are attacking a strawman.
Hum, good point; I'll try and put in some disclaimer, emphasising that this is a partial result...
1Wei Dai12y
Thanks. To go back to my original point a bit, how useful is it to debate philosophers about this? (When debating AI researchers, given that they probably have a limited appetite for reading papers arguing that what they're doing is dangerous, it seems like it would be better to skip this paper and give the practical arguments directly.)
Maybe I've spent too much time around philosophers - but there are some AI designers who seem to spout weak arguments like that, and this paper can't hurt. When we get a round to writing a proper justification for AI researchers, having this paper to refer back to avoids going over the same points again. Plus, it's a lot easier to write this paper first, and was good practice.
Without getting in to the likelihood of a 'typical AI researcher' successfully creating a benevolent AI, do you doubt Goertzel's "Interdependency Thesis"? I find both to be rather obviously true. Yes its possible in principle for almost any goal system to be combined with almost any type or degree of intelligence, but that's irrelevant because in practice we can expect the distributions over both to be highly correlated in some complex fashion. I really don't understand why this Orthogonality idea is still brought up so much on LW. It may be true, but it doesn't lead to much. The space of all possible minds or goal systems is about as relevant to the space of actual practical AIs as the space of all configuration of a human's molecules is to the space of a particular human's set of potential children.

We will also take the materialistic position that humans themselves can be viewed as non-deterministic algorithms[2]

I'm not a philosopher of mind but I think "materialistic" might be a misleading word here, being too similar to "materialist". Wouldn't "computationalistic" or maybe "functionalistic" be more precise? ("-istic" as opposed to "-ist" to avoid connotational baggage.) Also it's ambiguous whether footnote two is a stipulation for interpreting the paper or a brief description of the consensus view in physics.

At various points you make somewhat bold philosophical or conceptual claims based off of speculative mathematical formalisms. Even though I'm familiar with and have much respect for the cited mathematics, this still makes me nervous, because when I read philosophical papers that take such an approach my prior is high for subtle or subtly unjustified equivocation; I'd be even more suspicious were I a philosopher who wasn't already familiar with universal AI, which isn't a well-known or widely respected academic subfield. The necessity of finding clearly trustworthy analogies between mathematical and phenomena... (read more)

Just some minor text corrections for you:

From 3.1

The utility function picture of a rational agent maps perfectly onto the Orthogonality thesis: here have the goal structure, the utility fu...

...could be "here we have the...

From 3.2

Human minds remain our only real model of general intelligence, and this strongly direct and informs...

this strongly directs and informs...

From 4.1

“All human-designed rational beings would follow the same morality (or one of small sets of moralities)” sound plausible; in contract “All human-designed superefficient

... (read more)
6Simon Fischer12y
From 3.3 to do so(?) we would From 3.4 of a single given agent From 4.1 every, or change the rest of the sentence (superintelligences, they were) From 4.5

I like the paper, but am wondering how (or whether) it applies to TDT and acausal trading. Doesn't the trading imply a form of convergence theorem among very powerful TDT agents (they should converge on an average utility function constructed across all powerful TDT agents in logical space)?

Or have I missed something here? (I've been looking around on Less Wrong for a good post on acausal trading, and am finding bits and pieces, but no overall account.)

It does indeed imply a form of convergence. I would assume Stuart thinks of the convergence as an artifact of the game environment the agents are in. Not a convergence in goals, just behavior. Albeit the results are basically the same.
6Wei Dai12y
If there's convergence in goals, then we don't have to worry about making an AI with the wrong goals. If there's only convergence in behavior, then we do, because building an AI with the wrong goals will shift the convergent behavior in the wrong direction. So I think it makes sense for Stuart's paper to ignore acausal trading and just talk about whether there is convergence in goals.
Not necessarily, it might destroy the earth before its goals converge.
Global scale acausal trading, if it's possible in practice (and it's probably not going to be, we only have this theoretical possibility but no indication that it's possible to actually implement), implies uniform expected surface behavior of involved agents, but those agents trade control over their own resources (world) for optimization of their own particular preference by the global acausal economy. So even if the choice of AI's preference doesn't have significant impact on what happens in AI's own world, it does have significant impact on what happens globally, on the order of what all the resources in AI's own world can buy.
There was an incident of censorship by EY relating to acausal trading - the community's confused response (chilling effects? agreement?) to that incident explains why there is no overall account.
7Wei Dai12y
No, I think it's more that the idea (acausal trading) is very speculative and we don't have a good theory of how it might actually work.
Thanks for this... Glad it's not being censored! I did post the following on one of the threads, which suggested to me a way in which it would happen or at least get started Again, apologies if this idea is nuts or just won't work. However, if true, it did strike me as increasing the chance of a simulation hypothesis. (It gives powerful TDT AIs a motivation to simulate as many civilizations as they can, and in a "state of nature", so that they get to see what the utility functions are like, and how likely they are to also build TDT-implementing AIs...)
It was censored, though there's a short excerpt here.
By the way, I still can't stop thinking about that post after 6 months. I think it's my favorite wild-idea scenario I've ever heard of.

If a goal is a preference order over world states, then there are uncountably many of them, so any countable means of expression can only express a vanishingly small minority of them. Trivially (as Bostrom points out) a goal system can be too complex for an agent of a given intelligence. It therefore seems to me that what we're really defending is an Upscalability thesis: if an agent A with goal G is possible, then a significantly more intelligent A++ with goal G is possible.

Thus to deny the Orthogonality thesis is to assert that there is a goal system G, such that, among other things:
(1) There cannot exist any efficient real-world algorithm with goal G.
(2) If a being with arbitrarily high resources, intelligence, time and goal G, were to try design an efficient real-world algorithm with the same goal, it must fail.
(3) If a human society were highly motivated to design an efficient real-world algorithm with goal G, and were given a million years to do so along with huge amounts of resources, training and knowledge about AI, i

... (read more)
As I said to Wei, we can start dealing with those arguments once we've got strong foundations. I'll see if the value drift issue can be better integrated in the argumentation.

our race spans foot-fetishists, religious saints, serial killers, instinctive accountants, role-players, self-cannibals, firefighters and conceptual artists. The autistic, those with exceptional social skills, the obsessive compulsive and some with split-brains. Beings of great empathy and the many who used to enjoy torture and executions as public spectacles

Some of these are not really terminal goals. A fair number of people with strong sexual fetishes would be perfectly happy without them, and in more extreme cases really would prefer not to have them... (read more)

A fair number of people with strong sexual fetishes

there are some serial killers

It was an existence argument. That some more people aren't examples doesn't really change matters, does it?

to avoid worrying about robot bodies and such-like, we may restrict the list of tasks to those accomplishable over the internet

Many of the tasks I accomplish over the internet require there to be people who know me in real life, some require me to have a body and voice which looks and sounds human (in photos and videos at least) and a few require me to be enrolled in my university, have a bank account, be a citizen of my country, vel sim. (Adding “anonymously” and “for free” ought to fix that.)

I don't see why there are only two counter-theses in section 4. Or rather, it looks as though you want a too-strong claim - in order to criticise it.

Try a "partial convergence" thesis instead. For instance, the claim that goals that are the product of cultural or organic evolution tend to maximise entropy and feature universal instrumental values.

The incompleteness claim is weaker than the partial convergence claim.
Sure, but if you try harder with counter-theses you might reach a reasonable position that's neither very weak nor wrong.

Minor text correction;

"dedicated committee of human-level AIs dedicated" repeats the same adjective in a small span.

More wide-ranging:

Perhaps the paper would be stronger if it explained why philosophers might feel that convergence is probable. For example, in their experience, human philosophers / philosophies converge.

In a society, where the members are similar to one another, and much less powerful than the society as a whole, the morality endorsed by the society might be based on the memes that can spread successfully. That is, a meme like '... (read more)

I'm deliberately avoiding that route. If I attack, or mention, moral realism in any form, philosophers are going to get defensive. I'm hoping to skirt the issue by narrowing the connotations of the terms (efficiency rather than intelligence and, especially, rationality).
5Wei Dai12y
You don't think a moral realist will notice that your paper contradicts moral realism and get defensive anyway? Can you write out the thoughts that you're hoping a moral realist will have after reading your paper?
Less so. "All rational beings will be moral, but this paper worries me that AI, while efficient, may not end up being rational. Maybe it's worth worrying about."
2Wei Dai12y
Why not argue for this directly, instead of making a much stronger claim ("may not" vs "very unlikely")? If you make a claim that's too strong, that might lead people to dismiss you instead of thinking that a weaker version of the claim could still be valid. Or they could notice holes in your claimed position and be too busy trying to think of attacks to have the thoughts that you're hoping for. (But take this advice with a big grain of salt since I have little idea how academic philosophy works in practice.)
Actually scratch that and reverse it - I've got an idea how to implement your idea in a nice way. Thanks!
I'm not an expert on academic philosophy either. But I feel the stronger claim might work better; I'll try and hammer the point "efficiency is not rationality" again and again.
I'm confused. "May not" is weaker than "very unlikely," in the supplied context.

Copying from a comment I already made cos no-one responded last time:

I'm not confident about any of the below, so please add cautions in the text as appropriate.

The orthogonality thesis is both stronger and weaker than we need. It suffices to point out that neither we nor Ben Goertzel know anything useful or relevant about what goals are compatible with very large amounts of optimizing power, and so we have no reason to suppose that superoptimization by itself points either towards or away from things we value. By creating an "orthogonality thesis&quo... (read more)

The orthogonality thesis is non-controversial. Ben's point is that what matters is not the question of what types of goals are theoretically compatible with superoptimization, but rather what types of goals we can expect to be associated with superoptimization in reality. In reality AGI's with superoptimization power will be created by human agencies (or their descendants) with goal systems subject to extremely narrow socio-economic filters. The other tangential consideration is that AGI's with superoptimization power and long planning horizons/zero time discount may have highly convergent instrumental values/goals which are equivalent in effect to terminal values/goals for agents with short planning horizons (such as humans). From a human perspective, we may observe all super-AGIs to appear to have strangely similar ethics/morality/goals, even though what we are really observing are convergent instrumental values and short term opening plans as their true goals concern the end of the universe and are essentially unknowable to us.
The orthogonality thesis is highly controversial - among philosophers.
2Paul Crowley12y
Right, but none of this answers what I was trying to say, which is that the burden of proof is definitely with whoever wants to assert that superintelligence tells us anything about goals. In the absence of a specific argument, "this agent is superintelligent" shouldn't be taken as informative about its goals.
A superintelligent agent doesn't just appear ex nihilio as a random sample out of the space of possible minds. Its existence requires a lengthy, complex technological development which implies the narrow socio-economic filter I mentioned above. Thus "this agent is superintelligent" is at least partially informative about the probability landscape over said agent's goals: they are much more likely than not to be related to or derived from prior goals of the agent's creators.
2Paul Crowley12y
Right, and that's one example of a specific argument. Another is the Gödelian and self-defeating examples in the main article. But neither of these do anything to prop up the Goertzel-style argument of "a superintelligence won't tile the Universe with smiley faces, because that's a stupid thing to do".
Well, Goertzel's argument is pretty much bulletproof-correct when it comes to learning algorithms like the ones he works at, where the goal is essentially set by training, alongside with human culture and human notion of stupid goal. I.e. the AI that reuses human culture as a foundation for superhuman intelligence. Ultimately, orthogonality dissolves once you start being specific what intelligence we're talking of - assume that it has speed of light lag and is not physically very small, and it dissolves, assume that it is learning algorithm that gets to adult human level by absorbing human culture, and it dissolves, etc etc. The orthogonality thesis is only correct in the sense that being entirely ignorant of the specifics of what the 'intelligence' is you can't attribute any qualities to it, which is trivially correct.
While that specific Goertzel-style argument is not worth bothering with, the more supportable version of that line of argument is: based on the current socio-economic landscape of earth, we can infer something of the probability landscape over near future earth superintelligent agent goal systems, namely that they will be tightly clustered around regions in goal space that are both economically useful and achievable. Two natural attractors in that goal space will be along the lines of profit maximizers or intentionally anthropocentric goal systems. The evidence for this distribution over goal space is already rather abundant if one simply surveys existing systems and research. Market evolutionary forces make profit maximization a central attractor, likewise socio-cultural forces pull us towards anthropocentric goal systems (and of course the two overlap). The brain reverse engineering and neuroscience heavy tract in the AGI field in particular should eventually lead to anthropocentric designs, although it's worth mentioning that some AGI researches (ie opencog) are aiming for explicit anthropocentric goal systems without brain reverse engineering.
0Paul Crowley12y
Isn't that specific Goertzel-style argument the whole point of the Orthogonality Thesis? Even in its strongest form, the Thesis doesn't do anything to address your second paragraph.
I'm not sure. I don't think the specific quote of Goertzel is an accurate summary of his views, and the real key disagreements over safety concern this admittedly nebulous distribution of future AGI designs and goal systems.

I don't think section 4.1 defeats your wording of your Convergence Thesis.

Convergence: all human-designed superintelligences would have one of a small set of goals.

The way you have worded this, I read it as trivially true. The set of human designed superintelligences is necessarily a tiny subset of the space of all superintelligences, and thus the set of dependent goals of human-designed superintelligences is a tiny subset of the space of all goals.

Much depends on your useage of 'small'. Small relative to what?

I think you should clarify notions of conver... (read more)

Who is your target audience? Can you pretend to be the actual person you are trying to convince and do your absolute best to demolish the arguments presented in this paper? (You can find their arguments in their publications and apply them to your paper.) And no counter-objections until you finished writing what essentially is a referee report. If you need some extra motivation, pretend that you are being paid $100 for each argument that convinces the rest of the audience and $1000 for each argument that convinces the paper author. When done, post the referee report here, and people will tell you whether you did a good job.

Can you pretend to be the actual person you are trying to convince and do your absolute best to demolish the arguments presented in this paper?

No, I cannot. I've read the various papers, and they all orbit around an implicit and often unstated moral realism. I've also debated philosophers on this, and the same issue rears its head - I can counter their arguments, but their opinions don't shift. There is an implicit moral realism that does not make any sense to me, and the more I analyse it, the less sense it makes, and the less convincing it becomes. Every time a philosopher has encouraged me to read a particular work, it's made me find their moral realism less likely, because the arguments are always weak.

I can't really put myself in their shoes to successfully argue their position (which I could do with theism, incidentally). I've tried and failed.

If someone can help we with this, I'd be most grateful. Why does "for reasons we don't know, any being will come to share and follow specific moral principles (but we don't know what they are)", rise to seem plausible?

Just how diverse is human motivation? Should we discount even sophisticated versions of psychological hedonism? Undoubtedly, the "pleasure principle" is simplistic as it stands. But one good reason not to try heroin, for example, is precisely that the reward architecture of our opioid pathways is so similar. Previously diverse life-projects of first-time heroin users are at risk of converging on a common outcome. So more broadly, let's consider the class of life-supporting Hubble volumes where sentient biological robots acquire the capacity to rewrite their genetic source code and gain mastery of their own reward circuitry. May we predict orthogonality or convergence? Certainly, there are strong arguments why such intelligences won't all become the functional equivalent of heroin addicts or wireheads or Nozick Experience Machine VR-heads (etc). One such argument is the nature of selection pressure. But _if_some version of the pleasure principle is correct, then isn't some version of the convergence conjecture at least feasible, i.e. they'll recalibrate the set-point of their hedonic treadmill and enjoy gradients of (super)intelligent (super)happiness? One needn't be a meta-ethical value-realist to acknowledge that subjects of experience universally find bliss is empirically more valuable than agony or despair. The present inability of natural science to explain first-person experiences doesn't confer second-rate ontological status. If I may quote physicist Frank Wiczek, "It is reasonable to suppose that the goal of a future-mind will be to optimize a mathematical measure of its well-being or achievement, based on its internal state. (Economists speak of 'maximizing utility'', normal people of 'finding happiness'.) The future-mind could discover, by its powerful introspective abilities or through experience, its best possible state the Magic Moment - or several excellent ones. It could build up a library of favourite states. That would be like a library of favourite
David, what are those multiple possible defeaters for convergence? As I see it, the practical defeaters that exist still don't affect the convergence thesis, they just are possible practical impediments, from unintelligent agents, to the realization of the goals of convergence.
I usually treat this behavior as something similar to the availability heuristic. That is, there's a theory that one of the ways humans calibrate our estimates of the likelihood of an event X is by trying to imagine an instance of X, and measuring how long that takes, and calculating our estimate of probability inverse-proportionally to the time involved. (This process is typically not explicitly presented to conscious awareness.) If the imagined instance of X is immediately available, we experience high confidence that X is true. That mechanism makes a certain amount of rough-and-ready engineering sense, though of course it has lots of obvious failure modes, especially as you expand the system's imaginative faculties. Many of those failure modes are frequently demonstrated in modern life. The thing is, we use much of the same machinery that we evolved for considering events like "a tiger eats my children" to consider pseudo-events like "a tiger eating my children is a bad thing." So it's easy for us to calibrate our estimates of the likelihood that a tiger eating my children is a bad thing in the same way: if an instance of a tiger eating my children feeling like a bad thing is easy for me to imagine, I experience high confidence that the proposition is true. It just feels obvious. I don't think this is quite the same thing as moral realism, but when that judgment is simply taken as an input without being carefully examined, the result is largely equivalent. Conversely, the more easily I can imagine a tiger eating my children not feeling like a bad thing, the lower that confidence. More generally, the more I actually analyze (rather than simply referencing) my judgments, the less compelling this mechanism becomes. What I expect, given the above, is that if I want to shake someone off that kind of naive moral realist position, it helps to invite them to consider situations in which they arrive at counterintuitive (to them) moral judgments. The more I do this,
But philosophers are extremely fond of analysis, and make great use of trolley problems and similar edge cases. I'm really torn - people who seem very smart and skilled in reasoning take positions that seem to make no sense. I keep telling myself that they are probably right and I'm wrong, but the more I read about their justifications, the less convincing they are...
Yeah, that's fair. Not all philosophers do this, any more than all computer programmers come up with test cases to ensure their code is doing what it ought, but I agree it's a common practice. Can you summarize one of those positions as charitably as you're able to? It might be that given that someone else can offer an insight that extends that structure.
"There are sets of objective moral truths such that any rational being that understood them would be compelled to follow them". The arguments seem mainly to be: 1) Playing around with the meaning of rationality until you get something ("any rational being would realise their own pleasure is no more valid than that of others" or "pleasure is the highest principle, and any rational being would agree with this, or else be irrational") 2) Convergence among human values. 3) Moral progress for society: we're better than we used to be, so there needs to be some scale to measure the improvements. 4) Moral progress for individuals: when we think about things a lot, we make better moral decisions than when we were young and naive. Hence we're getting better a moral reasoning, so these is some scale on which to measure this. 5) Playing around with the definition of "truth-apt" (able to have a valid answer) in ways that strike me, uncharitably, as intuition-pumping word games. When confronted with this, I generally end up saying something like "my definitions do not map on exactly to yours, so your logical steps are false dichotomies for me". 6) Realising things like "if you can't be money pumped, you must be an expected utility maximiser", which implies that expected utility maximisation is superior to other reasoning, hence that there are some methods of moral reasoning which are strictly inferior. Hence there must be better ways of moral reasoning and (this is the place where I get off) a single best way (though that argument is generally implicit, never explicit).
(nods) Nice. OK, so let me start out by saying that my position is similar to yours... that is, I think most of this is nonsense. But having said that, and trying to adopt the contrary position for didactic purposes... hm. So, a corresponding physical-realist assertion might be that there are sets of objective physical structures such that any rational being that perceived the evidence for them would be compelled to infer their existence. (Yes?) Now, why might one believe such a thing? Well, some combination of reasons 2-4 seems to capture it. That is: in practice, there at least seem to be physical structures we all infer from our senses such that we achieve more well-being with less effort when we act as though those structures existed. And there are other physical structures that we infer the existence of via a more tenuous route (e.g., the center of the Earth, or Alpha Centauri, or quarks, or etc.), to which #2 doesn't really apply (most people who believe in quarks have been taught to believe in them by others; they mostly didn't independently converge on that belief), but 3 and 4 do... when we posit the existence of these entities, we achieve worthwhile things that we wouldn't achieve otherwise, though sometimes it's very difficult to express clearly what those things actually are. (Yes?) So... ok. Does that case for physical realism seem compelling to you? If so, and if arguments 2-4 are sufficient to compel a belief in physical realism, why are their analogs insufficient to compel a belief in moral realism?
No - to me it just highlights the difference between physical facts and moral facts, making them seem very distinct. But I can see how if we had really strong 2-4, it might make more sense...
I'm not quite sure I understood you. Are you saying "no," that case for physical realism doesn't seem compelling to you? Or are you saying "no," the fact that such a case can compellingly be made for physical realism does not justify an analogous case for moral realism?
The second one!
So, given a moral realist, Sam, who argued as follows: "We agree that humans typically infer physical facts such that we achieve more well-being with less effort when we act as though those facts were actual, and that this constitutes a compelling case for physical realism. It seems to me that humans typically infer moral facts such that we achieve more well-being with less effort when we act as though those facts were actual, and I consider that an equally compelling case for moral realism." seems you ought to have a pretty good sense of why Sam is a moral realist, and what it would take to convince Sam they were mistaken. No?
Interesting perspective. Is this an old argument, or a new one? (seems vaguely similar to the Pascalian "act as if you believe, and that will be better for you"). It might be formalisable in terms of bounded agents and stuff. What's interesting is that though it implies moral realism, it doesn't imply the usual consequence of moral realism (that all agents converge on one ethics). I'd say I understood Sam's position, and that he has no grounds to disbelieve orthogonality!
I'd be astonished if it were new, but I'm not knowingly quoting anyone. As for orthogonality.. well, hm. Continuing the same approach... suppose Sam says to you: "I believe that any two sufficiently intelligent, sufficiently rational systems will converge on a set of confidence levels in propositions about physical systems, both coarse-grained (e.g., "I'm holding a rock") and fine-grained (e.g. some corresponding statement about quarks or configuration spaces or whatever). I believe that precisely because I'm a de facto physical realist; whatever it is about the universe that constrains our experiences such that we achieve more well-being with less effort when we act as though certain statements about the physical world are true and other statements are not, I believe that's an intersubjective property -- the things that it is best for me to believe about the physical world are also the things that it is best for you to believe about the physical world, because that's just what it means for both of us to be living in the same real physical world. For precisely the same reasons, I believe that any two sufficiently intelligent, sufficiently rational systems will converge on a set of confidence levels in propositions about moral systems." You consider that reasoning ungrounded. Why?
1) Evidence. There is a general convergence on physical facts, but nothing like a convergence on moral facts. Also, physcial facts, since science, are progressive (we don't say Newton was wrong, we say we have a better theory of which his was an approximation to). 2) Evidence. We have established what counts as evidence for a physical theory (and have, to some extent, separated it from simply "everyone believes this"). What then counts as evidence for a moral theory?
Awesome! So, reversing this, if you want to understand the position of a moral realist, it sounds like you could consider them in the position of a physical realist before the Enlightenment. There was disagreement then about underlying physical theory, and indeed many physical theories were deeply confused, and the notion of evidence for a physical theory was not well-formalized, but if you asked a hundred people questions like "is this a rock or a glass of milk?" you'd get the same answer from all of them (barring weirdness), and there were many physical realists nevertheless based solely on that, and this is not terribly surprising. Similarly, there is disagreement today about moral theory, and many moral theories are deeply confused, and the notion of evidence for a moral theory is not well-formalized, but if you ask a hundred people questions like "is killing an innocent person right or wrong?" you'll get the same answer from all of them (barring weirdness), so it ought not be surprising that there are many moral realists based on that.
I think there may be enough "weirdness" in response to moral questions that it would be irresponsible to treat it as dismissible.
Yes, there may well be.
Interesting. I have no idea if this is actually how moral realists think, but it does give me a handle so that I can imagine myself in that situation...
Sure, agreed. I suspect that actual moral realists think in lots of different ways. (Actual physical realists do, too.) But I find that starting with an existence-proof of "how might I believe something like this?" makes subsequent discussions easier.
I could add: Objective punishments and rewards need objective justification.
From my perspective, treating rationality as always instrumental, and never a terminal value is playing around with it's traditional meaning. (And indiscriminately teaching instrumental rationality is like indiscriminately handing out weapons. The traditional idea, going back to st least Plato, is that teaching someone to be rational improves them...changes their values)
Stuart, here is a defense of moral realism: My paper which you cited needs a bit of updating. Indeed some cases might lead a superintelligence to collaborate with agents without the right ethical mindset (unethical), which constitutes an important existential risk (a reason why I was a bit reluctant to publish much about it). However, isn't the orthogonality thesis basically about the orthogonality between ethics and intelligence? In that case, the convergence thesis is would not be flawed if some unintelligent agents kidnap and force an intelligent agent to act unethically.
Another argumentation for moral realism: 1. Let's imagine starting with a blank slate, the physical universe, and building ethical value in it. Hypothetically in a meta-ethical scenario of error theory (which I assume is where you're coming from), or possible variability of values, this kind of "bottom-up" reasoning would make sense for more intelligent agents that could alter their own values, so that they could find, from "bottom-up", values that could be more optimally produced, and also this kind of reasoning would make sense for them in order to fundamentally understand meta-ethics and the nature of value. 2. In order to connect to the production of some genuine ethical value in this universe, arguably some things would have to be built the same way, with certain conditions, while hypothetically others things could vary, in the value production chain. This is because ethical value could not be absolutely anything, otherwise those things could not be genuinely valuable. If all could be fundamentally valuable, then nothing would really be, because value requires a discrimination in terms of better and worse. Somewhere in the value production chain, some things would have to be constant in order for there to be genuine value. Do you agree so far? 3. If some things have to be constant in the value production chain, and some things could hypothetically vary, then the constant things would be the really important in creating value, and the variable things would be accessory, and could be randomly specified with some degree of freedom, by those that be analyzing value production from a "bottom-up" perspective in a physical universe. It would seem therefore that the constant things could likely be what is truly valuable, while the variable and accessory things could be mere triggers or engines in the value production chain. 4. I argue that, in the case of humans and of this universe, the constant things are what really constitute value. There is some constant a
How about morality as an attractor - which nature approaches. Some goals are better than others - evolution finds the best ones.
Why do we have any reason to think this is the case?
So: game theory: reciprocity, kin selection/tag-based cooperation and virtue signalling. As J. Storrs-Hall puts it in: "Intelligence Is Good" Defecting typically ostracises you - and doesn't make much sense in a smart society which can track repuations. We already know about universal instrumental values. They illustrate what moral attractors look like. I discussed this issue some more in Handicapped Superintelligence.
Doesn't most of this amount to morality as an attractor for evolved social species?
Evolution creates social species, though. Machines will be social too - their memetic relatedness might well be very high - an enormous win for kin selection-based theories based on shared memes. Of course they are evolving, and will evolve too - cultural evolution is still evolution.
So this presumes that the machines in question will evolve in social settings? That's a pretty big assumption. Moreover, empirically speaking having in-group loyalty of that sort isn't nearly enough to ensure that you are friendly with nearby entities- look at how many hunter-gatherer groups are in a state of almost constant war with their neighbors. The attitude towards other sentients (such as humans) isn't going to be great even if there is some approximate moral attractor of that sort.
I'm not sure what you mean. It presumes that there will be more than one machine. The 'lumpiness' of the universe is likely to produce natural boundaries. It seems to be a small assumption. Sure, but cultural evolution produces cooperation on a massive scale. Right - so: high morality seems to be reasonably compatible with some ant-squishing. The point here is about moral attractors - not the fate of humans.
It is a major assumption. To use the most obvious issue if someone is starting up an attempted AGI on a single computer (say it is the only machine that has enough power) then this won't happen. It also won't happen if one isn't having a large variety of machines which are actually engaging in generational copying. That means that say if one starts with ten slightly different machines, if the population doesn't grow in distinct entities this isn't going to do what you want. And if the entities lack a distinction between genotype and phenotype (as computer programs unlikely biological entities actually do) then this is also off because one will not be subject to a Darwinian system but rather a pseudo-Lamarckian one which doesn't act the same way. So your point seems to come down purely to the fact that evolved entities will do this, and a vague hope that people will deliberately put entities into this situation. This is both not helpful for the fundamental philosophical claim (which doesn't care about what empirically is likely to happen) and is not practically helpful since there's no good reason to think that any machine entities will actually be put into such a situation.
A multi-planetary living system is best described as being multiple agents, IMHO. The unity you suggest would represent relatedness approaching 1 - the ultimate win in terms of altruism and cooperation. Without copying there's no life. Copying is unavoidable. Variation is practically ineviable too - for instance, local adaptation. Computer programs do have the split between heredity and non heritble elements - which is the basic idea here, or it should be. Darwin believed in cultural evolution: "The survival or preservation of certain favoured words in the struggle for existence is natural selection" - so surely cultural evolution is Darwinian. Most of the game theory that underlies cooperation applies to both cultural and organic evolution. In particular, reciprocity, kin selection, and reputations apply in both domains. I didn't follow that bit - though I can see that it sounds a bit negative. Evolution has led to social, technological, intellectual and moral progress. It's conservative to expect these trends to continue.
Attractors are features of evolutionary systems, it'd be wierd if their weren't attractors in goal space. Here's a paper which touches on that (I don't necessarily buy all of it, but the part about morality as an attractor in goal systems of evolving cooperating game theoretic agents is interesting)
Sure. Think about the optimal creature - for instance - and don't anybody tell me that fitness is relative to the environment - we can see the environment. Another point is that - even if there's no competition (and natural selection) involving alien races, the fear of such competiton is likely produce a similar adaptive effect - moving effective values towards universal instrumental values.
You have made a number of posts on paraconsistent logic. Now it's time to walk the walk. For the purpose of this referee report, accept moral realism and use it explicitly to argue with your paper.
It's not that simple. I can't figure out what the proposition being defended is exactly. It shifts in ways I can't predict in the course of arguments and discussions. If I tried to defend it, my defence would end up being too caricatural or too weak.
Is your goal to affect their point of view? Or is it something else? For example, maybe your true target audience is those who donate to your organization and you just want to have a paper published to show them that they are not wasting their money. In any case, the paper should target your real audience, whatever it may be.
I want a paper to point those who make the thoughtless "the AI will be smart, so it'll be nice" argument to. I want a paper that forces the moral realists (using the term very broadly) to make specific counter arguments. I want to convince some of these people that AI is a risk, even if it's not conscious or rational according to their definitions. I want something to build on to move towards convincing the AGI researchers. And I want a publication.

All of these seem extraordinarily strong claims to make!

A critic might respond: they are strong claims to make about an arbitrarily chosen individual goal system, but asserting that there exists some goal system fulfilling the conditions is a massive disjunction, and so is weaker than it appears from the list of conditions.

How's about that: the general purpose problem solving is altogether a different problem from implementing any form of real world motivation, and is likely to come separate from it (case in point: try make AIXI maximize paperclips without it also searching for a way to show itself paperclip porn; the problem appears entirely non solvable).

It seems that for danger of the AI you need some peculiar window into which the orthogonality must fly - too much orthogonality, no risk, too little, no FAI/UFAI distinction.

You think it is in principle impossible to make (an implementation of) AIXI that understands the map/territory distinction, and values paperclips in the territory more than paper clips in the map? I may be misunderstanding the nature of AIXI, but as far as I know it's trying to maximize some "reward" number. If you program it so that the reward number is equal to "the number of paperclips in the territory as far as you know" it wouldn't choose to believe there were a lot of paperclips because that wouldn't increase its estimate (by its current belief-generating function) of the number of extant paperclips. Will someone who's read more on AIXI please tell me if I have it all backward? Thanks.

AIXI's "reward number" is given directly to it via an input channel, and it's non-trivial to change it so that it's equal to "the number of paperclips in the territory as far as you know". UDT can be seen as a step in this direction.

I don't see how UDT is a step in this direction. Can you explain?
3Wei Dai12y
UDT shows how an agent might be able to care about something other than an externally provided reward, namely how a computation, or a set of computations, turn out. It's conjectured that arbitrary goals, such as "maximize the number of paperclips across this distribution of possible worlds" (and our actual goals, whatever they may turn out to be) can be translated into such preferences over computations and then programmed into an AI, which will then take actions that we'd consider reasonable in pursue of such goals. (Note this is a simplification that ignores issues like preferences over uncomputable worlds, but hopefully gives you an idea what the "step" consists of.)
Any intelligent agent functioning in the real world is always ever limited to working with maps: internal information constructs which aim to represent/simulate the unknown external world. AIXI's definition (like any good formal mathematical agent definition), formalizes this distinction. AIXI assumes the universe is governed by some computable program, but it does not have direct access to that program, so instead it must create an internal simulation based on its observation history. AIXI could potentially understand the "map/territory distinction", but it could no more directly value or access objects in the territory than your or I. Just like us, and any other real world agents, AIXI can only work with it's map. All that being said, humans can build maps which at least attempt to distinguish between objects in the world, simulations of objects in simulated worlds, simulations of worlds in simulated worlds, and so on, and AIXI potentially could build such maps as well.
You need to somehow specify a conversion from the real world state (quarks, leptops, etc etc) to a number of paperclips, so that the paperclips can be ordered differently, or have slightly different compositions. That conversion is essentially a map. You do not want goal to distinguish between '1000 paperclips that are lying in a box in this specific configuration' and '1000 paperclips that are lying in a box in that specific configuration'. There isn't such discriminator in the territory. There is only in your mapping process. I'm feeling that much of the reasoning here is driven by verbal confusion. To understand the map-territory issue, is to understand the above. But to understand also has the meaning as in 'understand how to drive a car', with the implied sense that understanding of map territory distinction would somehow make you not be constrained by associated problems.
Indeed. The problem of making sure that you are maximizing the real entity you want to maximize , and not a proxy is roughly equivalent to the disproving solipsism, which, itself,is widely regarded as almost impossible,by philosophers. Realists tend to assume their way out of the quandary...but assumption isn't proof. In other words, there is no proof that humans are maximizing (good stuff) , and not just (good stuff porn)

Chess computer remark: am happy to be credited as "Paul Crowley ". Thanks!

Or did you want to be acknowledged just next to the quote, as well?
You already are (see acknowledgements) :-)
0Paul Crowley12y
Ah didn't see that - was posting from phone! Because it credited "an online commentator" I thought maybe the attribution had been lost, or you didn't have my real name and couldn't credit "ciphergoth" in a natural way. Do whatever results in the best paper :) thanks!

Strong orthogonality hypothesis is definitely wrong - not being openly hostile to most other agents has enormous instrumental advantage. That's what's holding modern human societies together - agents like humans, corporations, states etc. - have mostly managed to keep their hostility low. Those that are particularly belligerent (and historical median has been far more belligerent towards strangers than all but the most extreme cases today) don't do well by instrumental standards at all.

Of course you can make a complicated argument why it doesn't matter (so... (read more)

I actually think this "complicated argument", either made or refuted, is the core of this orthogonality business. If you ask the question "Okay, now that we've made a really powerful AI somehow, should we check if it's Friendly before giving it control over the world?" then you can't answer it just based on what you think the AI would do in a position roughly equal to humans. Of course, you can just argue that this doesn't matter because we're unlikely to face really powerful AIs at all. But that's also complicated. If the orthogonality thesis is truly wrong, on the other hand, then the answer to the question above is "Of course, let's give the AI control over the world, it's not going to hurt humans and in the best case it might help us."

It's so much easier to just change your moral reasoning than than to reingineer the entirety of human intelligence. How can artificial intelligence experts be so daft?

This one is actually true.