General purpose intelligence: arguing the Orthogonality thesis

Note: informally, the point of this paper is to argue against the instinctive "if the AI were so smart, it would figure out the right morality and everything will be fine." It is targeted mainly at philosophers, not at AI programmers. The paper succeeds if it forces proponents of that position to put forwards positive arguments, rather than just assuming it as the default position. This post is presented as an academic paper, and will hopefully be published, so any comments and advice are welcome, including stylistic ones! Also let me know if I've forgotten you in the acknowledgements.


Abstract: In his paper “The Superintelligent Will”, Nick Bostrom formalised the Orthogonality thesis: the idea that the final goals and intelligence levels of agents are independent of each other. This paper presents arguments for a (slightly narrower) version of the thesis, proceeding through three steps. First it shows that superintelligent agents with essentially arbitrary goals can exist. Then it argues that if humans are capable of building human-level artificial intelligences, we can build them with any goal. Finally it shows that the same result holds for any superintelligent agent we could directly or indirectly build. This result is relevant for arguments about the potential motivations of future agents.

 

1 The Orthogonality thesis

The Orthogonality thesis, due to Nick Bostrom (Bostrom, 2011), states that:

  • Intelligence and final goals are orthogonal axes along which possible agents can freely vary: more or less any level of intelligence could in principle be combined with more or less any final goal.

It is analogous to Hume’s thesis about the independence of reason and morality (Hume, 1739), but applied more narrowly, using the normatively thinner concepts ‘intelligence’ and ‘final goals’ rather than ‘reason’ and ‘morality’.

But even ‘intelligence’, as generally used, has too many connotations. A better term would be efficiency, or instrumental rationality, or the ability to effectively solve problems given limited knowledge and resources (Wang, 2011). Nevertheless, we will be sticking with terminology such as ‘intelligent agent’, ‘artificial intelligence’ or ‘superintelligence’, as they are well established, but using them synonymously with ‘efficient agent’, artificial efficiency’ and ‘superefficient algorithm’. The relevant criteria is whether the agent can effectively achieve its goals in general situations, not whether its inner process matches up with a particular definition of what intelligence is.

Thus an artificial intelligence (AI) is an artificial algorithm, deterministic or probabilistic, implemented on some device, that demonstrates an ability to achieve goals in varied and general situations[1]. We don’t assume that it need be a computer program, or a well laid-out algorithm with clear loops and structures – artificial neural networks or evolved genetic algorithms certainly qualify.

A human level AI is defined to be an AI that can successfully accomplish any task at least as well as an average human would (to avoid worrying about robot bodies and such-like, we may restrict the list of tasks to those accomplishable over the internet). Thus we would expect the AI to hold conversations about Paris Hilton’s sex life, to compose ironic limericks, to shop for the best deal on Halloween costumes and to debate the proper role of religion in politics, at least as well as an average human would.

A superhuman AI is similarly defined as an AI that would exceed the ability of the best human in all (or almost all) tasks. It would do the best research, write the most successful novels, run companies and motivate employees better than anyone else. In areas where there may not be clear scales (what’s the world’s best artwork?) we would expect a majority of the human population to agree the AI’s work is among the very best.

Nick Bostrom’s paper argued that the Orthogonality thesis does not depend on the Humean theory of motivation. This paper will directly present arguments in its favour. We will assume throughout that human level AIs (or at least human comparable AIs) are possible (if not, the thesis is void of useful content). We will also take the materialistic position that humans themselves can be viewed as non-deterministic algorithms[2]: this is not vital to the paper, but is useful for comparison of goals between various types of agents. We will do the same with entities such as committees of humans, institutions or corporations, if these can be considered to be acting in an agent-like way.

1.1 Qualifying the Orthogonality thesis

The Orthogonality thesis, taken literally, is false. Some motivations are mathematically incompatible with changes in intelligence (“I want to prove the Gödel statement for the being I would be if I were more intelligent”). Some goals specifically refer to the intelligence of the agent, directly (“I want to be an idiot!”) or indirectly (“I want to impress people who want me to be an idiot!”). Though we could make a case that an agent wanting to be an idiot could initially be of any intelligence level, it won’t stay there long, and it’s hard to see how an agent with that goal could have become intelligent in the first place. So we will exclude from consideration those goals that intrinsically refer to the intelligence level of the agent.

We will also exclude goals that are so complex or hard to describe that the complexity of the goal becomes crippling for the agent. If the agent’s goal takes five planets worth of material to describe, or if it takes the agent five years each time it checks its goal, it’s obvious that that agent can’t function as an intelligent being on any reasonable scale.

Further we will not try to show that intelligence and final goals can vary freely, in any dynamical sense (it could be quite hard to define this varying). Instead we will look at the thesis as talking about possible states: that there exist agents of all levels of intelligence with any given goals. Since it’s always possible to make an agent stupider or less efficient, what we are really claiming is that there exist high-intelligence agents with any given goal. Thus the restricted Orthogonality thesis that we will be discussing is:

  • High-intelligence agents can exist having more or less any final goals (as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence).

2 Orthogonality for theoretic agents

If we were to step back for a moment and consider, in our mind’s eyes, the space of every possible algorithm, peering into their goal systems and teasing out some measure of their relative intelligences, would we expect the Orthogonality thesis to hold? Since we are not worrying about practicality or constructability, all that we would require is that for any given goal system, there exists a theoretically implementable algorithm of extremely high intelligence.

At this level of abstraction, we can consider any goal to be equivalent with maximising a utility function. It is generally not that hard to translate given goals into utilities (many deontological systems are equivalent with maximising the expected utility of a function that gives 1 if the agent always makes the correct decision and 0 otherwise), and any agent making a finite number of decisions can always be seen as maximising a certain utility function.

For utility function maximisers, the AIXI is the theoretically best agent there is, more successful at reaching its goals (up to a finite constant) than any other agent (Hutter, 2005). AIXI itself is incomputable, but there are computable variants such as AIXItl or Gödel machines (Schmidhuber, 2007) that accomplish comparable levels of efficiency. These methods work for whatever utility function is plugged into them. Thus in the extreme theoretical case, the Orthogonality thesis seems trivially true.

There is only one problem with these agents: they require incredibly large amounts of computing resources to work. Let us step down from the theoretical pinnacle and require that these agents could actually exist in our world (still not requiring that we be able or likely to build it).

An interesting thought experiment occurs here. We could imagine an AIXI-like super-agent, with all its resources, that is tasked to design and train an AI that could exist in our world, and that would accomplish the super-agent’s goals. Using its own vast intelligence, the super-agent would therefore design a constrained agent maximally effective at accomplishing those goals in our world. Then this agent would be the high-intelligence real-world agent we are looking for. It doesn’t matter that this is a thought experiment – if the super-agent can succeed in the thought experiment, then the trained AI can exist in our world.

This argument generalises to other ways of producing the AI. Thus to deny the Orthogonality thesis is to assert that there is a goal system G, such that, among other things:

  1. There cannot exist any efficient real-world algorithm with goal G.
  2. If a being with arbitrarily high resources, intelligence, time and goal G, were to try design an efficient real-world algorithm with the same goal, it must fail.
  3. If a human society were highly motivated to design an efficient real-world algorithm with goal G, and were given a million years to do so along with huge amounts of resources, training and knowledge about AI, it must fail.
  4. If a high-resource human society were highly motivated to achieve the goals of G, then it could not do so (here the human society is seen as the algorithm).
  5. Same as above, for any hypothetical alien societies.
  6. There cannot exist any pattern of reinforcement learning that would train a highly efficient real-world intelligence to follow the goal G.
  7. There cannot exist any evolutionary or environmental pressures that would evolving highly efficient real world intelligences to follow goal G.

All of these seem extraordinarily strong claims to make! The last claims all derive from the first, and merely serve to illustrate how strong the first claim actually is. Thus until such time as someone comes up with such a G and strong arguments for why it must fulfil these conditions, we can assume the Orthogonality statement established in the theoretical case.

 

3 Orthogonality for human-level AIs

Of course, even if efficient agents could exist for all these goals, that doesn’t mean that we could ever build them, even if we could build AIs. In this section, we’ll look at the ground for assuming the Orthogonality thesis holds for human-level agents. Since intelligence isn’t varying much, the thesis becomes simply:

  • If we could construct human-level AIs at all, we could construct human-level AIs with more or less any final goals (as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence).

So, is this true? The arguments in this section are generally independent of each other, and can be summarised as:

  1. Some possible AI designs have orthogonality built right into them.
  2. AI goals can reach the span of human goals, which is larger than it seems.
  3. Algorithms can be combined to generate an AI with any easily checkable goal.
  4. Various algorithmic modifications can be used to further expand the space of possible goals, if needed.

3.1 Utility functions

The utility function picture of a rational agent maps perfectly onto the Orthogonality thesis: here have the goal structure, the utility function, packaged neatly and separately from the intelligence module (whatever part of the machine calculates which actions maximise expected utility). Demonstrating the Orthogonality thesis is as simple as saying that the utility function can be replaced with another. However, many putative agent designs are not utility function based, such as neural networks, genetic algorithms, or humans. Nor do we have the extreme calculating ability that we had in the purely theoretic case to transform any goals into utility functions. So from now on we will consider that our agents are not expected utility maximisers with clear and separate utility functions.

3.2 The span of human motivations

It seems a reasonable assumption that if there exists a human being with particular goals, then we can construct a human-level AI with similar goals. This is immediately the case if the AI was a whole brain emulation/upload (Sandberg & Bostrom, 2008), a digital copy of a specific human mind. Even for more general agents, such as evolved agents, this remains a reasonable thesis. For a start, we know that real-world evolution has produced us, so constructing human-like agents that way is certainly possible. Human minds remain our only real model of general intelligence, and this strongly direct and informs our AI designs, which are likely to be as human-similar as we can make them. Similarly, human goals are the easiest goals for us to understand, hence the easiest to try and implement in AI. Hence it seems likely that we could implement most human goals in the first generation of human-level AIs.

So how wide is the space of human motivations[3]? Our race spans foot-fetishists, religious saints, serial killers, instinctive accountants, role-players, self-cannibals, firefighters and conceptual artists. The autistic, those with exceptional social skills, the obsessive compulsive and some with split-brains. Beings of great empathy and the many who used to enjoy torture and executions as public spectacles[4]. It is evident that the space of possible human motivations is vast[5]. For any desire, any particular goal, no matter how niche[6], pathological, bizarre or extreme, as long as there is a single human who ever had it, we could build and run an AI with the same goal.

But with AIs we can go even further. We could take any of these goals as a starting point, make them malleable (as goals are in humans), and push them further out. We could provide the AIs with specific reinforcements to push their goals in extreme directions (reward the saint for ever-more saintly behaviour). If the agents are fast enough, we could run whole societies of them with huge varieties of evolutionary or social pressures, to further explore the goal-space.

We may also be able to do surgery directly on their goals, to introduce more yet variety. For example, we could take a dedicated utilitarian charity worker obsessed with saving lives in poorer countries (but who doesn’t interact, or want to interact, directly with those saved), and replace ‘saving lives’ with ‘maximising paperclips’ or any similar abstract goal. This is more speculative, of course – but there are other ways of getting similar results.

3.3 Interim goals as terminal goals

If someone were to hold a gun to your head, they could make you do almost anything. Certainly there are people who, with a gun at their head, would be willing to do almost anything. A distinction is generally made between interim goals and terminal goals, with the former being seen as simply paths to the latter, and interchangeable with other plausible paths. The gun to your head disrupts the balance: your terminal goal is simply not to get shot, while your interim goals become what the gun holder wants them to be, and you put a great amount of effort into accomplishing the minute details of these interim goals. Note that the gun has not changed your level of intelligence or ability.

This is relevant because interim goals seem to be far more varied in humans than terminal goals. One can have interim goals of filling papers, solving equations, walking dogs, making money, pushing buttons in various sequences, opening doors, enhancing shareholder value, assembling cars, bombing villages or putting sharks into tanks. Or simply doing whatever the guy with gun at our head orders us to do. If we could accept human interim goals as AI terminal goals, we would extend the space of goals quite dramatically.

To do we would want to put the threatened agent, and the gun wielder, together into the same AI. Algorithmically there is nothing extraordinary about this: certain subroutines have certain behaviours depending on the outputs of other subroutines. The ‘gun wielder’ need not be particularly intelligent: it simply needs to be able to establish whether its goals are being met. If for instance those goals are given by a utility function then all that is required in an automated system that measure progress toward increasing utility and punishes (or erases) the rest of the AI if not. The ‘rest of AI’ is just required to be a human-level AI which would be susceptible to this kind of pressure. Note that we do not require that it even be close to human in any way, simply that it place a highest value on self-preservation (or on some similar small goal that the ‘gun wielder’ would have power over).

For humans, another similar model is that of a job in a corporation or bureaucracy: in order to achieve the money required for their terminal goals, some human are willing to perform extreme tasks (organising the logistics of genocides, weapon design, writing long detailed press releases they don’t agree with at all). Again, if the corporation-employee relationship can be captured in a single algorithm, this would generate an intelligent AI whose goal is anything measurable by the ‘corporation’. The ‘money’ could simply be an internal reward channel, perfectly aligning the incentives.

If the subagent is anything like a human, they would quickly integrate the other goals into their own motivation[7], removing the need for the gun wielder/corporation part of the algorithm.

3.4 Noise, anti-agents and goal combination

There are further ways of extending the space of goals we could implement in human-level AIs. One simple way is simply to introduce noise: flip a few bits and subroutines, add bugs and get a new agent. Of course, this is likely to cause the agent’s intelligence to decrease somewhat, but we have generated new goals. Then, if appropriate, we could use evolution or other improvements to raise the agent’s intelligence again; this will likely undo some, but not all of effect of the noise. Or we could use some of the tricks above to make a smarter agent implement the goals of the noise-modified agent.

A more extreme example would be to create an anti-agent: an agent whose single goal is to stymie the plans and goals of single given agent. This already happens with vengeful humans, and we would just need to dial it up: have an anti-agent that would do all it can to counter the goals of a given agent, even if that agent doesn’t exist (“I don’t care that you’re dead, I’m still going to despoil your country, because that’s what you’d wanted me to not do”). This further extends the space of possible goals.

Different agents with different goals can also be combined into a single algorithm. With some algorithmic method for the AIs to negotiate their combined objective and balance the relative importance of their goals, this procedure would construct a single AI with a combined goal system. There would likely be no drop in intelligence/efficiency: committees of two can work very well towards their common goals, especially if there is some automatic penalty for disagreements.

3.5 Further tricks up the sleeve

This section started by emphasising the wide space of human goals, and then introduced tricks to push goal systems further beyond these boundaries. The list isn’t exhaustive: there are surely more devices and ideas one can use to continue to extend the space of possible goals for human-level AIs. Though this might not be enough to get every goal, we can nearly certainly use these procedures to construct a human-level AI with any human-comprehensible goal. But would the same be true for superhuman AIs?

 

4 Orthogonality for superhuman AIs

We now come to the area where the Orthogonality thesis seems the most vulnerable. It is one thing to have human-level AIs, or abstract superintelligent algorithms created ex nihilo, with certain goals. But if ever the human race were to design a superintelligent AI, there would be some sort of process involved – directed evolution, recursive self-improvement (Yudkowsky, 2001), design by a committee of AIs, or similar – and it seems at least possible that such a process could fail to fully explore the goal-space. If we define the Orthogonality thesis in this context as:

  • If we could construct superintelligent AIs at all, we could construct superintelligent AIs with more or less any goals (as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence).

There are two counter-theses. The weakest claim is:

  • Incompleteness: there are some goals that no superintelligence designed by us could have.

A stronger claim is:

  • Convergence: all human-designed superintelligences would have one of a small set of goals.

They should be distinguished; Incompleteness is all that is needed to contradict Orthogonality, but Convergence is often the issue being discussed. Often convergence is assumed to be to some particular model of metaethics (Müller, 2012).

4.1 No convergence

The plausibility of the convergence thesis is highly connected with the connotations of the terms used in it. “All human-designed rational beings would follow the same morality (or one of small sets of moralities)” sound plausible; in contract “All human-designed superefficient algorithms would accomplish the same task” seems ridiculous. To quote an online commentator, how good at playing chess would a chess computer have to be before it started feeding the hungry?

Similarly, if there were such a convergence, then all self-improving or constructed superintelligence must fall prey to it, even if it were actively seeking to avoid it. After all, the lower-level AIs or the AI designers have certain goals in mind (as we’ve seen in the previous section, potentially any goals in mind). Obviously, they would be less likely to achieve their goals if these goals were to change (Omohundro, 2008) (Bostrom, 2012). The same goes if the superintelligent AI they designed didn’t share these goals. Hence the AI designers will be actively trying to prevent such a convergence, if they suspected that one was likely to happen. If for instance their goals were immoral, they would program their AI not to care about morality; they would use every trick up their sleeves to prevent the AI’s goals from drifting from their own.

So the convergence thesis requires that for the vast majority of goals G:

  1. It is possible for a superintelligence to exist with goal G (by section 2).
  2. There exists an entity with goal G (by section 3), capable of building a superintelligent AI.
  3. Yet any attempt of that entity to build a superintelligent AI with goal G will be a failure, and the superintelligence’s goals will converge on some other goal.
  4. This is true even if the entity is aware of the convergence and explicitly attempts to avoid it.

This makes the convergence thesis very unlikely. The argument also works against the incompleteness thesis, but in a weaker fashion: it seems more plausible that some goals would be unreachable, despite being theoretically possible, rather than most goals being unreachable because of convergence to a small set.

There is another interesting aspect of the convergence thesis: it postulates that certain goals G will emerge, without them being aimed for or desired. If one accepts that goals aimed for will not be reached, one has to ask why convergence is assumed: why not divergence? Why not assume that though G is aimed for, random accidents or faulty implementation will lead to the AI ending up with one of a much wider array of possible goals, rather than a much narrower one.

4.2 Oracles show the way

If the Orthogonality thesis is wrong, then it implies that Oracles are impossible to build. An Oracle is a superintelligent AI that accurately answers questions about the world (Armstrong, Sandberg, & Bostrom, 2011). This includes hypothetical questions about the future, which means that we can produce a superintelligent AI with goal G by wiring a human-level AI with goal G to an Oracle: the human level AI will go through possible actions, have the Oracle check the outcomes, and choose the one that best accomplishes G.

What makes the “no Oracle” position even more counterintuitive is that any superintelligence must be able to look ahead, design actions, predict the consequences of its actions, and choose the best one available. But the convergence thesis implies that this general skill is one that we can make available only to AIs with certain specific goals. Though the agents with those narrow goals are capable of doing these predictions, they automatically lose this ability if their goals were to change.

4.3 Tricking the controller

Just as with human-level AIs, one could construct a superintelligent AI by wedding together a superintelligence with a large dedicated committee of human-level AIs dedicated to implementing a goal G, and checking the superintelligence’s actions. Thus to deny the Orthogonality thesis requires that one believes that the superintelligence is always capable of tricking this committee, no matter how detailed and thorough their oversight.

This argument extends the Orthogonality thesis to moderately superintelligent AIs, or to any situation where there was a diminishing return to intelligence. It only fails if we take AI to be fantastically superhuman: capable of tricking or seducing any collection of human-level beings.

4.4 Temporary fragments of algorithms, fictional worlds and extra tricks

These are other tricks that can be used to create an AI with any goals. For any superintelligent AI, there are certain inputs that will make it behave in certain ways. For instance, a human-loving moral AI could be compelled to follow most goals G for a day, if they were rewarded with something sufficiently positive afterwards. But its actions for that one day are the result of a series of inputs to a particular algorithm; if we turned off the AI after that day, we would have accomplished moves towards goal G without having to reward its “true” goals at all. And then we could continue the trick the next day with another copy.

For this to fail, it has to be the case that we can create an algorithm which will perform certain actions on certain inputs as long as it isn’t turned off afterwards, but that we cannot create an algorithm that does the same thing if it was to be turned off.

Another alternative is to create a superintelligent AI that has goals in a fictional world (such as a game or a reward channel) over which we have control. Then we could trade interventions in the fictional world against advice in the real world towards whichever goals we desire.

These two arguments may feel weaker than the ones before: they are tricks that may or may not work, depending on the details of the AI’s setup. But to deny the Orthogonality thesis requires not only denying that these tricks would ever work, but denying that any tricks or methods that we (or any human-level AIs) could think up, would ever work at controlling the AIs. We need to assume superintelligent AIs cannot be controlled.

4.5 In summary

Denying the Orthogonality thesis thus requires that:

  1. There are goals G, such that an entity an entity with goal G cannot build a superintelligence with the same goal. This despite the fact that the entity can build a superintelligence, and that a superintelligence will goal G can exist.
  2. Goal G cannot arise accidentally from some other origin, and errors and ambiguities do not significantly broaden the space of possible goals.
  3. Oracles and general purpose planners cannot be built. Superintelligent AIs cannot have their planning abilities repurposed.
  4. A superintelligence will always be able to trick its controllers, and there is no way the controllers can set up a reasonable system of control.
  5. Though we can create an algorithm that does certain actions if it was not to be turned off after, we cannot create an algorithm that does the same thing if it was to be turned off after.
  6. An AI will always come to care intrinsically about things in the real world.
  7. No tricks can be thought up to successfully constrain the AI’s goals: superintelligent AIs cannot be controlled.

 

5 Bayesian Orthogonality thesis

All the previous sections concern hypotheticals, but of a different kind. Section 2 touches upon what kinds of algorithm could theoretically exist. But sections 3 and 4 concern algorithms that could be constructed by humans (or from AIs originally constructed by humans): they refer to the future. As AI research advances, and certain approaches or groups start to show or lose prominence, we’ll start getting a better idea of how such an AI will emerge.

Thus the orthogonality thesis will narrow as we achieve better understanding of how AIs would work in practice, of what tasks they will be put to and of what requirements their designers will desire. Most importantly of all, we will get more information on the critical question as to whether the designers will actually be able to implement their desired goals in an AI. On the eve of creating the first AIs (and then the first superintelligent AIs), the Orthogonality thesis will likely have pretty much collapsed: yes, we could in theory construct an AI with any goal, but at that point, the most likely outcome is an AI with particular goals – either the goals desired by their designers, or specific undesired goals and error modes.

However, until that time arises, because we do not know any of this information currently, we remain in the grip of a Bayesian version of the Orthogonality thesis:

  • As far as we know now (and as far as we’ll know until we start building AIs), if we could construct superintelligent AIs at all, we could construct superintelligent AIs with more or less any goals (as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence).

 

6 Conclusion

It is not enough to know that an agent is intelligent (or superintelligent). If we want to know something about its final goals, about the actions it will be willing to undertake to achieve them, and hence its ultimate impact on the world, there are no shortcuts. We have to directly figure out what these goals are, and cannot rely on the agent being moral just because it is superintelligent/superefficient.

 

7 Acknowledgements

It gives me great pleasure to acknowledge the help and support of Anders Sandberg, Nick Bostrom, Toby Ord, Owain Evans, Daniel Dewey, Eliezer Yudkowsky, Vladimir Slepnev, Viliam Bur, Matt Freeman, Wei Dai, Will Newsome, Paul Crowley, Alexander Kruel and Rasmus Eide, as well as those members of the Less Wrong online community going by the names shminux, Larks and Dmytry.

 

8 Bibliography

Armstrong, S., Sandberg, A., & Bostrom, N. (2011). Thinking Inside the Box: Controlling and Using an Oracle AI. forthcoming in Minds and Machines .

Bostrom, N. (2012). Superintelligence: Groundwork to a Strategic Analysis of the Machine Intelligence Revolution. to be published.

Bostrom, N. (2011). The Superintelligent Will: Motivation and Instrumental Rationality in Advance Artificial Agents. forthcoming in Minds and Machines .

de Fabrique, N., Romano, S. J., Vecchi, G. M., & van Hasselt, V. B. (2007). Understanding Stockholm Syndrome. FBI Law Enforcement Bulletin (Law Enforcement Communication Unit) , 76 (7), 10-15.

Hume, D. (1739). A Treatise of Human Nature. 

Hutter, M. (2005). Universal algorithmic intelligence: A mathematical top-down approach. In B. Goertzel, & C. Pennachin (Eds.), Artificial General Intelligence. Springer-Verlag.

Müller, J. (2012). Ethics, risks and opportunities of superintelligences. Retrieved May 2012, from http://www.jonatasmuller.com/superintelligences.pdf

Omohundro, S. M. (2008). The Basic AI Drives. In P. Wang, B. Goertzel, & S. Franklin (Eds.), Artificial General Intelligence: Proceedings of the First AGI Conference (Vol. 171).

Sandberg, A., & Bostrom, N. (2008). Whole brain emulation: A roadmap. Future of Humanity Institute Technical report , 2008-3.

Schmidhuber, J. (2007). Gödel machines: Fully self-referential optimal universal self-improvers. In Artificial General Intelligence. Springer.

Wang, P. (2011). The assumptions on knowledge and resources in models of rationality. International Journal of Machine Consciousness , 3 (1), 193-218.

Yudkowsky, E. (2001). General Intelligence and Seed AI 2.3. Retrieved from Singularity Institute for Artificial Intelligence: http://singinst.org/ourresearch/publications/GISAI/

Footnotes

[1] We need to assume it has goals, of course. Determining whether something qualifies as a goal-based agent is very tricky (researcher Owain Evans is trying to establish a rigorous definition), but this paper will adopt the somewhat informal definition that an agent has goals if it achieves similar outcomes from very different starting positions. If the agent ends up making ice cream in any circumstances, we can assume ice creams are in its goals.

[2] Every law of nature being algorithmic (with some probabilistic process of known odds), and no exceptions to these laws being known.

[3] One could argue that we should consider the space of general animal intelligences – octopuses, supercolonies of social insects, etc... But we won’t pursue this here; the methods described can already get behaviours like this.

[4] Even today, many people have had great fun torturing and abusing their characters in games like “the Sims” (http://meodia.com/article/281/sadistic-ways-people-torture-their-sims/). The same urges are present, albeit diverted to fictionalised settings. Indeed games offer a wide variety of different goals that could conceivably be imported into an AI if it were possible to erase the reality/fiction distinction in its motivation.

[5] As can be shown by a glance through a biography of famous people – and famous means they were generally allowed to rise to prominence in their own society, so the space of possible motivations was already cut down.

[6] Of course, if we built an AI with that goal and copied it millions of times, it would no longer be niche.

[7] Such as the hostages suffering from Stockholm syndrome (de Fabrique, Romano, Vecchi, & van Hasselt, 2007).

156 comments, sorted by
magical algorithm
Highlighting new comments since Today at 5:34 PM
Select new highlight date

For utility function maximisers, the AIXI is the theoretically best agent there is, more successful at reaching its goals (up to a finite constant) than any other agent (Hutter, 2005).

False. AIXI as defined can maximize only a sensory reward channel, not a utility function over an environmental model with a known ontology. As Dewey demonstrates, this problem is not easy to fix; AIXI can have utility functions over (functions of) sensory data, but its environment-predictors vary freely in ontology via Solomonoff induction, so it can't have a predefined utility function over the future of its environment without major rewriting.

AIXI is the optimal function-of-sense-data maximizer for Cartesian agents with unbounded computing power and access to a halting oracle, in a computable environment as separated from AIXI by the Cartesian boundary, given that your prior belief about the possible environments matches AIXI's Solomonoff prior.

Thanks for the correction. Daniel hadn't mentioned that as a problem when he reviewed the paper, so, I took it as being at least approximately correct, but it is important to be as rigorous as possible. I'll see what can be rescued, and what needs to be reworked.

Here's an attack on section 4.1. Consider the possibility that "philosophical ability" (something like the ability to solve confusing problems that can't be easily formalized) is needed to self-improve beyond some threshold of intelligence, and this same "philosophical ability" also reliably causes one to decide that some particular goal G is the right goal to have, and therefore beyond some threshold of intelligence all agents have goal G. To deny this possibility seems to require more meta-philosophical knowledge than we currently possess.

Other possibility that is easy to see if you are to think more like an engineer and less like philosopher:

The AI is to operate with light-speed delay, and has to be made of multiple nodes. It is entirely possible that some morality systems would not allow efficient solutions to this challenge (i.e. would break into some sort of war between modules, or otherwise fail to intellectually collaborate).

It is likely that there's only a limited number of good solutions to P2P intelligence design, and the one that would be found would be substantially similar to our own solution of fundamentally same problem, solution which we call 'morality', complete with various non-utilitarian quirks.

edit: that is, our 'morality' is the set of rules for inter-node interaction in society, and some of such rules just don't work. Orthogonality thesis for anything in any sense practical is a conjunction of potentially very huge number of propositions (which are assumed false without consideration, by omission) - any sort of consideration not yet considered can break the symmetry between different goals, then another such consideration is incredibly unlikely to add symmetry back.

Yes, to deny it requires more meta-philosophical knowledge than we currently possess. But to affirm it as likely requires more meta-philosophical knowledge than we currently possess. My purpose is to show that it's very unlikely, not that it's impossible.

Do you feel I didn't make that point? Should I have addressed "moral realism" explicitly? I didn't want to put down the words, because it raises defensive hackles if I start criticising a position directly.

Perhaps I should have said "To conclude that this possibility is very unlikely" instead of "To deny this possibility". My own intuition seems to assign a probability to it that is greater than "very unlikely" and this was largely unchanged after reading your paper. For example, many of the items in the list in section 4.5, that have to be true if orthogonality was false, can be explained by my hypothesis, and the rest do not seem very unlikely to begin with.

My own intuition seems to assign a probability to it that is greater than "very unlikely"

Why? You're making an extraordinary claim. Something - undefined - called philosophical ability is needed (for some reason) to self improve and, for some extraordinary and unexplained reason, this ability causes an agent to have a goal G. Where goal G is similarly undefined.

Let me paraphrase: Consider the possibility that "mathematical ability" is needed to self-improve beyond some threshold of intelligence, and this same "mathematical ability" also reliably causes one to decide that some particular goal G is the right goal to have, and therefore beyond some threshold of intelligence all agents have goal G.

Why is this different? What in your intuition is doing the work "philosophical ability" -> same goals? If we call it something else than "philosophical ability", would you have the same intuition? What raises the status of that implication to the level that it's worthy of consideration?

I'm asking seriously - this is the bit in the argument I consistently fail to understand, the bit that never makes sense to me, but who's outline I can feel in most counterarguments.

It seems to me there are certain similarities and correlations between thinking about decision theory (which potentially makes one or an AI one builds more powerful) and thinking about axiology (what terminal goals one should have). They're both "ought" questions, and If you consider the intelligences that we can see or clearly reason about (individual humans, animals, Bayesian EU maximizer, narrow AIs that exist today), there seems a clear correlation between "ability to improve decision theory via philosophical reasoning" (as opposed to CDT-AI changing into XDT and then being stuck with that) and "tendency to choose one's goals via philosophical reasoning".

One explanation for this correlation (and also the only explanation I can see at the moment, besides it being accidental) is that something we call "philosophical ability" is responsible for both. Assuming that's the case, that still leaves the question of whether philosophical ability backed up with enough computing power eventually leads to goal convergence.

One major element of philosophical reasoning seems to be a distaste for and tendency to avoid arbitrariness. It doesn't seem implausible that for example "the ultimate philosopher" would decide that every goal except pursuit of pleasure / avoidance of pain is arbitrary (and think that pleasure/pain is not arbitrary due to philosophy-of-mind considerations).

One major element of philosophical reasoning seems to be a distaste for and tendency to avoid arbitrariness.

If an agent has goal G1 and sufficient introspective access to know its own goal, how would avoiding arbirtrariness in its goals help it achieve goal G1 better than keeping goal G1 as its goal?

I suspect we humans are driven to philosophize about what our goals ought to be by our lack of introspective access, and that searching for some universal goal, rather than what we ourselves want, is a failure mode of this philosophical inquiry.

I think we don't just lack introspective access to our goals, but can't be said to have goals at all (in the sense of preference ordering over some well defined ontology, attached to some decision theory that we're actually running). For the kind of pseudo-goals we have (behavior tendencies and semantically unclear values expressed in natural language), they don't seem to have the motivational strength to make us think "I should keep my goal G1 instead of avoiding arbitrariness", nor is it clear what it would mean to "keep" such pseudo-goals as one self-improves.

What if it's the case that evolution always or almost always produces agents like us, so the only way they can get real goals in the first place is via philosophy?

The primary point of my comment was to argue that an agent that has a goal in the strong sense would not abandon its goal as a result of philosophical consideration. Your response seems more directed at my afterthought about how our intuitions based on human experience would cause us to miss the primary point.

I think that we humans do have goals, despite not being able to consistantly pursue them. I want myself and my fellow humans to continue our subjective experiences of life in enjoyable ways, without modifying what we enjoy. This includes connections to other people, novel experiences, high challenge, etc. There is, of course, much work to be done to complete this list and fully define all the high level concepts, but in the end I think there are real goals there, which I would like to be embodied in a powerful agent that actually runs a coherent decision theory. Philosophy probably has to play some role in clarifying our "pseudo-goals" as actual goals, but so does looking at our "pseudo-goals", however arbitrary they may be.

The primary point of my comment was to argue that an agent that has a goal in the strong sense would not abandon its goal as a result of philosophical consideration.

Such an agent would also not change its decision theory as a result of philosophical consideration, which potentially limits its power.

Philosophy probably has to play some role in clarifying our "pseudo-goals" as actual goals, but so does looking at our "pseudo-goals", however arbitrary they may be.

I wouldn't argue against this as written, but Stuart was claiming that convergence is "very unlikely" which I think is too strong.

Such an agent would also not change its decision theory as a result of philosophical consideration, which potentially limits its power.

I don't think that follows, or at least the agent could change its decision theory as a result of some consideration, which may or may not be "philosophical". We already have the example that a CDT agent that learns in advance it will face Newcomb's problem could predict it would do better if it switched to TDT.

I wrote earlier

"ability to improve decision theory via philosophical reasoning" (as opposed to CDT-AI changing into XDT and then being stuck with that)

XDT (or in Eliezer's words, "crippled and inelegant form of TDT") is closer to TDT but still worse. For example, XDT would fail to acausally control/trade with other agents living before the time of its self-modification, or in other possible worlds.

Ah, yes, I agree that CDT would modify to XDT rather than TDT, though the fact that it self modifies at all shows that goal driven agents can change decision theories because the new decision theory helps it achieve its goal. I do think that it's important to consider how a particular decision theory can decide to self modify, and to design an agent with a decision theory that can self modify in good ways.

Not strictly. If strongly goal'd agent determines that a different decision theory (or any change to itself) better maximizes its goal, it would adopt that new decision theory or change.

I agree that humans are not utility-maximizers or similar goal-oriented agents - not in the sense we can't be modeled as such things, but in the sense that these models do not compress our preferences to any great degree, which happens to be because they are greatly at odds with our underlying mechanisms for determining preference and behavior.

Also, can we even get 'real goals' like this? We're threading onto land of potentially proposing something as silly as blue unicorns on back side of the moon. We use goals to model other human intelligences, that is built into our language, that's how we imagine other agents, that's how you predict a wolf, a cat, another ape, etc. The goals are really easy within imagination (which is not reductionist and where the true paperclip count exists as a property of the 'world'). Outside imagination, though...

If an agent has goal G1 and sufficient introspective access to know its own goal, how would avoiding arbirtrariness in its goals help it achieve goal G1 better than keeping goal G1 as its goal?

Avoiding arbitrariness is useful to epistemic rationality and therefore to instrumental rationality. If an AI has rationality as a goal it will avoid arbitrariness, whether or not that assists with G1.

Avoiding arbitrariness is useful to epistemic rationality and therefore to instrumental rationality.

Avoiding giving credence to arbitrary beliefs is useful to epistemic rationality and therefor to instrumental rationality, and therefor to goal G1. Avoiding arbitrariness in goals still does not help with achieving G1 if G1 is considered arbitrary. Be careful not to conflate different types of arbitrariness.

If an AI has rationality as a goal

Rationality is not an end goal, it is that which you do in pursuit of a goal that is more important to you than being rational.

If an agent has goal G1 and sufficient introspective access to know its own goal, how would avoiding arbirtrariness in its goals help it achieve goal G1 better than keeping goal G1 as its goal?

You are making the standard MIRI assumptions that goals are unupdatable, and not including rationality (non arbitrariness, etc) as a terminal value. (The latter is particularly odd, as Orthogonality implies it).

I suspect we humans are driven to philosophize about what our goals ought to be by our lack of introspective access, and that searching for some universal goal, rather than what we ourselves want, is a failure mode of this philosophical inquiry.

I suspect we want universal goals for the same reason we want universal laws.

You are making the standard MIRI assumptions that goals are unupdatable

No, I am arguing that agents with goals generally don't want to update their goals. Neither I nor MIRI assume goals are unupdatable, actually a major component of MIRI's research is on how to make sure a self improving AI has stable goals.

and don't include rationality (non arbitrariness, etc) as a terminal value. (The latter is particularly odd, as Orthogonality implies it).

It is possible to have an agent that terminally values meta properties of its own goal system. Such agents, if they are capable of modifying their goal system, will likely self modify to some self-consistent "attractor" system. This does not mean that all agents will converge on a universal goal system. There are different ways that agents can value meta properties of their own goal system, so there are likely many attractors, and many possible agents don't have such meta values and will not want to modify their goal systems.

It is possible to have an agent that terminally values meta properties of its own goal system. Such agents, if they are capable of modifying their goal system, will likely self modify to some self-consistent "attractor" system. This does not mean that all agents will converge on a universal goal system.

Who asserted they would? Moral agents can have all sorts of goals, They just have to respect each others values. If Smith wants to be an athlete, and Robinson is a budding writer, that doesn't mean one of them is immoral.

There are different ways that agents can value meta properties of their own goal system,

Ok. That would be a problem with your suggestion of valuing arbitrary meta properties of their goal system. Then lets go back to my suggestion of valuing rationality.

so there are likely many attractors, and many possible agents don't have such meta values and will not want to modify their goal systems.

Agents will do what they are built to do. If agents that don't value rationality are dangerous, build ones that do.

MIRI: "We have detemined that cars without bbrakes are dangerous. We have also determined that the best solution is to reduce the speed limit to 10mph"

Everyone else: "We know cars without brakes are dangerous. That's why we build them with brakes".

Who asserted they would? Moral agents can have all sorts of goals, They just have to respect each others values. If Smith wants to be an athlete, and Robinson is a budding writer, that doesn't mean one of them is immoral.

Have to, or else what? And how do we separate moral agents from agents that are not moral?

Ok. That would be a problem with your suggestion of valuing arbitrary meta properties of their goal system. Then lets go back to my suggestion of valuing rationality.

Valuing rationality for what? What would an agent which "values rationality" do?

Agents will do what they are built to do. If agents that don't value rationality are dangerous, build ones that do.

MIRI: "We have detemined that cars without bbrakes are dangerous. We have also determined that the best solution is to reduce the speed limit to 10mph"

Everyone else: "We know cars without brakes are dangerous. That's why we build them with brakes".

If the solution is to build agents that "value rationality," can you explain how to do that? If it's something so simple as to be analogous to adding brakes to a car, as opposed to, say, programming the car to be able to drive itself (let alone something much more complicated,) then it shouldn't be so difficult to describe how to do it.

Moral agents [..] have to respect each others values.

Have to, or else what?

Have to, logically. Like even numbers have to be divisible,

And how do we separate moral agents from agents that are not moral?

How do we recognise anything? They have behaviour and characteristics which match the definition.

Valuing rationality for what?

For itself. I do not accept that rationality can only be instrumental, a means to an end.

What would an agent which "values rationality" do?

The kind of thing EY, the CFAR and other promoters of rationality urge people to do.

If the solution is to build agents that "value rationality," can you explain how to do that?

In the same kind of very broad terms that MIRI can explain how to build Artificial Obsessive Compulsives.

If it's something so simple as to be analogous to adding brakes to a car,

The analogy was not about simplicity. Illustrative analogies are always simpler than what they are illustrating: that is where their usefulness lies.

Something - undefined - called philosophical ability

Robin Hanson's 'far mode' (his take on construal level theory) is a plausible match to this 'something'. Hanson points out that far mode is about general categories and creative metaphors. This is a match to something from AGI research...categorization and analogical inference. This can be linked to Bayesian inference by considering analogical inference as a natural way of reasoning about 'priors'.

...and, for some extraordinary and unexplained reason, this ability causes an agent to have a goal G.

A plausible explanation is that analogical inference is associated with sentience (subjective experience), as suggested by Douglas Hofstadter (who has stated he thinks 'analogies' are the core of conscious cognition). Since sentience is closely associated with moral reasoning, it's at least plausible that this ability could indeed give rise to converge on a particular G.

Where goal G is similarly undefined.

Here is a way G can be defined:

Analogical inference is concerned with Knowledge Representation (KR), so we could redefine ethics based on 'representations of values' ('narratives', which as Daniel Dennett has pointed out,indeed seem to be closely linked to subjective experience) rather than external consequences. At this point we can bring in the ideas of Schmidhuber and recall a powerful point made by Hanson (see below).

For maximum efficiency, all AGIs with the aforementioned 'philosophical ability' (analogical inference and production of narratives) would try to minimize the complexity of the cognitive processes generating its internal narratives. This could place universal contraints of what these values are. For example, Schmidhuber pointed out that data compression could be used to get a precise definition of 'beauty'.

Lets now recall a powerful point Hanson made a while back on OB: the brain/mind can be totally defined in terms of a 'signal processor'. Given this perspective, we could then view the correct G as the 'signal' and moral errors as 'noise'. Algorithmic information theory could then be used to define a complexity metric that would precisely define this G.

Schmidthuber's definition of beauty is wrong. He says, roughly, that you're most pleased when after great effort you find a way to compress what was seemingly incompressible. If that were so, I could please you again and again by making up new AES keys with the first k bits random and the rest zero, and using them to generate and give you a few terabytes of random data. You'd have to brute force the key, at which point you'll have compressed down from terabytes to kilobytes. What beauty! Let's play the exact game again, with the exact same cipher but a different key, forever.

Right. That said, wireheading, aka the grounding problem, is a huge unsolved philosophical problem, so I'm not sure Schmidhuber is obligated to answer wireheading objections to his theory.

Right. That said, wireheading, aka the grounding problem, is a huge unsolved philosophical problem, so I'm not sure Schmidhuber is obligated to answer wireheading objections to his theory.

Unsolved philsophical problem? Huh? No additional philosophical breakthroughs are required for wireheading to not be a problem.

If I want (all things considered, etc) to wirehead, I'll wirehead. If I don't want to wirehead I will not wirehead. Wireheading introduces no special additional problems and is handled the same way all other preferences about future states of the universe can be handled.

(Note: It is likely that you have some more specific point regarding in what sense you consider wireheading 'unsolved'. I welcome explanations or sources.)

Unsolved in the sense that we don't know how to give computer intelligences intentional states in a way that everyone would be all like "wow that AI clearly has original intentionality and isn't just coasting off of humans sitting at the end of the chain interpreting their otherwise entirely meaningless symbols". Maybe this problem is just stupid and will solve itself but we don't know that yet, hence e.g. Peter's (unpublished?) paper on goal stability under ontological shifts. (ETA: I likely don't understand how you're thinking about the problem.)

Unsolved in the sense that we don't know how to give computer intelligences intentional states in a way that everyone would be all like "wow that AI clearly has original intentionality and isn't just coasting off of humans sitting at the end of the chain interpreting their otherwise entirely meaningless symbols".

Being able to do this would also be a step towards the related goal of trying to give computer intelligences intelligence that we cannot construe as 'intentionality' in any morally salient sense, so as to satisfy any "house-elf-like" qualms that we may have.

e.g. Peter's (unpublished?) paper on goal stability under ontological shifts.

I assume you mean Ontological Crises in Artificial Agents’ Value Systems? I just finished republishing that one. Originally published form. New SingInst style form. A good read.

But the theory fails because this fits it but isn't wireheading, right? It wouldn't actually be pleasing to play that game.

But the theory fails because this fits it but isn't wireheading, right? It wouldn't actually be pleasing to play that game.

I think you are right.

The two are errors that practically, with respect to hedonistic extremism, operate in opposing directions. They are similar in form in as much as they fit the abstract notion "undesirable outcomes due to lost purposes when choosing to optimize what turns out to be a poor metric for approximating actual preferences".

Meh, yeah, maybe? Still seems like other, more substantive objections could be made.

Relatedly, I'm not entirely sure I buy Steve's logic. PRNGs might not be nearly as interesting as short mathematical descriptions of complex things, like Chaitin's omega. Arguably collecting as many bits of Chaitin's omega as possible, or developing similar maths, would in fact be interesting in a human sense. But at that point our models really break down for many reasons, so meh whatever.

Engineering ability suffices:

http://lesswrong.com/lw/cej/general_purpose_intelligence_arguing_the/6lst

Do philosophers have an incredibly strong ugh field around anything that can be deemed 'implementation detail'? Clearly, 'superintelligence' the string of letters can have what ever 'goals' the strings of letters, no objection here. The superintelligence in form of distributed system with millisecond or worse lag between components, and nanosecond or better clock speed, on the other hand...

Looking at your post at http://lesswrong.com/lw/2id/metaphilosophical_mysteries, I can see the sketch of an argument. It goes something like "we know that some decision theories/philosophical processes are 'objectively 'inferior, hence some are objectively superior, hence (wave hands furiously) it is at least possible that some system is objectively best".

I would counter:

1) The argument is very weak. We know some mathematical axiomatic systems are contradictory, hence inferior. It doesn't follow from that that there is any "best" system of axioms.

2) A lot of philosophical progress is entirely akin to mathematical progress: showing the consequences of the axioms/assumptions. This is useful progress, but not really relevant to the argument.

3) All the philosophical progress seems to lie on the "how to make better decisions given a goal" side; none of it lies on the "how to have better goals" side. Even the expected utility maximisation result just says "if you are unable to predict effectively over the long term, then to achieve your current goals, it would be more efficient to replace these goals with others compatible with a utility function".

However, despite my objections, I have to note that the argument is at least an argument, and provides some small evidence in that direction. I'll try and figure out whether it should be included in the paper.

If an agent with goal G1 acquires sufficient "philosophical ability", that it concludes that goal G is the right goal to have, that means that it decided that the best way to achieve goal G1 is to pursue goal G. For that to happen, I find it unlikely that goal G is anything other than a clarification of goal G1 in light of some confusion revealed by the "philosophical ability", and I find it extremely unlikely that there is some universal goal G that works for any goal G1.

Offbeat counter: You're assuming that this ontology that privileges "goals" over e.g. morality is correct. What if it's not? Are you extremely confident that you've carved up reality correctly? (Recall that EU maximizers haven't been shown to lead to AGI, and that many philosophers who have thought deeply about the matter hold meta-ethical views opposed to your apparent meta-ethics.) I.e., what if your above analysis is not even wrong?

You're assuming that this ontology that privileges "goals" over e.g. morality is correct.

I don't believe that goals are ontologically fundamental. I am reasoning (at a high level of abstraction) about the behavior of a physical system designed to pursue a goal. If I understood what you mean by "morality", I could reason about a physical system designed to use that and likely predict different behaviors than for the physical system designed to pursue a goal, but that doesn't change my point about what happens with goals.

Recall that EU maximizers haven't been shown to lead to AGI

I don't expect EU maximizers to lead to AGI. I expect EU maximizing AGIs, whatever has led to them, to be effective EU maximizers.

Sorry, I meant "ontology" in the information science sense, not the metaphysics sense; I simply meant that you're conceptually (not necessarily metaphysically) privileging goals. What if you're wrong to do that? I suppose I'm suggesting that carving out "goals" might be smuggling in conclusions that make you think universal convergence is unlikely. If you conceptually privileged rational morality instead, as many meta-ethicists do, then your conclusions might change, in which case it seems you'd have to be unjustifiably confident in your "goal"-centric conceptualization.

I think I am only "privileging" goals in a weak sense, since by talking about a goal driven agent, I do not deny the possibility of an agent built on anything else, including your "rational morality", though I don't know what that is.

Are you arguing that a goal driven agent is impossible? (Note that this is a stronger claim than it being wiser to build some other sort of agent, which would not contradict my reasoning about what a goal driven agent would do.)

(Yeah, the argument would have been something like, given a sufficiently rich and explanatory concept of "agent", goal-driven agents might not be possible --- or more precisely, they aren't agents insofar as they're making tradeoffs in favor of local homeostatic-like improvements as opposed to traditionally-rational, complex, normatively loaded decision policies. Or something like that.)

Let me try to strengthen your point. If an agent with goal G1 acquires sufficient "philosophical ability", that it concludes that goal G is the right goal to have, that means that it decided that the best way to achieve goal G1 is to pursue what it thinks is the "right goal to have". This would require it to take a kind of normative stance on goal fulfillment, which would require it to have normative machinery, which would need to be implemented in the agents mind. Is it impossible to create an agent without normative machinery of this kind? Does philosophical ability depend directly on normative machinery?

Let G1="Figure out the right goal to have"

Couple of comments:

  • The section "Bayesian Orthogonality thesis" doesn't seem right, since a Bayesian would think in terms of probabilities rather than possibilities ("could construct superintelligent AIs with more or less any goals"). If you're saying that we should assign a uniform distribution for what AI goals will be realized in the future, that's clearly wrong.
  • I think the typical AI researcher, after reading this paper, will think "sure, it might be possible to build agents with arbitrary goals if one tried, but my approach will probably lead to a benevolent AI". (See here for an example of this.) So I'm not sure why you're putting so much effort into this particular line of argument.

This is the first step (pointed more towards philosophers). Formalise the "we could construct an AI with arbitrary goals", and with that in the background, zoom in on the practical arguments with the AI researchers.

Will restructure the Bayesian section. Some philosophers argue things like "we don't know what moral theories are true, but a rational being would certainly find them"; I want to argue that this is equivalent, from our perspective, with the AI's goals ending up anywhere. What I meant to say is that ignorance of this type is like any other type of ignorance, hence the "Bayesian" terminology.

This is the first step (pointed more towards philosophers). Formalise the "we could construct an AI with arbitrary goals", and with that in the background, zoom in on the practical arguments with the AI researchers.

Ok, in that case I would just be wary about people being tempted to cite the paper to AI researchers without having the followup arguments in place, who would then think that their debating/discussion partners are attacking a strawman.

Hum, good point; I'll try and put in some disclaimer, emphasising that this is a partial result...

Thanks. To go back to my original point a bit, how useful is it to debate philosophers about this? (When debating AI researchers, given that they probably have a limited appetite for reading papers arguing that what they're doing is dangerous, it seems like it would be better to skip this paper and give the practical arguments directly.)

Maybe I've spent too much time around philosophers - but there are some AI designers who seem to spout weak arguments like that, and this paper can't hurt. When we get a round to writing a proper justification for AI researchers, having this paper to refer back to avoids going over the same points again.

Plus, it's a lot easier to write this paper first, and was good practice.

Without getting in to the likelihood of a 'typical AI researcher' successfully creating a benevolent AI, do you doubt Goertzel's "Interdependency Thesis"? I find both to be rather obviously true. Yes its possible in principle for almost any goal system to be combined with almost any type or degree of intelligence, but that's irrelevant because in practice we can expect the distributions over both to be highly correlated in some complex fashion.

I really don't understand why this Orthogonality idea is still brought up so much on LW. It may be true, but it doesn't lead to much.

The space of all possible minds or goal systems is about as relevant to the space of actual practical AIs as the space of all configuration of a human's molecules is to the space of a particular human's set of potential children.

‘maximising paperclips’

Since you want a non-LWian audience, make that “maximising the number of paperclips in the universe”, otherwise the meaning might be unclear.

Although, his point would still hold if the reader was imagining the goal of making extremely large paperclips.

We will also take the materialistic position that humans themselves can be viewed as non-deterministic algorithms[2]

I'm not a philosopher of mind but I think "materialistic" might be a misleading word here, being too similar to "materialist". Wouldn't "computationalistic" or maybe "functionalistic" be more precise? ("-istic" as opposed to "-ist" to avoid connotational baggage.) Also it's ambiguous whether footnote two is a stipulation for interpreting the paper or a brief description of the consensus view in physics.

At various points you make somewhat bold philosophical or conceptual claims based off of speculative mathematical formalisms. Even though I'm familiar with and have much respect for the cited mathematics, this still makes me nervous, because when I read philosophical papers that take such an approach my prior is high for subtle or subtly unjustified equivocation; I'd be even more suspicious were I a philosopher who wasn't already familiar with universal AI, which isn't a well-known or widely respected academic subfield. The necessity of finding clearly trustworthy analogies between mathematical and phenomenal concepts is a hard problem to solve both when thinking about the problem oneself and when presenting one's thoughts to others, and there might not be a good solution in general, but there are a few instances in your paper that I think are especially shaky. E.g.,

For utility function maximisers, the AIXI is the theoretically best agent there is, more successful at reaching its goals (up to a finite constant) than any other agent (Hutter, 2005). AIXI itself is incomputable, but there are computable variants such as AIXItl or Gödel machines (Schmidhuber, 2007) that accomplish comparable levels of efficiency. These methods work for whatever utility function is plugged into them. Thus in the extreme theoretical case, the Orthogonality thesis seems trivially true.

You overreach here. AIXItl or Goedel machines might not be intelligent even given arbitrarily much resources; in fact I believe Eliezer's position is that Goedel machines immediately run into intractable Loebian problems. AIXI-tl could share a similar fate. As far as I know no one's found an agent algorithm that fits your requirements without controversy. E.g., the grounding problem is unsolved and so we can't know that any given agent algorithm won't reliably end up wireheading. So the theoretical orthogonality thesis isn't trivially true, contra your claim, and such an instance of overreaching justifies hypothetical philosophers' skepticism about the general soundness of your analogical approach.

Unfortunately I'll have to end there.

I like the paper, but am wondering how (or whether) it applies to TDT and acausal trading. Doesn't the trading imply a form of convergence theorem among very powerful TDT agents (they should converge on an average utility function constructed across all powerful TDT agents in logical space)?

Or have I missed something here? (I've been looking around on Less Wrong for a good post on acausal trading, and am finding bits and pieces, but no overall account.)

Global scale acausal trading, if it's possible in practice (and it's probably not going to be, we only have this theoretical possibility but no indication that it's possible to actually implement), implies uniform expected surface behavior of involved agents, but those agents trade control over their own resources (world) for optimization of their own particular preference by the global acausal economy. So even if the choice of AI's preference doesn't have significant impact on what happens in AI's own world, it does have significant impact on what happens globally, on the order of what all the resources in AI's own world can buy.

It does indeed imply a form of convergence. I would assume Stuart thinks of the convergence as an artifact of the game environment the agents are in. Not a convergence in goals, just behavior. Albeit the results are basically the same.

If there's convergence in goals, then we don't have to worry about making an AI with the wrong goals. If there's only convergence in behavior, then we do, because building an AI with the wrong goals will shift the convergent behavior in the wrong direction. So I think it makes sense for Stuart's paper to ignore acausal trading and just talk about whether there is convergence in goals.

Not necessarily, it might destroy the earth before its goals converge.

There was an incident of censorship by EY relating to acausal trading - the community's confused response (chilling effects? agreement?) to that incident explains why there is no overall account.

No, I think it's more that the idea (acausal trading) is very speculative and we don't have a good theory of how it might actually work.

Thanks for this... Glad it's not being censored!

I did post the following on one of the threads, which suggested to me a way in which it would happen or at least get started

Again, apologies if this idea is nuts or just won't work. However, if true, it did strike me as increasing the chance of a simulation hypothesis. (It gives powerful TDT AIs a motivation to simulate as many civilizations as they can, and in a "state of nature", so that they get to see what the utility functions are like, and how likely they are to also build TDT-implementing AIs...)

By the way, I still can't stop thinking about that post after 6 months. I think it's my favorite wild-idea scenario I've ever heard of.

If a goal is a preference order over world states, then there are uncountably many of them, so any countable means of expression can only express a vanishingly small minority of them. Trivially (as Bostrom points out) a goal system can be too complex for an agent of a given intelligence. It therefore seems to me that what we're really defending is an Upscalability thesis: if an agent A with goal G is possible, then a significantly more intelligent A++ with goal G is possible.

Just some minor text corrections for you:

From 3.1

The utility function picture of a rational agent maps perfectly onto the Orthogonality thesis: here have the goal structure, the utility fu...

...could be "here we have the...

From 3.2

Human minds remain our only real model of general intelligence, and this strongly direct and informs...

this strongly directs and informs...

From 4.1

“All human-designed rational beings would follow the same morality (or one of small sets of moralities)” sound plausible; in contract “All human-designed superefficient

I think it would be sounds since the subject is the argument, even though the argument contains plural subjects, and I think you meant "in contrast", but I may be mistaken.

From 3.3

To do we would want to put the threatened agent

to do so(?) we would

From 3.4

an agent whose single goal is to stymie the plans and goals of single given agent

of a single given agent

From 4.1

then all self-improving or constructed superintelligence must fall prey to it, even if it were actively seeking to avoid it.

every, or change the rest of the sentence (superintelligences, they were)

From 4.5

There are goals G, such that an entity an entity with goal G

a superintelligence will goal G can exist.

Thus to deny the Orthogonality thesis is to assert that there is a goal system G, such that, among other things:
(1) There cannot exist any efficient real-world algorithm with goal G.
(2) If a being with arbitrarily high resources, intelligence, time and goal G, were to try design an efficient real-world algorithm with the same goal, it must fail.
(3) If a human society were highly motivated to design an efficient real-world algorithm with goal G, and were given a million years to do so along with huge amounts of resources, training and knowledge about AI, it must fail.
(4) If a high-resource human society were highly motivated to achieve the goals of G, then it could not do so (here the human society is seen as the algorithm).
(5) Same as above, for any hypothetical alien societies.
(6) There cannot exist any pattern of reinforcement learning that would train a highly efficient real-world intelligence to follow the goal G.
(7) There cannot exist any evolutionary or environmental pressures that would evolving highly efficient real world intelligences to follow goal G.
All of these seem extraordinarily strong claims to make!

When I try arguing for anti-orthogonality, all those different conditions on G do not appear to add strength. The claim of anti-orthogonality is, after all, that most goal systems are inconsistent in the same sense in which "I want to be stupider" or "I want to prove Goedel's statement..." goals are inconsistent, even though the inconsistency is not immediately apparent. And then all of the conditions immediately follow.

The "human society" conditions (3) and (4) are supposed to argue in favor of there being no impossible G-s, but in fact they argue for the opposite. Because, obviously, there are only very few G-s, which would be acceptable as long-term goals for human societies.

This point also highlights another important difference: the anti-orthogonality thesis can be weaker than "there cannot exist any efficient real-world algorithm with goal G". Instead, it can be "any efficient real-world algorithm with goal G is value-unstable", meaning that if any value drift, however small, is allowed, then the system will in short time drift away from G to the "right goal system". This would distinguish between the "strong anti-orthogonality" (1), (2), (3) on the one hand, and "weak anti-orthogonality" (4), (5), (6), (7) on the other.

This weaker anti-orthogonality thesis is sufficient for practical purposes. It basically asserts that an UFAI could only be created via explicit and deliberate attempts to create an UFAI, and not because of bugs, insufficient knowledge, etc. And this makes the whole "Orthogonality for superhuman AIs" section much less relevant.

It basically asserts that an UFAI could only be created via explicit and deliberate attempts to create an UFAI,

As I said to Wei, we can start dealing with those arguments once we've got strong foundations.

I'll see if the value drift issue can be better integrated in the argumentation.

I don't see why there are only two counter-theses in section 4. Or rather, it looks as though you want a too-strong claim - in order to criticise it.

Try a "partial convergence" thesis instead. For instance, the claim that goals that are the product of cultural or organic evolution tend to maximise entropy and feature universal instrumental values.

Or rather, it looks as though you want a too-strong claim - in order to criticise it.

The incompleteness claim is weaker than the partial convergence claim.

Sure, but if you try harder with counter-theses you might reach a reasonable position that's neither very weak nor wrong.

our race spans foot-fetishists, religious saints, serial killers, instinctive accountants, role-players, self-cannibals, firefighters and conceptual artists. The autistic, those with exceptional social skills, the obsessive compulsive and some with split-brains. Beings of great empathy and the many who used to enjoy torture and executions as public spectacles

Some of these are not really terminal goals. A fair number of people with strong sexual fetishes would be perfectly happy without them, and in more extreme cases really would prefer not to have them. Similarly, there are some serial killers who really don't like the fact that they have such a compulsion. Your basic point is sound, but these two examples seem weak.

A fair number of people with strong sexual fetishes

there are some serial killers

It was an existence argument. That some more people aren't examples doesn't really change matters, does it?

Copying from a comment I already made cos no-one responded last time:

I'm not confident about any of the below, so please add cautions in the text as appropriate.

The orthogonality thesis is both stronger and weaker than we need. It suffices to point out that neither we nor Ben Goertzel know anything useful or relevant about what goals are compatible with very large amounts of optimizing power, and so we have no reason to suppose that superoptimization by itself points either towards or away from things we value. By creating an "orthogonality thesis" that we defend as part of our arguments, we make it sound like we have a separate burden of proof to meet, whereas in fact it's the assertion that superoptimization tells us something about the goal system that needs defending.

The orthogonality thesis is non-controversial. Ben's point is that what matters is not the question of what types of goals are theoretically compatible with superoptimization, but rather what types of goals we can expect to be associated with superoptimization in reality.

In reality AGI's with superoptimization power will be created by human agencies (or their descendants) with goal systems subject to extremely narrow socio-economic filters.

The other tangential consideration is that AGI's with superoptimization power and long planning horizons/zero time discount may have highly convergent instrumental values/goals which are equivalent in effect to terminal values/goals for agents with short planning horizons (such as humans). From a human perspective, we may observe all super-AGIs to appear to have strangely similar ethics/morality/goals, even though what we are really observing are convergent instrumental values and short term opening plans as their true goals concern the end of the universe and are essentially unknowable to us.