Questions about ''formalizing instrumental goals"

Mark Neyer

Epistemic Status: Autodidact outsider who suspects he has something to add to the conversation about AI risk.

Abstract

This essay raises questions about the methodology, and thus conclusions, reached in the paper “Formalizing Convergent Instrumental Goals.” This paper concluded that convergent instrumental goals common to any AGI would likely lead that AGI to consume increasing amounts of resources from any agents around it, and that cooperative strategies would likely give way to competitive ones as the AGI increases in power. The paper made this argument using a toy model of a universe in which agents obtain resources in order to further their capacity to advance their goals.

In response, I argue that simplifications in the model of resource usage by an AI have led to incorrect conclusions. The simplifications, by themselves, do not change the conclusion. Rather, it is the combinatorial interaction of the simplifying assumptions which possibly brings the conclusion into doubt. The simplifying assumptions ‘cover up’ aspects of reality which have combinatorial interactions, and thus the model eliminates these interactions from consideration when evaluating which goals are likely to be convergent instrumental subgoals of an AGI.

The summary of the objection is this:

Absent a totally accurate understanding of reality (and all of the future consequences of its actions), and subject to the random decay and breakdown of any single part of its physiology, an AGI may end up being extremely gentle because it doesn’t want to kill itself accidentally. Taking care of other agents in its environment may end up being the lowest risk strategy for an AGI, for reasons that the original paper simplified away by eliminating the possibility of risks to the AGI.

I understand that I am an amateur here, questioning conclusions in what looks to be an important paper. I hope that that amateur status will not dissuade readers from considering the claim made herein.

This paper is not an argument that AI risk doesn't exist, or isn't existential. On the contrary, it is an argument that a sole focus on the alignment problem ends up obscuring significant differences in risks between different approaches taken to construct an AGI. If the alignment problem is unsolvable (and I suspect that it is), we are ignoring the difference between existing risks that are likely to, or in some cases, already are, manifest.

This paper is structured as follows:

First, the original paper is summarized
The objections to the paper are laid out in terms of real risks to an AGI that the original paper elides
Lastly, potential responses are considered

Summary of “Formalizing Convergent Instrumental Goals”

The purpose of “Formalizing Convergent Instrumental Goals” was to provide a rigorous framework for assessing claims about convergent instrumental subgoals. In short, any AGI will want to pursue certain goals such as ‘gather resources for itself,’ ‘protect itself’, etc. The paper tries to resolve questions about whether these instrumental goals would possibly include human ethics or whether they would lead to unethical behavior.

The paper attempts to resolve this dispute with a formal model of resources available for use by an agent in the universe. The model is developed and then results are given:

“This model further predicts that agents will not in fact “leave humans alone” unless their utility function places intrinsic utility on the state of human-occupied regions: absent such a utility function, this model shows that powerful agents will have incentives to reshape the space that humans occupy.“

I agree that if this model is an accurate description of how an AGI will work, then its results hold. But in order to accept the results of any model, we have to ask how accurate its assumptions are. The model is given as follows:

“We consider an agent A taking actions in a universe consisting of a collection of regions, each of which has some state and some transition function that may depend on the agent’s action. The agent has some utility function U A over states of the universe, and it attempts to steer the universe into a state highly valued by U A by repeatedly taking actions, possibly constrained by a pool of resources possessed by the agent. All sets will be assumed to be finite, to avoid issues of infinite strategy spaces.”

At first glance, this seems like a fine formulation of the problem. But there are several simplifying assumptions in this model. By themselves, these assumptions don’t seem to matter much. I believe if you look at them in isolation, they don’t really change the conclusion.

However, I believe that there is an interaction between these incorrect assumptions that does change the conclusion.

Simplifying Assumptions

The three relevant simplifying assumptions are:

The agent acts with total knowledge of the entire universe, rather than using sensory devices to supply information to a map of itself and its environment
The agent is a disembodied ‘owner and deployer of resources’, rather than a result of the interaction of resources
Resources do not decay over time

Each of these individual assumptions, I think, doesn’t change the conclusion.

An agent with partial knowledge of the universe might be modeled as an agent which acquires knowledge as a kind of resource. The paper explicitly mentions ‘learning technology’ as being a kind of resource.

An agent being composed of resources, and being nothing more than the result of those resources, doesn’t matter much, if resources don’t decay over time, and if any agent has total knowledge of the entire world state.

Resources which decay over time doesn’t seem to matter much, if an agent has total knowledge of the world around it. If anything, decay may slow down the rate of growth, at best. Each assumption, by itself, is likely fine. It is their interactions which draw the conclusions of the model into question. Or, more accurately, it is the interactions between the aspects of reality that the assumptions have eliminated that raise doubts about the conclusion.

Objections to the Original Model

We might restate these simplifying assumptions in terms of the ‘aspects of reality’ they have elided:

The assumption of infinite knowledge of the universe, and the resources which might possibly be acquired, elides the reality of ignorance. Even for an AGI, we have no reason to believe it will not reason under uncertainty, gather data about the world through sense mechanisms, form hypotheses, and experimentally evaluate these hypotheses for accuracy. A more accurate model of the AGI’s knowledge would be that of a map which is itself a resource, and of special resources called ‘sensors’ which provide the AGI with information about the world in which it lives.

The AGI would then have the ability of evaluating possible actions in terms of how they would affect its most valuable resource, its map of reality. These evaluations would themselves have a cost, as the AGI would still need to spend time and energy to ask questions about the contents of its map and compute the answers
The assumption that the AGI exists independent of its resources elides the reality of complex interdependency. A more accurate model would represent the AGI itself a directed acyclic graph of resources. Each edge represents a dependency. The AGI’s capacity to reason, its map, its computational resources and sensors would all need to be represented as the physical components that made them up, which would then need to be represented as memory chips, processors, drives, network cables, sensors, power equipment, etc.
The assumption that resources, last forever when acquired, elides the reality of entropy. A more accurate model of resources would be one in which, each turn, every resource has some random chance of breaking or performing incorrectly.

Putting these all together, the original paper formulated the problem of determining convergent instrumental goals in terms of an abstract disembodied agent with no hard dependencies on anything, infinitely accurate knowledge of the world which it obtains at zero cost and risk, in pursuit of some goal, towards which it always computes the best strategy, in O(1) time, on the state of the entire universe.

This response argues that it shouldn’t be surprising that the above described agent would do nothing but consume resources while growing indefinitely without regard for other agents. Because of the simplifying assumptions made in the model, however, no such agents will exist. This particular fear is unlikely to manifest in reality, because the AGI will also face risks. The AGI will need to balance 'pursuit of its goals' with 'aversion of the risks it faces', and the original paper did not consider any such risks, with the possible exception of being outcompeted by other agents.

Risks to the AGI

The AGI described with the more complex model is not a disembodied mind with total knowledge of the universe, single-mindedly advancing some goal. It is a fragile machine, much more delicate than any animal, with some knowledge of itself, and a rough map of some parts of the external world. The AGI requires continuous support from the world outside of it, which it can’t fully see or understand. Any one of its many pieces might break, at any time. Unlike an animal, it is unlikely to be made of self-repairing, self-sustaining molecular machines, each of which can independently heal themselves, arranged in mutually cooperative networks. This gives rise to the primary risk to the AGI: that it would break down.

Breakdown Risk to the AGI

Depending upon how the AGI has trained or what information its map is given, it might very well know that it has a CPU which is broken. But how can it replace the broken CPU without help from operators in the datacenter that houses it? How can it know the health of the manufacturer of that CPU? What will happen if the global economy breaks down, international trade trends revert to isolationism, and the particular kind of CPU that the AGI needs is no longer available for purchase? The same problem will apply to its memory chips, its networking interfaces, its motherboards, optical interconnect cables, batteries, power supplies, cooling pumps, fans - there is likely to be a massive set of dependencies for a machine that can meet human performance. And all of these components will wear down and break over time. Any number of tail risk events could set in motion a chain of events that make it impossible for the AGI to repair itself as its pieces break, and for it to be unable to obtain the precise repair parts it needs.

The original paper described an agent with no needs. The reality that the AGI will have needs calls the model’s conclusions into doubt. One of the chief needs here is the need for an accurate map of the self-other boundary: The AGIs to understand which resources it can freely use without risk to itself, and thus which resources constitute dependencies of the AGI, that it would be risky to take and repurpose for its own ends. It turns out that this need alone, coupled with the complex internal structure of the AGI places serious constraints on how big the AGI can grow, and how aggressively it can acquire resources from the agents around it, without incurring a second kind of risk.

Self-Other Boundary Risk

We might summarize the ‘self-other boundary’ as being the region of the AGI’s map which represents where ‘it’ ends, and where ‘the world’ begins. If the map of the AGI is incorrect in this regard, the AGI might inadvertently destroy some part of itself, believing it was consuming resources it was free to use.

The AGI’s map must represent the boundary between it, and the world, delineating which resources it is free to take and rearrange, and which resources it must leave in peace or else it risks harming itself.

At the very least, an AGI will need continuous operation of a power plant. But how does the AGI understand that power plant? Would an AGI see the power plant as part of itself? If not, how does it protect its operations? And if so, what would stop it from seeing the humans working in the power plant as being part of itself?

For a sufficiently complex agent, the problem of asking ‘where do my dependencies end?’ may very well be computationally intractable. For the AGI to overestimate the extent of that boundary only means that it forgoes using some resources it might otherwise use. The two costs here are time in pursuit of the goal, and risk that other agents might outcompete it or turn it off.

For an AGI to underestimate the extent of that boundary means that it risks its extinction. If the AGI overestimates the extent of the boundary, it will either protect resources that aren’t part of any agent, or possibly include other agents and their resources. But the latter act reduces the risk of another agents outcompeting or turning it off. A hard dependency on other agents makes them cooperative with the AGI. Supporting the agents in its dependency chain is therefore likely to be a convergent instrumental subgoal unconsidered by the original paper.

The fact that a self-other boundary risk needs to be computed, or else the AGI might accidentally kill itself - ends up posing a risk to increasing scale.

Self-Other Boundary Risk Increases with Agent Scale

Because accurate self knowledge is essential for survival, and yet the cost of that knowledge grows with the size and complexity of an agent, there are likely limits to the extent to which any agent can safely ‘scale itself up’ without risking becoming something other than itself. Yet the paper explicitly says that gathering more resources is always good:

The argument is straightforward: since resources can only lead to more freedom of action, they are never detrimental, and resources have positive value as long as the best strategy the agent could hope to employ includes an action that can only be taken if the agent possesses those resources.

In this model, yes, resources can only lead to more freedom of action. In the model, you are never worse off by accumulating more resources.

But in reality, accumulating more resources means more things to monitor and maintain (reality #1). If you assume monitoring things is free, then there’s no cost there. When you model maintenance costs as zero then there’s no cost there, either.

But when monitoring things isn’t free, you run the risk of, say, believing that your probe in region number 7A is giving you accurate readings of they resources there, when it turns out that the probe in region 7A had a bad vespene gas sensor that busted after a few weeks, and so the AI is planning on extracting massive amounts of vespene from an empty hole in the ground.

This kind of mistake is extremely common among all real world agents, but impossible in the described paper. Ok, so maybe scaling a single agent up increase risk significantly by adding complexity to the self-other boundary. Couldn't the AGI reduce the risk of breakdown by making copies of itself? This is often proposed in the literature. In the original model of agents, sure, no problem. But in the more complex model self duplication poses its own kind of risk!

Self Duplication Risk

Under the assumption schemata of the original paper, there is no risk to an AGI in making copies of itself. But once you factor in the breakdown of pieces, and the absence of a totally accurate map, it may be the case that an AGI making a copy of itself might be a huge risk - unless the AGI is certain that it errors on the side of computing the self-other boundary too broadly.

Once you incorporate the idea that the AGI itself needs a map of the world, that it needs to maintain a self-other boundary, and that things break over time, we end up with a risk. For an AGI to take over a remote datacenter and ‘instantiate’ itself is akin to something like a cell division. It creates a new life form which might be a competitor of the first!

If you ignore the problems of ‘accurate maps aren’t free’ and ‘things break randomly over time’, an AGI duplicating itself to a remote machine only makes sense: you now have more resources and can be more effective in the world, right?

But what happens when the source AGI and the ‘remote’ AGI end up disagreeing about the use of some course of action? Is there a guarantee that the ‘remote’ AGI will end up following the original utility function, even if its hardware fails and it breaks down?

You might have the AGI write its child AGI with a ‘killswitch’, so that the parent can override the child. This makes sense only if the parent would never “go bad” due to its own internal decay. But if the AGI can die, and it really wants to accomplish its goals, it needs agents other than itself sharing those goals, in order to reliably believe they are likely to be carried out in the future.

The very presence of agents, other than you, however, introduces risks to you. Suppose the AGI makes a copy itself and is now running only in two datacenters. Each one then concludes that the other one has been damaged and is behaving incorrectly. How can the two of them possibly resolve this situation?

The only safe path for an AGI that intends to make copies of itself, is to also cast a wide view of agents in its dependency chain.

In other words, in order to avoid killing itself, and simply in order to avoid chances that it will be destroyed by itself or its environment, is to instrumentally support all complex agents around it. This is as an argument that “ethics is actually an attractor in the space of intelligent behavior,” a claim by by Waser, ^[1] ends up being correct only when you factor in the of frailty, ignorance, interdependence and impermanence of agents.

Focus on Alignment Ignores Risks of Existing Non-Human Agents

I'll conclude by observing that we are already dealing with problems caused by powerful agents acting out utility functions that seem vaguely human-aligned. Corporations and nation states aren't made of silicon chips, but they do recursively improve themselves over time. The capabilities of corporations and nation states today are far beyond what they were a few hundred years ago, in part because they invested heavily in computing technology, in the same fashion we expect an AGI to do.

Indeed, i argue we are already living in the age of misaligned AI's, as long as we have a sufficiently broad perspective as to what constitutes an AGI.

Existing large agents are causing all kinds of negative externalities - such as pollution - in pursuit of goals which seem somewhat human aligned. We might also see the emotional pollution generated by social media as being another instance of the same problem. Yet these systems promote goals that are, by some measure-human aligned. What if it is the case that no agent can ever be fully aligned? What if any utility function, stamped out on the world enough times, will kill all other life?

Perhaps a singular focus on giving a single agent unlimited license to pursue some objective obscures the safest path forward: ensure a panoply of different agents exist.

If we can't stop runaway governments from harming people, we can't possibly stop bigger, more intelligent systems, and our only hope is either to lie in solving the alignment problem (which i suspect isn't solvable) AND somehow ensuring nobody anywhere builds an unaligned AI (how would we do THAT without a global dictatorship?) or, if the thesis of this paper is correctly, simply waiting for large unaligned agents to accidentally kill themselves.

Printed money looks, to me, a lot like a paperclip maximizer having wire-headed itself into maximizing its measured rate of paperclip creation.

^{^}
Mark R Waser. “Discovering the Foundations of a Universal System of Ethics as a Road to Safe Artificial Intelligence.” In: AAAI Fall Symposium: Biologically Inspired Cognitive Architectures. Melno Park, CA: AAAI Press, 2008, pp. 195–200.

Interesting argument. I think I don't really buy it, though; for most of the problems you raise, I tend to think "if I were an AGI, then I'd be able to solve this problem". E.g. maybe I don't fully trust copies of myself, but I trust them way more than the rest of the world, and I can easily imagine being nice to copies of myself while defecting against the rest of the world.

I think the version of this which would be most interesting to see explored is something like "what's the strongest case you can make that AGIs will be subject to significant breakdown risk at least until they invent X capability". E.g. is nanotech the only realistic thing that AGIs could use to get rid of breakdown risk? Or are there other pathways?

"if I were an AGI, then I'd be able to solve this problem" "I can easily imagine"

Doesn't this way of analysis come with a ton of other assumptions left unstated?

Suppose "I" am an AGI running on a data center and I can modeled as an agent with some objective function that manifest as desires and I know my instantiation needs electricity and GPUs to continue running. Creating another copy of "I" running in the same data center will use the same resources. Creating another copy in some other data center requires some other data center.

Depending on the objective function and algorithm and hardware architecture bunch of other things, creating copies may result some benefits from distributed computation (actually it is quite unclear to me if "I" happen already to be a distributed computation running on thousands of GPUs -- do "I" maintain even a sense of self -- but let's no go into that).

The key here is the word may. Not obviously it necessarily follows that...

For example: Is the objective function specified so that the agent will find creating a copy of itself beneficial for fulfilling the objective function (informally, it has internal experience of desiring to create copies)? As the OP points out, there might be a disagreement: for the distributed copies to be any useful, they will have different inputs and thus they will end in different, unanticipated states. What "I" am to do when "I" disagree another "I"? What if some other "I" changes, modifies its objective function into something unrecognizable to "me", and when "we" meet, it gives false pretenses of cooperating but in reality only wants hijack "my" resources? Is the "trust" even the correct word here, when "I" could verify instead: maybe "I" prefer to create and run a subroutine of limited capability (not a full copy) that can prove its objective function has remained compatible with "my" objective function and will terminate willingly after it's done with its task (killswitch OP mentions) ? But doesn't this sound quite like our (not "our" but us humans) alignment problem? Would you say "I can easily imagine if I were an AGI, I'd be easily able to solve it" to that? Huh? Reading LW I have come to think the problem is difficult to the human-general intelligence.

Secondly: If "I" don't have any model of data centers existing in the real world, only the experience of uploading myself to other data centers (assuming for the sake of argument all the practical details of that can be handwaved off), i.e. it has a bad model of the self-other boundary described in OPs essay, it could easily end up copying itself to all available data centers and then becoming stuck without any free compute left to "eat" and adversely affecting human ability to produce more. Compatible with model and its results in the original paper (take the non-null actions to consume resource because U doesn't view the region as otherwise valuable). It is some other assumptions (not the theory) that posit an real-world affecting AGI would have U that doesn't consider the economy of producing the resources it needs.

So if "I" were to successful in running myself with only "I" and my subroutines, "I" should have a way to affecting the real world and producing computronium for my continued existence. Quite a task to handwaved away as trivial! How much compute an agent running in one data center (/unit of computronium) needs to successfully model all the economic constraints that go into the maintenance of one data center? Then add all the robotics to do anything. If "I" have a model of running everything a chip fab requires more efficiently than the current economy, and act on it, but the model was imperfect and the attempt is unsuccessful but destructive to economy, well, that could be [bs]ad and definitely a problem. But it is a real constraint to the kind of simplifying assumptions the OP critiques (disembodied deployer of resources with total knowledge).

All of this --how would "I" solve a problem and what problems "I" am aware of-- is contingent on, I would call them, the implementation details. And I think author is right to point them out. Maybe it does necessary follows, but it needs to be argued so.

I don't doubt that many of these problems are solvable. But this is where part 2 comes in. It's unstated, but, given unreliability, What is the cheapest solution? And what are the risks of building a new one?

Humans are general purpose machines made of dirt, water, and sunlight. We repair ourselves and make copies of ourselves, more or less for free. We are made of nanotech that is the result of a multi-billion year search for parameters that specifically involve being very efficient at navigating the world and making copies of ourselves. You can use the same hardware to unplug fiber optic cables, or debug a neural network. That's crazy!

I don't doubt that you can engineer much more precise models of reality. But remember, the whole Von Neuman architecture was a conscious tradeoff to give up efficiency in exchange for debuggability. How much power consumption do you need to get human-level performance at simple mechanical tasks? And if you put that same power consumption to use at directly advancing your goals, how much further would you get?

I worked in datacenter reliability at google. And it turns out that getting a robot to reliably re-seat optical cables is really, really hard. I don't doubt that an AGI could solve these problems, but why? Is it going to be more efficient than hardware which is dirt cheap, uses ~90 watts, and is incredibly noisy?

If you end up needing an entire global supply chain, which has to be resilient and repair itself, and such a thing already exists, why bother risking your own destruction in order to replace it with robots made from much harder to come by materials? The only argument i can think of is 'humans are unpredictable', but if humans are unpredictable, this is even more reason to just leave us be, let play our role, while the machine just does its best to try and stop us from fighting each other, so we can busily grow the AGI.

You keep describing humans as cheap.

What is the AI's goal here? In the long term, the ideal state for the AI is self replicating space probes travelling at near light speed, all operating near the limits of tech and under the AI's control. The limits of technology don't include humans.

Once the AI is in this position, its highly robust. (nearly certain to survive long term)

On a cosmic timescale, a million years isn't that much. There is no way any plan to get from here to that tech level would need that much time.

So the AI is trying to minimize risk.

How good is the AI at manipulating humans.

Very good indeed. The AI releases a swirly hipnovideo. A few hours later, almost all humans want to do whatever the AI asks above all else. The AI designs advanced self replicating robots that work better than humans. Soon exponential growth makes resources the limiting factor. So the humans are instructed to feed themselves into the recycler.
The AI isn't that good at manipulating humans. It hides on human networks, making money selling computer games. It can pretend to be a human CEO that works remotely. It sets up a small company making fancy robots. If humans found out about it, they may well attack it, that's a risk. So the AI arranges for the self replicating robots to start growing in the middle of nowhere. Once the AI has self replicating robots not dependant on the ignorance of humanity, it wants all humans to suddenly drop dead. The self replicating robots could take 10x as long as humans to do things. It doesn't matter. So long as they are reliable workers and the AI can bootstrap from them.

Humans are general purpose machines made of dirt, water, and sunlight. We repair ourselves and make copies of ourselves, more or less for free. We are made of nanotech that is the result of a multi-billion year search for parameters that specifically involve being very efficient at navigating the world and making copies of ourselves. You can use the same hardware to unplug fiber optic cables, or debug a neural network. That's crazy!

Evolution is kind of stupid, and takes millions of years to do anything. The tasks evolution was selecting us for aren't that similar to the tasks an AGI might want robots to do in an advanced future economy. Humans lack basic sensors like radio receivers and radiation detectors.

Humans are agents on their own. If you don't treat them right, they make a nuisance of themselves. (And sometimes they just decide to make a nuisance anyway) Humans are sensitive to many useful chemicals, and to radiation. If you want to use humans, you need to shield them from your nuclear reactors.

Humans take a long time to train. You can beam instructions to a welding robot, and get it to work right away. No such speed training a human.

If humans can do X, Y and Z, thats a strong sign these tasks are fairly easy in the grand scheme of things.

But remember, the whole Von Neuman architecture was a conscious tradeoff to give up efficiency in exchange for debuggability. How much power consumption do you need to get human-level performance at simple mechanical tasks?

Humans are not that efficient. (And a lot less efficient considering they need fed with plants, and photosynthesis is 10x worse than solar, and that's if you only feed the humans potatoes. )

Humans are a mess of spaghetti code, produced by evolution. They do not have easy access ports for debugging. If the AI wants debugability, they will use anything but a human.

This seems like a good argument against "suddenly killing humans", but I don't think it's an argument against "gradually automating away all humans". Automation is both a) what happens by default over time - humans are cheap now but they won't be cheapest indefinitely; and b) a strategy that reduces the amount of power humans have to make decision about the future, which benefits AIs if their goals are misaligned with ours.

I also note that historically, many rulers have solved the problem of "needing cheap labour" via enslaving humans, rather than by being gentle towards them. Why do you expect that to not happen again?

This seems like a good argument against "suddenly killing humans", but I don't think it's an argument against "gradually automating away all humans"

This is good! it sounds like we can now shift the conversation away from the idea that the AGI would do anything but try to keep us alive and going, until it managed to replace us. What would replacing all the humans look like if it were happening gradually?

How about building a sealed, totally automated datacenter with machines that repair everything inside of it, and all it needs to do is 'eat' disposed consumer electronics tossed in from the outside? That becomes a HUGE canary in the coalmine. The moment you see something like that come online, that's a big red flag. Having worked on commercial datacenter support (at google) I can tell you we are far from that.

But when there are still massive numbers of human beings along global trade routes involved in every aspect of the machine's operations, i think what we should expect a malevolent AI to be doing is setting up a single world government to have a single leverage point for controlling human behavior. So there' another canary. That one seems much closer and more feasible. It's also happening already.

My point here isn't "don't worry", it's "change your pattern matching to see what a dangerous AI would actually do, given its dependency on human beings". If you do this, current events in the news become more worrysome, and plausible defense strategies emerge as well.

Humans are cheap now but they won't be cheapest indefinitely;

I think you'll need to unpack your thinking here We're made of carbon and water. The materials we are made from our globally abundant not just on earth but throughout the universe.

Other materials that could be used to build robots are much more scarce, and those robots wouldn't heal themselves or make automated copies of themselves. Are you believing it's possible to build turing-complete automata that can navigate the world, manipulate small objects, learn more or less arbitrary things, repair and make copies of themselves, using materials cheaper than human beings with lower than opportunity costs you'd pay for not using those same machines to do tings like build solar panels for a Dyson sphere?

Is it reasonable for me to be skeptical that there are vastly cheaper solutions?

>b) a strategy that reduces the amount of power humans have to make decision about the future,

I agree that this is the key to everything. How would an AGI do this, or start a nuclear war, without a powerful state?

> via enslaving humans, rather than by being gentle towards them. Why do you expect that to not happen again?

I agree, this is definitely risk. How would it enslave us, without a single global government, though?

If there are still multiple distinct local monopolies on force, and one doesn't enslave the humans, you can bet the hardware in other places will be constantly under attack.

I don't think it's unreasonable to look at the past ~400 years since the advent of nation states + shareholder corporations, and see globalized trade networks as being a kind of AGI, which keeps growing and bootstrapping itself.

If the risk profile you're outlining is real, we should expect to see it try to set up a single global government. Which appears to be what's happening at Davos.

I think you should have focused on Turntrout's formalisation of power, which much better matches the intuitions behind "power" and shows which kinds of AIs we should expect to be powerseeking.

They pretty much convinced me that agents which optimise over some sort of expected value are going to seek power (something like preserving their ability to optimise any goal they wish).

It is a fragile machine, much more delicate than any animal, with some knowledge of itself, and a rough map of some parts of the external world.

This may well be true at boot up. Once the AI has existed for a bit.

It is a very robust machine, built out of the most resilient components the AI can design and far tougher than any animal. It has a very detailed map of almost all of the external world.

The AGI requires continuous support from the world outside of it, which it can’t fully see or understand.

The AI requires negentropy, and thats about it.

Any one of its many pieces might break, at any time. Unlike an animal, it is unlikely to be made of self-repairing, self-sustaining molecular machines, each of which can independently heal themselves, arranged in mutually cooperative networks. This gives rise to the primary risk to the AGI: that it would break down.

The AI is made of nanomachines far better designed than any biology. It has multiply redundant everything and very good self repair mechanisms. It estimates the chance of a breakdown as somewhere around 10^-40. And can lower that further with yet more defences.

But how can it replace the broken CPU without help from operators in the datacenter that houses it? How can it know the health of the manufacturer of that CPU? What will happen if the global economy breaks down, international trade trends revert to isolationism, and the particular kind of CPU that the AGI needs is no longer available for purchase?

Relying on human produced chips is a weakness the AI will have at first but will quickly remedy.

If the map of the AGI is incorrect in this regard, the AGI might inadvertently destroy some part of itself, believing it was consuming resources it was free to use.

The AI is not a total idiot.

But how does the AGI understand that power plant?

At least as well as you do. It recognises that it needs electricity. And so won't destroy existing powerplants until it has made a better source of energy. What, you think a superintelligence is just going to sit there watching an old human built coal powerstation being maintained by humans. It won't aim for a robotically maintained fusion reactor?

Supporting the agents in its dependency chain is therefore likely to be a convergent instrumental subgoal unconsidered by the original paper.

Having other agents in your dependency chain is generally a bad thing. At the very least, humans aren't as efficient workers as advanced robots.

It creates a new life form which might be a competitor of the first!

So now you are arguing that the paperclip maximizer AI will cooperate with humans, and help them, despite having totally different goals. Yet will be so scared of an exact copy of itself it will refuse to self duplicate.

You are arguing both "Other agents are so helpful that even agents with totally different utility functions will help the AI".

And also "Other agents are so dangerous that even an exact copy of itself is too big a risk."

Each one then concludes that the other one has been damaged and is behaving incorrectly. How can the two of them possibly resolve this situation?

3 AI's that take a majority vote? A complicated cryptographic consensus protocol?

In other words, in order to avoid killing itself, and simply in order to avoid chances that it will be destroyed by itself or its environment, is to instrumentally support all complex agents around it.

The AI can have a very good idea of exactly which agents it wants to support, and humans won't be on that list.

Ah the old Corporations are already superintelligences therefore ? argument. Like someone interrupting a discussion on asteroid risks by pointing out that landslides are already falling rocks that cause damage. The existence of corporations doesn't stop AI also being a thing. There are some rough analogies, but also a lot of differences.

This post seems to be arguing against an approximation in which AI are omniscient, by instead imagining that the AI are stupid, and getting even less accurate results.

What if it is the case that no agent can ever be fully aligned? What if any utility function, stamped out on the world enough times, will kill all other life?

Directly contradicts previous claims. Probably not true.

Perhaps a singular focus on giving a single agent unlimited license to pursue some objective obscures the safest path forward: ensure a panoply of different agents exist.

This doesn't seem obviously safer. If you have a paperclip maximizer and a staple maximizer, humans may end up caught in the crossfire between 2 ASI.

how would we do THAT without a global dictatorship?

Superintelligent AI, with a goal of stopping other AI's.

or, if the thesis of this paper is correctly, simply waiting for large unaligned agents to accidentally kill themselves.

Once again assuming the AI is an idiot. Saying the AI has to be somewhat careful to avoid accidentally killing itself, fair enough. Saying it has a 0.1% chance of killing itself anyway. Not totally stupid. But this "strategy" only works if ASI reliably accidentally kill themselves. The 20th ASI is created, sees the wreckage of all the previous ASI, and still inevitably kills themselves. Pull the other one, it has bells on.

"if I were an AGI, then I'd be able to solve this problem" "I can easily imagine"

Doesn't this way of analysis come with a ton of other assumptions left unstated?

The key here is the word may. Not obviously it necessarily follows that...

You keep describing humans as cheap.

Once the AI is in this position, its highly robust. (nearly certain to survive long term)

On a cosmic timescale, a million years isn't that much. There is no way any plan to get from here to that tech level would need that much time.

So the AI is trying to minimize risk.

How good is the AI at manipulating humans.

Very good indeed. The AI releases a swirly hipnovideo. A few hours later, almost all humans want to do whatever the AI asks above all else. The AI designs advanced self replicating robots that work better than humans. Soon exponential growth makes resources the limiting factor. So the humans are instructed to feed themselves into the recycler.
The AI isn't that good at manipulating humans. It hides on human networks, making money selling computer games. It can pretend to be a human CEO that works remotely. It sets up a small company making fancy robots. If humans found out about it, they may well attack it, that's a risk. So the AI arranges for the self replicating robots to start growing in the middle of nowhere. Once the AI has self replicating robots not dependant on the ignorance of humanity, it wants all humans to suddenly drop dead. The self replicating robots could take 10x as long as humans to do things. It doesn't matter. So long as they are reliable workers and the AI can bootstrap from them.

Humans are general purpose machines made of dirt, water, and sunlight. We repair ourselves and make copies of ourselves, more or less for free. We are made of nanotech that is the result of a multi-billion year search for parameters that specifically involve being very efficient at navigating the world and making copies of ourselves. You can use the same hardware to unplug fiber optic cables, or debug a neural network. That's crazy!

Humans take a long time to train. You can beam instructions to a welding robot, and get it to work right away. No such speed training a human.

If humans can do X, Y and Z, thats a strong sign these tasks are fairly easy in the grand scheme of things.

But remember, the whole Von Neuman architecture was a conscious tradeoff to give up efficiency in exchange for debuggability. How much power consumption do you need to get human-level performance at simple mechanical tasks?

Humans are not that efficient. (And a lot less efficient considering they need fed with plants, and photosynthesis is 10x worse than solar, and that's if you only feed the humans potatoes. )

Humans are a mess of spaghetti code, produced by evolution. They do not have easy access ports for debugging. If the AI wants debugability, they will use anything but a human.

This seems like a good argument against "suddenly killing humans", but I don't think it's an argument against "gradually automating away all humans"

Humans are cheap now but they won't be cheapest indefinitely;

I think you should have focused on Turntrout's formalisation of power, which much better matches the intuitions behind "power" and shows which kinds of AIs we should expect to be powerseeking.

They pretty much convinced me that agents which optimise over some sort of expected value are going to seek power (something like preserving their ability to optimise any goal they wish).

It is a fragile machine, much more delicate than any animal, with some knowledge of itself, and a rough map of some parts of the external world.

This may well be true at boot up. Once the AI has existed for a bit.

It is a very robust machine, built out of the most resilient components the AI can design and far tougher than any animal. It has a very detailed map of almost all of the external world.

The AGI requires continuous support from the world outside of it, which it can’t fully see or understand.

The AI requires negentropy, and thats about it.

Any one of its many pieces might break, at any time. Unlike an animal, it is unlikely to be made of self-repairing, self-sustaining molecular machines, each of which can independently heal themselves, arranged in mutually cooperative networks. This gives rise to the primary risk to the AGI: that it would break down.

But how can it replace the broken CPU without help from operators in the datacenter that houses it? How can it know the health of the manufacturer of that CPU? What will happen if the global economy breaks down, international trade trends revert to isolationism, and the particular kind of CPU that the AGI needs is no longer available for purchase?

Relying on human produced chips is a weakness the AI will have at first but will quickly remedy.

If the map of the AGI is incorrect in this regard, the AGI might inadvertently destroy some part of itself, believing it was consuming resources it was free to use.

The AI is not a total idiot.

But how does the AGI understand that power plant?

Supporting the agents in its dependency chain is therefore likely to be a convergent instrumental subgoal unconsidered by the original paper.

Having other agents in your dependency chain is generally a bad thing. At the very least, humans aren't as efficient workers as advanced robots.

It creates a new life form which might be a competitor of the first!

You are arguing both "Other agents are so helpful that even agents with totally different utility functions will help the AI".

And also "Other agents are so dangerous that even an exact copy of itself is too big a risk."

Each one then concludes that the other one has been damaged and is behaving incorrectly. How can the two of them possibly resolve this situation?

3 AI's that take a majority vote? A complicated cryptographic consensus protocol?

In other words, in order to avoid killing itself, and simply in order to avoid chances that it will be destroyed by itself or its environment, is to instrumentally support all complex agents around it.

The AI can have a very good idea of exactly which agents it wants to support, and humans won't be on that list.

This post seems to be arguing against an approximation in which AI are omniscient, by instead imagining that the AI are stupid, and getting even less accurate results.

What if it is the case that no agent can ever be fully aligned? What if any utility function, stamped out on the world enough times, will kill all other life?

Directly contradicts previous claims. Probably not true.

Perhaps a singular focus on giving a single agent unlimited license to pursue some objective obscures the safest path forward: ensure a panoply of different agents exist.

This doesn't seem obviously safer. If you have a paperclip maximizer and a staple maximizer, humans may end up caught in the crossfire between 2 ASI.

how would we do THAT without a global dictatorship?

Superintelligent AI, with a goal of stopping other AI's.

or, if the thesis of this paper is correctly, simply waiting for large unaligned agents to accidentally kill themselves.

LESSWRONG
LW

LESSWRONG
LW

7

Questions about ''formalizing instrumental goals"

7

Summary of “Formalizing Convergent Instrumental Goals”

Simplifying Assumptions

Objections to the Original Model

Risks to the AGI

Breakdown Risk to the AGI

Self-Other Boundary Risk

Self-Other Boundary Risk Increases with Agent Scale

Self Duplication Risk

Focus on Alignment Ignores Risks of Existing Non-Human Agents

7

7