Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The paperclip maximizer is a thought experiment about a hypothetical superintelligent AGI that is obsessed with maximizing paperclips. It can be modeled as a utility-theoretic agent whose utility function is proportional to the number of paperclips in the universe. The Orthogonality Thesis argues for the logical possibility of such an agent. It comes in weak and strong forms:

The weak form of the Orthogonality Thesis says, "Since the goal of making paperclips is tractable, somewhere in the design space is an agent that optimizes that goal."

The strong form of Orthogonality says, "And this agent doesn't need to be twisted or complicated or inefficient or have any weird defects of reflectivity; the agent is as tractable as the goal." That is: When considering the necessary internal cognition of an agent that steers outcomes to achieve high scores in some outcome-scoring function U, there's no added difficulty in that cognition except whatever difficulty is inherent in the question "What policies would result in consequences with high U-scores?"

This raises a number of questions:

  • Why would it be likely that the future would be controlled by utility-maximizing agents?
  • What sorts of utility functions are likely to arise?

A basic reason to expect the far future to be controlled by utility-maximizing agents is that utility theory is the theory of making tradeoffs under uncertainty, and agents that make plans far into the future are likely to make tradeoffs, since tradeoffs are necessary for their plans to succeed. They will be motivated to make tradeoffs leading to controlling the universe almost regardless of what U is, as long as U can only be satisfied by pumping the distant future into a specific part of the possibility space. Whether an agents seeks to maximize paperclips, minimize entropy, or maximize the amount of positive conscious experience, it will be motivated to, in the short term, cause agents sharing its values to have more leverage over the far future. This is the basic instrumental convergence thesis.

One example of approximately utility-maximizing agents we know about are biological organisms. Biological organisms model the world and have goals with respect to the world, which are to some degree resistant to wireheading (thus constituting environmental goals). They make tradeoffs to achieve these goals, which have correlation with survival and reproduction. The goals that end up likely for biological organisms to have will be (a) somewhat likely to arise from pre-existing processes such as genetic mutation, (b) well-correlated enough with survival and reproduction that an agent optimizing for these goals will be likely to replicate more agents with similar goals. However, these goals need not be identical with inclusive fitness to be likely goals for biological organisms. Inclusive fitness itself may be too unlikely to arise as a goal from genetic mutation and so on, to be a more popular value function than proxies for it.

However, there are a number of goals and values in the human environment that are not well-correlated with inclusive fitness. These are generally parts of social systems. Some examples include capacity at games such as sports, progress in a research field such as mathematics, and maximization of profit (although, this one is at least related to inclusive fitness in a more direct way than the others). Corresponding institutions which incentivize (generally human) agents to optimize for these goals include gaming/sports leagues, academic departments, and corporations.

It is quite understandable that goals well-correlated with inclusive fitness would be popular, but why would goals that are not well-correlated with inclusive fitness also be popular? Molgbug's Fnargl thought experiment might shed some light on this:

So let's modify this slightly and instead look for the worst possible rational result. That is, let's assume that the dictator is not evil but simply amoral, omnipotent, and avaricious.

One easy way to construct this thought-experiment is to imagine the dictator isn't even human. He is an alien. His name is Fnargl. Fnargl came to Earth for one thing: gold. His goal is to dominate the planet for a thousand years, the so-called "Thousand-Year Fnarg," and then depart in his Fnargship with as much gold as possible. Other than this Fnargl has no other feelings. He's concerned with humans about the way you and I are concerned with bacteria.

You might think we humans, a plucky bunch, would say "screw you, Fnargl!" and not give him any gold at all. But there are two problems with this. One, Fnargl is invulnerable---he cannot be harmed by any human weapon. Two, he has the power to kill any human or humans, anywhere at any time, just by snapping his fingers.

Other than this he has no other powers. He can't even walk---he needs to be carried, as if he was the Empress of India. (Fnargl actually has a striking physical resemblance to Jabba the Hutt.) But with invulnerability and the power of death, it's a pretty simple matter for Fnargl to get himself set up as Secretary-General of the United Nations. And in the Thousand-Year Fnarg, the UN is no mere sinecure for alcoholic African kleptocrats. It is an absolute global superstate. Its only purpose is Fnargl's goal---gold. And lots of it.

In other words, Fnargl is a revenue maximizer. The question is: what are his policies? What does he order us, his loyal subjects, to do?

The obvious option is to make us all slaves in the gold mines. Otherwise---blam. Instant death. Slacking off, I see? That's a demerit. Another four and you know what happens. Now dig! Dig! (Perhaps some readers have seen Blazing Saddles.)

But wait: this can't be right. Even mine slaves need to eat. Someone needs to make our porridge. And our shovels. And, actually, we'll be a lot more productive if instead of shovels, we use backhoes. And who makes those? And...

We quickly realize that the best way for Fnargl to maximize gold production is simply to run a normal human economy, and tax it (in gold, natch). In other words, Fnargl has exactly the same goal as most human governments in history. His prosperity is the amount of gold he collects in tax, which has to be exacted in some way from the human economy. Taxation must depend in some way on the ability to pay, so the more prosperous we are, the more prosperous Fnargl is.

Fnargl's interests, in fact, turn out to be oddly well-aligned with ours. Anything that makes Fnargl richer has to make us richer, and vice versa.

For example, it's in Fnargl's interest to run a fair and effective legal system, because humans are more productive when their energies aren't going into squabbling with each other. It's even in Fnargl's interest to have a fair legal process that defines exactly when he will snap his fingers and stop your heart, because humans are more productive when they're not worried about dropping dead.

And it is in his interest to run an orderly taxation system in which tax rates are known in advance, and Fnargl doesn't just seize whatever, whenever, to feed his prodigious gold jones. Because humans are more productive when they can plan for the future, etc. Of course, toward the end of the Thousand-Year Fnarg, this incentive will begin to diminish---ha ha. But let's assume Fnargl has only just arrived.

Other questions are easy to answer. For example, will Fnargl allow freedom of the press? But why wouldn't he? What can the press do to Fnargl? As Bismarck put it: "they say what they want, I do what I want." But Bismarck didn't really mean it. Fnargl does.

One issue with the Fnargl thought experiment is that, even with the power of death, Fnargl may lack the power to rule the world, since he relies on humans around him for information, and those humans have incentives to deceive him. However, this is an aside; one could modify the thought experiment to give Fnargl extensive surveillance powers.

The main point is that, by monomaniacally optimizing for gold, Fnargl rationally implements processes for increasing overall resources and efficient conversion between different resources, coherent tradeoffs between different resources, and a coherent system (including legalistic aspects and so on) so as to make these tradeoffs in a rational manner. This leads to a Fnargl-ruled civilization "succeeding" in the sense of having a strong material economy, high population, high ability to win wars, and so on. Molgbug asserts that Fnargl's interests are well-aligned with ours, which is more speculative; due to convergent instrumentality, Fnargl will implement the sort of infrastructure that rational humans would implement, although the implied power competition would reduce the level of alignment.

By whatever "success" metric for civilizations we select, it is surely possible to do better than optimizing for gold, as it is possible for an organism to gain more inclusive fitness by having values that are more well-aligned with inclusive fitness. But even a goal as orthogonal to civilizational success as gold-maximization leads to a great deal of civilizational success, due to civilizational success being a convergent instrumental goal.

Moreover, the simplicity and legibility of gold-maximization simplifies coordination compared to a more complex proxy for civilizational success. A Fnargl-ocracy can evaluate decisions (such as decisions related to corporate governance) using a uniform gold-maximization standard, leading to a high degree of predictability, and simplicity in prioritization calculations.

What real-world processes resemble Fnargl-ocracy? One example is Bitcoin. Proof-of-work creates incentives for maximizing a certain kind of cryptographic puzzle-solving. The goal itself is rather orthogonal to human values, but Bitcoin nonetheless creates incentives for goals such as creating computing machinery, which are human-aligned due to convergent instrumentality (additional manufacturing of computing infrastructure can be deployed to other tasks that are more directly human-aligned).

As previously mentioned, sports and gaming are popular goals that are fairly orthogonal to human values. Sporting incentivizes humans and groups of humans to become more physically and mentally capable, leading to more generally-useful fitness practices such as weight training, and agency-related mental practices, which people can learn about by listening to sports athletes and coaches. Board games such as chess incentivize practical rationality and general understanding of rationality, including AI-related work such as the Minimax algorithmMonte-Carlo Tree Search, and AlphaGo. Bayesian probability theory was developed in large part to analyze gambling games. Speedrunning has led to quite a lot of analysis of video games and practice at getting better at these games, by setting a uniform standard by which gameplay runs can be judged.

Academic fields, especially STEM-type fields such as mathematics, involve shared, evaluable goals that are not necessarily directly related to human values. For example, number theory is a major subfield of mathematics, and its results are rarely directly useful, though progress in number theory, such as the proof of Fermat's last theorem, is widely celebrated. Number theory does, along the way, produce more generally-useful work, such as Peano arithmetic (and proof theory more generally), Gödel's results, and cryptographic algorithms such as RSA.

Corporations are, in general, supposed to maximize profit conditional on legal compliance and so on. While profit-maximization comes apart from human values, corporations are, under conditions of rule of law, generally incentivized to produce valuable goods and services at minimal cost. This example is less like a paperclip maximizer than the previous examples, as the legal and economic system that regulates corporations has been in part designed around human values. The simplicity of the money-maximization goal, however, allows corporations to make internal decisions according to a measurable, legible standard, instead of dealing with more complex tradeoffs that could lead to inconsistent decisions (which may be "money-pumpable" as VNM violations tend to be).

Some systems are relatively more loaded on human values, and less like paperclip maximizers. Legal systems are designed and elaborated on in a way that takes human values into account, in terms of determining which behaviors are generally considered prosocial and antisocial. Legal decisions form precedents that formalize certain commitments including trade-offs between different considerations. Religions are also designed partially around human values, and religious goals tend to be aligned with self-replication, by for example encouraging followers to have children, to follow legalistic norms with respect to each other, and to spread the religion.

The degree to which commonly-shared social goals can be orthogonal to human values is still, however, striking. These goals are a kind of MacGuffin, as Zvi wrote about:

Everything is, in an important sense, about these games of signaling and status and alliances and norms and cheating. If you don't have that perspective, you need it.

But let's not take that too far. That's not all such things are about.  Y still matters: you need a McGuffin. From that McGuffin can arise all these complex behaviors. If the McGuffin wasn't important, the fighters would leave the arena and play their games somewhere else. To play these games, one must make a plausible case one cares about the McGuffin, and is helping with the McGuffin.

Otherwise, the other players of the broad game notice that you're not doing that. Which means you've been caught cheating.

Robin's standard reasoning is to say, suppose X was about Y. But if all we cared about was Y, we'd simply do Z, which is way better at Y. Since we don't do Z, we must care about something else instead. But there's no instead; there's only in addition to.

A fine move in the broad game is to actually move towards accomplishing the McGuffin, or point out others not doing so. It's far from the only fine move, but it's usually enough to get some amount of McGuffin produced.

By organizing around a MacGuffin (such as speedrunning), humans can coordinate around a shared goal, and make uniform decisions around this shared goal, which leads to making consistent tradeoffs in the domain related to this goal. The MacGuffin can, like gold-maximization, be basically orthogonal to human values, and yet incentivize instrumental optimization that is convergent with that of other values, leading to human value satisfaction along the way.

Adopting a shared goal has the benefit of making it easy to share perspective with others. This can make it easier to find other people who think similarly to one's self, and develop practice coordinating with them, with performance judged on a common standard. Altruism can have this effect, since in being altruistic, individual agents "erase" their own index, sharing an agentic perspective with others; people meeting friends through effective altruism is an example of this.

It is still important, to human values, that the paperclip-maximizer-like processes are not superintelligent; while they aggregate compute and agency across many humans, they aren't nearly as strongly superintelligent as a post-takeoff AGI. Such an agent would be able to optimize its goal without the aid of humans, and would be motivated to limit humans' agency so as to avoid humans competing with it for resources. Job automation worries are, accordingly, in part the worry that existing paperclip-maximizer-like processes (such as profit-maximizing corporations) may become misaligned with human welfare as they no longer depend on humans to maximize their respective paperclips.

For superintelligent AGI to be aligned with human values, therefore, it is much more necessary for its goals to be directly aligned with human values, even more than the degree to which human values are aligned with inclusive evolutionary fitness. This requires overcoming preference falsification, and taking indexical (including selfish) goals into account.

To conclude, paperclip-maximizer-like processes arise in part because the ability to make consistent, legible tradeoffs is a force multiplier. The paperclip-maximization-like goals (MacGuffins) can come apart from both replicator-type objectives (such as inclusive fitness) and human values, although can be aligned in a non-superintelligent regime due to convergent instrumentality. It is hard to have a great deal of influence over the future without making consistent tradeoffs, and already-existing paperclip-maximizer-like systems provide examples of the power of legible utility functions. As automation becomes more powerful, it becomes more necessary, for human values, to design systems that optimize goals aligned with human values.

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 1:26 AM

Instrumental convergence makes differences in values hard to notice, so there can be abudant examples of misalignment that remain unobtrusive. The differences only become a glaring problem with enough inequality of power when coercing or outright overwriting others becomes feasible (Fnargl only reaches the coercing stage, but not overwriting stage). Thus even differences in values between humans and randomly orthogonal AGIs can seem non-threatening until they aren't, the same as differences in human values can remain irrelevant for average urban dwellers.

Alignment on values becomes crucial given overwriting-level power differences, since a thought experiment with putting a human in that position predicts a decent chance of non-doom, even though it's not a good plan on its own. Conversely, unfettered superintelligent paperclip maximizers or Fnargls don't leave survivors. And since the world is very far from an equilibrium of abundant superintelligence, there are going to be unprecedented power differentials while it's settling. This makes the common sense impression of irrelevance of misalignment on values (being masked by instrumental convergence) misleading when it comes to AGI.

Thinking this through.

There's a lot of ways in which speedrunning is like paperclip maximisation: speedrunning doesn't contribute to society and further paperclips after we've produced a certain amount become useless.

I'm still confused by the analogy though. Because seems like a lot of people may do speedrunning for fun - but maybe you see it as more about status - while paperclip production isn't fun. I think this makes a difference though, as even though we don't want our society to produce absurd amounts of paperclips, we probably do want lots of niche ways to have fun.

Competitive paperclip maximization in a controlled setting sounds like it might be fun. The important thing is that it's one thing that's fun out of many things, and variety is important.

This comment is just a few related thoughts I've had on the subject. Hopefully it's better than nothing (the karma count of my previous comments makes me doubtful)

I'm having a harder time coming up with counter-examples than examples of paperclip maximizers.
Weeds in my garden multiplying, cancer tumors growing, internet memes spreading. Pretty much any system has a 'direction' in which it evolves, so everything from simple mathematical laws to complex human situations seems to have similar behaviour and risks as paperclip maximizers.
A maximizing agent is aligned with itself and every dependency of its goal, so in a sense, every dependency is "protected" by it. You could consider everything in this dependency-chain (e.g. Fnargl's goal depends on humans, so he doesn't kill them) to be part of a single system, and every independent factor to be "outside" of this system. An "us vs them" mentality is not as harmful to life if all intelligent life is included in this "us".

Sadly, I don't think corporations are a good example, since lobbying and corruption can help them to bend the law in their favour.
We will likely see a similar problem with humanity soon, that instead of aligning society to human values, and instead of this being necessary, we will start engineering humans to be aligned with the system. Healthy behaviour which is harmful to society is not allowed, and if anyone doesn't fit into the modern society (which isn't very well aligned with human needs), we give them medicine which increase their tolerance to modern society, rather than attempting to create a society in which a wider variety of people can thrive.
I don't think that any complex system is aligned with human *values*, though,just human well-being in some form, so any effective means will suffice, including deception. Also, that which matters to us seems to exist only in the micro-states, and the macro states, due to regression to the mean, seem to effectively delete our individual desires by reducing them to the average. Macro-states can still be said to be a results of average human behaviour, but if we start handing over control to algorithms, then human influence disappears. It's not only our imperfection which is optimized away but our morality as well.

As a side note, we can't always be as lucky as we would be in a Fnargl-ocracy, which plans for the future. I'd argue that most of us plan on smaller time-spans, leading to worse outcomes than longer time-spans would. If we worked on much larger time-spans, I think that most problems will be solved, including pressing ones like global warming. This may be sufficient, depending on Moloch.

I mainly agree with the post, to the degree that I don't see why intelligent people on here are so preoccupied with super-intelligence. In my view, humans are doomed even if we don't succeed in creating super-intelligence. We seem to be handing control over to something inorganic, while creating things with much more fitness than ourselves, so that we ultimately become superfluous. and I'd argue that society is misaligned with humanity already, which explains most of the unnecessary suffering in the world.
Also, I believe that human agency is inversely proportional to technological level (hence the ever-increasing regulations)

Finally, I just had a crazy idea. Make the utility function of the first GAI be the extent to which it can solve the alignment problem (we're making AIs to solve our other problems already, and these are essentially just smaller alignment problems which hurt everything outside the scope of said problems, like with the Fnargl example, which almost has a high enough scope to preserve the whole)