This is a special post for quick takes by Gerald Monroe. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
9 comments, sorted by Click to highlight new comments since: Today at 1:28 AM

The trivial solution to AI alignment.  

Preface: I'm not expecting this solution to work, I just want to understand why the 'ez' solution doesn't work.

A paperclip maximizer:

           The paperclip maximizer is a paperclip factory designer/manager.  It has the role to select, from a finite set of machines available on the market, machines to manufacture, inspect, and package paperclips.

          It's reward heuristic is H = ( (quota - paperclips_made) x R1 - <human harm> x R2 - <machines outside factory boundary>  x R3 - <operating cost> x R4).

          So the machine gets the most reward if it makes up to the quota of paperclips for the lowest cost, with the least machinery outside the factory bounds, and no humans hurt.  (it has perception stack subsystems that try to detect humans inside the factory boundary, and estimates harm if any machine impinges on the space they occupy or are predicted to occupy in near future frames)

        The other design aspect is functional sparseness.  This means :

                          The machine's neural network is the smallest network that maximizes reward from the possibilities tried during design and training

                         The machine exports common problems to other machines.

                                       Common problems : paperclip quality control, paperclip production counts, operating cost calculation, machine perception, machine pathing for paperclip manufacture, and so on.

                        What this would mean is that this machine shares these subsystems with other AI systems that have totally different incentives, and these subsystems themselves were designed to accomplish roles without the paperclip factory planner having input into their design.  Thus it can't install a corrupt "production counter" to declare it's met the quota without energy usage.  Or redesign the paperclip plant materials porters to have onboard weapons.  And it lacks the cognitive ability to consider most of these things because this doesn't improve reward during the training phase.

 

Why won't this work?

Yes, you can avoid AGI misalignment if you choose to not employ AGI. What do you do about all the other people who will deploy AGI as soon as it is possible?

This is, arguably, AGI.  The reason it's AGI is because you can solve most real world problems by licensing a collection of common subcomponents (I would predict some stuff will be open source but the need for data and cloud compute resources to build and maintain a component means nothing can be free), where you only need to define your problem.

In this specific toy example, the only thing that is written by human devs for their paperclip factory might be a small number of json files that reference paperclip specs, define the system topology, and references another system that actually handles marketing/selling paperclips and sets the quotas.  

Obviously then the meta task of authoring this file could be done by another AI based system that learns from examples of humans authoring these files. 

So the system as a whole has the capabilities of AGI even though it's made of a large number of narrow AIs.  This is arguably how our brains most likely work at the subsystem level so there's prior art.

Also I can see how you might get sloppy and stop adhering to the 'sparseness' criterion.  Make your paperclip factory of subsystems that are smarter than they need to be, increasing their flexibility in novel situations at the cost of potentially unbounded behavior.

This is definitively not AGI.

And it lacks the cognitive ability to consider most of these things because this doesn't improve reward during the training phase.

If it lacks cognitively ability to consider things that humans can consider, then it's not AGI.

Put units on your equation.  I don't think H will end up being what you think it is.  Or, the coefficients R1-R4 are exactly as complex as the problem you started with, and you've accomplished no simplification with this.

Heck, even the first term, (quota - paperclips made) hand-waves where the quota comes from, and any non-linearity in making slightly more for next year being better than slightly fewer than needed this year.

R1 <-> R4 are arbitrary positive floating point numbers.  Units are currency units.  So "human harm" is in terms of estimated actual costs + estimate reputation damage paid for injuries/wrongful deaths, "outside boundary" is an estimate of the fines for trespassing and settlements paid in lawsuits, "paperclips made" is the economic value of the paperclips, and operating cost is obvious.

Hmm.  Either I'm misunderstanding, or you just described a completely amoral optimizer, which will kill billions as long as it can't be held financially liable.  Maybe just take over the governments (or at least currency control), so it can't be financially penalized for anything, ever.

Also, you're adding paperclip-differential to money, so the result won't be pure money.  That's probably good, because otherwise this beast stops making paperclips and optimizes for negative realized costs on one of the other dimensions.

Without even getting into whether your specific reward heuristic is misaligned, it seems to me that you'd just shifted the problem slightly out of the focus of your description of the system, by specifying that all of the work will be done by subsystems that you're just assuming will be safe. 

 

"paperclip quality control" has just as much potential for misalignment in the limit as does paperclip maximization, depending on what kind of agent you use to accomplish it. So, even if we grant the assumption that your heuristic is aligned, we are merely left with the task of designing a bunch of aligned agents to do subtasks.

Paperclip quality control is an agent that was trained on simulated sensor inputs (camera images and whatever else) of variations of paperclip. Paperclips that are not within a narrow range of dimensions and other measurements for correctness are rejected.

It doesn't have any learning ability. It is literally an overgrown digital filter that takes in some dimensions of input image and outputs true or false to accept or reject. (and probably another vector specifying the checks failed)

We can describe every subagent for everything the factory needs as such limited, narrow domain machines that alignment issues are not possible. (Especially as most will have no memory and all will have learning disabled)