Futarchy and Unfriendly AI

by jefftk1 min read3rd Apr 201527 comments

18

Personal Blog

We have a reasonably clear sense of what "good" is, but it's not perfect. Suffering is bad, pleasure is good, more people living enjoyable lives is good, yes, but tradeoffs are hard. How much worse is it to go blind than to lose your leg? [1] How do we compare the death of someone at eighty to the death of someone at twelve? If you wanted to build some automated system that would go from data about the world to a number representing how well it's doing, where you would prefer any world that scored higher to any world scoring lower, that would be very difficult.

Say, however, that you've built a metric that you think matches your values well and you put some powerful optimizer to work maximizing that metric. This optimizer might do many things you think are great, but it might be that the easiest ways to maximize the metric are the ones that pull it apart from your values. Perhaps after it's in place it turns out your metric included many things that only strongly correlated with what you cared about, where the correlation breaks down under maximization.

What confuses me is that the people who warn about this scenario with respect to AI are often the same people in favor of futarchy. They both involve trying to define your values and then setting an indifferent optimizer to work on them. If you think AI would be very dangerous but futarchy would be very good, why?

I also posted this on my blog.


[1] This is a question people working in public health try to answer with Disability Weights for DALYs.

27 comments, sorted by Highlighting new comments since Today at 4:19 PM
New Comment

Futarchy has less potential for perverse instantiation because it has fewer degrees of freedom to work with, because it operates on a human time scale but no faster, and because its outputs can be (and will be) ignored if they're sufficiently ridiculous.

I think the crucial difference between AI and futarchy is that in AI the utility function is decided once an for all. Once a superintelligence is out there, there is no stopping it. On the other hand in futarchy the utility function is determined by some sort of democratic mechanism which operates continuously, that can introduce corrections it if things start going awry.

Can you suggest a scenario in which futarchy would result in a clear negative outcome, something analogous to turning the universe into paper clips?

Analogous to turning the universe into paper clips

That's a low bar: it's an intentionally silly example. No one actually thinks we're likely to accidentally create a paperclip maximizer AI anymore than we're likely to accidentally include a "number of paperclips in the world" term in a futarchy metric. But something as clearly negative would be mandatory wireheading to maximize a "human pleasure" term.

A less extreme (and less clearly negative, but also more likely) example would be maximizing GDP. Hanson often uses GDP as an example of something you could include in a futarchy metric. GDP only counts market work, however, which means you can increase GDP by moving tasks from "do them yourself" to "hire someone". For example, if I watch my kid that doesn't count towards GDP, but if I pay you to watch them, and you pay me to do whatever you would otherwise have done during that time, it does.

GDP/person is one of the best metrics for "how is a country doing", often doing much better than explicit attempts to measure things closer to what we care about, but put a big optimizing push behind it and soon all the tiny tasks we do over the course of the day are pressured into market work.

Feedback controls. Futarchy is transparent,carried out in real time, and gives plenty of room to adjust values and change strategies if the present ones prove defective. On the other hand, a superintelligent AI would basically run as a black box. The operators would set the values, then the AI would use some method to optimize and then spit out the optimal strategy (and presumably implement it). There's no room for human feedback between setting the values and implementing the optimal strategy.

This relates to my previous post on confounding in Prediction Markets. In my analysis, if we allow human feedback between setting the values and implementing the strategy, you break the causal interpretation of the prediction market and therefore lose the ability to use it for optimization. This is obviously a trade-off between other considerations that may be more important, but you will run into big problems if the market participants expect there is a significant probability that humans will override the market

There's no room for human feedback between setting the values and implementing the optimal strategy.

Here and elsewhere I've advocated* that, rather than using Hanson's idea of target-values that are objectively verifiable like GDP, futarchy would do better to add human feedback in the stage of the process where it gets decided whether the goals were met or not. Whoever proposed the goal would decide after the prediction deadline expired, and thus could respond to any improper optimizing by refusing to declare the goal "met" even if it technically was met.

[ * You can definitely do better than the ideas on that blog post, of course.]

I can't say much about the consequences of this, but it appears to me that both democracy and futarchy are efforts to more closely approximate something along the lines of a CEV for humanity. They have the same problems, in fact. How do you reconcile mutually exclusive goals of the people involved?

In any case, that isn't directly relevant, but linking futarchy with AI caused me to notice that. Perhaps that sort of optimization style, of getting at what we "truly want" once we've cleared up all the conflicting meta-levels of "want-to-want", is something that the same sorts of people tend to promote.

I'm not a big fan of decision making by conditional prediction markets (btw, "futarchy" is an obscure, non-descriptive name. Better call it something like "prophetocracy"), but I think that proponents like Robin Hanson propose that the value system is not set once and for all but regularly updated by a democratically elected government. This should avoid the failure mode you are talking about.

"Futarchy" is an obscure, non-descriptive name. Better call it something like "prophetocracy"

"Futarchy" is the standard term for this governmental system. Perhaps Hanson should have chosen a different name, but that's the name its been going under for about a decade and I don't think "prophetocracy" would be an improvement.

It's not a very well know word, anyway. Would the cost of changing it outweigh the benefit of a relatively self-descriptive word?

What is the alternative? Futarchy is unfriendly, but so is the current government.

Think about the laws that govern these things, and how to use them to make these things better for us.