Rob Bensinger

Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.

Rob Bensinger's Comments

Concerns Surrounding CEV: A case for human friendliness first

Sure! :) Sorry if I came off as brusque, I was multi-tasking a bit.

Concerns Surrounding CEV: A case for human friendliness first

I wasn't bringing up evolution because you brought up evolution; I was bringing it up separately to draw a specific analogy.

Concerns Surrounding CEV: A case for human friendliness first

By analogy, I'd ask you to consider why it doesn't make sense to try to "cooperate" with the process of evolution. Evolution can be thought of as an optimizer, with a "goal" of maximizing inclusive reproductive fitness. Why do we just try to help actual conscious beings, rather than doing some compromise between "helping conscious beings" and "maximizing inclusive reproductive fitness" in order to be more fair to evolution?

A few reasons:

  • The things evolution "wants" are terrible. This isn't a case of "vanilla or chocolate?"; it's more like "serial killing or non-serial-killing?".
  • Evolution isn't a moral patient: it isn't a person, it doesn't have experiences or emotions, etc.
    • (A paperclip maximizer might be a moral patient, but it's not obvious that it would be; and there are obvious reasons for us to deliberately design AGI systems to not be moral patients, if possible.)
  • Evolution can't use threats or force to get us to do what it wants.
    • (Ditto a random optimizer, at least if we're smart enough to not build threatening or coercive systems!)
  • Evolution won't reciprocate if we're nice to it.
    • (Ditto a random optimizer. This is still true after you build an unfriendly optimizer, though not for the same reasons: an unfriendly superintelligence is smart enough to reciprocate, but there's no reason to do so relative to its own goals, if it can better achieve those goals through force.)
Concerns Surrounding CEV: A case for human friendliness first

In the next part (forgiving me if this is way off) essentially you are saying my second question in the post is false, it wont be self aware or if it is it wont reflect enough to consider significantly rewriting its source code

No, this is not right. A better way of stating my claim is: "The notion of 'self-awareness' or 'reflectiveness' you're appealing to here is a confused notion." You're doing the thing described in Ghosts in the Machine and Anthropomorphic Optimism, most likely for reasons described in Sympathetic Minds and Humans in Funny Suits: absent a conscious effort to correct for anthropomorphism, humans naturally model other agents in human-ish terms.

Im more positing at what point does paperclip maximizer learn so much it has a model of behaving in a manner that doesn't optimize paperclips and explores that, or have a model of its own learning capabilities and explore optimizing for other utilities.

What does "exploring" mean? I think that I'm smart enough to imagine adopting an ichneumon wasp's values, or a serial killer's values, or the values of someone who hates baroque pop music and has strong pro-Spain nationalist sentiments; but I don't try to actually adopt those values, it's just a thought experiment. If a paperclip maximizer considers the thought experiment "what if I switched to less paperclip-centric values?", why (given its current values) would it decide to make that switch?

maybe the initial task we give it should take into account what its potential volition may be at some point rather than just our own as a pre signal of pre committing to cooperation.

I think there's a good version of ideas in this neighborhood, and a bad version of such ideas. The good version is cosmopolitan value and not trying to lock in the future to an overly narrow or parochial "present-day-human-beings" version of what's good and beautiful.

The bad version is deliberately building a paperclipper out of a misguided sense of fairness to random counterfactual value systems, or out of a misguided hope that a paperclipper will spontaneously generate emotions of mercy, loyalty, or reciprocity when given a chance to convert especially noble and virtuous humans into paperclips.

Concerns Surrounding CEV: A case for human friendliness first

To answer questions like these, I recommend reading https://www.lesswrong.com/rationality and then browsing https://arbital.com/explore/ai_alignment/. Especially relevant:

Or, quoting "The Value Learning Problem":

[S]ystems that can strictly outperform humans cognitively have less to gain from integrating into existing economies and communities. Hall [2007] has argued:

"The economic law of comparative advantage states that cooperation between individuals of differing capabilities remains mutually beneficial. [ . . . ] In other words, even if AIs become much more productive than we are, it will remain to their advantage to trade with us and to ours to trade with them."

As noted by Benson-Tilsen and Soares [forthcoming 2016], however, rational trade presupposes that agents expect more gains from trade than from coercion. Non-human species have various “comparative advantages” over humans, but humans generally exploit non-humans through force. Similar patterns can be observed in the history of human war and conquest. Whereas agents at similar capability levels have incentives to compromise, collaborate, and trade, agents with strong power advantages over others can have incentives to simply take what they want.

The upshot of this is that engineering a functioning society of powerful autonomous AI systems and humans requires that those AI systems be prosocial. The point is an abstract one, but it has important practical consequences: rational agents’ interests do not align automatically, particularly when they have very different goals and capabilities.

And quoting Ensuring smarter-than-human intelligence has a positive outcome:

The notion of AI systems “breaking free” of the shackles of their source code or spontaneously developing human-like desires is just confused. The AI system is its source code, and its actions will only ever follow from the execution of the instructions that we initiate. The CPU just keeps on executing the next instruction in the program register. We could write a program that manipulates its own code, including coded objectives. Even then, though, the manipulations that it makes are made as a result of executing the original code that we wrote; they do not stem from some kind of ghost in the machine.

The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification.

Enslaving conscious beings is obviously bad. It would be catastrophic to bake into future AGI systems the assumption that non-human animals, AI systems, ems, etc. can't be moral patients, and there should be real effort to avoid accidentally building AI systems that are moral patients (or that contain moral patients as subsystems); and if we do build AI systems like that, then their interests need to be fully taken into account.

But the language you use in the post above is privileging the hypothesis that AGI systems' conditional behavior and moral status will resemble a human's, and that we can't design smart optimizers any other way. You're positing that sufficiently capable paperclip maximizers must end up with sufficient nobility of spirit to prize selflessness, trust, and universal brotherhood over paperclips; but what's the causal mechanism by which this nobility of spirit enters the system's values? It can't just be "the system can reflect on its goals and edit them", since the system's decisions about which edits to make to its goals (if any) are based on the goals it already has.

You frame alignment as "servitude", as though there's a ghost or homunculus in the AI with pre-existing goals that the AI programmers ruthlessly subvert or overwrite. But there isn't a ghost, just a choice by us to either build systems with humane-value-compatible or humane-value-incompatible optimization targets.

The links above argue that the default outcome, if you try to be "hands-off", is a human-value-incompatible target -- and not because inhumane values are what some ghost "really" wants, and being hands-off is a way of letting it follow through on its heart's desire. Rather, the heart's desire is purely a product of our design choices, with no "perfectly impartial and agent-neutral" reason to favor one option over any other (though plenty of humane reasons to do so!!), and the default outcome comes from the fact that many possible minds happen to converge on adversarial strategies, even though there's no transcendent agent that "wants" this convergence to happen. Trying to cooperate with this convergence property is like trying to cooperate with gravity, or with a rock.

We run the Center for Applied Rationality, AMA

I feel like this comment should perhaps be an AIRCS class -- not on meta-ethics, but on 'how to think about what doing debugging your brain is, if your usual ontology is "some activities are object-level engineering, some activities are object-level science, and everything else is bullshit or recreation"'. (With meta-ethics addressed in passing as a concrete example.)

We run the Center for Applied Rationality, AMA

I felt a "click" in my brain reading this comment, like an old "something feels off, but I'm not sure what" feeling about rationality techniques finally resolving itself.

If this comment were a post, and I were in the curating-posts business, I'd curate it. The demystified concrete examples of the mental motion "use a tool from an unsciencey field to help debug scientists" are super helpful.

We run the Center for Applied Rationality, AMA

Can you too-tersely summarize your Nisbett and Wilson argument?

Or, like... writer a teaser / movie trailer for it, if you're worried your summary would be incomplete or inoculating?

We run the Center for Applied Rationality, AMA

I don't know Dario well, but I know enough to be able to tell that the anon here doesn't know what they're talking about re Dario.

Load More