We are familiar with the thesis that Value is Fragile. This is why we are researching how to impart values to an AGI.

Embedded Minds are Fragile

Besides values, it may be worth remembering that human minds too are very fragile.

A little magnetic tampering with your amygdalas, and suddenly you are a wannabe serial killer. A small dose of LSD can get you to believe you can fly, or that the world will end in 4 hours. Remove part of your Ventromedial PreFrontal Cortex, and suddenly you are so utilitarian even Joshua Greene would call you a psycho.

It requires very little material change to substantially modify a human being's behavior. Same holds for other animals with embedded brains, crafted by evolution and made of squishy matter modulated by glands and molecular gates.

A Problem for Paul-Boxing and CEV?

One assumption underlying Paul-Boxing and CEV is that:

It is easier to specify and simulate a human-like mind then to impart values to an AGI by means of teaching it values directly via code or human language.

Usually we assume that because, as we know, value is fragile. But so are embedded minds. Very little tampering is required to profoundly transform people's moral intuitions. A large fraction of the inmate population in the US has frontal lobe or amygdala malfunctions.

Finding out the simplest description of a human brain that when simulated continues to act as that human brain would act in the real world may turn out to be as fragile, or even more fragile, than concept learning for AGI's.

New Comment
23 comments, sorted by Click to highlight new comments since: Today at 6:49 AM

What is paul-boxing?

I wish there was a standardized name for the proposal by Paul Christiano that indirect normativity be done using a specific counterfactual human with a computer to aid herself. Is there? I've heard people calling it Paul-Boxing many times, maybe there is a different one.


...is, I think, still the reference post.

That post discusses two big ideas, one is putting a human in a box and building a model of their input/output behavior as "the simplest model consistent with the observed input/output behavior." Nick Bostrom calls this the "crypt," which is not a very flattering name but I have no alternative. I think it has been mostly superseded by this kind of thing (and more explicitly, here, but realistically the box part was never necessary).

The other part is probably more important, but less colorful; extrapolate by actually seeing what a person would do in a particular "favorable" environment. I have been calling this "explicit" extrapolation.

I'm sorry to never name this. I think I can be (partly) defended because the actual details have changed so much, and it's not clear exactly what you would want to refer to.

I'm skeptical about the relevance of the fragility of minds here. Yes, if you mess up your simulation slightly it will start to make bad predictions. But that seems to make it easier to specify a person precisely, not harder. The differences in observations allow someone to quickly rule out alternative models by observations of people. Indeed, the way in which a human brain is physically represented makes little difference to the kind of predictions someone would make. As another commenter pointed out, if you randomly flip some bits in a computer it will not do anything good. But that has little relevance to your predictions of a computer, unless you expect some bits to get randomly flipped.

(A general point is also that you shouldn't be specifying values either through a definition of a human mind, or through language. You should be using these weak descriptions to give an indirect definition of value, via an edict like "do whatever it is that people want" and recommending actions like "when the time comes, ask people." (This indirection may be done in hypothetical simulation, for the explicit extrapolation approach.) These targets are much less fragile. So I tend to think that the fragility of value doesn't bear much on the feasibility of these approaches.

A little magnetic tampering with your amygdalas, and suddenly you are a wannabe serial killer.

How do you know?

People with amygdala damage are well represented in the serial killer population. there's a catch that people who become serial killers through these sorts of trauma frequently do not "want to" be serial killers, they just are. They find their own actions morally reprehensible. One place to look for cases like these is the Human Aggression lectures on Sapolsky's course on human behavioral biology. Probably Antonio Damasio has info in his giant brainscan database as well.

Just because a lot of people with amygdala damage are serial killers doesn't mean that everyone or most of the people with amygdala damage are.

I don't think the evidence to which you point proves a claim as strong as the one you make in the post.

Magnetic though?

[-][anonymous]8y 7

I can't find any studies that look into the effects on mood specifically, but I found a related paper where they applied a magnetic field to the amygdalae of rats to increase the frequency of seizures. The idea of magnetic stimulation to parts of the brain causing altered states is in the public consciousness thanks to the God Helmet, though that applied a magnetic field to the temporal lobes to induce altered states and failed replication. That is likely why magnetic fields were used as an example.

Yes, I intended to show that a relatively non-invasive technique may be sufficient to drastically distort the mind. The same point could be made with blood clogging, eletric overdrive - seizures - or a small tumor, if history shows that trans-cranial magnetic stimulation was a placebo after all.

[-][anonymous]8y 2

While placebo effects may account for some of the proposed therapeutic uses of transcranial magnetic stimulation (TMS), there is now little doubt that TMS actually does affect neural activity in highly constrained fashion. To me one of the most convincing demonstrations is the ability of TMS to reliably evoke motor responses in specific finger muscles via stimulation at varying locations over motor cortex.

While finger twitches are not so impressive in terms of potentially major consequences, speech arrest (TMS over left frontal cortex) is far more striking, if very temporary. TMS, applied at unacceptably high levels, could definitely "distort the mind" just as direct application of a comparable electrical field would - or for that matter injury or surgery. However, existing literature on focal lesions is not as clear-cut as your post suggests: our predictive power at the individual level is still very low even for highly well-known conditions.

I do agree with the central thesis: human brains are fragile and certain networks, if damaged, can have extreme consequences. But they're not so predictably fragile.

But, but... people who want to do things that they feel are morally reprehensible are not reflectively consistent, and certainly people who imagine things like imminent end of the world. I thought CEV-type approaches only rely on reflectively consistent people to learn from.

Is this surprising? If I randomly corrupt my computer's memory it will malfunction too. Let alone damage the actual hardware. It's amazing human brains are as resilient as they are.

Read the very last comment by Archytect (downvoted 5 times), it is exactly the same as the top one. Makes one wonder, doesn't it? I laughed.

Modded +5 Insightful.

I used to find Slashdot delightful,
But my feelings of late are more spiteful;
My comments sarcastic
The iconoclastic
Keep modding to Plus Five (Insightful).

-- http://www.xkcd.com/301/

[-][anonymous]8y 6

I wanted to upvote, but you're currently an appropriate +5 :)

Not to mention tampering with it, or allowing it to tamper with itself, might have all kinds of unforeseen consequences. To me its like, here is a whole lot of evolutionary software that does all this elegant stuff a lot of the time... but has never been unit tested.

[-][anonymous]8y 0


[This comment is no longer endorsed by its author]Reply

If a got a perfectly working infant-like mind it would be pretty straightforward to bring in up as a human. We have some working way of teaching natural intelligences good enough values. However we have zero description on what goes on in a infant brain. We treat it like a balck box, we don't know how it works but we know how to use it. Therefore I estimate that value teaching via human language is easier than specifying a human-like mind.

New to LessWrong?