I wonder how much an agent could achieve by thinking along the following lines:

Big Brad is a human-shaped robot who works as a lumberjack. One day his employer sends him into town on his motorbike carrying two chainsaws, to get them sharpened. Brad notices an unusual number of the humans around him suddenly crossing streets to keep their distance from him.

Maybe they don't like the smell of chainsaw oil? So he asks one rather slow pedestrian "Why are people keeping their distance?" to which the pedestrian replies "Well, what if you attacked us?"

Now in the pedestrian's mind, that's a reasonable response. If Big Brad did attack someone walking next to them, without notice, Brad would be able to cut them in half. To humans who expect large bike-riding people carrying potential weapons to be disproportionately likely to be violent without notice, being attacked by Brad seems a reasonable fear, worthy of expending a little effort to alter walking routes to allow running away if Brad is violent.

But Brad knows that Brad would never do such a thing. Initially, it might seem like asking Brad "What if 2 + 2 equalled 3?"

But if Brad can think about the problem in terms of what information is available to the various actors in the scenario, he can reframe the pedestrian's question as: "What if an agent that, given the information I have so far, is indistinguishable from you, were to attack us?"

If Brad is aware that random pedestrians in the street don't know Brad personally, to the level of being confident about Brad's internal rules and values, and he can hypothesise the existence of an alternative being, Brad' that a pedestrian might consider would plausibly exist and would have different internal rules and values to those of Brad yet otherwise appear identical, then Brad has a way forwards to think through the problem.

On the more general question of whether it would be useful for Brad to have the ability to ask himself: "What if the universe were other than I think it is? What if magic works and I just don't know that yet? What if my self-knowledge isn't 100% reliable, because there are embedded commands in my own code that I'm currently being kept from being aware of by those same commands? Perhaps I should allocate a minute probability to the scenario that somewhere there exists a lightswitch that's magically connected to the sun and which, in defiance of known physics, can just turn it off and on?", with careful allocation of probabilities that might avoid divide-by-zero problems, but I don't think it is a panacea - there are additional approaches to counterfactual thinking that may be more productive in some circumstances.

Decision Theory

by abramdemski, Scott Garrabrant 1 min read31st Oct 201837 comments


Ω 24

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(A longer text-based version of this post is also available on MIRI's blog here, and the bibliography for the whole sequence can be found here.)

The next post in this sequence, 'Embedded Agency', will come out on Friday, November 2nd.

Tomorrow’s AI Alignment Forum sequences post will be 'What is Ambitious Value Learning?' in the sequence 'Value Learning'.