The kicker is that we don't reason directly about ourselves as such, we use a simplified model of ourselves. And we're REALLY GOOD at using that model for causal reasoning, even when it is reflective, and involves multiple levels of self-reflection and counterfactuals - at least when we bother to try. (We try rarely because explicit modelling is cognitively demanding, and we usually use defaults / conditioned reasoning. Sometimes that's OK.)

Example: It is 10PM. A 5-page report is due in 12 hours, at 10AM.

Default: Go to sleep at 1AM, set alarm for 8AM. Result: Don't finish report tonight, have too little time to do so tomorrow.

Conditioned reasoning: Stay up to finish the report first. 5 hours of work, and stay up until 3AM. Result? Write bad report, still feel exhausted the next day

Counterfactual reasoning: I should nap / get some amount of sleep so that I am better able to concentrate, which will outweigh the lost time. I could set my alarm for any amount of time; what amount does my model of myself imply will lead to an optimal well-rested / sufficient time trade-off?

Self-reflection problem, second use of mini-self model: I'm worse at reasoning at 1AM than I am at 10PM. I should decide what to do now, instead of delaying until then. I think going to sleep at 12AM and waking at 3AM gives me enough rest and time to do a good job on the report.

Consider counterfactual and impact: How does this impact the rest of my week's schedule? 3 hours is locally optimal, but I will crash tomorrow and I have a test to study for the next day. Decide to work a bit, go to sleep at 12:30 and set alarm for 5:30AM. Finish the report, turn it in by 10AM, then nap another 2 hours before studying.

We built this model based on not only small samples of our own history, but learning from others, incorporating data about seeing other people's experiences. We don't consider staying up all night and then driving to handing the report, because we realize exhausted driving is dangerous - because we heard stories of people doing so, and know that we would be similarly unsteady. Is a person going to explore and try different strategies by staying up all night and driving? If you die, you can't learn from the experience - so you have good ideas ab out what parts of the exploration space are safe to try. You might use Adderall because it's been tried before and is relatively safe, but you don't ingest arbitrary drugs to see if they help you think.

BUT an AI doesn't (at first) have that sample data to reason from, nor does a singleton have observation of other near-structurally identical AI systems and the impacts of their decisions, nor does it have a fundamental understanding about what is safe to explore.

Decision Theory

by abramdemski, Scott Garrabrant 1 min read31st Oct 201837 comments


Ω 24

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(A longer text-based version of this post is also available on MIRI's blog here, and the bibliography for the whole sequence can be found here.)

The next post in this sequence, 'Embedded Agency', will come out on Friday, November 2nd.

Tomorrow’s AI Alignment Forum sequences post will be 'What is Ambitious Value Learning?' in the sequence 'Value Learning'.