A small example of one-step hypotheticals

byStuart_Armstrong3mo28th Jan 2019No comments


Just a small example of what one-step hypotheticals might mean in theory and in practice.

This involves a human H pricing some small object:

In theory

The human H is (hypothetically) asked various questions that causes it to model how much they would pay for the small violin. These questions are asked at various times, and with various phrasings, and the results look like this:

Here the costings are all over the place, and one obvious way of reconciling them would be to take the mean (indicated by the large red square), which is around 5.5.

But it turns out there are extra patterns in the hypotheticals and the answers . For example, there is a clear difference between valuations that are done in the morning, around midday, or in the evening. And there is a difference if the violin is (accurately) described as "handmade"L

There are now more options for finding a "true" valuation here. The obvious first step would be to over-weight the evening valuations, as there are less datapoints there (this would bring the average up a bit). Or one could figure out whether the "true" H was better represented by their morning, midday, or evening selves. Or whether their preference for "handmade" objects was strong and genuine, or a passing positive affect. H's various meta-preferences would all be highly relevant to these choices.

In practice

Ok, that's what might happen if the agent had the power to ask unlimited hypothetical questions in arbitrarily many counterfactual scenarios. But that is not the case in the real world: the agent would be able to ask one, or maybe two questions at most, before the human attitude to the violin would change, and further data would become tainted.

Note that if the agent had a good brain model of H, it might be able to simulate all the relevant answers; but we'll assume for the moment that the agent doesn't have the capabilities.

So, in theory, huge amounts of data and many relevant patterns that are meta-preferentially relevant. In practice, two values maximum.

Now, if this was all that the agent had access to, then it could only use a crude guess. But if the agent was investigating the human more thoroughly, it could do a lot more. The pattern of valuing things differently at different times of the day might show up over longer observations, as would the pattern of reacting to key words in the description. If the agent assumed that "valuing objects" was not something that humans did ex-nihilo with each object (with each object having its own independent quirky biases), then it could apply the template across all valuations, and from even a single data point (along with knowledge of the time of day, the description, etc...) come up with an estimate that was closer to the theoretical one.