## LESSWRONGLW

It seems inelegant to me that utility functions are created for specific situations, while these clearly aren't the same as that of the agent in total among all of their decisions. For instance, a model may estimate an agent's expected utility from the result of a specific intervention, but this clearly isn't quite right; the agent has a much more complicated utility function outside this intervention. According to a specific model, "Not having an intervention" could set "Utility = 0"; but for any real agent, it's quite likely their life wouldn't actually have 0 utility without the intervention.

It seems like it's important to distinguish that a utility score in a model is very particular to the scenario for that model, and does not represent a universal utility function for the agents in question.

Let be an agent's true utility function across a very wide assortment of possible states, and be the utility function used for the sake of the model. I believe that is supposed to approximate in some way; perhaps they should be related by an affine transformation.

The important thing for a utility function, as it is typically used (in decision models), is probably not that , but rather, that decisions made within the specific context of approximate those made using .

Here, I use brackets to describe "The expected value, according to a utility function", and D to describe the set of decisions made conditional on a specific utility function being used for decision making.

Then, we can represent this supposed estimation with:

Showing 3 of 8 replies (Click to show all)
3Pattern6moThe model has changed when the decisions it is used to make change. If the model 'reverses' and suggests doing the opposite/something different in every case from what it previously recommended, then it has 'completely changed'. (This might be roughly the McNamara fallacy, of declaring that things that 'can't be measured' aren't important.) EDIT: Also, if there's a set of information consisting of a bunch of pieces, A, B, and C, and incorporating all but one of them doesn't have a big impact on the model, but the last piece does, whichever piece that is, 'this metric' could lead to overestimating the importance of whichever piece happened to be last, when it's A, B, and C together that made an impact. It 'has this issue' because the metric by itself is meant to notice 'changes in the model over time', not figure out why/solve attribution.
4Stuart_Armstrong6moI delve into that here: https://www.lesswrong.com/posts/Lb3xCRW9usoXJy9M2/platonic-rewards-reward-features-and-rewards-as-information#Extending_the_problem [https://www.lesswrong.com/posts/Lb3xCRW9usoXJy9M2/platonic-rewards-reward-features-and-rewards-as-information#Extending_the_problem]

Oh fantastic, thanks for the reference!