*This post has benefitted from discussion with Sam Eisenstat, Scott Garrabrant, Tsvi Benson-Tilsen, Daniel Demski, Daniel Kokotajlo, and Stuart Armstrong. It started out as a thought about Stuart Armstrong's research agenda.*

In this post, I hope to say something about what it means for a rational agent to have preferences. The view I am putting forward is relatively new to me, but it is not *very* radical. It is, dare I say, a conservative view -- I hold close to Bayesian expected utility theory. However, my impression is that it differs greatly from *common impressions* of Bayesian expected utility theory.

I will argue against a particular view of expected utility theory -- a view which I'll call *reductive utility*. I do not recall seeing this view explicitly laid out and defended (except in in-person conversations). However, I expect at least a good chunk of the assumptions are commonly made.

# Reductive Utility

The core tenets of reductive utility are as follows:

- The sample space of a rational agent's beliefs is, more or less, the set of possible ways the world could be -- which is to say, the set of possible
*physical configurations of the universe*. Hence, each world is one such configuration. - The preferences of a rational agent are represented by a utility function from worlds to real numbers.
- Furthermore, the utility function should be a
*computable*function of worlds.

Since I'm setting up the view which I'm knocking down, there is a risk I'm striking at a straw man. However, I think there are some good reasons to find the view appealing. The following subsections will expand on the three tenets, and attempt to provide some motivation for them.

If the three points seem obvious to you, you might just skip to the next section.

## Worlds Are Basically Physical

What I mean here resembles the standard physical-reductionist view. However, my emphasis is on certain features of this view:

- There is some "basic stuff" -- like like quarks or vibrating strings or what-have-you.
- What there is to know about the world is some set of statements about this basic stuff -- particle locations and momentums, or wave-form function values, or what-have-you.
- These special atomic statements should be logically independent from each other (though they may of course be probabilistically related), and together, fully determine the world.
- These should (more or less) be what beliefs are about, such that we can (more or less) talk about beliefs in terms of the sample space as being the set of worlds understood in this way.

This is the so-called "view from nowhere", as Thomas Nagel puts it.

I don't intend to construe this position as ruling out certain non-physical facts which we may have beliefs about. For example, we may believe indexical facts on top of the physical facts -- there might be (1) beliefs about the universe, and (2) beliefs about where we are in the universe. Exceptions like this violate an extreme reductive view, but are still close enough to count as reductive thinking for my purposes.

## Utility Is a Function of Worlds

So we've got the "basically physical" . Now we write down a utility function . In other words, utility is a random variable on our event space.

What's the big deal?

One thing this is saying is that *preferences are a function of the world*. Specifically, *preferences need not only depend on what is observed.* This is incompatible with standard RL in a way that matters.

But, in addition to saying that utility can depend on more than just observations, we are *restricting* utility to *only* depend on things that are in the world. After we consider all the information in , there cannot be any extra uncertainty about utility -- no extra "moral facts" which we may be uncertain of. If there are such moral facts, they have to be present somewhere in the universe (at least, derivable from facts about the universe).

One implication of this: *if utility is about high-level entities, the utility function is responsible for deriving them from low-level stuff.* For example, if the universe is made of quarks, but utility is a function of beauty, consciousness, and such, then needs to contain the beauty-detector and consciousness-detector and so on -- otherwise how can it compute utility given all the information about the world?

## Utility Is Computable

Finally, and most critically for the discussion here, should be a computable function.

To clarify what I mean by this: should have some sort of representation which allows us to feed it into a Turing machine -- let's say it's an infinite bit-string which assigns true or false to each of the "atomic sentences" which describe the world. should be a computable function; that is, there should be a Turing machine which takes a rational number and takes , prints a rational number within of , and halts. (In other words, we can compute to any desired degree of approximation.)

Why should be computable?

One argument is that should be computable because the agent has to be able to use it in computations. This perspective is especially appealing if you think of as a black-box function which you can only optimize through search. If you can't evaluate , how are you supposed to use it? If exists as an actual module somewhere in the brain, how is it supposed to be implemented? (If you don't think this sounds very convincing, great!)

Requiring to be computable may also seem easy. What is there to lose? Are there preference structures we really care about being able to represent, which are fundamentally not computable?

And what would it even mean for a computable agent to have non-computable preferences?

However, the computability requirement is more restrictive than it may seem.

There is a sort of continuity implied by computability: must not depend too much on "small" differences between worlds. The computation only accesses finitely many bits of before it halts. All the rest of the bits in must not make more than difference to the value of .

This means some seemingly simple utility functions are not computable.

As an example, consider the procrastination paradox. Your task is to push a button. You get 10 utility for pushing the button. You can push it any time you like. However, if you never press the button, you get -10. On any day, you are fine with putting the button-pressing off for one more day. Yet, if you put it off forever, you lose!

We can think of as a string like 000000100.., where the "1" is the day you push the button. To compute the utility, we might look for the "1", outputting 10 if we find it.

But what about the all-zero universe, 0000000...? The program must loop forever. We can't tell we're in the all-zero universe by examining any finite number of bits. You don't know whether you will eventually push the button. (Even if the universe also gives your source code, you can't necessarily tell from that -- the logical difficulty of determining this about yourself is, of course, the original point of the procrastination paradox.)

Hence, a preference structure like this is not computable, and is not allowed according to the reductive utility doctrine.

The advocate of reductive utility might take this as a victory. The procrastination paradox has been avoided, and other paradoxes with a similar structure. (The St. Petersburg Paradox is another example.)

On the other hand, if you think this is a *legitimate preference structure*, dealing with such 'problematic' preferences motivates abandonment of reductive utility.

# Subjective Utility: The Real Thing

We can strongly oppose all three points without leaving orthodox Bayesianism. Specifically, I'll sketch how the Jeffrey-Bolker axioms enable non-reductive utility. (The title of this section is a reference to Jeffrey's book *Subjective Probability: The Real Thing.*)

However, the *real* position I'm advocating is more grounded in logical induction rather than the Jeffrey-Bolker axioms; I'll sketch that version at the end.

## The View From Somewhere

The reductive-utility view approached things from the starting-point of the universe. Beliefs are for what is real, and what is real is basically physical.

The non-reductive view starts from the standpoint of the agent. Beliefs are *for things you can think about*. This doesn't rule out a physicalist approach. What it *does* do is give high-level objects like tables and chairs an equal footing with low-level objects like quarks: both are inferred from sensory experience by the agent.

Rather than assuming an underlying set of *worlds*, Jeffrey-Bolker assume only a set of events. For two events and , the conjunction exists, and the disjunction , and the negations and . However, unlike in the Kolmogorov axioms, these are not assumed to be intersection, union, and complement of an underlying set of worlds.

Let me emphasize that: *we need not assume there are "worlds" at all.*

In philosophy, this is called situation semantics -- an alternative to the more common possible-world semantics. In mathematics, it brings to mind pointless topology.

In the Jeffrey-Bolker treatment, a world is just a maximally specific event: an event which describes everything completely. But there is no requirement that maximally-specific events exist. Perhaps any event, no matter how detailed, can be further extended by specifying some yet-unmentioned stuff. (Indeed, the Jeffrey-Bolker axioms assume this! Although, Jeffrey does not seem philosophically committed to that assumption, from what I have read.)

Thus, there need not be any "view from nowhere" -- no semantic vantage point from which we see the whole universe.

This, of course, deprives us of the objects which utility was a function of, in the reductive view.

## Utility Is a Function of Events

The reductive-utility makes a distinction between utility -- the random variable itself -- and *expected* utility, which is the subjective estimate of the random variable which we use for making decisions.

The Jeffrey-Bolker framework does not make a distinction. Everything is a subjective preference evaluation.

A reductive-utility advocate sees the expected utility of an event as * derived from* the utility of the worlds within the event. They start by defining ; then, we define the expected utility of an event as -- or, more generally, the corresponding integral.

In the Jeffrey-Bolker framework, we instead define *directly* on events. These preferences are required to be *coherent with *breaking things up into sums, so = -- but we do not define one from the other.

We don't have to know how to evaluate entire worlds in order to evaluate events. All we have to know is how to evaluate events!

I find it difficult to really believe "humans have a utility function", even approximately -- but I find it *much easier* to believe "humans have expectations on propositions". Something like that could even be true at the *neural* level (although of course we would not obey the Jeffrey-Bolker axioms in our neural expectations).

*Updates* Are Computable

Jeffrey-Bolker doesn't say anything about computability. However, if we do want to address this sort of issue, it leaves us in a different position.

Because *subjective expectation is primary*, it is now more natural to require that the agent can evaluate events, without any requirement about a function on worlds. (Of course, we *could* do that in the Kolmogorov framework.)

Agents don't need to be able to compute the utility of a whole world. All they need to know is how to update expected utilities as they go along.

Of course, the subjective utility can't be just *any* way of updating as you go along. It needs to be * coherent, *in the sense of the Jeffrey-Bolker axioms. And, maintaining coherence can be very difficult. But it can be quite easy even in cases where the random-variable treatment of the utility function is not computable.

Let's go back to the procrastination example. In this case, to evaluate the expected utility of each action at a given time-step, the agent does not need to figure out whether it ever pushes the button. It just needs to have some probability, which it updates over time.

For example, an agent might initially assign probability to pressing the button at time , and to never pressing the button. Its probability that it would ever press the button, and thus its utility estimate, would decrease with each observed time-step in which it didn't press the button. (Of course, such an agent would press the button immediately.)

Of course, this "solution" doesn't touch on any of the tricky logical issues which the procrastination paradox was originally introduced to illustrate. This isn't meant as a solution to the procrastination paradox -- only as an illustration of how to coherently update discontinuous preferences. This simple is ** uncomputable** by the definition of the previous section.

It also doesn't address computational tractability in a very real way, since if the prior is very complicated, computing the subjective expectations can get extremely difficult.

We can come closer to addressing logical issues and computational tractability by considering things in a logical induction framework.

# Utility Is Not a Function

In a logical induction (LI) framework, the central idea becomes *"update your subjective expectations in any way you like, so long as those expectations aren't (too easily) exploitable to Dutch-book."* This clarifies what it means for the updates to be "coherent" -- it is somewhat more elegant than saying "... any way you like, so long as they follow the Jeffrey-Bolker axioms."

This replaces the idea of "utility function" entirely -- there isn't any need for a *function* any more, just a logically-uncertain-variable (LUV, in the terminology from the LI paper).

Actually, there are different ways one might want to set things up. I hope to get more technical in a later post. For now, here's some bullet points:

- In the simple procrastination-paradox example, you push the button if you have any uncertainty at all. So things are not that interesting. But, at least we've solved the problem.
- In more complicated examples -- where there is some real benefit to procrastinating -- a LI-based agent could totally procrastinate forever. This is because LI doesn't give any guarantee about converging to correct beliefs for uncomputable propositions like whether Turing machines halt or whether people stop procrastinating.
- Believing you'll stop procrastinating even though you won't is
*perfectly coherent*-- in the same way that believing in nonstandard numbers is perfectly logically consistent. Putting ourselves in the shoes of such an agent, this just means we've examined our own decision-making to the best of our ability, and have put significant probability on "we don't procrastinate forever". This kind of reasoning is necessarily fallible. - Yet, if a system we built were to do this, we might have strong objections. So, this can count as an alignment problem. How can we give feedback to a system to avoid this kind of mistake? I hope to work on this question in future posts.

IIUC, you argue that for an embedded agent to have an explicit utility function, it needs to be a function of the microscopic description of the universe. This is unsatisfactory since the agent shouldn't start out knowing microscopic physics. The alternative you suggest is using the more exotic Jeffrey-Bolker approach. However, this is not how I believe embedded agency should work.

Instead, you should consider a utility function that depends on the universe described

in whatever ontology the utility function is defined(which we may call "macroscopic"). Microscopic physics comes in when the agent learns a fine-grained model of the dynamics in the macroscopic ontology. In particular, this fine-grained model can involve a fine-grained state space.The other issue discussed is utility functions of the sort exemplified by the procrastination paradox. I think that besides being uncomputable, this brings in other pathologies. For example, since the utility functions you consider are discontinuous, it is no longer guaranteed an optimal policy exists at all. Personally, I think discontinuous utility functions are strange and poorly motivated.

I don't want to make a strong argument against your position here. Your position can be seen as one example of "don't make utility a function of the microscopic".

But let's pretend for a minute that I do want to make a case for my way of thinking about it as opposed to yours.

As for discontinuous utility:

My main motivating force here is to capture the maximal breadth of what rational (ie coherent, ie non-exploitable) preferences can be, in order to avoid ruling out some human preferences. I have an intuition that this can ultimately help get the right learning-theoretic guarantees as opposed to hurt, but, I have not done anything to validate that intuition yet.

With respect to procrastination-like problems, optimality has to be subjective, since there is no foolproof way to tell when an agent will procrastinate forever. If humans have any preferences like this, then alignment means alignment with human subjective evaluations of this matter -- if the human (or some extrapolated human volition, like HCH) looks at the system's behavior and says "NO!! Push the button now, you fool!!" then the system is misaligned. The value-learning should account for this sort of feedback in order to avoid this. But this does not attempt to minimize loss in an objective sense -- we export that concern to the (extrapolated?) human evaluation which we are bounding loss with respect to.

With respect to the problem of no-optimal-policy, my intuition is that you try for bounded loss instead; so (as with logical induction) you are never perfect but you have some kind of mistake bound. Of course this is more difficult with utility than it is with pure epistemics.

First, it seems to me rather clear what macroscopic physics I attach utility to. If I care about people, this means my utility function comes with some model of what a "person" is (that has many free parameters), and if something falls within the parameters of this model then it's a person, and if it doesn't then it isn't a person (ofc we can also have a fuzzy boundary, which is supported in quasi-Bayesianism).

Second, what does it mean for a hypothesis to be "individual"? If we have a prior over a family of hypotheses, we can take their convex combination and get a new individual hypothesis. So I'm not sure what sort of "fluidity" you imagine that is not supported by this.

The agent doesn't have full Knightian uncertainty over all microscopic possibilities. The prior is composed of

refinementsof an "ontological belief" that has this uncertainty. You can even consider a version of this formalism that is entirely Bayesian (i.e. each refinement has to be maximal), but then you lose the ability to retain an "objective" macroscopic reality in which the agent's point of view is "unspecial", because if the agent's beliefs about this reality have no Knightian uncertainty then it's inconsistent with the agent's free will (you could "avoid" this problem using an EDT or CDT agent but this would be bad for the usual reasons EDT and CDT are bad, and ofc you need Knightian uncertainty anyway because of non-realizability).This does not strike me as the sort of thing which will be easy to write out. But there are other examples. What if humans value something like observer-independent beauty? EG, valuing beautiful things existing regardless of whether anyone observes their beauty. Then it seems pretty unclear what ontological objects it gets predicated on.

What I have in mind is complicated interactions between different ontologies. Suppose that we have one ontology -- the ontology of classical economics -- in which:

And we have another ontology -- the hippie ontology -- in which:

And suppose what we want to do is try to reconcile the value-content of these two different perspectives. This isn't going to be a mixture between two partial hypotheses. It might actually be closer to an intersection between two partial hypotheses -- since the different hypotheses largely talk about different entities. But that won't be right either. Rather, there is philosophical work to be done, figuring out how to appropriately mix the values which are represented in the two ontologies.

My intuition behind allowing preference structures which are "uncomputable" as functions of fully specified worlds is, in part, that one might continue doing this kind of philosophical work in an unbounded way -- IE there is no reason to assume there's a point at which this philosophical work is finished and you now have something which can be conveniently represented as a function of some specific set of entities. Much like logical induction never finishes and gives you a Bayesian probability function, even if it gets closer over time.

OK, that makes sense!

Right.

Of course it is not easy to write out. Humanity preferences are highly complex. By "clear" I only meant that it's clear something like this exists, not that I or anyone can write it out.

This seems ill-defined. What is a "thing"? What does it mean for a thing to "exist"? I can imagine valuing beautiful wild nature, by having "wild nature" be a part of the innate ontology. I can even imagine preferring certain computations to have results with certain properties. So, we can consider a preference that some kind of simplicity-prior-like computation outputs bit sequences with some complexity theoretic property we call "beauty". But if you want to go even more abstract than that, I don't know how to make sense of that ("make sense"

notas "formalize" but just as "understand what you're talking about").It would be best if you had a simple example, like a diamond maximizer, where it's more or less clear that it makes sense to speak of agents with this preference.

Why do we want to reconcile them? I think that you might be mixing two different questions here. The first question is what kind of preferences ideal "non-myopic" agents can have. About this I maintain that my framework provides a good answer, or at least a good first approximation of the answer. The second question is what kind of preferences

humanscan have. But humans are agents with only semi-coherent preferences, and I see no reason to believe things like reconciling classical economics with hippies should follow from any natural mathematical formalism. Instead, I think we should model humans as having preferences that change over time, and the detailed dynamics of the change is just a function the AI needs to learn, not some consequence of mathematical principles of rationality.Nothing can deal with quark-level pictures, so it's the only option.

Using intuitive physics, there aren't any microscopic conditions. Its a recent discovery that macroscopic objects are made of invisibly tiny components. So there was a time when people didn't worry that moving one electron would destroy the universe because they had not heard of electrons, followed by a time when people knew that moving one electron would not destroy the universe because they understood electrons. Where's the problem?

It seems to me that the Jeffrey-Bolker framework is a poor match for what's going on in peoples' heads when they make value judgements, compared to the VNM framework. If I think about how good the consequences of an action are, I try to think about what I expect to happen if I take that action (ie the outcome), and I think about how likely that outcome is to have various properties that I care about, since I don't know exactly what the outcome will be with certainty. This isn't to say that I literally consider probability distributions in my mind, since I typically use qualitative descriptions of probability rather than numbers in [0,1], and when I do use numbers, they are very rough, but this does seem like a sort of fuzzy, computationally limited version of a probability distribution. Similarly, my estimations of how good various outcomes are are often qualitative, rather than numerical, and again this seems like a fuzzy, computationally limited version of utility function. In order to determine the utility of the event "I take action A", I need to consider how good and how likely various consequences are, and take the expectation of the 'how good' with respect to the 'how likely'. The Jeffrey-Bolker framework seems to be asking me to pretend none of that ever happened.

I don't understand JB yet, but when I introspected just now, my experience of decision-making doesn't have any separation between beliefs and values, so I think I disagree with the above. I'll try to explain why by describing my experience. (Note: Long comment below is just saying one very simple thing. Sorry for length. There's a one-line tl;dr at the end.)

Right now I'm considering doing three different things. I can go and play a videogame that my friend suggested we play together, I can do some LW work with my colleague, or I can go play some guitar/piano. I feel like the videogame isn't very fun right now because I think the one my friend suggested not that interesting of a shared experience. I feel like the work is fun because I'm excited about publishing the results of the work, and the work itself involves a kind of cognition I enjoy. And playing piano is fun because I've been skilling up a lot lately and I'm going to do accompany some of my housemates in some hamilton songs.

Now, I know some likely ways that what seems valuable to me might change. There are other videogames I've played lately that have been really fascinating and rewarding to play together, that involve problem solving where 2 people can be creative together. I can imagine the work turning out to not actuallybe the fun part but the boring parts. I can imagine that I've found no traction (skill-up) in playing piano, or that we're going to use a recorded soundtrack rather than my playing for the songs we're learning.

All of these to me feel like updates in my understanding of what events are reachable to me; this doesn't feel like changing my utility evaluation of the events. The event of "play videogame while friend watches bored" could change to "play videogame while creatively problem-solving with friend". The event of "gain skill in piano and then later perform songs well with friends" could change to "struggle to do something difficult and sound bad and that's it".

If I think about changing my utility function, I expect that would feel more like... well, I'm not sure. My straw version is "I creatively solve problems with my friend on a videogame, but somehow that's

objectively badso I will not do it". That's where some variable in the utility function changed while all the rest of the facts about my psychology and reality stay the same. This doesn't feel to me like my regular experience of decision-making.But, maybe that's not the idea. The idea is like if I had some neurological change, perhaps I become more of a sociopath and stop feeling empathy and everyone just feels like objects to me rather than alive. Then a bunch of the social experiences above would change, they'd lose any experience of things like vicarious enjoyment and pleasure of bonding with friends. Perhaps that's what VNM is talking about in my experience.

I think that some of the standard "updates to my ethics / utility function" ideas that people discuss often don't feel like this to me. Like, some people say that reflecting onf population ethics leads them to change their utility function and start to care about the far future. That's not my experience – for me it's been things like the times in HPMOR when Harry thinks about civilizations of the future, what they'll be like/think, and how awesome they can be. It feels real to me, like a reachable state, and this is what has changed a lot of my behaviour, in contrast with changing some variable in a function of world-states that's independent from my understanding of what events are achievable.

To be clear, sometimes I describe my experience more like the sociopath example, where my fundamental interests/values change. I say things like "I don't enjoy videogames as much as I used to" or "These days I value honesty and reliability a lot more than politeness", and there is a sense there where I now experience the same events very differently. "I had a positive meeting with John" might now be "I feel like he was being evasive about the topic we were discussing". The things that are salient to me change. And I think that the language of "my values have changed" is often an effective one for communicating that – even if my experience does not match beliefs|utility, any sufficiently coherent agent can be described this way, and it is often easy to help others model me by describing my values as having changed.

But I think my internal experience is more that I made substantial updates about what events I'm moving towards, and the event "We had a pleasant interaction which will lead to use working effectively together" has changed to "We were not able to say the possibly unwelcome facts of the matter, which will lead to a world where we don't work effectively together". So internally it feels like an update about what events are reachable, even though someone from the outside who doesn't understand my internal experience might more naturally say "It seems like Ben is treating the same event differently now, so I'll model him as having changed his values".

tl;dr: While I often talk separately about what actions I/you/we could take and how valuable those actions are are, internally when when I'm 'evaluating' the actions, I'm just trying to visualise what they are, and there is no second step of running my utility function on those visualisations.

As I say, I'm not sure I understand JB, so perhaps this is also inconsistent with it. I just read your comment and noticed it didn't match my own introspective experience, so I thought I'd share my experience.

I agree that the considerations you mentioned in your example are not changes in values, and didn't mean to imply that that sort of thing is a change in values. Instead, I just meant that such shifts in expectations are changes in probability distributions, rather than changes in events, since I think of such things in terms of how likely each of the possible outcomes are, rather than just which outcomes are possible and which are ruled out.

Ah, I see, that makes sense.

Perhaps it goes without saying, but obviously, both frameworks are flexible enough to allow for most phenomena -- the question here is what is more natural in one framework or another.

My

mainargument is that the procrastination paradox is not natural at all in a Savage framework, as it suggests an uncomputable utility function. I think this plausibly outweighs the issue you're pointing at.But with respect to the issue you are pointing at:

In the Savage framework, an outcome already encodes everything you care about. So the computation which seems to be suggested by Savage is to think of these maximally-specified outcomes, assigning them probability and utility, and then combining those to get expected utility. This seems to be very demanding: it requires imagining these very detailed scenarios.

Alternately, we might say (as as Savage said) that the Savage axioms apply to "small worlds" -- small scenarios which the agent abstracts from its experience, such as the decision of whether to break an egg for an omelette. These can be easily considered by the agent, if it can assign values "from outside the problem" in an appropriate way.

But then, to account for the breadth of human reasoning, it seems to me we also want an account of things like extending a small world when we find that it isn't sufficient, and coherence between different small-world frames for related decisions.

This gives a picture very much like the Jeffrey-Bolker picture, in that we don't really work with outcomes which completely specify everything we care about, but rather, work with a variety of simplified outcomes with coherence requirements between simpler and more complex views.

So overall I think it is better to have some picture where you can break things up in a more tractable way, rather than having full outcomes which you need to pass through to get values.

In the Jeffrey-Bolker framework, you can re-estimate the value of an event by breaking it up into pieces, estimating the value and probability of each piece, and combining them back together. This process could be iterated in a manner similar to dynamic programming in RL, to improve value estimates for actions -- although one needs to settle on a story about where the information originally comes from. I currently like the logical-induction-like picture where you get information coming in "somehow" (a broad variety of feedback is possible, including abstract judgements about utility which are hard to cash out in specific cases) and you try to make everything as coherent as possible in the meanwhile.

Yes, but if you don't know which outcome is the true one, so you're considering a probability distribution over outcomes instead of a single outcome, then it still makes sense to speak of the probability that the true outcome has some feature. This is what I meant.

You do not need to be able to imagine every possible outcome individually in order to think of functions on or probability distributions over the set of outcomes, any more than I need to be able to imagine each individual real number in order to understand the function x↦x2 or the standard normal distribution.

It seems that you're going by an analogy like Jeffrey-Bolker : VNM :: events : outcomes, which is partially right, but leaves out an important sense in which the correct analogy is Jeffrey-Bolker : VNM :: events : probability distributions, since although utility is defined on outcomes, the function that is actually evaluated is expected utility, which is defined on probability distributions (this being a distinction that does not exist in Jeffrey-Bolker, but does exist in my conception of real-world human decision making).

I've curated this. This seems to me like an important conceptual step in understanding agency, the subjective view is very interesting and surprising to me. This has been written up very clearly and well, I expect people to link back to this post quite a lot, and I'm really excited to read more posts on this. Thanks a lot Abram.

First, I really like this shift in thinking, partly because it moves the needle toward an anti-realist position, where you don't even need to postulate an external world (you probably don't see it that way, despite saying "Everything is a subjective preference evaluation").

Second, I wonder if you need an even stronger restriction, not just computable, but efficiently computable, given that it's the agent that is doing the computation, not some theoretical AIXI. This would probably also change "too easily" in "

those expectations aren't (too easily) exploitable to Dutch-book." to efficiently. Maybe it should be even more restrictive to avoid diminishing returns trying to squeeze every last bit of utility by spending a lot of compute.I definitely see it as a shift in that direction, although I'm not ready to really bite the bullets -- I'm still feeling out what I personally see as the implications. Like, I want a realist-but-anti-realist view ;p

Right, that's very much what I'm thinking.

Well, we all advance at our own pace. Accepting that reality, truth and existence are relative and often subjective notions is not an easy step :) Or that there are various degrees of existence.

I don't think the motivation for this is quite the same as the motivation for pointless topology, which is designed to mimic classical topology in a way that Jeffrey-Bolker-style decision theory does not mimic VNM-style decision theory. In pointless topology, a continuous function of locales X→Y is a function from the lattice of open sets of X to the lattice of open sets of Y. So a similar thing here would be to treat a utility function as a function from some lattice of subsets of R (the Borel subsets, for instance) to the lattice of events.

My understanding of the Jeffrey-Bolker framework is that its primary difference from the VNM framework is not its pointlessness, but the fact that it comes with a prior probability distribution over outcomes, which can only be updated by conditioning on events (i.e. updating on evidence that has probability 1 in some worlds and probability 0 in the rest). VNM does not start out with a prior, and allows any probability distribution over outcomes to be compared to any other, and Jeffrey-Bolker only allows comparison of probability distributions obtained by conditioning the prior on an event. Of course, this interpretation requires a fair amount of reading between the lines, since the Jeffrey-Bolker axioms make no explicit mention of any probability distribution, but I don't see any other reasonable way to interpret them, since if asked which of two events is better, I will often be unable to answer without further information, since the events may contain worlds of widely varying utility. Associating an event with a fixed prior conditioned on the event gives me this additional information needed to answer the question, and I don't see how any others could work. Starting with a prior that gets conditioned on events that correspond to the agent's actions seems to build in evidential decision theory as an assumption, which makes me suspicious of it.

This can be resolved by defining worlds to be minimal non-zero elements of the completion of the Boolean algebra of events, rather than a minimal non-zero event. This is what you seemed to be implicitly doing later with the infinite bitstrings example, where the events were clopen subsets of Cantor space (i.e. sets of infinite bitstrings such that membership in the set only depends on finitely many bits), and this Boolean algebra has no minimal non-zero elements (maximally-specific events), but the minimal non-zero elements of its completion correspond to infinite bitstrings, as desired.

Partof the point of the JB axioms is that probability is constructedtogether with utilityin the representation theorem, in contrast to VNM, which constructs utility via the representation theorem, but takes probability as basic.This makes Savage a better comparison point, since the Savage axioms are more similar to the VNM framework while also trying to construct probability and utility together with one representation theorem.

As a representation theorem, this makes VNM weaker and JB stronger: VNM requires stronger assumptions (it requires that the preference structure include information about all these probability-distribution comparisons), where JB only requires preference comparison of events which the agent sees as real possibilities. A similar remark can be made of Savage.

Right, that's fair. Although: James Joyce, the big CDT advocate, is quite the Jeffrey-Bolker fan! See Why We Still Need the Logic of Decision for his reasons.

Doesn't pointless topology allow for

somedistinctions which aren't meaningful in pointful topology, though? (I'm not really very familiar, I'm just going off of something I've heard.)Isn't the approach you mention pretty close to JB? You're not modeling the VNM/Savage thing of arbitrary gambles; you're just assigning values (and probabilities) to events, like in JB.

Setting aside VNM and Savage and JB, and considering the most common approach in practice -- use the Kolmogorov axioms of probability, and treat utility as a random variable -- it seems like the pointless analogue would be close to what you say.

Yeah. The question remains, though: should we think of utility as a function of these minimal elements of the completion? Or not? The computability issue I raise is, to me, suggestive of the negative.

Sure, I guess I just always talk about VNM instead of Savage because I never bothered to learn how Savage's version works. Perhaps I should.

This might be true if we were idealized agents who do Bayesian updating perfectly without any computational limitations, but as it is, it seems to me that the assumption that there is a fixed prior is unreasonably demanding. People sometimes update probabilities based purely on further thought, rather than empirical evidence, and a framework in which there is a fixed prior which gets conditioned on events, and banishes discussion of any other probability distributions, would seem to have some trouble handling this.

Sure, for instance, there are many distinct locales that have no points (only one of which is the empty locale), whereas there is only one ordinary topological space with no points.

Assuming you're referring to "So a similar thing here would be to treat a utility function as a function from some lattice of subsets of R (the Borel subsets, for instance) to the lattice of events", no. In JB, the set of events is the domain of the utility function, and in what I said, it is the codomain.

I think that computable is obviously too strong a condition for classical utility; enumerable is better.

Imagine you're about to see the source code of a machine that's running, and if the machine eventually halts then 2 utilons will be generated. That's a simpler problem to reason about than the procrastination paradox, and your utility function is enumerable but not computable. (Likewise, logical inductors obviously don't make PA

approximately computable, but their properties are what you'd want the definition ofapproximately enumerableto be, if any such definition were standard.)I suspect that the procrastination paradox leans heavily on the computability requirement as well.

I'm not sure what it would mean for a real-valued function to be enumerable. You could call a function f:X→R enumerable if there's a program that takes x∈X as input and enumerates the rationals that are less than f(x), but I don't think this is what you want, since presumably if a Turing machine halting can generate a positive amount of utility that doesn't depend on the number of steps taken before halting, then it could generate a negative amount of utility by halting as well.

I think accepting the type of reasoning you give suggests that limit-computability is enough (ie there's a program that takes x∈X and produces a sequence of rationals that converges to f(x), with no guarantees on the rate of convergence). Though I don't agree that it's obvious we should accept such utility functions as valid.

I mean the sort of "eventually approximately consistent over computable patterns" thing exhibited by logical inductors, which is stronger than limit-computability.

It's not clear to me what this means in the context of a utility function.

Let's talk first about non-embedded agents.

Say that I'm given the specification of a Turing machine, and I have a computable utility mapping from output states (including "does not halt") to [0,1]. We presumably agree

thatis possible.I agree that it's impossible to make a computable mapping from Turing machines to outcomes, so therefore I cannot have a computable utility function from TMs to the reals which assigns the same value to any two TMs with identical output.

But I can have a logical inductor which, for each TM, produces a sequence of predictions about that TM's output's utility. Every TM that halts will eventually get the correct utility, and every TM that doesn't will converge to some utility in [0,1], with the usual properties for logical inductors guaranteeing that TMs easily proven to have the same output will converge to the same number, etc.

That's a computable sequence of utility functions over TMs with asymptotic good properties. At any stage, I could stop and tell you that I choose some particular TM as the best one as it seems to me now.

I haven't really thought in a long while about questions like "do logical inductors' good properties of self-prediction mean that they could avoid the procrastination paradox", so I could be talking nonsense there.

I think we're going to have to back up a bit. Call the space of outcomes O and the space of Turing machines M. It sounds like you're talking about two functions, U:O→R and eval:M→O. I was thinking of U as the utility function we were talking about, but it seems you were thinking of U∘eval.

You suggested U should be computable but eval should not be. It seems to me that eval should certainly be computable (with the caveat that it might be a partial function, rather than a total function), as computation is the only thing Turing machines do, and that if non-halting is included in a space of outcomes (so that eval is total), it should be represented as some sort of limit of partial information, rather than represented explicitly, so that eval is continuous.

In any case, a slight generalization of Rice's theorem tells us that any computable function from Turing machines to reals that depends only of the machine's semantics must be constant, so I suppose I'm forced to agree that, if we want a utility function U∘eval that is defined on all Turing machines and depends only on their semantics, then at least one of U or eval should be uncomputable. But I guess I have to ask why we would want to assign utilities to Turing machines.

I've been using computable to mean a total function (each instance is computable in finite time).

I'm thinking of an agent outside a universe about to take an action, and each action will cause that universe to run a particular TM. (You could maybe frame this as "the agent chooses the tape for the TM to run on".) For me, this is analogous to acting in the world and causing the world to shift toward some outcomes over others.

By asserting that U should be the computable one, I'm asserting that "how much do I like this outcome" is a more tractable question than "which actions result in this outcome".

An intuition pump in a human setting:

I can check whether given states of a Go board are victories for one player or the other, or if the game is not yet finished (this is analogous to U being a total computable function). But it's much more difficult to choose, for an unfinished game where I'm told I have a winning strategy, a move such that I still have a winning strategy. The best I can really do as a human is calculate a bit and then guess at how the leaves will probably resolve if we go down them (this is analogous to eval being an enumerable but not necessarily computable function).

In general, individual humans are much better at figuring out what outcomes we want than we are at figuring out exactly how to achieve those outcomes. (It would be quite weird if the opposite were the case.) We're not good at either in an absolute sense, of course.

2 points about how I think about this that differs significantly. (I just read up on Bolker and Jeffrey, as I was previously unfamiliar.) I had been thinking about writing this up more fully, but have been busy. (i.e. if people think it's worthwhile, tell me and I will be more likely do so.)

First, utility is only ever computed over models of reality, not over reality itself, because it is a part of the decision making process, not directly about any self-monitoring or feedback process. It is never really evaluated against reality, nor does it need to be. Evidence for this in humans is that people suck at actually noticing how they feel, what they like, etc. The updating of their world model is a process that happens alongside planning and decision making, and is only sometimes actively a target of maximizing utility because people's model can include correspondence with reality as a goal. Many people simply don't do this, or care about map/reality correspondence. They are very unlikely to read or respond to posts here, but any model of humans should account for their existence, and the likely claim that their brains work the same way other people's brains do.

Second, Jeffrey's "News Value" is how he fits in a relationship between utility and reality. As mentioned, for many people their map barely corresponds to the territory, and they don't seem to suffer much. (Well, unless an external event imposes itself on them in a way that affects them in the present. And even then, how often do they update their model?) So I don't think Jeffrey is right. Instead, I don't think an agent could be said to "have" utility at all - utility maximization is a process, never an evaluated goal. The only reason reality matters is because it provides feedback to the model over which people evaluate utility, not because utility is lost or gained. I think this also partly explains happiness set points - as a point of noticing reality, humans are motivated by anticipated reward more than reward. I think the model I propose makes this obvious, instead of surprising.

Planned summary for the Alignment Newsletter:

Thank you for this.

Your characterization of Reductive Utility matches very well my own experience in philosophical discussion about utilitarianism. Most of my interlocutors object that I am proposing a reductive utility notion which suffers from incomputability (which is essentially how Anscombe dismissed it all in one paragraph, putting generations of philosophers pitted eternally against any form of consequentialism).

However, I always thought it was obvious that one need not believe that objects and moral thinking must be derived from ever lower levels of world states.

What do you think are the downstream effects of holding Reductive Utility Function theory?

I'm thinking the social effects of RUF is more compartmentalization of domains because from an agent perspective their continuity is incomputable, does that make sense?

I do not think you are selling a strawman, but the notion that a utility function should be computable seems to me to be completely absurd. It seems like a confusion born from not understanding what computability means in practice.

Say I have a computer that will simulate an arbitrary Turing machine T, and will award me one utilon when that machine halts, and do nothing for me until that happens. With some clever cryptocurrency scheme, this is a scenario I could actually build today. My utility function ought plausibly to have a term in it that assigns a positive value to the computer simulating a halting Turing machine, and zero to the computer simulating a non-halting Turing machine. Yet the assumption of utility function computability would rule out this very sensible desire structure.

If I live in a Conway's Game of Life universe, there may be some chunk of universe somewhere that will eventually end up destroying all life (in the biological sense, not the Game of Life sense) in my universe. I assign lower utility to universes where this is the case, than to those were it is not. Is that computable? No.

More prosaically, as far as I currently understand, the universe we actually live in seems to be continuous in nature, and its state may not be describable even in principle with a finite number of bits. And even if it is, I do not actually know this, which means my utility function is also over potential universes (which, as far as I know, might be the one I live in) that require an infinite amount of state bits. Why in the world would one expect a utility function over an uncountable domain to be computable?

As far as I can see, the motivation for requiring a utility function to be computable is that this would make optimization for said utility function to be a great deal easier. Certainly this is true; there are powerful optimization techniques that apply only to computable utility functions, that an optimizer with an uncomputable utility function does not have access to in their full form. But the utility function is not up for grabs; the fact that life will be easier for me if I want a certain thing, should not be taken as an indication that that is want I want! This seems to me like the cart-before-horse error of trying to interpret the problem as one that is easier to solve, rather than the problem one actually wants solved.

This line of thought here illustrates very well the (I claim) grossly mistaken intuition for assuming computability. If you can't evaluate U() perfectly, then perhaps what your brain is doing is only an approximation of what you really want, and perhaps the same constraint will hold for any greater mind that you can devise. But that does not mean that what your brain is optimizing for is necessarily what it actually wants! There is no requirement at all that your brain is a perfect judge of the desirability of the world it's looking at, after all (and we know for a fact that it does a far from perfect job at this).

No, you can't do that today. You could produce a contraption that will deposit 1 BTC into a certain bitcoin wallet if and when some computer program halts, but this won't do the wallet's owner much good if they die before the program halts. If you reflect on what it means to award someone a utilon, rather than a bitcoin, I maintain that it isn't obvious that this is even possible in theory.

There is a notion of computability in the continuous setting.

This seems like a strawman to me. A better motivation would be that agents that actually exist are computable, and a utility function is determined by judgements rendered by the agent, which is incapable of thinking uncomputable thoughts.

Clearly, there is a kind of utility function action that is computable. Clearly the kind of UF that is defined in terms of preferences over fine-grained world-states isn't computable. So, clearly, "utility function" is being used to mean different things.

That seems to conflate two different things: whether you can compute the occurrence of event E, as opposed to whether you could compute your preference for E over not E.

What does it mean for the all-zero universe to be infinite, as opposed to not being infinite? Finite universes have a finite number of bits of information describing them (This doesn't actually negate the point that uncomputable utility functions exist, merely that utility functions that care whether they are in a mostly-empty vs perfectly empty universe are a weak example.

What happens if the author/definer of U(E) is wrong about the probabilities? If U(E) is not defined from, nor defined by, the value of its sums, what bad stuff happens if they aren't equal? Consider the dyslexic telekinetic at a roulette table, who places a chip on 6, but thinks he placed the chip on 9; Proposition A is "I will win if the ball lands in the '9' cup (or "I have bet on 9", or all such similar propositions), and event E is that agent exercising their telekinesis to cause the ball to land in the 9 cup. (Putting decisions and actions in the hypothetical to avoid a passive agent)

Is that agent merely *mistaken* about the value of U(E), as a result of their error on P(A) and following the appropriate math? Does their error result in a major change in their utility _function_ _computation_ measurement when they correct their error? Is it considered safe for an agent to justify cascading major changes in utility measurement over many (literally all?) events after updating a probability?

An instantiated entity (one that exists in a world) can only know of events E where such events are either observations that they make, or decisions that they make; I see flaws with an agent who sets forth actions that it believes sufficient to bring about a desired outcome and then feels satisfied that it is done, and also with an agent that is seeking spoofable observations about that desired outcome (in particular, the kind of dynamic where agents will seek evidence that tends to confirm desirable event E, because that evidence makes the agent happy, and evidence against E makes the agent sad, so they avoid such evidence).

Ultimately, I am advocating a logical-induction like treatment of this kind of thing.

of all sortscan come into the system, at any time. The system might suddenly get information about the utility of some hypothetical example, or a logical proposition about utility, whatever. It can be arbitrarily difficult to connect this evidence to practical cases. However, the traders work to reduce inconsistencies throughout the whole system, and therefore, evidence gets propagated more or less as well as it can be.There is at least one major step that I did not know of, between the things I think I understand and a market that has currency and traders.

I understand how a market of traders can result in a consensus evaluation of probability, because there is a *correct* evaluation of the probability of a proposition. How does a market of traders result in a consensus evaluation of the utility of an event? If two traders disagree about whether to pull the lever, how is it determined which one gets the currency?

The mechanism is the same in both cases:

eventsvia conditional betting.If we want do do reward-learning in a setup like this, the (discounted) rewards can be incremental payouts of the U shares. But note that even if there is no feedback of any kind (IE, the shares of U never

actuallypay out), the shares equilibrate to a subjective value on the market -- like collector's items. But the market still forces the changes in value over time to be increasingly coherent, and the conditional beliefs about it to be increasingly coherent. This corresponds to fully subjective utility with no outside feedback.They make bets about what happens if the lever is or isn't pulled (including conditional buys/sells of shares of utility). These bets will be evaluated as normal. In this setup we only get feedback on whichever action actually happens -- but, this may still be enough data to learn under certain assumptions (which I hope to discuss in a future post). We can also consider more exotic settings in which we do get feedback on both cases even though only one happens; this could be feasible through human feedback about counterfactuals. (I also hope to discuss this alternative in a future post.)

Suppose the utility trading commission discovered that a trader used forbidden methods to short a utility bet (e.g. insider trading, coercing other traders, exploiting a flaw in the marketplace), and takes action to confiscate the illicit gains.

What actions transfer utility from the target? (In systems that pay out money, their bank account is debited; in systems that use blockchain, transactions are added or rolled back manually) what does it mean to take utility from a trader directly?

What it means here

is preciselythat it is described by an infinite number of bits -- specifically, an infinite number of zeros!Granted, we

couldtry to reorganize the way we describe the universe so that we have ashortcode for that world, rather than an infinitely long one. This becomes a fairly subtle issue. I will say a couple of things:First, it seems to me like the reductionist may want to object to such a reorganization. In the reductive view, it is important that there is a

specialdescription of the universe, in which we have isolated the actualbasic facts of reality-- things resembling particle position and momentum, or what-have-you.Second, I challenge you to propose a description language which (a) makes the procrastination example computable, (b) maps all worlds onto a description, and (c) does not create any invalid input tapes.

For example, I can make a modified universe-description in which the first bit is '1' if the button

evergets pressed. The rest of the description remains as before, placing a '1' at time-steps when the button is pressed (but offset by one place, to allow for the extra initial bit). So seeing '0' right away tells me I'm in the button-never-pressed world; it now has a 1-bit description, rather than an infinite-bit description. HOWEVER, this description language includes a description which does not correspond to any world, and is therefore invalid: the string which starts with '1' but then contains only zeros forever.This issue has a variety of potential replies/implications -- I'm not saying the situation is clear. I didn't get into this kind of thing in the post because it seems like there are just too many things to say about it, with no totally clear path.

The universe that is described by an infinite string of zeroes differs from the universe that is described by the empty string in what manner?

The requirement about computability:

An infinite loop may be a paradox. Perhaps the paradox exists only because of the infinity, or some confusion stemming from it or how it is used?*

What is the difference between 0.9999 that goes on forever, and 1? In the real numbers, 0.

How do you determine this? If you know the process generating the numbers you can tell.

Practically?

1. If only a finite number of digits is relevant to your decision it doesn't matter. (Additionally, if a theory isn't falsifiable, a) should we consider the hypothesis?, and b) is there lower hanging fruit we should pick before trying to solve a potentially unsolvable problem?)

2. Wait. Where did you get an infinite number of bits (which you are unable to analyze because they are infinite)

from? (Computability sounds nice, but absent arbitrarily large computing resources (i.e. infinite), in this universe, past a certain point, computability don't seem to exist in a practical sense.)*It isn't necessarily clear that the environment must be computable. (Even if there is some proof of this, an agent unaware of the proof a) must function without it, b) decide whether it is worth investing the time to try and find/create it.)

This seems like a weak argument. If I think about a human trying to achieve some goal in practice, "think of U() as a black-box function which you can only optimize through search" doesn't really describe how we typically reason. I would say that we optimize for things we can't evaluate all the time - it's our default mode of thought.

We don't need to evaluate U() in order to decide which of two options yields higher U().Example: suppose I'm a general trying to maximize my side's chance of winning a war. Can I evaluate the probability that we win, given

allof the information available to me? No - fully accounting for every little piece of info I have is way beyond my computational capabilities. Even reasoning through an entire end-to-end plan for winning takes far more effort than I usually make for day-to-day decisions. Yet I can say that some actions are likely to increase our chances of victory, and I can prioritize actions which are more likely to increase our chances of victory by a larger amount.Suppose I'm running a company, trying to maximize profits. I don't make decisions by looking at the available options, and then estimating how profitable I expect the company to be under each choice. Rather, I reason locally: at a cost of X I can gain Y, I've cached an intuitive valuation of X and Y based on their first-order effects, and I make the choice based on that without reasoning through all the second-, third-, and higher-order effects of the choice. I don't calculate all the way through to an expected utility or anything comparable to it.

If I see a $100 bill on the ground, I don't need to reason through exactly what I'll spend it on in order to decide to pick it up.

In general, I think humans usually make decisions directionally and locally: we try to decide which of two actions is more likely to better achieve our goals, based on local considerations,

withoutactually simulating all the way to the possible outcomes.Taking a more theoretical perspective... how would a human or other agent work with an uncomputable U()? Well, we'd consider specific choices available to us, and then try to guess which of those is more likely to give higher U(). We might look for proofs that one specific choice or the another is better; we might leverage logical induction; we might do something else entirely. None of that necessarily requires evaluating U().

Yeah, a didactic problem with this post is that when I write everything out, the "reductive utility" position does not sound that tempting.

I still think it's a really easy trap to fall into, though, because before thinking too much the assumption of a computable utility function sounds extremely reasonable.

With dynamic-programming inspired algorithms such as AlphaGo, "cached an intuitive valuation of X and Y" is modeled as a kind of approximate evaluation which is learned based on feedback -- but feedback requires the ability to compute U() at some point. (So you don't start out knowing how to evaluate uncertain situations, but you do start out knowing how to evaluate utility on completely specified worlds.)

So one might still reasonably assume you need to be able to compute U() despite this.

I actually found the position very tempting until I got to the subjective utility section.

Specifically, discontinuous utility functions have always seemed basically irrational to me, for reasons related to incomputability.

So, when and why are we able to get away with doing that?

AFAICT, the formalisms of agents that I'm aware of (Bayesian inference, AIXI etc.) set things up by supposing logical omniscience and that the true world generating our hypotheses is in the set of hypotheses and from there you can show that the agent will maximise expected utilty, or not get dutch booked or whatever. But humans, and ML algorithms for that matter, don't do that, we're able to get "good enough" results even when we know our models are wrong and don't capture a good deal of the underlying process generating our observations. Furthermore, it seems that empirically, the more expressive the model class we use, and the more compute thrown at the problem, the better these bounded inference algorithms work. I haven't found a good explanation of why this is the case beyond hand wavy "we approach logical omniscience as compute goes to infinity and our hypothesis space grows to encompass all computable hypotheses, so eventually our approximation should work like the ideal Bayesian one".

I think in part we can get away with it because it's possible to optimize for things that are only

usuallydecidable.Take winning the war for example. There may be no computer program that could look at

anystate of the world and tell you who won the war - there are lots of weird edge cases that could cause a Turing machine to not return a decision. But if we expect to be able to tell who won the war with very high probability (or have a model that we think matches who wins the war with high probability), then we can just sort of ignore the weird edge cases and model failures when calculating an expected utility.Perhaps...

As the approximation gets closer to the ideal, the results do as well. (The

Less Wrongquote seems relevant.)The description of a particular version of expected utility theory feels very particular to me.

Utility is generally expressed as a function of a random variable. Not as a function of an element from the sample space.

For instance: suppose that my utility is linear in the profit or loss from the following game. We draw one bit from /dev/random. If it is true, I win a pound, else I lose one.

Utility is not here a function of 'the configuration of the universe'. It is a function of a bool. The bool itself may depend on (some subset of) 'the configuration of the universe' but reality maps universe to bool for us, computability be damned.

I don't think we should be surprised that any reasonable utility function is uncomputable. Consider a set of worlds with utopias that last only as long as a Turing machine in the world does not halt and are otherwise identical. There is one such world for each Turing machine. All of these worlds are possible. No computable utility function can assign higher utility to every world with a never halting Turing machine.