# 101

## Epistemic Status

Unsure[1], partially noticing my own confusion. Hoping Cunningham's Law can help resolve it.

# Confusions About Arguments From Expected Utility Maximisation

Some MIRI people (e.g. Rob Bensinger) still highlight EU maximisers as the paradigm case for existentially dangerous AI systems. I'm confused by this for a few reasons:

1. Not all consequentialist/goal directed systems are expected utility maximisers
• E.g. humans
2. Some recent developments make me sceptical that VNM expected utility are a natural form of generally intelligent systems
1. Wentworth's subagents provide a model for inexploitable agents that don't maximise a simple unitary utility function
1. The main requirement for subagents to be a better model than unitary agents is path dependent preferences or hidden state variables
2. Alternatively, subagents natively admit partial orders over preferences
1. If I'm not mistaken, utility functions seem to require a (static) total order over preferences
1. This might be a very unreasonable ask; it does not seem to describe humans, animals, or even existing sophisticated AI systems
3. I think the strongest implication of Wentworth's subagents is that expected utility maximisation is not the limit or idealised form of agency
2. Shard Theory suggests that trained agents (via reinforcement learning[2]) form value "shards"
1. Values are inherently "contextual influences on decision making"
1. Hence agents do not have a static total order over preferences (what a utility function implies) as what preferences are active depends on the context
1. Preferences are dynamic (change over time), and the ordering of them is not necessarily total
2. This explains many of the observed inconsistencies in human decision making
2. A multitude of value shards do not admit analysis as a simple unitary utility function
3. Reward is not the optimisation target
1. Reinforcement learning does not select for reward maximising agents in general
1. Reward "upweight certain kinds of actions in certain kinds of situations, and therefore reward chisels cognitive grooves into agents"
4. I'm thus very sceptical that systems optimised via reinforcement learning to be capable in a wide variety of domains/tasks converge towards maximising a simple expected utility function
3. I am not aware that humanity actually knows training paradigms that select for expected utility maximisers
1. Our most capable/economically transformative AI systems are not agents and are definitely not expected utility maximisers
1. Such systems might converge towards general intelligence under sufficiently strong selection pressure but do not become expected utility maximisers in the limit
1. The do not become agents in the limit and expected utility maximisation is a particular kind of agency
4. I am seriously entertaining the hypothesis that expected utility maximisation is anti-natural to selection for general intelligence
1. I'm not under the impression that systems optimised by stochastic gradient descent to be generally capable optimisers converge towards expected utility maximisers
2. The generally capable optimisers produced by evolution aren't expected utility maximisers
3. I'm starting to suspect that "search like" optimisation processes for general intelligence do not in general converge towards expected utility maximisers
1. I.e. it may end up being the case that the only way to create a generally capable expected utility maximiser is to explicitly design one
1. And we do not know how to design capable optimisers for rich environments
2. We can't even design an image classifier
2. I currently disbelieve the strong orthogonality thesis translated to practice
1. While it may be in theory feasible to design systems at any intelligence level with any final goal
2. In practice, we cannot design capable optimisers.
3. For intelligent systems created by "search like" optimisation, final goals are not orthogonal to cognitive ability
1. Sufficiently hard optimisation for most cognitive tasks would not converge towards selecting for generally capable systems
1. In the limit, what do systems selected for playing Go converge towards?
1. I posit that said limit is not "general intelligence"
2. The cognitive tasks/domain on which a system was optimised for performance on may instantiate an upper bound on the general capabilities of the system
1. You do not need much optimisation power to attain optimal performance in logical tic tac toe
1. Systems selected for performance at logical tic tac toe should be pretty weak narrow optimisers because that's all that's required for optimality in that domain

I don't expect the systems that matter (in the par human or strongly superhuman regime) to be expected utility maximisers. I think arguments for AI x-risk that rest on expected utility maximisers are mostly disconnected from reality. I suspect that discussing the perils of expected utility maximisation in particular — as opposed to e.g. dangers from powerful (consequentialist?) optimisation processes — is somewhere between being a distraction and being actively harmful[3].

I do not think expected utility maximisation is the limit of what generally capable optimisers look like[4].

# Arguments for Expected Utility Maximisation Are Unnecessary

I don't think the case for existential risks from AI safety rest on expected utility maximisation. I kind of stopped alieving expected utility maximisers a while back (only recently have I synthesised explicit beliefs that reject it), but I still plan on working on AI existential safety, because I don't see the core threat as resulting from expected utility maximisation.

The reasons I consider AI an existential threat mostly rely on:

• Instrumental convergence for consequentialist/goal directed systems
• A system doesn't need to be a utility maximiser for a simple utility function to be goal directed (again, see humans)
• Selection pressures for power seeking systems
• Reasons
• More economically productive/useful
• Some humans are power seeking
• Power seeking systems promote themselves/have better reproductive fitness
• Human disempowerment is the immediate existential catastrophe scenario I foresee from power seeking
• This could lead towards dystopian scenarios in multipolar outcomes
• Humans getting outcompeted by AI systems
• Could slowly lead to an extinction

I do not actually expect extinction near term, but it's not the only "existential catastrophe":

• Human disempowerment
• Various forms of dystopia
1. ^

I optimised for writing this quickly. So my language may be stronger/more confident that I actually feel. I may not have spent as much time accurately communicating my uncertainty as may have been warranted.

2. ^

Correct me if I'm mistaken, but I'm under the impression that RL is the main training paradigm we have that selects for agents.

I don't necessarily expect that our most capable systems would be trained via reinforcement learning, but I think our most agentic systems would be.

3. ^

There may be significant opportunity cost via diverting attention from other more plausible pathways to doom.

In general, I think exposing people to bad arguments for a position is a poor persuasive strategy as people who dismiss said bad arguments may (rationally) update downwards on the credibility of the position.

4. ^

I don't necessarily think agents are that limit either. But as "Why Subagents?" shows, expected utility maximisers aren't the limit of idealised agency.

New Comment

# 6 Answers sorted by top scoring

Scott Garrabrant

### Dec 27, 2022

677

My take is that the concept of expected utility maximization is a mistake. In Eliezer's Coherent decisions imply consistent utilities, you can see the mistake where he writes:

From your perspective, you are now in Scenario 1B. Having observed the coin and updated on its state, you now think you have a 90% chance of getting $5 million and a 10% chance of getting nothing. Reflectively stable agents are updateless. When they make an observation, they do not limit their caring as though all the possible worlds where their observation differs do not exist. As far as I know, every argument for utility assumes (or implies) that whenever you make an observation, you stop caring about the possible worlds where that observation went differently. The original Timeless Decision Theory was not updateless. Nor were any of the more traditional ways of thinking about decision. Updateless Decision Theory, and subsequent decision theories corrected this mistake. Von Neumann did not notice this mistake because he was too busy inventing the entire field. The point where we discover updatelessness is the point where we are supposed to realize that all of utility theory is wrong. I think we failed to notice. Ironically the community that was the birthplace of updatelessness became the flag for taking utility seriously. (To be fair, this probably is the birthplace of updatelessness because we took utility seriously.) Unfortunately, because utility theory is so simple, and so obviously correct if you haven't thought about updatelessness, it ended up being assumed all over the place, without tracking the dependency. I think we use a lot of concepts that are built on the foundation of utility without us even realizing it. (Note that I am saying here that utility theory is a theoretical mistake! This is much stronger than just saying that humans don't have utility functions.) What should I read to learn about propositions like "Reflectively stable agents are updateless" and "utility theory is a theoretical mistake"? I notice that I'm confused. I've recently read the paper "Functional decision theory..." and it's formulated explicitly in terms of expected utility maximization. 5Scott Garrabrant3mo FDT and UDT are formulated in terms of expected utility. I am saying that the they advocate for a way of thinking about the world that makes it so that you don't just Bayesian update on your observations, and forget about the other possible worlds. Once you take on this worldview, the Dutch books that made you believe in expected utility in the first place are less convincing, so maybe we want to rethink utility. I don't know what the FDT authors were thinking, but it seems like they did not propagate the consequences of the worldview into reevaluating what preferences over outcomes look like. Don't updateless agents with suitably coherent preferences still have utility functions? 7Scott Garrabrant3mo That depends on what you mean by "suitably coherent." If you mean they need to satisfy the independence vNM axiom, then yes. But the point is that I don't see any good argument why updateless agents should satisfy that axiom. The argument for that axiom passes through wanting to have a certain relationship with Bayesian updating. 5Scott Garrabrant3mo Also, if by "have a utility function" you mean something other than "try to maximize expected utility," I don't know what you mean. To me, the cardinal (as opposed to ordinal) structure of preferences that makes me want to call something a "utility function" is about how to choose between lotteries. 1Eric Chen3mo Yeah by "having a utility function" I just mean "being representable as trying to maximise expected utility". 2Eric Chen3mo Ah okay, interesting. Do you think that updateless agents need not accept any separability axiom at all? And if not, what justifies using the EU framework for discussing UDT agents? In many discussions on LW about UDT, it seems that a starting point is that agent is maximising some notion of expected utility, and the updatelessness comes in via the EU formula iterating over policies rather than actions. But if we give up on some separability axiom, it seems that this EU starting point is not warranted, since every major EU representation theorem needs some version of separability. 5Scott Garrabrant3mo You could take as an input parameter to UDT a preference ordering over lotteries that does not satisfy the independence axiom, but is a total order (or total preorder if you want ties). Each policy you can take results in a lottery over outcomes, and you take the policy that gives your favorite lottery. There is no need for the assumption that your preferences over lotteries is vNM. Note that I don't think that we really understand decision theory, and have a coherent proposal. The only thing I feel like I can say confidently is that if you are convinced by the style of argument that is used to argue for the independence axiom, then you should probably also be convinced by arguments that cause you to be updateful and thus not reflectively stable. 2Eric Chen3mo Okay this is very clarifying, thanks! If the preference ordering over lotteries violates independence, then it will not be representable as maximising EU with respect to the probabilities in the lotteries (by the vNM theorem). Do you think it's a mistake then to think of UDT as "EU maximisation, where the thing you're choosing is policies"? If so, I believe this is the most common way UDT is framed in LW discussions, and so this would be a pretty important point for you to make more visibly (unless you've already made this point before in a post, in which case I'd love to read it). 4Scott Garrabrant3mo I think UDT is as you say. I think it is also important to clarify that you are not updating on your observations when you decide on a policy. (If you did, it wouldn't really be a function from observations to actions, but it is important to emphasize in UDT.) Note that I am using "updateless" differently than "UDT". By updateless, I mostly mean anything that is not performing Bayesian updates and forgetting the other possible worlds when it makes observations. UDT is more of a specific proposal. "Updateless" is more of negative property, defined by lack of updating. I have been trying to write a big post on utility, and haven't yet, and decided it would be good to give a quick argument here because of the question. The only posts I remember making against utility are in the geometric rationality sequence [https://www.lesswrong.com/s/4hmf7rdfuXDJkxhfg], especially this [https://www.lesswrong.com/s/4hmf7rdfuXDJkxhfg/p/Xht9swezkGZLAxBrd] post. 4Eric Chen3mo Thanks, the clarification of UDT vs. "updateless" is helpful. But now I'm a bit confused as to why you would still regard UDT as "EU maximisation, where the thing you're choosing is policies". If I have a preference ordering over lotteries that violates independence, the vNM theorem implies that I cannot be represented as maximising EU. In fact, after reading Vladimir_Nesov's comment [https://www.lesswrong.com/posts/XYDsYSbBjqgPAgcoQ/why-the-focus-on-expected-utility-maximisers?commentId=BrAG4cdesXsshbyRt#a5tn6B8iKdta6zGFu], it doesn't even seem fully accurate to view UDT taking in a preference ordering over lotteries. Here's the way I'm thinking of UDT: your prior over possible worlds uniquely determines the probabilities of a single lottery L, and selecting a global policy is equivalent to choosing the outcomes of this lottery L. Now different UDT agents may prefer different lotteries, but this is in no sense expected utility maximisation. This is simply: some UDT agents think one lottery is the best, other might think another is the best. There is nothing in this story that resembles a cardinal utility function over outcomes that the agents are multiplying with their prior probabilities to maximise EU with respect to. It seems that to get an EU representation of UDT, you need to impose coherence on the preference ordering over lotteries (i.e. over different prior distributions), but since UDT agents come with some fixed prior over worlds which is not updated, it's not at all clear why rationality would demand coherence in your preference between lotteries (let alone coherence that satisfies independence). 2Scott Garrabrant3mo Yeah, I don't have a specific UDT proposal in mind. Maybe instead of "updateless" I should say "the kind of mind that might get counterfactually mugged" as in this [https://www.lesswrong.com/posts/g3PwPgcdcWiP33pYn/counterfactual-mugging-poker-game] example. To ask for decisions to be coherent, there need to be multiple possible situations in which decisions could be made, coherently across these situations or not. A UDT agent that picks a policy faces a single decision in a single possible situation. There is nothing else out there for the decision in this situation to be coherent with. The options offered for the decision could be interpreted as lotteries over outcomes, but there is still only one decision to pick one lottery among them all, instead of many situations where the decision is to pick among a par... 4Scott Garrabrant3mo I am not sure if there is any disagreement in this comment. What you say sounds right to me. I agree that UDT does not really set us up to want to talk about "coherence" in the first place, which makes it weird to have it be formalized in term of expected utility maximization. This does not make me think intelligent/rational agents will/should converge to having utility. 4Vladimir_Nesov3mo I think coherence of unclear kind is an important principle that needs a place in any decision theory, and it motivates something other than pure updatelessness. I'm not sure how your argument should survive this. The perspective of expected utility and the perspective of updatelessness both have glaring flaws, respectively unwarranted updatefulness and lack of a coherence concept. They can't argue against each other in their incomplete forms. Expected utility is no more a mistake than updatelessness. I'm confused about the example you give. In the paragraph, Eliezer is trying to show that you ought to accept the independence axiom, cause you can be Dutch booked if you don't. I'd think if you're updateless, that means you already accept the independence axiom (cause you wouldn't be time-consistent otherwise). And in that sense it seems reasonable to assume that someone who doesn't already accept the independence axiom is also not updateless. I haven't followed this very close, so I'm kinda out-of-the-loop... Which part of UDT/updatelessness says "don't go for the most utility" (no-maximization) and/or "utility cannot be measured / doesn't exist" (no-"foundation of utility", debatably no-consequentialism)? Or maybe "utility" here means something else? Do you expect learned ML systems to be updateless? It seems plausible to me that updatelessness of agents is just as "disconnected from reality" of actual systems as EU maximization. Would you disagree? 4Scott Garrabrant3mo No, at least probably not at the time that we lose all control. However, I expect that systems that are self-transparent and can easily sellf-modify might quickly converge to reflective stability (and thus updatelessness). They might not, but I think the same arguments that might make you think they would develop a utility function also can be used to argue that they would develop updatelessness (and thus possibly also not develop a utility function). Note that I am not saying here that rational agents can't have a utility function. I am only saying that they don't have to. As far as I know, every argument for utility assumes (or implies) that whenever you make an observation, you stop caring about the possible worlds where that observation went differently. Are you just referring to the VNM theorems or are there other theorems you have in mind? Note for self: It seems like the independence condition breaks for counterfactual mugging assuming you think we should pay. Assume P is paying$50 and N is not paying, M is receiving \$1 million if you would have paid in the counterfactual and zero otherwise. We have N>P but 0.5P+0.5M>0.5N+0.5M in contradiction to independence. The issue is that the value of M is not independent of the choice between P and N.

Reflectively stable agents are updateless. When they make an observation, they do not limit their caring as though all the possible worlds where their observation differs do not exist.

This is very surprising to me! Perhaps I misunderstand what you mean by "caring," but: an agent who's made one observation is utterly unable[1] to interact with the other possible-worlds where the observation differed; and it seems crazy[1] to choose your actions based on something they can't affect; and "not choosing my actions based on X" is how I would defi...

9Scott Garrabrant3mo
Here [https://www.lesswrong.com/posts/g3PwPgcdcWiP33pYn/counterfactual-mugging-poker-game] is a situation where you make an "observation" and can still interact with the other possible worlds. Maybe you do not want to call this an observation, but if you don't call it an observation, then true observations probably never really happen in practice. I was not trying to say that is relevant to the coin flip directly. I was trying to say that the move used to justify the coin flip is the same move that is rejected in other contexts, and so we should open to the idea of agents that refuse to make that move, and thus might not have utility.
1Optimization Process3mo
Ah, that's the crucial bit I was missing! Thanks for spelling it out.

Neel Nanda

### Dec 28, 2022

1012

My personal take is that everything you wrote in this post is correct, and expected utility maximisers are neither the real threat, nor a great model for thinking about dangerous AI. Thanks for writing this up!

tailcalled

### Dec 27, 2022

60

The key question I always focus on is: where do you get your capabilities from?

For instance, with GOFAI and ordinary programming, you have some human programmer manually create a model of the scenarios the AI can face, and then manually create a bunch of rules for what to do in order to achieve things. So basically, the human programmer has a bunch of really advanced capabilities, and they use them to manually build some simple capabilities.

"Consequentialism", broadly defined, represents an alternative class of ways to gain capabilities, namely choosing what to do based on it having the desired consequences. To some extent, this is a method humans uses, perhaps particularly the method the smartest and most autistic humans most use (which I suspect to be connected to LessWrong demographics but who knows...). Utility maximization captures the essence of consequentalism; there are various other things, such as multi-agency that one can throw on top of it, but those other things still mainly derive their capabilities from the core of utility maximization.

Self-supervised language models such as GPT-3 do not gain their capabilities from consequentialism, yet they have advanced capabilities nonetheless. How? Imitation learning, which basically works because of Aumann's agreement theorem. Self-supervised language models mimic human text, and humans do useful stuff and describe it in text, so self-supervised language models learn the useful stuff that can be described in text.

Risk that arises purely from language models or non-consequentialist RLHF might be quite interesting and important to study. I feel less able to predict it, though, partly because I don't know what the models will be deployed to do, or how much they can be coerced into doing, or what kinds of witchcraft are necessary to coerce the models into doing those things.

It is possible to me that imitation learning and RLHF can bring us to the frontier of human abilities, so that we have a tool that can solve tasks as well as the best humans can. However, I don't think it will be able to much exceed that frontier. This is still superhuman, because no human is as good as all the best humans at all the tasks. But it is not far-superhuman, even though I think being far-superhuman is possible, and a key part in it not being far-superhuman is that it cannot extend its capabilities. As such, I would expect consequentialism to be necessary for creating something that is far-superhuman.

I think many of the classical AI risk arguments apply to consequentialist far-superhuman AI.

If I understood your model correctly, GPT has capability because (1) humans are consequentialists so they have capabilities and (2) GPT imitates human output (3) which requires the GPT learning the underlying human capabilities.

GPT is behavior cloning. But it is the behavior of a universe that is cloned, not of a single demonstrator, and the result isn’t a static copy of the universe, but a compression of the universe into a generative rule.

I think the above quote from janus would add to (3) that it requires GPT to also learn the environment and...

2tailcalled3mo
I think most of the capabilities on earth exist in humans, not in the environment. For instance if you have a rock, it's just gonna sit there; it's not gonna make a rocket and fly to the moon. This is why I emphasize GPT as getting its capabilities from humans, since there are not many other things in the environment it could get capabilities from. I agree that insofar as there are other things in the environment with capabilities (e.g. computers outputting big tables of math results) that get fed into GPT, it also gains some capabilities from them. I think they get their capabilities from evolution, which is a consequentialist optimizer?

It is possible to me that imitation learning and RLHF can bring us to the frontier of human abilities, so that we have a tool that can solve tasks as well as the best humans can. However, I don't think it will be able to much exceed that frontier. This is still superhuman, because no human is as good as all the best humans at all the tasks. But it is not far-superhuman, even though I think being far-superhuman is possible, and a key part in it not being far-superhuman is that it cannot extend its capabilities. As such, I would expect consequentialism to b

...
2tailcalled3mo
Could you expand on what you mean by general intelligence, and how it gets created selected for by the task of minimising predictive loss on sufficiently large and diverse datasets like humanity's text corpus?
1DragonGod3mo
This is the part I've not yet written up in a form I endorse. I'll try to get it done before the end of the year.

I don't think consequentialism is related to utility maximisation in the way you try to present it. There are many consequentialistic agent architectures that are explicitly not utility maximising, e. g. Active Inference, JEPA, ReduNets.

Then you seem to switch your response to discussing that consequentialism is important for reaching the far-superhuman AI level. This looks at least plausible to me, but first, these far-superhuman AIs could have a non-UM consequentialistic agent architecture (see above), and second, DragonGod didn't say that the risk is ne...

2tailcalled3mo
JEPA seems like it is basically utility maximizing to me. What distinction are you referring to? I keep getting confused about Active Inference (I think I understood it once based on an equivalence to utility maximization, but it's a while ago and you seem to be saying that this equivalence doesn't hold), and I'm not familiar with ReduNets, so I would appreciate a link or an explainer to catch up. I was sort of addressing alternative risks in this paragraph:

If AI risk arguments mainly apply to consequentialist (which I assume is the same as EU-maximizing in the OP) AI, and the first half of the OP is right that such AI is unlikely to arise naturally, does that make you update against AI risk?

2tailcalled3mo
Yes Not quite the same, but probably close enough. You can have non-consequentialist EU maximizers if e.g. the actionspace and statespace is small and someone manually computed a table of the expected utilities. In that case, the consequentialism is in the entity that computed the table of the expected utilities, not the entity that selects an action based on the table. (Though I suppose such an agent is kind of pointless since you could as well just store a table of the actions to choose.) You can also have consequentialists that are not EU maximizers if they are e.g. a collection of consequentialist EU maximizers working together.

TekhneMakre

### Dec 27, 2022

51

If you're saying "let's think about a more general class of agents because EU maximization is unrealistic", that's fair, but note that you're potentially making the problem more difficult by trying to deal with a larger class with fewer invariants.

If you're saying "let's think about a distinct but not more general class of agents because that will be more alignable", then maybe, and it'd be useful to say what the class is, but: you're going to have trouble aligning something if you can't even know that it has some properties that are stable under self-reflection. An EU maximizer is maybe close to being stable under self-reflection and self-modification. That makes it attractive as a theoretical tool: e.g. maybe you can point at a good utility function, and then get a good prediction of what actually happens, relying on reflective stability; or e.g. maybe you can find nearby neighbors to EU maximization that are still reflectively stable and easier to align. It makes sense to try starting from scratch, but IMO this is a key thing that any approach will probably have to deal with.

I strongly suspect that expected utility maximisers are anti-natural for selection for general capabilities.

4TekhneMakre3mo
There's naturality as in "what does it look like, the very first thing that is just barely generally capable enough to register as a general intelligence?", and there's naturality as in "what does it look like, a highly capable thing that has read-write access to itself?". Both interesting and relevant, but the latter question is in some ways an easier question to answer, and in some ways easier to answer alignment questions about. This is analogous to unbounded analysis: https://arbital.com/p/unbounded_analysis/ [https://arbital.com/p/unbounded_analysis/] In other words, we can't even align an EU maximizer, and EU maximizers have to some extent already simplified away much of the problem (e.g. the problems coming from more unconstrained self-modification).
1Roman Leventov3mo
You seem to try to bail out EU maximisation as the model because it is a limit of agency, in some sense. I don't think this is the case. In classical [https://arxiv.org/abs/1906.10184] and quantum [https://chrisfieldsresearch.com/qFEP-2112.15242.pdf] derivations of the Free Energy Principle [https://www.lesswrong.com/tag/free-energy-principle], it is shown that the limit is the perfect predictive capability of the agent's environment (or, more pedantically: in classic formulation, FEP is derived from basic statistical mechanics; in quantum formulation, it's more of being postulated, but it is shown that quantum FEP in the limit is equivalent to the Unitarity Principle). Also, Active Inference, the process theory which is derived from the FEP, can be seen as a formalisation of instrumental convergence [https://www.lesswrong.com/posts/ostLZyhnBPndno2zP/active-inference-as-a-formalisation-of-instrumental]. So, we can informally outline the "stages of life" of a self-modifying agent as follows: general intelligence -> maximal instrumental convergence -> maximal prediction of the environment -> maximal entanglement with the environment.
2TekhneMakre3mo
What you've said so far doesn't seem to address my comments, or make it clear to me what the relevant of the FEP is. I also don't understand the FEP or the point of the FEP. I'm not saying EU maximizers are reflectively stable or a limit of agency, I'm saying that EU maximization is the least obviously reflectively unstable thing I'm aware of.
1Roman Leventov3mo
I said that the limit of agency is already proposed, from the physical perspective (FEP). And this limit is not EU maximisation. So, methodologically, you should either criticise this proposal, or suggest an alternative theory that is better, or take the proposal seriously. If you take the proposal seriously (I do): the limit appears to be "uninteresting". A maximally entangled system is "nothing", it's perceptibly indistinguishable from its environment, for a third-person observer (let's say, in Tegmark's tripartite partition system-environment-observer [ https://journals.aps.org/prd/abstract/10.1103/PhysRevD.85.123517]).  There is no other limit. Instrumental convergence is not the limit, a strong instrumentally convergent system is still far from the limit. This suggests that unbounded analysis, "thinking to the limit" is not useful, in this particular situation. Any physical theory of agency [https://www.lesswrong.com/posts/2BPPwboTDrAMFiGHe/the-two-conceptions-of-active-inference-an-intelligence#Theories_of_agency] must ensure "reflective stability", by construction. I definitely don't sense anything "reflectively unstable" in Active Inference, because it's basically the theory of self-evidencing, and wields instrumental convergence in service of this self-evidencing. Who wouldn't "want" this, reflectively? Active Inference agents in some sense must want this by construction because they want to be themselves, as long as possible. However they redefine themselves, and at that very moment, they also want to be themselves (redefined). The only logical possibility out of this is to not want to exist at all at some point, i. e., commit suicide, which agents (e. g., humans) actually do sometimes. But conditioned on that they want to continue to exist, they are definitely reflectively stable.
2TekhneMakre3mo
I'm talking about reflective stability. Are you saying that all agents will eventually self modify into FEP, and FEP is a rock?
1Roman Leventov2mo
Reward is not Necessary: How to Create a Compositional Self-Preserving Agent for Life-Long Learning [https://www.lesswrong.com/posts/df4Jjg9cmJ7R2bkzR/reward-is-not-necessary-how-to-create-a-compositional-self-1]

Wei_Dai

### Dec 28, 2022

30

Speaking for myself, I sometimes use "EU maximization" as shorthand for one of the following concepts, depending on context:

1. The eventual intellectual descendant of EU maximization, i.e., the decision theory or theory of rationality that future philosophers will eventually accept as correct or ideal or normative, which presumably will have some kind of connection (even if only historical) to EU maximization.
2. The eventual decision procedure of a reflectively stable superintelligence.
3. The decision procedure of a very capable consequentialist AI, even if it's not quite reflectively stable yet.

Hmm, I just did a search of my own LW content, and can't actually find any instances of myself doing this, which makes me wonder why I was tempted to type the above. Perhaps what I actually do is if I see someone else mention "EU maximization", I mentally steelman their argument by replacing the concept with one of the three above, if anyone of them would make a sensible substitution.

Do you have any actual examples of anyone talking about EU maximization lately, in connection with AI risk?

I note that EU maximization has this baggage of never strictly preferring a lottery over outcomes to the component outcomes, and you steelmen appear to me to not carry that baggage. I think that baggage is actually doing work in some people's reasoning and intuitions.

2Wei_Dai3mo
Do you have any examples of this?
2Scott Garrabrant3mo
Hmm, examples are hard. Maybe the intuitions contribute to concept of edge instantiation [https://arbital.com/p/edge_instantiation/]?

I parsed the Rob Bensinger tweet I linked in the OP as being about expected utility maximising when I read it, but others have pointed out that wasn't necessarily a fair reading.

Htarlov

### Feb 25, 2023

10

I think it depends on how you define expected utility. I agree that a definition that limits us only to analyzing end-state maximizers that seek some final state of the world is not very useful.

I don't think that for non-trivial AI agents, the utility function should or even can be defined as a simple function over the preferable final state of the world.  U:Ω→R

This function does not take into account time and an intermediate set of predicted future states that the agent will possibly have preference over. The agent may have a preference for the final state of the universe but most likely and realistically it won't have that kind of preference except for some special strange cases. There are two reasons:

• a general agent likely won't be designed as a maximizer over one single long-term goal (like making paperclips) but rather as useful for humans over multiple domains so it would rather care more about short-term outcomes, middle-term preferences, and tasks "at hand"
• the final state of the universe is generally known by us and will likely be known by a very intelligent general agent, even if you ask current GPT-3 it knows that we will end up in Big Freeze or Big Rip with the latter being more likely. Agent can't really optimize for the end state of the Universe as there are not many actions that could change physics and there is no way to reason about the end state except for general predictions that do not end up well for this universe, whatever the agent does.

Any complex agent would likely have a utility function over possible actions that would be equal to the utility function of the set of predicted futures after action A vs the set of predicted futures without action A (or over differences between worlds in those futures). By action I mean possibly a set of smaller actions (hierarchy of actions - e.g. plans, strategies), it might not be atomic. Directly it cannot be easily computable so most likely this would be compressed to a set of important predicted future events on the level of abstraction that the agent cares about, which should constitute future worlds without action A and action A with enough approximation.

This is also how we evaluate actions. We evaluate outcomes in the short and long terms. We also care differently depending on time scope.

I say this because most sensible "alignment goals" like please don't kill humans are time-based. What does it mean not to kill humans? It is clearly not about the final state. Remember, Big Rip or Big Freeze. Maybe AGI can kill some for a year and then no more assuming the population will go up and some people are killed anyway so it does not matter long-term? No, this is also not about the non-final but long-term outcome. Really it is a function of intermediate states. Something like the integral of some function U'(dΩ) where dΩ is a delta between outcomes of action vs non-action, over time, which can be approximated and compressed into integral over the function of an event over multiple events until some time T being maximal sensible scope.

Most of the behaviors and preferences of humans are also time-scoped, and time-limited and take multiple future states into account, mostly short-scoped. I don't think that alignment goals can be even expressed in terms of simple end-goal (preferable final state of the world) as the problem partially comes from the attitude of eng goal justifying the means that are at the core of the utility function defined as U:Ω→R.

It seems plausible to me that even non-static human goals can be defined as utility functions over the set of differences in future outcomes (difference between two paths of events). What is also obvious to me is that we as humans are able to modify our utility function to some extent, but not very much. Nevertheless, for humans the boundaries between most baseline goals, preferences, and morality vs instrumental convergence goals are blurry. We have a lot of heuristics and biases so our minds work out some things more quickly and more efficiently than if we would on intelligence, thinking, and logic. The cost is lower consistency, less precision, and higher variability.

So I find it useful to think about agents as maximizers over utility function, but not defined as one final goal or outcome or state of the world. Rather one that maximizes the difference between two ordered sets of events in different time scopes to calculate the utility of an action.

I also don't think agents must be initially rationally stable with an unchangeable utility function. This is also a problem as an agent can have initially a set of preferences with some hierarchy or weights, but it also can reason that some of these are incompatible with others, that the hierarchy is not logically consistent, and might seek to change it for sake of consistency to be fully coherent.

I'm not an AGI, clearly, but it is just like I think about morality right now. I learned that killing is bad. But I still can question "why we don't kill?" and modify my worldview based on the answer (or maybe specify it in more detail in this matter). And it is a useful question as it says a lot about edge cases including abortion, euthanasia, war, etc. The same might happen for rational agents - as it might update their utility function to be stable and consistent, maybe even questioning some of the learned parts of the utility function in the process. Yes, you can say that if you can change that then it was not your terminal goal. Nevertheless, I can imagine agents with no terminal core goals at all. I'm not even sure if we as humans have any core terminal goals (maybe except avoiding death and own harm in the case of most humans in most circumstances... but some overcome that as Thích Quảng Đức did).

I agree with the following caveats:

• I think you're being unfair to that Rob tweet and the MIRI position; having enough goal-directedness to maximize the number of granite spheres + no special structure to reward humans is a far weaker assumption than utility maximization. The argument in the tweet also goes through if the AI has 1000 goals as alien as maximizing granite spheres, which I would guess Rob thinks is more realistic. (note that I haven't talked to him and definitely don't speak for him or MIRI)
• Shard theory is mostly just a frame and hasn't discovered anything yet; the nontrivial observations about how agents and values behave rely on ~9 nonobvious claims, and the obviously true observations are not very powerful in arguing for alternate models of how powerful AI behaves. [If this sounds critical of shard theory, note that I'm excited about the shard theory frame, it just seems premature to conclude things from the evidence we have]
• [edited to add:] Reflection might give some degree of coherence. This is important in the MIRI frame and also in the shard theory frame.

The argument in the tweet also goes through if the AI has 1000 goals as alien as maximizing granite spheres, which I would guess Rob thinks is more realistic.

As an aside: If one thinks 1000 goals is more realistic, then I think it's better to start communicating using examples like that, instead of "single goal" examples. (I myself lazily default to "paperclips" to communicate AGI risk quickly to laypeople, so I am critiquing myself to some extent as well.)

Anyways, on your read, how is "maximize X-quantity" different from "max EU where utility is linearly increasing in granite spheres"?

There's a trivial sense in which the agent is optimizing the world and you can rationalize a utility function from that, but I think an agent that, from our perspective, basically just maximizes granite spheres can look quite different from the simple picture of an agent that always picks the top action according to some (not necessarily explicit) granite-sphere valuation of the actions, in ways such that the argument still goes through.

• The agent can have all the biases humans do.
• The agent can violate VNM axioms in any other way that doesn't ruin it, basically anything that has low frequency or importance.
• The agent only tries to maximize granite spheres 1 out of every 5 seconds, and the other 4/5 is spent just trying not to be turned off.
• The agent has arbitrary deontological restrictions, say against sending any command to its actuators whose hash starts with 123.
• The agent has 5 goals it is jointly pursuing, but only one of them is consequentialist.
• The agent will change its goal depending on which cosmic rays it sees, but is totally incorrigible to us.

The original wording of the tweet was "Suppose that the AI's sole goal is to maximize the number of granite spheres in its future light cone." This is a bit closer to my picture of EU maximization but some of the degrees of freedom still apply.

1. Yeah, I think that's fair. I may have pattern matched/jumped to conclusions too eagerly. Or rather, I've been convinced that my allegation is not very fair. But mostly, the Rob tweet provided the impetus for me to synthesise/dump all my issues with EU maximisation. I think the complaint can stand on its own, even if Rob wasn't quite staking the position I thought he was.

That said, I do think that multi objective optimisation is way more existentially safe than optimising for a single simple objective. I don't actually think the danger directly translates. And I think it's unlikely that multi-objective optimisers would not care about humans or other agents.

I suspect the value shard formation hypotheses would imply instrumental convergence towards developing some form of morality. Cooperation is game theoretically optimal. Though it's not clear yet, how accurate the value shard formation hypothesis is true.

2. I'm not relying too heavily on Shard Theory I don't think. I mostly cited it because it's what actually lead me in that direction not because I fully endorse it. The only shard theory claims I rely on are:

• Values are contextual influences on decision making
• Reward is not the optimisation target

Do you think the first is "non obvious"?

That said, I do think that multi objective optimisation is way more existentially safe than optimising for a single simple objective. I don’t actually think the danger directly translates. And I think it’s unlikely that multi-objective optimisers would not care about humans or other agents.

I think one possible form of existential catastrophe is that human values get only a small share of the universe, and as a result the "utility" of the universe is much smaller than it could be. I worry this will happen if only one or few of the objectives of multi objective optimization cares about humans or human values.

Also, if one of the objectives does care about humans or human values, it might still have to do so in exactly the right way in order to prevent (other forms of) existential catastrophe, such as various dystopias. Or if more than one cares, they might all have to care in exactly the right way. So I don't see multi objective optimisation as much safer by default, or much easier to align.

I think that multi-decision-influence networks seem much easier to align and much safer for humans.

1. Even on the view you advocate here (where some kind of perfection is required), "perfectly align part of the motivations" seems substantially easier than "perfectly align all of the AI's optimization so it isn't optimizing for anything you don't want."
2. all else equal, the more interests present at a bargaining table, the greater the chance that some of the interests are aligned.
3. I don't currently understand what it means for the agent to have to care in "exactly the right way." I find myself wanting to point to Alignment allows "nonrobust" decision-influences, but I know you've already read that post, so...
1. Perhaps I'm worried about implicit equivocation between
1. "the agent's values have to be adversarially robust so that they make decisions in precisely the right way, such that the plans which most appeal to the decision-maker are human-aligned plans" (I think this is wrong, as I argue in that post)
2. "There may be a lot of sensitivity of outcome-goodness to the way the human-aligned shards influence decisions, such that the agent has to learn lots of 'niceness parameters' in order to end up treating us well" (Seems plausible)
2. I think (i) is doomed and unnecessary and not how realistic agents work, and I think (ii) might be very hard. But these are really not the same thing, I think (i) and (ii) present dissimilar risk profiles and research interventions. I think "EU maximization" frames focus on (i).

I think that multi-decision-influence networks seem much easier to align and much safer for humans.

It seems fine to me that you think this. As I wrote in a previous post, "Trust your intuitions, but don’t waste too much time arguing for them. If several people are attempting to answer the same question and they have different intuitions about how best to approach it, it seems efficient for each to rely on his or her intuition to choose the approach to explore."

As a further meta point, I think there's a pattern where because many existing (somewhat) concrete AI alignment approaches seem doomed (we can fairly easy see how they would end up breaking), people come up with newer, less concrete approaches which don't seem doomed, but only because they're less concrete and therefore it's harder to predict what they would actually do, or because fewer people have looked into them in detail and tried to break them. See this comment where I mentioned a similar worry with regard to Paul Christiano's IDA when it was less developed.

In this case, I think there are many ways that a shard-based agent could potentially cause existential catastrophes, but it's hard for me to say more, since I don't know the details of what your proposal will be.

(For example, how do the shards resolve conflicts, and how will they eventually transform into a reflectively stable agent? If one of the shards learns a distorted version of human values, which would cause an existential catastrophe if directly maximized for, how exactly does that get fixed by the time the agent becomes reflectively stable? Or if the agent never ends up maximizing anything, why isn't that liable to be a form of existential catastrophe? How do you propose to prevent astronomical waste caused by the agent spending resources on shard values that aren't human values? What prevents the shard agent from latching onto bad moral/philosophical ideas and causing existential catastrophes that way?)

I don't want to discourage you from working more on your approach and figuring out the details, but at the same time it seems way too early to say, hey let's stop working on other approaches and focus just on this one.

I think these are great points, thanks for leaving this comment. I myself patiently await the possible day where I hit an obvious shard theory landmine which has the classic alignment-difficulty "feel" to it. That day can totally come, and I want to be ready to recognize if it does.

at the same time it seems way too early to say, hey let's stop working on other approaches and focus just on this one.

FWIW I'm not intending to advocate "shard theory or GTFO", and agree that would be bad as a community policy.

I've tried to mention a few times[1] (but perhaps insufficiently prominently) that I'm less excited about people going "oh yeah I guess shard theory is great or something, let's just think about that now" and more excited about reactions like "Oh, I guess I should have been practicing more constant vigilance, time to think about alignment deeply on my own terms, setting aside established wisdom for the moment." I'm excited about other people thinking about alignment from first principles and coming up with their own inside views, with their own theories and current-best end-to-end pictures of AGI training runs

1. ^

A-Outer: Suppose I agreed. Suppose I just dropped outer/inner. What next?

A: Then you would have the rare opportunity to pause and think while floating freely between agendas. I will, for the moment, hold off on proposing solutions. Even if my proposal is good, discussing it now would rob us of insights you could have contributed as well. There will be a shard theory research agenda post which will advocate for itself, in due time.

I've also made this point briefly at the end of in-person talks. Maybe I should say it more often.

Cooperation is game theoretically optimal.

This is a claim I strongly disagree with, assuming there aren't enforcement mechanisms like laws or contracts. If there isn't enforcement, then this reduces to the Prisoner's Dilemma, and there defection is game-theoretically optimal. Cooperation only works if things can be enforced, and the likelihood that we will be able to enforce things like contracts on superhuman intelligences is essentially like that of animals enforcing things on a human, i.e so low that it's not worth privileging the hypothesis.

And this is important, because it speaks to why the alignment problem is hard: agents with vastly differing capabilities can't enforce much of anything, so defection is going to happen. And I think this prediction bears out in real life relations with animals, that is humans can defect consequence free, so this usually happens.

One major exception is pets, where the norm really is cooperation, and the version that would be done for humans is essentially benevolent totalitarianism. Life's good in such a society, but modern democratic freedoms are almost certainly gone or so manipulated that it doesn't matter.

That might not be bad, but I do want to note that in game theory without enforcement is where defection rules.

instrumental convergence towards developing some form of morality.

That respects the less capable agent's wants, and stably is the necessary thing. And the answer to this is negative, expect in the pets case. And even here, this will entail the end of democracy and most freedom as we know it. It might actually be benevolent totalitarianism, and you may make an argument that this is desirable, though I do want to note the costs.

What exactly do you mean by "multi objective optimization"?

Optimising multiple objective functions in a way that cannot be collapsed into a single utility function to e.g. the reals.

I guess multi objective optimisation can be represented by a single utility function that maps to a vector space, but as far as I'm aware, utility functions usually have a field as their codomain.

I think most examples of "arguments from expected utility maximisation" are going to look like what Rob wrote: not actually using expected utility maximization, but rather "having goals that you do a pretty good job at accomplishing". This gets you things like "more is better" with respect to resources like negentropy and computation, it gets you the idea that it's better to achieve the probability that your goal is achieved, it gets you some degree of resisting changes to goal content (altho I think contra Omohundro this can totally happen in bargaining scenarios), if "goal content" exists, and it gets you "better to not be diverted from your goal by meddling humans".

Also: I don't understand how one is supposed to get from "trained agents have a variety of contextual influences on decision making" to "trained agents are not expected utility maximizers", without somehow rebutting the arguments people make for why utility maximizers are good - and once we refute these arguments, we don't need talk of "shards" to refute them extra hard. Like, you can have different influences on your behaviour at different times that all add up to coherence. For example, one obvious influence that would plausibly be reinforced is "think about coherence so you don't randomly give up resources". Maybe this is supposed to make more sense if we use "expected utility maximizer" in a way that excludes "thing that is almost expected-utility-optimal" or "thing that switches between different regimes of expected utility maximization" but that strikes me as silly.

Separately from Scott's answer, if people reason

1. "Smart entities will be coherent relative to what they care about",
2. "Coherent entities can be seen as optimizing expected utility for some utility function"
3. "EU maximizers are dangerous."

I think both (1) and (3) are sketchy/wrong/weird.

(1) There's a step like "Don't you want to save as many lives as possible? Then you have to coherently trade off opportunities by assigning a value to each life." and the idea that this kind of reasoning then pins down "you now maximize, or approximately maximize, or want to maximize, some utility function over all universe-histories." This is just a huge leap IMO.

Also, I think that people mostly just imagine specific kinds of EU maximizers (e.g. over action-observation histories) with simple utility functions (e.g. one we could program into a simple Turing machine, and then hand to AIXI). And people remember all the scary hypotheticals where AIXI wireheads, or Eliezer's (hypothetical) example of an outcome-pump. I think that people think "it'll be an EU maximizer" and remember AIXI and conclude "unalignable" or "squeezes the future into a tiny weird contorted shape unless the utility function is perfectly aligned with what we care about." My imagined person acknowledges "mesa optimizers won't be just like AIXI, but I don't see a reason to think they'll be fundamentally differently structured in the limit."

On these perceptions of what happens in common reasoning about these issues, I think there is just an enormous number of invalid reasoning steps, and I can tell people about a few of them but—even if I make myself understood—there usually don't seem to be internal errors thrown which leads to a desperate effort to recheck other conclusions and ideas drawn from invalid steps. EU-maxing and its assumptions seep into a range of alignment concepts (including exhaustive search as a plausible idealization of agency). On my perceptions, even if someone agrees that a specific concept (like exhaustive search) is inappropriate, they don't seem to roll back belief-updates they made on the basis of that concept.

My current stance is "IDK what the AI cognition will look like in the end", and I'm trying not to collapse my uncertainty prematurely.

Agree with everything, including the crucial conclusion that thinking and writing about utility maximisation is counterproductive.

Just one minor thing that I disagree with in this post: while simulators as a mathematical abstraction are not agents, the physical systems that are simulators in our world, e. g. LLMs, are agents.

An attempt to answer the question in the title of this post, although that could be a rhetorical one:

• This could be a sort of epistemic and rhetorical inertia, specifically due to this infamous example of a paperclip maximiser. For a similar reason, a lot of people are still discussing and are mesmerised by the "Chinese Room" argument.
• The historic focus of LW is on decision theories and the discipline of rationality, where EU maximiser is the model agent. Then this concept was carried over to AI alignment discussions and was applied to model the future superhuman AI without careful consideration, or in lieu of the better models of agency (at least at the time when this has happened). Then, again: epistemic and didactic inertia: a lot of "foundational AI x-risk/alignment texts" still mention utility maximization, new people who enter the field still get exposed to this concept early and think and write about it perhaps before finding concepts and questions more relevant to the actual reality to think and write about, etc.

I wish that we were clearer that in a lot of circumstances we don't actually need a utility maximiser for our argument, but rather an AI that is optimising sufficiently hard. Unfortunately, I suspect competitive dynamics and the unilateralist's curse might push us further down this path than we'd like.

My personal suspicion is that an AI being indifferent between a large class of outcomes matters little; it's still going to absolutely ensure that it hits the pareto frontier of its competing preferences.

Hitting the pareto frontier looks very different from hitting the optimum of a single objective.

I don't think those arguments that rely on EU maximisation translate.

As an additional reason to be suspicious of arguments based on expected utility maximization, VNM expected utility maximizers aren't embedded agents. Classical expected utility theory treats computations performed at EUMs as having no physical side effects (e.g., energy consumption or waste heat generation), and the hardware that EUMs run on is treated as separate from the world that EUMs maximize utility over. Classical expected utility theory can't handle scenarios like self-modification, logical uncertainty, or the existence of other copies of the agent in the environment. Idealized EUMs aren't just unreachable via reinforcement learning, they aren't physically possible at all. An argument based on expected utility maximization that doesn't address embedded agency is going to ignore a lot of factors that are relevant to AI alignment.

I still have serious trouble trying to get what people include in "expected utility maximatization". A utility function is just a restatement of preferences. It does and requires nothing.

I collected some bits of components what this take is (cognizable to me) actually saying.

static total order over preferences (what a utility function implies)

This claims that utility functions have temporal translation symmetry built-in.

maximising a simple expected utility function

This claims that utility functions means that an agent has internal representations of its affordances (or some kind of self-control logic). I disagree/I don't understand.

Suppose you want to test how fire-safe agents are. You do so by putting an appendage of them on a hot stove. If the agent rests its appendage on the stove you classify it as defective. If the agent removes its appendage from the stove you classify it as compliant. You test rock-bot and spook-bot. Rock-bot fails and does not have any electronics inside its shell. Spook-bot just has a reflex retracting everythin upon a pain singnal and passes. Neither bot involves making a world-model or considering options. Another way of frasing this is that you dislike agents that find bots whos utility function values resting the appendage to a great degree to be undesirable.

maximising a simple expected utility function

This claims that expected utility maximation involves using an internal representation that is some combination of: fast to use in deployment, has low hardware space requirements to store, uses little programmer time to code, uses few programming lines to encode.

And I guess in the mathematical sense this line of direction goes to the direction of "utility function has a finite small amount of terms as an algebraic expression".

So the things I have fished out and explicated:

• static
• phenomenological
• succinct

Not all decision-making algorithms work by preferring outcomes, and not all decision-making algorithms that work by preferring outcomes have preferences that form a total preorder over outcomes, which is what would be required to losslessly translate those preferences into a utility function. Many reasonable kinds of decision-making algorithms (for example, ones that have ceteris paribus preferences) do not meet that requirement, including the sorts we see in real world agents. I see no reason to restrict ourselves to the subset that do.

So the phenomenological meaning is what you centrally mean?

I do not advocate for any of the 3 meanings, but I want to figure out what you are against.

To me a utility function is a description of the agents existences impact and even saying that it refers to an algorithm is a misuse of the concept.

To be honest I'm not sure what you mean. I don't think so?

An agent makes decisions by some procedure. For some agents, the decisions that procedure produces can be viewed as choosing the more preferred outcome (i.e. when given a choice between A and B, if its decision procedure deterministically chooses A we'd describe that as "preferring A over B"). For some of those agents, the decisions they make have some additional properties, like that they always either consistently choose A over B or are consistently indifferent between them. When you have an agent like that and combine it with probabilistic reasoning, you get agent whose decision-making can be compressed into a single utility function.

That notion of chooser is sensible. I think it is important to differentiate between "giving a choice" and "forms a choice" ie whether it is the agent or the enviroment doing it. Seating a rock-bot in front of a chess board can be "giving a choice" without "forms a choice" ever happening (rock-bot is not a chooser). Simiarly while the environment "gives a choice to pull arm away" spook-bot never "forms a choice" (because it is literally unimaginable for it to do otherwise) and is not a chooser.

Even spook-bot is external situation consistent and doesn't require being a chooser to do that. Only a chooser can ever be internal situation consistent (and even then it should be relativised to particular details of the internal state ie "Seems I can choose between A and B" and "Seems I can choose between A and B. Oh there is a puppy in the window." are in the same bucket) but that is hard to approach as the agent is free to build representations as it wants.

So sure if you have an agent that is internal-situation-consistent along some of its internal situations details and you know what details those are then you can specify which bits of the agents internal state you can forget without impacting your ability to predict its external actions.

Going over this revealed a stepping stone I had been falling for. "Expected utility" involves mental representations and "utility expectation" is about statistics of which there might not be awereness. An agent that makes the choice with highest utility expectation is statistically as suffering-free as possible. An agent that makes the choice with highest expected utility is statistically minimally (subjectively) regretful.

I think that solving the alignment for EV maximizers is a much stronger version of alignment than eg prosaic alignment of LLM-type models. Agents seem like they’ll be more powerful than Tool AIs. We don’t know how to make them, but if someone does, and capabilities timelines shorten drastically, it would be awesome to even have a theory of EV maximizer alignment before then

Reinforcement learning does create agents, those agents just aren't expected utility maximisers.

Claims that expected utility maximisation is the ideal or limit of agency seem wrong.

I think expected utility maximisation is probably anti-natural to generally capable optimisers.