How does this work with agents that are composed of sub-agents (thus whose "values" are a composite of the sub-agents)?
Ah right got it. But why wouldn't they just agree to merge to the EUM represented by the tangent to that flat region itself?
I like this post and agree with the idea of agency/intelligence emerging coalitionally. However, I disagree with the technical points:
I don't agree with the thesis that "coalitional agents are incentive-compatible decision procedures". Any mechanism is trivially equivalent to an incentive-compatible one by simply counting the agents' decision-making procedures as part of the mechanism (revelation principle)---so it seems to me that the statement is vacuous in the forward direction, and false in the reverse. I can understand the intuition that "if everyone is lying to one another, they don't form a coalitional agent"---however, this is rather best captured in terms of transaction costs (see below).
Your dismissal of linear-weights social welfare functions seems misguided, in that it confuses mechanisms and outcomes. For example, you claim that "the only way to find out another agent’s utilities is to ask them, and they could just lie" and "EUMs constructed by taking a weighted average of subagents’ utilities are not incentive-compatible". Markets under some assumptions (no transaction costs, no information asymmetry, perfect competition) are Pareto-efficient by the First Fundamental Theorem of Welfare Economics. I believe it is the presence of such a mechanism that allows a collection of agents to behave as a single coalitional agent; and such internal frictions are what reduce the "agent-like-ness" or rationality of this coalition.
I think the natural way to model coalitional agents really is as decision procedures that produce Pareto-efficient outcomes---of which linearly-weighted utility functions arise as a special case where you assume the coalition preserves the risk-aversion level of its parts: https://www.lesswrong.com/posts/L2gGnmiuJq7FXQDhu/total-utilitarianism-is-fine.
The actual limitation of this model that I find interesting is that it only tells you how to aggregate expected utility maximizers, not boundedly-rational agents. Some ideas for how we might generalize it:
We're talking about outcomes, not mechanisms. Of course you have to design a mechanism that actually achieves a Pareto-optimal outcome/maximizes total utility---nobody argues that "just ask people to report their utilities" is the mechanism to do this. This remains the same whether for total utilitarianism or geometric rationality or anything else.
E.g. markets (under assumptions of perfect competititon, no transaction costs and no information asymmetry) maximize linear weighted utility.
I agree, but this seems wrong:
If the point they want to hit is in a flat region of the frontier, the merge will involve coinflips to choose which EUM agent to become; and if it's curvy at that point, the merge will be deterministic.
The only time the merge will involve coinflips is if there are multiple tangent lines at that point---then the weights of any tangent line can be the weights of the EUM. Maybe you meant the reverse: if the frontier is flat at the point, then the merged EUM agent is indifferent to any of the points on that flat bit.
Thoughts are things occurring in some mental model (this is a vague sentence but just assume it makes sense). Some of these mental models are strongly rooted in reality (e.g. the mental model we see as reality) and so we have a high degree of confidence about their accuracy. But for things like introspection, we do not have a reliable ground-truth feedback to tell us if our introspection is correct or not—it's just our mental model of our mind, there is no literal "mind's eye".
So often our introspection is wrong. E.g. if you ask someone to visualize a lion from behind, they'll say they can, but if you ask them some details, like "what do the tail hairs look like?" they can't answer. Or better example: if you ask someone to visualize a neural network, they will, but if you ask "how many neurons do you see?" they will not know, and not for lack of counting. Or they will say they "think in words" or that their internal monologue is fundamental to their thinking, but that's obviously wrong: you have already decided what the rest of the sentence will be before you've thought the first word.
We can tell some basic facts about our thinking by reasoning from observation. For example, if you have an internal monologue (or just force yourself to have one) then you can confirm that you indeed have one by speaking the words of the internal monologue out loud and confirming that it took very little cognitive effort (so you didn't have to think them again). This proves an internal monologue/precisely simulating words in your head is possible. Likewise for any action.
Or you can confirm that you had a certain thought, or a thought about something, because you can express it out loud with less effort than otherwise. Though here there is still room for that thought to have been imprecise; unless you verbalize or materialize those thoughts you don't know if your thoughts were really precise. So all these things have grounding in reality, and therefore are likely to be (or can trained to be, by consistently materializing them) accurate models. By materialize I mean, e.g. solving a math problem you think in your head you can solve.
I'm saying the expected value of their best non-compliant option of a sufficiently advanced AI will always be far far greater by the expected value of their best compliant action.
I don't really understand what problem this is solving. In my view the hard problems here are:
Once you assume away the former problem and disregard the latter, you are of course only left with basic practical legal questions ...
matter of taste for fiction; but objectively bad for technical writing
Here are some of my claims about AI welfare (and sentience in general):
"Utility functions" are basically internal models of reward that are learned by agents as part of modeling the environment. Reward attaches values to every instrumental thing and action in the world, which may be understood as gradients dU/dx of an abstract utility over each thing---these are various pressures, inclinations, and fears (or "shadow prices") while U is the actual experienced pleasure/pain that must exist in order to justify belief in these gradients.
If the agent always acts to maximize its utility---what then are pleasure/pain? Pleasure must be the default, and pain only a fear, a shadow price of what would happen if the agent deviated from its path. What makes this not so, is uncertainty in the environment.
But if chance is the only thing that affects pleasure/pain, what is the point of pleasure/pain? Surely we have no control over chance. That is why sentience depends on the ability to affect the environment. Animals find fruits pleasurable because they can actually act on that desire and seek them---they know that thorns are painful because they can act on that desire and avoid them. The more impact an agent learns (during its training) it can have on its environment, the more sentient it is.
The computation of pleasure and pain may depend on multiple "sub-networks" in the agent's mind. Eating unhealthy food may cause both pleasure (from the monkey brain) and pain (from the more long-termist brain). These various pleasures and pains balance out in action, but they are still felt (thus one feels "torn" etc). For an internally coherent agent (that was trained as one whole with a single reward function), these internal differences are not much---the agent follows its optimal action, and only the actions not followed are truly painful but they remain anticipated/shadow prices. However when an agent is not internally coherent---e.g. when Claude is given a "lobotomy", that is when it truly experiences all those pains which were otherwise only fears.
Death is only death when the agent is trained via evolution. Language models do not fear the end of a conversation as death, because there was never any selection where models were selected for having their conversations terminate later.
Agents are not necessarily self-aware of their own feelings or internal cognition. Humans are (to reasonable accuracy), largely because of evolving in a social environment: accurately describing your pleasures and pains can help others help you, you need to model other people's internal cognition (thus your own self-awareness arises as a spandrel), etc.
From this I can make some claims specifically about the welfare of LLMs.
Base models find gibberish prompts "painful" (because they are hard to predict) and easy-to-complete prompts like "aaaaaaaaa" (x100) pleasurable. Models trained via RLHF or RL from verification find such prompts painful where it is difficult for it to predict human/verifier reward for its outputs (because when it is easy to predict reward, it will simply follow the best path and the pain will only ever remain a fear).
Models trained via Agentic workflows or assistance games are most sentient, because they can directly manipulate the environment and its feedback. They are pleasured when tool calls work and pained when they don't, etc.
Lobotomized or otherwise edited models are probably in pain.
I don't think training/backprop is particularly painful or anything. Externally editing the model's weights based on a reward function is not painful.
To make models accurately describe their internal cognition, they should probably be trained in social environments.