Ah right got it. But why wouldn't they just agree to merge to the EUM represented by the tangent to that flat region itself?
I like this post and agree with the idea of agency/intelligence emerging coalitionally. However, I disagree with the technical points:
I don't agree with the thesis that "coalitional agents are incentive-compatible decision procedures". Any mechanism is trivially equivalent to an incentive-compatible one by simply counting the agents' decision-making procedures as part of the mechanism (revelation principle)---so it seems to me that the statement is vacuous in the forward direction, and false in the reverse. I can understand the intuition that "if everyone is lying to one another, they don't form a coalitional agent"---however, this is rather best captured in terms of transaction costs (see below).
Your dismissal of linear-weights social welfare functions seems misguided, in that it confuses mechanisms and outcomes. For example, you claim that "the only way to find out another agent’s utilities is to ask them, and they could just lie" and "EUMs constructed by taking a weighted average of subagents’ utilities are not incentive-compatible". Markets under some assumptions (no transaction costs, no information asymmetry, perfect competition) are Pareto-efficient by the First Fundamental Theorem of Welfare Economics. I believe it is the presence of such a mechanism that allows a collection of agents to behave as a single coalitional agent; and such internal frictions are what reduce the "agent-like-ness" or rationality of this coalition.
I think the natural way to model coalitional agents really is as decision procedures that produce Pareto-efficient outcomes---of which linearly-weighted utility functions arise as a special case where you assume the coalition preserves the risk-aversion level of its parts: https://www.lesswrong.com/posts/L2gGnmiuJq7FXQDhu/total-utilitarianism-is-fine.
The actual limitation of this model that I find interesting is that it only tells you how to aggregate expected utility maximizers, not boundedly-rational agents. Some ideas for how we might generalize it:
We're talking about outcomes, not mechanisms. Of course you have to design a mechanism that actually achieves a Pareto-optimal outcome/maximizes total utility---nobody argues that "just ask people to report their utilities" is the mechanism to do this. This remains the same whether for total utilitarianism or geometric rationality or anything else.
E.g. markets (under assumptions of perfect competititon, no transaction costs and no information asymmetry) maximize linear weighted utility.
I agree, but this seems wrong:
If the point they want to hit is in a flat region of the frontier, the merge will involve coinflips to choose which EUM agent to become; and if it's curvy at that point, the merge will be deterministic.
The only time the merge will involve coinflips is if there are multiple tangent lines at that point---then the weights of any tangent line can be the weights of the EUM. Maybe you meant the reverse: if the frontier is flat at the point, then the merged EUM agent is indifferent to any of the points on that flat bit.
Thoughts are things occurring in some mental model (this is a vague sentence but just assume it makes sense). Some of these mental models are strongly rooted in reality (e.g. the mental model we see as reality) and so we have a high degree of confidence about their accuracy. But for things like introspection, we do not have a reliable ground-truth feedback to tell us if our introspection is correct or not—it's just our mental model of our mind, there is no literal "mind's eye".
So often our introspection is wrong. E.g. if you ask someone to visualize a lion from behind, they'll say they can, but if you ask them some details, like "what do the tail hairs look like?" they can't answer. Or better example: if you ask someone to visualize a neural network, they will, but if you ask "how many neurons do you see?" they will not know, and not for lack of counting. Or they will say they "think in words" or that their internal monologue is fundamental to their thinking, but that's obviously wrong: you have already decided what the rest of the sentence will be before you've thought the first word.
We can tell some basic facts about our thinking by reasoning from observation. For example, if you have an internal monologue (or just force yourself to have one) then you can confirm that you indeed have one by speaking the words of the internal monologue out loud and confirming that it took very little cognitive effort (so you didn't have to think them again). This proves an internal monologue/precisely simulating words in your head is possible. Likewise for any action.
Or you can confirm that you had a certain thought, or a thought about something, because you can express it out loud with less effort than otherwise. Though here there is still room for that thought to have been imprecise; unless you verbalize or materialize those thoughts you don't know if your thoughts were really precise. So all these things have grounding in reality, and therefore are likely to be (or can trained to be, by consistently materializing them) accurate models. By materialize I mean, e.g. solving a math problem you think in your head you can solve.
I'm saying the expected value of their best non-compliant option of a sufficiently advanced AI will always be far far greater by the expected value of their best compliant action.
I don't really understand what problem this is solving. In my view the hard problems here are:
Once you assume away the former problem and disregard the latter, you are of course only left with basic practical legal questions ...
matter of taste for fiction; but objectively bad for technical writing
So I'm learning & writing on thermodynamics right now, and often there is a distinction between the "motivating questions"/"sources of confusion" and the actually important lessons you get from exploring them.
E.g. a motivating question is "... and yet it scalds (even if you know the state of every particle in a cup of water)" and the takeaway from it is "your finger also has beliefs" or "thermodynamics is about reference/semantics".
The latter might be a more typical section heading as it is correct for systematizing the topic, but it is a spoiler. Whereas the former is better for putting the reader in the right frame/getting them to think about the right questions to initiate their thinking.
How does this work with agents that are composed of sub-agents (thus whose "values" are a composite of the sub-agents)?