How would you categorize a collection of agents that do not have a hierarchical relationship (each one is fully autonomous), but all share the same values? The main consideration that makes me think this is a likely outcome is that coordination seems a lot easier when two or more agents share the same values,[1] because they would immediately have no reason to lie to each other or do anything else that benefits their own values at the cost of others'. This could come about either by a group of agents with different values making a bargain to all change their values into some compromise values, or by a group of agents already having the same values (e.g., copies of some initial agent) outcompeting others due to their greater coordination ability.
My guess is that you would categorize this as "centralized" but if so I think it's rather misleading to group this kind of situation with humans deferring to or controlled by central authorities (e.g. "international organizations"), which is kind of the exact opposite, i.e., one is a non-hierarchical group of agents sharing the same values, the other is a hierarchical group of agents not sharing the same values. These things are probably going to have radically different properties!
FDT/UDT was in part an attempt to figure out how advanced agents might be able to coordinate well despite not sharing the same values, but it seems like we don't really have a good story to tell about this, after many years of trying.
One point of this framework is to distinguish "sharing values" from "actually trusting each other". There are cases where agents share values but don't trust each other, or get stuck in coordination traps (e.g. citizens living under a dictatorship who all hate the dictator but can't act as a coherent group to remove him). If your collection of agents resolves such traps via a centralized procedure (e.g. appointing a leader), then I'd call them a centralized agent (and they'd have many of the same weaknesses as other centralized agents, e.g. vulnerability to that leader being spoofed or subverted).
Conversely, your agents could resolve such traps via each individually having "procedural values" (aka "ethics") which help them coordinate. E.g. if each agent individually values honesty, so that it's easy for them to create common knowledge of how much they hate the dictator. Insofar as they use this coordination mechanism rather than any centralized mechanisms, I'd call them a distributed agent. However, note that agents can use this coordination mechanism even when they don't share terminal values as long as they share the right procedural values (e.g. honesty boxes in small towns, where you just trust that nobody will steal from you even though they ultimately care about different things from you).
As I mentioned in the post I call the idealized agents which has both the robustness of a distributed agent and the efficiency of a centralized agent "coalitional agents". Sharing all the same values certainly makes it easier for a collection of agents to act as a coalitional agent, but seems far from sufficient. In particular, it doesn't guarantee that they have an ethical framework which promotes distributed cooperation. They might even share values which undermine many types of cooperation (e.g. honor cultures shared values under which non-standard types of cooperativeness were often seen as weakness).
There are cases where agents share values but don't trust each other, or get stuck in coordination traps (e.g. citizens living under a dictatorship who all hate the dictator but can't act as a coherent group to remove him).
No, these citizens don't share the same values in my sense, because they're mostly selfish, and would prefer that others risk their lives to remove the dictatorship while they themselves freeride in safety. Two AIs with the same utility function (and prior) over world states or histories would be an example of what I mean.
Conversely, there are also cases where agents don't share "terminal values" but do share enough "procedural values" that they can basically just trust each other (e.g. honesty boxes in small towns, where you just trust that nobody will steal from you even though they ultimately care about different things from you).
I'm not sure why you need to invent a concept of "shared procedural values" to explain this when game theory says there are plenty of human situations where Cooperate can be a Nash equilibrium. In this case there's a small chance that stealing could be observed by someone, causing you to lose a huge amount of status/reputation, clearly not worth the benefit of getting some free item, in a small town.
I mean perhaps it has some value in that humans have computational limitations and can't compute the optimal strategy all the time, so maybe we use cached "procedural values" and this explains some amount of cooperation beyond standard game theory (e.g. why they don't steal even if everyone else left the town for vacation or something), but it doesn't seem that useful when we're talking about large stake future situations...
ETA: Looks like you edited your comment quite a bit from the version I replied to. Don't have time to check right now whether I also need to edit this reply.
No, these citizens don't share the same values in my sense, because they're mostly selfish, and would prefer that others risk their lives to remove the dictatorship while they themselves freeride in safety.
Yes, agreed, I was just gesturing at the closest thing we have to a real-world example. But severe coordination problems can also arise even when agents have exactly the same values—here's one example. (EDIT: a simpler example from my other reply: "they might not be certain that the other agent has the same values as they do, and therefore all else equal would prefer to have resources themselves rather than giving them to the other agent".) And based on this I claim that a dictatorship could in principle survive (at least for a while) even when every citizen actually shared exactly the same selfless anti-dictator values, as long as the common knowledge equilibrium were bad enough.
maybe we use cached "procedural values" and this explains some amount of cooperation beyond standard game theory
Yes. Except that our attitudes towards this differ: you seem to think of it as an edge case, whereas I think of it as an anomaly which is pointing us towards ways in which game theory is the wrong framework to be using. Specifically I think that game theory makes it hard to study group agents, because you assume that each agent's strategy is derived from its terminal values. However, in practice the way that group agents work is by programming heuristics/ethics/procedural values into their subagents. And so "cached" is an inaccurate description: it's not that the individual derives those values rationally and then stores them for later, but rather that they've been "programmed" with them (e.g. via reinforcement learning).
(Of course, ideally the study of group/distributed agents and the study of individual/centralized agents will converge, but I think the best way to do that is to separately explore both perspectives.)
ETA: Looks like you edited your comment quite a bit from the version I replied to.
Sorry about this!
One point of this framework is to distinguish "sharing values" from "actually trusting each other". There are cases where agents share values but don't trust each other, or get stuck in coordination traps
In Wei Dai's thinking, having the same values/utility function means that two agents care about the exact same things. This is formalized in UDT, but it's also a requirement you can add to most decision theories, e.g. CDT with reflective oracles (or some other mostly lawful incomplete measure). This is normally described as requiring that the utility function has no "indexical components," i.e. components that point to something about the agent that is running the utility function. This is slightly confusing, so it may be helpful to understand that in the case of utility functions with indexical components, two deterministic and non-pseudorandomizing robots may have different utility functions (Wei Dai's definition) even if they are exactly the same as each other in code and physical construction, and are just e.g. placed so one is facing the other.
Yes, as per my other reply to Wei I was just using the dictatorship example as the closest real-world example. Even when agents have the same utility function, with no indexical components, I claim that they might still face a trust bottleneck, because they have different beliefs. In particular they might not be certain that the other agent has the same values as they do, and therefore all else equal would prefer to have resources themselves rather than giving them to the other agent, which then recreates a bunch of standard problems like prisoner's dilemmas. (This is related to the epistemic prisoner's dilemma, but more general.)
Distributed thinking:
- The world will fill itself with moral agents. Contribute to the healthiest/wisest.
- Don't grab resources you don't know how to use. Cultivate one's garden.
- Avoiding xrisk and creating flourishing futures imply very similar strategies.
- (Analogy: "don't lie" and "don't kill" are good heuristics for almost any goals.)
I think you are inserting a lot of ought into this is at this point.
From the writing it sounds like you are describing a world where there are a bunch of these decentralized agents sharing the world peacefully. You claim that people want to create centralised agents, I think it is not so much that people would want to create a centralized agent, it is just that a single centralized agent is a stable equilibrium in a way that a multipolar world is not.
You are right that we are starting out in a decentralized multipolar AI world right now but this will end when an AI is capable of stopping other AIs from progressing, obviously you could not allow another AI becoming more powerful than you that is not aligned with you, even if you were human-aligned. And if there is another AI around the same capability level at the same time, you obviously would collaborate in some way to stop other AIs from progressing.
Having dozens of AIs continuously racing up the RSI superintelligence level is simply not a stable world that will continue, obviously you'd fight for resources. There aren't any solar systems with 5 different suns orbiting each other.
While I do think there are many reasons pluralism isn't stable, increasingly unstable as information technology advances, and there might not ever meaningfully be pluralism under AGI at all (eg, there probably will be many agents working in parallel, but the agents might basically share goals and also be subject to very strong oversight in ways that humans often pretend to be but never have been), which I'd like to see Ngo acknowledge,
The period of instability is fairly likely to be the period under which the constitution of the later stage of stability is written, so it's important that some of us try to understand it.
Much of my thinking over the last year has focusing on understanding the concept of "distributed agents", as opposed to the "centralized agents" that the existing paradigm of expected utility maximization describes. One way of describing the difference is in terms of how autonomous their subagents are. Another is that centralized agents are more efficient (as sometimes formalized by the notion of "coherence"), while distributed agents are more robust.
Unfortunately robustness is hard to formalize, since it requires that you perform well even in unpredicted (and sometimes unpredictable) situations. I give some tentative characterizations of distributed agents below, but there's still a lot of work to be done to formally define distributed agents. And ultimately I'd like to go further, to understand how agents can have both properties—which is roughly what I mean by "coalitional agency".
I gave a talk on the distinction about seven months ago. I'd been hoping to write up the main ideas at more length, but since that doesn't look like it'll happen any time soon, I'm sharing the slides below. Hopefully they're reasonably comprehensible by themselves, but feel free to ask questions about any parts that are unclear.
See more on my interpretation of Yudkowsky here; note that he disagrees with my emphasis on compression though (as per the exchange in the comments).
My post on why I'm not a bayesian also gives a sense of what understanding epistemology in more distributed terms looks like.
"Will's very rough first pass" is a reference to the passage in his textbook on utilitarianism where Will MacAskill describes what decision procedure a utilitarian should follow. My point here is to contrast how much thought he (and other utilitarians) put into finding criteria of rightness, vs how rudimentary their thinking about decision procedures is.