Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(This is a stylized version of a real conversation, where the first part happened as part of a public debate between John Wentworth and Eliezer Yudkowsky, and the second part happened between John and me over the following morning. The below is combined, stylized, and written in my own voice throughout. The specific concrete examples in John's part of the dialog were produced by me. It's over a year old. Sorry for the lag.)

(As to whether John agrees with this dialog, he said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment.)

 

J: It seems to me that the field of alignment doesn't understand the most basic theory of agents, and is missing obvious insights when it comes to modeling the sorts of systems they purport to study.

N: Do tell. (I'm personally sympathetic to claims of the form "none of you idiots have any idea wtf you're doing", and am quite open to the hypothesis that I've been an idiot in this regard.)

J: Consider the coherence theorems that say that if you can't pump resources out of a system, then it's acting agent-like.

N: I'd qualify "agent-like with respect to you", if I used the word 'agent' at all (which I mostly wouldn't), and would caveat that there are a few additional subtleties, but sure.

J: Some of those subtleties are important! In particular: there's a gap between systems that you can't pump resources out of, and systems that have a utility function. The bridge across that gap is an additional assumption that the system won't pass up certain gains (in a specific sense).

Roughly: if you won't accept 1 pepper for 1 mushroom, then you should accept 2 mushrooms for 1 pepper, because a system that accepts both of those trades winds up with strictly more resources than a system that rejects both (by 1 mushroom), and you should be able to do at least that well.

N: I agree.

J: But some of the epistemically efficient systems around us violate this property.

For instance, consider a market for (at least) two goods: peppers and mushrooms; with (at least) two participants: Alice and Bob. Suppose Alice's utility is  (where  and  are the quantities of peppers and mushrooms owned by Alice, respectively), and Bob's utility is  (where  and  are the quantities of peppers and mushrooms owned by Bob, respectively).

Example equilibrium: the price is 3 peppers for 1 mushroom. Alice doesn't trade at this price when she has , i.e. , i.e.  (using the fact that ), i.e. when Alice has 1.5 times as many peppers as she has mushrooms. Bob doesn't trade at this price when he has 6 times as many peppers as mushrooms, by a similar argument. So these prices can be an equilibrium whenever Alice has 1.5x as many peppers as mushrooms, and Bob has 6x as many peppers as mushrooms (regardless of the absolute quantities).

Now consider offering the market a trade of 25,000 peppers for 10,000 mushrooms. If Alice has 20,000 mushrooms (and thus 30,000 peppers), and Bob has only 1 mushroom (and thus 6 peppers), then the trade is essentially up to Alice. She'd observe that

so she (and thus, the market as a whole) would accept. But if Bob had 20,000 mushrooms (and thus 120,000 peppers), and Alice had only 2 mushrooms (and thus 3 peppers), then the trade is essentially up to Bob. He'd observe

so he wouldn't take the trade.[1]

Thus, we can see that whether a market — considered altogether — takes a trade, depends not only on the prices in the market (which you might have thought of as a sort of epistemic state, and that you might have noted was epistemically efficient with respect to you), but also on the hidden internal state of the market.

N: Sure. The argument was never "every epistemically efficient (wrt you) system is an optimizer", but rather "sufficiently good optimizers are epistemically efficient (wrt you)".

J: You might be missing the point here. You can appeal to "you can't pump money out of the system" to get a type of weak efficiency, but you MIRI folk seem to think that "can't pump money out" arguments also imply a form of strong efficiency, that things like markets lack.

N: I agree that "you can't pump money out" does not suffice for a utility function, and that you need an additional "it doesn't pass up free money" constraint to bridge the gap. And I concede that I sometimes use "you can't pump money out of it" as a pointer to a larger cluster of criteria, and sometimes in a way that elides a real distinction, and sometimes because I haven't been tracking that distinction carefully in my own head.

...for the record, though, I expect AI systems that are capable of ending the acute risk period, to not pass up on free valuables. So I admit I'm not sure why you're focused on this distinction as important. Do you think we have a disagreement beyond the point about me sometimes playing fast-and-loose with coherence constraints?

J: I still suspect you're missing the point. In real life, systems that control valuable resources tend to have the property that you can't pump resources out of them, for the obvious competitive reason. But there's no similarly compelling reason for things to avoid having path-dependence in their preferences, as you'd find in a market.

N: Hold up. The obvious reason for optimizing systems to avoid path-dependent preferences is so that they avoid passing up certain gains. A property I expect a market made of sufficiently competent participants to possess.

J: In which case, we have a disagreement, yeah. Which... well, I proved my point a few paragraphs ago, and you seemed to agree? So I admit I'm confused by your position.

To refine my own position: aggregate systems of agents do not in general act like an agent. That's what I've been trying to say, here — the condition that you can expect lots of systems to possess in real life is a weak ("no money pump") efficiency property. The strong efficiency property ("takes certain gains") is much rarer, and is lost the moment you start aggregating agents.

(Indeed, once you notice that you shouldn't think of a market as an agent in the VNM sense, you're only one step away from the conclusion that you shouldn't think of the constituent parts as agents either. When you look closely, even the market participants are probably better modeled as weakly-efficient market-like systems rather than as agents.)

N: Woah, hold your horses there. Optimizing systems that are epistemically and instrumentally efficient wrt you (which I suppose I could suffer calling 'agents' in this context) totally aggregate into other agents.

J: ...have you failed to internalize my argument from above?

N: Nobody ever said that a collection of participants  each with preferences over a set  of goods, have to aggregate into an agent that also has preferences over . The obvious guess is that that market aggregates into something more like an agent with preferences over functions from  to , i.e., over different assignments of goods to each participant in the market.

Like, it's completely respectable for the market to value "Alice gains 25,000 peppers but loses 10,000 mushrooms" differently from "Bob gains 25,000 peppers but loses 10,000 mushrooms". It will look like it has a bunch of hidden state about who has which resources when you consider only the resource totals, but that's just an artifact of looking at the wrong outcome space.

(Aside: I have a better sense now of why I might want some sort of "epistemic efficiency implies instrumental efficiency" argument: intuitively, this seems like the sort of thing you might want when looking at optimizers that are themselves aggregates of smaller optimizers. Which is an update for me; thanks. More specifically, I was not tracking the way that markets should look instrumentally coherent, and my "epistemic efficiency isn't necessarily supposed to imply instrumental efficiency" argument now seems off-base.)

J: OK, yes, you can think of the market as having preferences over assignments of goods to participants, but then you still lose agency. For example, consider the trades (in that extended outcome space) "Alice loses 1 pepper, Bob gains 2 mushrooms", and the trade "Alice gains 3 peppers, Bob loses 2 mushrooms". The market won't take either of these trades, because Alice isn't willing to take the former, and Bob isn't willing to take the latter. And this is exactly an instance of an aggregate system that satisfies weak efficiency (it has no preference cycles / you can't pump money out of it), but violates strong efficiency (the order-dimension of its preference graph is not 1 / it passes up certain gains). Aggregates of agents aren't agents!

N: Oh, yeah, I flatly deny that.

J: It's... a theorem?

N: It might be a theorem in Earth!economics, about Earth!economics!rational agents who use broken decision theories. It's false in real life, and when aggregating agents smart enough to use better decision theories.

J: I'm not sure how you expect to swing that. Utility functions are defined only up to affine transformations, so you can't just say "do whatever leads to the higher aggregate utility". I don't see how you'd break the symmetry between different offers, while respecting the invariance of utility functions up to affine transformations.

N: Sure, I'm not saying you can figure out how to aggregate a bunch of agents into a superagent by looking at their utility functions alone! You need to take some other structure about the agents into account, such as information about what each agent thinks is fair.[2] (See, eg, the relevant dath ilani papers.)

J: Ah. Hmm.

N: To spell it out more precisely: what happens in real life is that Alice and Bob accept both trades, and then Alice gives Bob a pepper, and now they've achieved the certain gain of +1 pepper apiece. Or, more generally, they do something that is at least that good for both of them within the constraints of what they agree is fair — because why would either settle for a strategy that does worse than that? In real life, aggregate agents don't pass up certain gains of valuable resources, because they value the resources.

J: I... see.

N: ...

J: OK, well, note that this assumes that side-channel trades can occur, at prices other than the market prices.

N: Yeah. Side channels that the agents would fight to establish, so that they can take advantage of certain gains.

Although, of course, logical decision theorists don't need to be able to make side-trades to accept such bets, and they'll keep taking advantage of certain gains even if you forbid such trades. Like, if Alice and Bob have common knowledge that the market is either going to be offered the trade "Alice gains $1,000,000; Bob loses $1" or the trade "Alice loses $1; Bob gains $1,000,000", with equal probability of each, and they're not allowed to trade between themselves, then they can (and will, if they're smart) simply agree to accept whichever trade they're presented (because this joint strategy makes them both significantly richer in expectation).

Again, these are smart folk who value resources. You can argue all you want about how they shouldn't be able to get the extra money, but don't count on those arguments holding up.

Like, there's a basic mental technique here of asking "if the participants in the market were all actually as smart as me and deeply driven to get more goods, could they somehow find some way to wind up richer?" Argue as you may, the aggregate markets of dath ilan still won't pass up certain gains of valuable resources that you can easily point out.

J: This... is something of an update for me, I admit.

N: ...

J: Although, I note that, for all your high-falutin arguments about dath ilan, none of them are going to convince the pancreas and the thalamus to start making side-channel ATP trades, at price-analogs that differ from the equilibrium.

N: Indeed. It's no coincidence that artificial intelligence is an X-risk, and pancreases are not.

J: Ah, but the pancreas/thalamus interaction can tell us something about intelligence. As can the study of the energy market (of sorts) inside a bacterium. As can Earth's best financial markets, populated as they are by poorly-coordinated constituents who might not decide to use logical decision theory even if they knew they had the option. Agents might be normative, but descriptively, the world is full of systems that you merely can't pump money out of.

N: OK, sure. But that's a vastly weaker claim. I happily endorse the descriptive claim "lots of modern optimizers are even worse at taking certain gains, than they are at avoiding money-pumps."

(I don't see this as particularly relevant to alignment research, but I believe it.)

J: I see it as quite relevant to alignment research! I'm hoping to learn quite a bit from bacteria and markets, that generalizes to humans and artificial intelligences.

N: That sounds like a disagreement for another day. As for today, I'll settle for you retreating from the position that I'm lacking a basic understanding of the objects I wish to study, and from the position that intelligent agents don't aggregate into an agent (and, relatedly, from the position that 'markets' beat 'agents' as a model of capable optimizers).

J: I'll just point out again that, if you want to learn what human values are, you need a descriptive theory of humans rather than a normative theory of intelligence, and I still bet that weak-efficiency is a better descriptive model of most humans than agenthood.

  1. ^

    This example (of two cases where a market's decision about a trade differs depending on hidden state) relies on the initial wealth distributions being unequal. Legends hold that there are other examples where the hidden state doesn't depend on initial differences, if the utilities aren't logarithmic. John Wentworth tells me he cares in practice about this additional fact, and notes that further information can be found in the literature under the heading of "non-existence of representative agents". I have not myself constructed such an example, and would be interested if someone has a simple one.

  2. ^

    Hopefully there are canonical solutions. For instance, in an ultimatum game, the Schelling fair point is that both participants get utility halfway between their best and worst deals, which solution is invariant under affine transformation. Knowing that agents are willing to accept these canonical solutions as 'fair' does not seem like a large additional burden of knowledge.

New to LessWrong?

New Comment
22 comments, sorted by Click to highlight new comments since: Today at 2:02 AM
[-]Algon1y2323

I really want to see a debate between Eliezer/Nate and Scott Garrabrant on the necessity of utility functions. More generally, I would like to know what the cruxes between MIRI researchers are on various topics in alignment.

IMO, coordination difficulties among sub-agents can't be waved away so easily. The solutions named, side-channel trades and counterfactual coordination, are both limited.

I would frame the nature of their limits, loosely, like this. In real minds (or at least the human ones we are familiar with), the stuff we care about lives in a high-dimensional space. A mind could be said to be, roughly, a network spanning such a space. A trade between elements (~sub-agents) that are nearby in this space will not be too hard to do directly. But for long-distance trades, side-channel reward will need to flow through a series of intermediaries -- this might involve several changes of local currencies (including traded favors or promises). Each local exchange needs to be worthwhile to its participants, and not overload the relationships that it's piggybacking on.

These long-distance trades can be really difficult to set up sometimes. The same way it would be hard for a random villager in the middle ages in France to send $10 to another random villager in China.

The difficulty depends on things like the size / dimensionality of the space; how well-connected it is; and how much slack is available in the relevant places in the system (for the intermediate elements to wiggle around enough to make all the local trades possible). Note that the need for slack makes this a holistic constraint: if you just have one really important trade to make, then sure, you can probably make it happen, by using up lots of slack (locking a lot of intermediate elements into orientations optimized for that big trade). But you can't do that for every possible trade. So these issues really show up when you have a lot of heterogeneous trades to make.

Counterfactual ("logical" ) coordination has similar issues. If A and B want to counterfactually coordinate, but they're far apart in this mind-space, then they can only communicate or understand one another in a limited way, via intermediaries (or via the small # of dimensions they do share). This just makes things harder -- hard to get shared meaning, hard to agree on what's fair, hard to find a solution together that will generalize well instead of being brittle.

BTW, I'm not denying that intelligence (whatever that might mean) helps with all this, but I am denying that it's a panacea.

It seems to me that this is basically solved by "you put probability distributions over all things that you don't actually know and may have disagreement about"

This is for logical coordination? How does it help you with that?

Like it helps everywhere when uncertainty is here? Imagine a problem "You are in Prisoner's dilemma with such and such payoffs, find optimal strategy if distribution of your possible opponents is 25% CooperateBots, 33% DefectBots and 42% those who actually knows decision theory".

I still don't know exactly what parts of my comment you're responding to. Maybe talking about a concrete sub-agent coordination problem would help ground this more.

But as a general response: in your example it sounds like you already have the problem very well narrowed down, to 3 possibilities with precise probabilities. What if there were 10^100 possibilities instead? Or uncertainty where the full real thing is not contained in the hypothesis space?

Loosely related to this, it would be nice to know if systems which reliably don't turn down 'free money' must necessarily have almost-magical levels of internal coordination or centralization. If the only things which can't (be tricked into) turn(ing) down free money when the next T seconds of trade offers are known are Matrioshka brains at most T light-seconds wide, does that tell us anything useful about the limits of that facet of dominance as a measure of agency?

Could someone help me collect the relevant literature here?

I think the complete class theorems are relevant: https://www.lesswrong.com/posts/sZuw6SGfmZHvcAAEP/complete-class-consequentialist-foundations

The Non-Existence of Representative Agents: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3302656

Representative Agents: https://en.wikipedia.org/wiki/Representative_agent

John Wentworth on Subagents: https://www.lesswrong.com/posts/3xF66BNSC5caZuKyC/why-subagents

Ty for post. Just for reference, does John endorse this summary?

[-]So8res1yΩ450

John said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment. (I added this note to the top of the post as a parenthetical; thanks.)

More details:

  • I think the argument Nate gave is at least correct for markets of relatively-highly-intelligent agents, and that was a big update for me (thankyou Nate!). I'm still unsure how far it generalizes to relatively less powerful agents.
  • Nate left out my other big takeaway: Nate's argument here implies that there's probably a lot of money to be made in real-world markets! In practice, it would probably look like an insurance-like contract, by which two traders would commit to the "side-channel trades at non-market prices" required to make them aggregate into an expected utility maximizer. (Obviously the contract wouldn't be phrased in those terms; most of the work of implementation would be to figure out what trades need to occur in practice under what conditions to achieve aggregability, and then figuring out simple approximations of those conditions to write into contracts.)
  • In the year since this discussion, I've also understood better why Nate seems to care mostly about the principles of relatively-highly-intelligent agents, as opposed to e.g. humans. I think that crux was mostly about corrigibility as an alignment target, and I have updated substantially toward that position as well.
  • My main remaining disagreement, for purposes of applying this argument to superhuman AI, is that an intelligence which originally develops as a market of relatively-weak agents does not obviously choose to self-modify in the way described in the post, in the process of becoming more intelligent. It's not clear that the component weak subagents themselves "upgrade".
    • Analogy for humans: insofar as human values are well thought of as a "market of weak subagents", it's not clear to me that making the individual subagents more capable (to the point where they make the sort of trades required by Nate's argument) is actually the way I'd prefer to upgrade myself. I'm not convinced that that would actually be the right way to reflectively extend my extant values.

Real markets mostly have it covered, because they have something close to [aggregated] utilons -- money, and so exchanges between 2 different goods rarely take place.

Also, any business can be seen as a "side-channel trade" -- the market value of one individual's time is often lower than the value they can produce in cooperation with others.

How do I test whether I actually understand the sort of thing Nate describes and not just consider it obvious in hindsight? (I feel like I was able to explain it to people before reading this post)

Curated. I love a good dialog, one where two parties are responding not to the modeled objector, but actually another person who gets to speak back from their actual models. And as N says, he's sympathetic to people saying "none of you idiots have any idea wtf you're doing". In this case, N might have known, but I'd be keen to see the day N, J, or co. has their ignorance revealed.

This recent tweet of Eliezer's crystallized a concept for me which I think is relevant to the concepts of optimization and agents discussed in the dialogue: https://twitter.com/ESYudkowsky/status/1639406023680344064

In complicated systems in real life, the thing that is better at "preimaging outcomes onto choices" is the scary one, and the interesting / complicated systems are the ones where the choosing algorithm is complex.

Sure, it's true that you can construct toy systems in restricted domains (like the mushrooms and peppers one) and define "agents" in these systems which technically violate certain efficiency assumptions.

But the reason these examples aren't compelling (to me) is that it's kind of obvious what all the agents in them will do, once you write down their utility functions and the starting resources available to them. There's not much complexity "left over" for interesting decision algorithms.

Two of the real-world examples in this dialogue actually demonstrate the difference between these kinds of systems nicely:

I could not step into the shoes of a successful hedge fund trader, and, given all the same choices and resources available to the trader, make decisions which result in more money in my trading account than than the original trader could.

OTOH, if I were some kind of ghost-in-the-machine of a bacterium making ATP, I could (probably) make the same (or better, in cases where that's possible) decisions that the actual bacterium is making, given all the same information and choices to available to it. (Though I might need a computer to keep track of all the hormones and blood-glucose levels and feedback loops.)

I can see how both examples might tell us something useful about intelligent systems, but the markets example seems more likely to have something to say about what the actual scary thing looks like.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

This discussion sheds light on an important consideration in AI: the loss or mutation of agency when aggregating systems or agents. It’s vital to remember that optimizers aren’t always constructed to fit alongside other optimizers neatly, and harmony can involve sacrifices. While John’s point about weak efficiency is well noted, finding a balance between descriptive and prescriptive models is essential. I appreciate the counterargument’s reasoning that capable aggregates of optimizers don't pass up certain gains. Energy is a zero-sum currency. When efficiencies are revealed, smart agents will find ways to fight for them.

>if you won't accept 1 pepper for 1 mushroom, then you should accept 2 mushrooms for 1 pepper

You need a bunch more assumptions for this to hold, and I would like to know what they are. For example: If I don't have or want any mushrooms, and nobody I know wants mushrooms, then I can't accept 1 pepper for 1 mushroom because I can't pay the mushroom. But it still doesn't make any sense for me to accept two mushrooms for one pepper either because I don't have any use for two mushrooms. To get intuition about this, replace "mushroom" with something that is both useless and unavailable, such as a pound of neutrinos in a box. There's no way to get the neutrinos into the box, and even if you had them in the box, they would leave the box instantly and still be useless.

In general, there is a tendency for people to use alleged theorems without checking the premises. You can get surprising outcomes when the premises don't hold.

the order-dimension of its preference graph is not 1 / it passes up certain gains

If the order dimension is 1, then the graph is a total order, right? Why the conceptual detour here?

Am I correct that "knowing what system thinks is fair" is equivalent to "knowing under which bargaining solution system acts"?

Which is closer to Nate’s position: a) competition leads to highly instrumentally efficient AIs or b) inductive biases lead to highly instrumentally efficient AIs?