Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Consider two prototypical “agents”: a human, and a company.

The human is relatively centralized and monolithic. As a rough approximation, every 100 ms or so observations flow into the brain from the eyes, ears, etc. This raw input data updates the brain’s world-model, and then decisions flow out, e.g. muscle movements. This is exactly the sort of “state-update model” which Against Time In Agent Models criticized: observations update one central internal state at each timestep, and all decisions are made based on that central state. It’s not even all that accurate a model for a human, but let’s set that aside for now and contrast it to a more obviously decentralized example.

In a company, knowledge and decisions are distributed. A cashier sees and hears customers in the store, and interacts with them in order to sell things. Meanwhile, a marketing editor tweaks some ad copy. Each mostly makes decisions based on their local information; most of that local information is not propagated to other decision-makers. Observations don’t update a single centralized state which informs all decisions. Instead, different decisions have different input information from different sources.

In Optimization at a Distance, I suggested a mental picture of agents kinda like this:

Note that this particular drawing doesn’t have any inputs to the optimizer (i.e. observations) for simplicity, but it’s easy to add inputs. The optimizer need not be strictly causally upstream of the target; it could have interactions back-and-forth with the target.

It’s like a phased array: there’s lots of little actions distributed over space/time, all controlled in such a way that their influence can add up coherently and propagate over a long distance to optimize some far-away target. Optimization at a Distance mainly emphasized the “height” of this picture, i.e. the distance between optimizer and target. This post is instead about the “width”: not only are the actions far from the optimization target, the actions themselves are also distributed in spacetime and potentially far apart from each other.

Contrast: Bayesian Updates

Suppose I want to watch my favorite movie, 10 Things I Hate About You, in the evening. To make this happen, I do some optimization - I steer myself-in-the-evening and my-immediate-environment-in-the-evening into the relatively small set of states in which I’m watching the movie. Via the argument in Utility Maximization = Description Length Minimization, we should expect that I approximately-act-as-though I’m a Bayesian reasoner maximizing some expected utility over myself-in-the-evening and my-immediate-environment-in-the-evening. (Note that it’s a utility function over myself-in-the-evening and my-immediate-environment-in-the-evening, not just any old random utility function; something like e.g. a rock would not be well-described by such a utility function.)

While arranging my evening, I may perform some Bayesian updates. Maybe I learn that the movie is not available on Netflix, so I ask a friend if they have a copy, then check Amazon when they don’t. This process is reasonably well-characterized as me having a centralized model of the places I might find the movie, and then Bayes-updating that model each time I learn another place where I can/can’t find it. (If I had checked Netflixed, then asked my friend, then checked Netflix again because I forgot whether it was on Netflix, that would not be well-modeled as Bayesian updates.)

By contrast, imagine that myself and some friends are arranging to watch 10 Things I Hate About You in the evening. I check to see if the movie is on Netflix, and at the same time my friend checks their parents’ pile of DVDs. My friend doesn’t find it in their parents’ DVD pile, and doesn’t know I already checked Netflix, so they also check Netflix. My friends and I, as a system, are not well-modeled as Bayesian updates to a single central knowledge-state; otherwise we wouldn’t check Netflix twice. And yet, it’s not obviously suboptimal (like me forgetting whether the movie is on Netflix would be). If there’s a lag in communication between us, it may just be faster and easier for us to both check Netflix independently, and then both check other sources independently if the movie isn’t there. We’re acting independently to optimize the same goal; our actions are chosen “locally” on the basis of whatever information is available, not necessarily based on a single unified knowledge-state.

So, we don’t really have “Bayesian updates” in the usual sense. And yet… we’re still steering the world into a relatively narrow set of states, the argument in Utility Maximization = Description Length Minimization still applies just fine, and that argument is still an essentially Bayesian argument. It’s still using a Bayesian distribution - i.e. a distribution which is ultimately part of a model, not necessarily a fundamental feature of the territory. It’s still about maximizing expected utility under that distribution. My friends and I, as a system, are still well modeled as a “Bayesian agent” in some sense. Just… not a monolithic Bayesian agent. We’re a distributed Bayesian agent, one in which different parts have different information.

Conditioning

Conditional probabilities do still enter the picture, just not as updates to a centralized world-state.

In the movie example, when I’m searching for the movie in various places, how do I steer the world into the state of us-watching-the-movie-in-the-evening? How do I maximize , jointly with my friends? Well, I act on the information I have, plus my priors about e.g. what information my friends will have and how they will act. If I have information  (e.g. I know that the movie isn’t on Netflix, and know nothing else relevant other than priors) when making a particular decision, then I act to maximize .

Why that particular mathematical form? Well, our shared optimization objective  is a sum over worlds :

If I know that e.g. the movie is not on Netflix, then I know my current action won’t impact any of the worlds where the movie is on Netflix. So I can ignore those worlds while making the current decision, and just sum over all the worlds in which the movie is not on Netflix. My new sum is , which becomes  after normalizing the probabilities. (Normalizing doesn’t change the optimal action, so we can do that “for free”.) By ignoring all the worlds I’m not in (based on the input information to the current decision), and taking the expectation over the rest, I’m effectively maximizing expected utility conditional on the information I have when making the decision.

More generally: each action is chosen to maximize expected utility conditional on whatever information is available as an input to that action (including priors about how the other actions will be taken). That’s the defining feature of a distributed Bayesian agent.

This post (and the more dense version here) spells out the mathematical argument in a bit more detail, starting from coherence rather than utility-maximization-as-description-length-minimization.

(Side note: some decision theory scenarios attempt to mess with the “current action won’t impact any of the other worlds” part, by making actions in one world impact other worlds. Something FDT-like would fix that, but that’s out of scope for the current post.)

Resources

The "Measuring Stick of Utility" Problem talks about how grounding the idea of “resources” in non-agenty concepts is a major barrier to using coherence theorems to e.g. identify agents in a given system. If we have distributed decisions, optimization at a distance, or both, and we expect that information at a distance is mediated by relatively low-dimensional summaries (i.e. the Telephone Theorem), then there’s an intuitively-natural way to recognize “resources” for purposes of coherence arguments.

Let’s go back to the example of a company, in which individual employees make many low-level decisions in parallel. The information relevant to each decision is mostly local - e.g. a cashier at a retail store in upstate New York does not need to know the details of station 13 on the company’s assembly line in Shenzhen. But there is some relevant information - for instance, if an extra 10 cents per item are spent at station 13 on the assembly line in Shenzhen, then the cashier needs to end up charging another ~10 cents per item to customers. Or, if the assembly line shuts down for a day and 10000 fewer items are produced, then the cashiers at all of the company’s stores need to end up selling 10000 fewer items.

So we have this picture where lots of different decisions are made mostly-locally, but with some relatively small summary information passed around between local decision makers. That summary consists mainly of a sum of “resources'' gained/lost across each decision. In our example, the resources would be dollars spent/gained, and items created/sold.

The key here is that we have lots of local decisions, with relatively low-dimensional coupling between them. The summary-information through which the decisions couple is, roughly speaking, the “resources”. (In practice, there will probably also be lots of extra summary-information between localities which isn’t controllable via the actions, and therefore needn’t be treated as a resource - e.g. all the facts about concrete one could learn from the store’s walls which would carry over to the concrete in the factory’s walls.)

Alternatively, rather than starting from distributed decisions, we could start from optimization at a distance. Because the optimization target is “far away” from the actions, only some relatively-low-dimensional summary of the actions impacts the target. Again, the components of that summary are, roughly speaking, the “resources”.

This picture fits in nicely with coherence theorems. The theorems talk about how a local decision maker needs to act in order to achieve pareto-optimal resource use, while still achieving local goals. For instance, the company’s marketing department should act-as-though it has a utility function over ads, otherwise it could run the same ads while spending pareto-fewer resources.

This picture also fits in nicely with natural abstractions. We have a large system with lots of parts “far away” from each other. The Telephone Theorem then says that they will indeed interact only via some relatively low-dimensional summary. In a decision framing, it says that only a relatively low-dimensional summary of the far-away decisions will be relevant to the local decision. Furthermore, we can in-principle derive that low-dimensional summary from the low-level physics of the world.

But this is still just an intuitive story. To make it rigorous, the Measuring Stick of Utility post argued that we need our resources to have two main properties:

  • More resource is always better
  • Resources are additive across decisions

Additivity across decisions, in particular, is the more restrictive condition mathematically. In order to identify natural abstraction summaries as “resources” for coherence purposes, those summaries need to be additive across all the local decisions.

… which is the main claim argued in Maxent and Abstractions. Summaries of information relevant at a distance can indeed be represented as sums over local variables/decisions.

59

Ω 22

4 comments, sorted by Click to highlight new comments since: Today at 11:03 AM
New Comment

Do you think that thinking explicitly about distributed systems (in the theoretical computer science sense) could be useful for having different frames or understanding of the tradeoffs? Or are you mostly using the idea of distributed systems as an intuitive frame without seeing much value in taking it too seriously?

If I may be so bold, the answer should be a guarded yes.

A snag is that the correct theory of what John calls 'distributed systems' or 'Time' and what theoretical CS academics generally call 'concurrency' is as of yet not fully constructed. To be sure, there are many quite well-developed theoretical frameworks - e.g. the Pi calculus or the various models of concurrency like Petri nets, transitions systems, event structures etc. They're certainly on my list of 'important things I'd like to understand better'. 

Our world, and our sensemaking of it, is fundamentally concurrent. If we had the 'correct' theory of concurrency and we would be able to coherently combine it with decision theory under uncertainty that would be very powerful.

Two answers:

  • I agree with Self-Embedded Agent that there's likely powerful frames for thinking about distributed compute which have not yet been discovered, and existing work may hint toward those. That's the sort of thing which is probably not useful for most researchers to think about, but worth at least some thinking about.
  • There's a shared core to distributed models which I do think basically-all technical researchers in the field should be familiar with. That's best picked up by seeing it in a few different contexts, and theory of distributed systems is one possible context to pick it up from. (Some others: Bayes nets/causality, working with structured matrices, distributed programming in practice.)

How technical is the use of the word 'distributed' here?

 

While arranging my evening, I may perform some Bayesian updates. Maybe I learn that the movie is not available on Netflix, so I ask a friend if they have a copy, then check Amazon when they don’t. This process is reasonably well-characterized as me having a centralized model of the places I might find the movie, and then Bayes-updating that model each time I learn another place where I can/can’t find it.

It seems more like going through a list of places and checking off 'not there' than Bayesian updating. Sure, that's a special case, 

My friends and I, as a system, are not well-modeled as Bayesian updates to a single central knowledge-state; otherwise we wouldn’t check Netflix twice.

but it seems like 'centrality' is less likely to be the thing here than something else. Coordination is mentioned, but it seems more like you both check Netflix because you're not asking 'what if _ checks Netflix'. In other words, maybe you're not acting in a 'Bayesian manner'. Rather than evaluate the probability, you take the action. I would also guess you didn't say Netflix because 'the probability points that way'.

If you watch Netflix a lot (or have used it recently) then it might come to mind quickly. If your friend watches something else a lot, maybe they check there first.

 

There's not much of a benefit of more elaborate protocols here (beyond texting your friend it's not on netflix), if there's not a lot of services to search. (Otherwise you could come up with a list together (or independently) and handle your parts (or pick some off the list at random, figuring that if both of you do that, you're more likely to find it, even if you don't coordinate more).) So I won't go into a lot more detail here, other than mentioning:

There are other considerations at play here than probability: cost. You have Netflix so you check there.