Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A crew of pirates all keep their gold in one very secure chest, with labelled sections for each pirate. Unfortunately, one day a storm hits the ship, tossing everything about. After the storm clears, the gold in the chest is all mixed up. The pirates each know how much gold they had - indeed, they’re rather obsessive about it - but they don’t trust each other to give honest numbers. How can they figure out how much gold each pirate had in the chest?

Here’s the trick: the captain has each crew member write down how much gold they had, in secret. Then, the captain adds it all up. If the final amount matches the amount of gold in the chest, then we’re done. But if the final amount does not match the amount of gold in the chest, then the captain throws the whole chest overboard, and nobody gets any of the gold.

I want to emphasize two key features of this problem. First, depending on what happens, we may never know how much gold each pirate had in the chest or who lied, even in hindsight. Hindsight isn’t 20/20. Second, the solution to the problem requires outright destruction of wealth. 

The point of this post is that these two features go hand-in-hand. There’s a wide range of real-life problems where we can’t tell what happened, even in hindsight; we’ll talk about three classes of examples. In these situations, it’s hard to design good incentives/mechanisms, because we don’t know where to allocate credit and blame. Outright wealth destruction provides a fairly general-purpose tool for such problems. It allows us to align incentives in otherwise-intractable problems, though often at considerable cost.

The Lemon Problem

Alice wants to sell her old car, and Bob is in the market for a decent quality used vehicle. One problem: while Alice knows that her car is in good condition (i.e. “not a lemon”), she has no cheap way to convince Bob of this fact. A full inspection by a neutral third party would be expensive, Bob doesn’t have the skills to inspect the car himself, and any words Alice speaks on the matter could just as easily be spoken by someone selling a lemon.

In order to convince Bob that the car is not a lemon, Alice needs to say or do something which a lemon-seller would not. What can she do?

One easy answer: offer to pay for any mechanical problems which come up after the sale. If Alice knew about expensive mechanical problems hiding under the car’s hood, then she wouldn’t offer Bob this sort of insurance (at least not for a low price). Conversely, if Alice is reasonably confident there are no mechanical problems, then offering to pay for the probably-non-existent problems costs her little.

There is one problem with this approach, however: if Alice is paying for mechanical problems, then Bob has no incentive to take good care of the car.

Ideally, if we could figure out in hindsight which problems were already present at the time of the sale, then Alice could offer to pay for only problems which were present beforehand. But in practice, if the car’s brakes fail 6 months or a year after the sale, we have no way to tell when the problem began. Were they already worn down, or has Bob been visiting the racetrack?

We can get a less-than-perfect solution using a proxy. For instance, if the car’s belt snaps a week after the sale, then it was probably frayed beforehand. If it snaps five years after the sale, then it probably wasn’t a noticeable issue beforehand. In this case, we can use time-at-which-a-problem-is-detected as a proxy for whether-a-problem-was-present-at-time-of-sale. This isn’t perfectly reliable, and there will be grey areas, but it gets us one step closer to figuring out in hindsight what happened.

Alternatively, we could try to align incentives without figuring out what happened in hindsight, using a trick similar to our pirate captain throwing the chest overboard. The trick is: if there’s a mechanical problem after the sale, then both Alice and Bob pay for it. I do not mean they split the bill; I mean they both pay the entire cost of the bill. One of them pays the mechanic, and the other takes the same amount of money in cash and burns it. (Or donates to a third party they don’t especially like, or ….) This aligns both their incentives: Alice is no longer incentivized to hide mechanical problems when showing off the car, and Bob is no longer incentivized to ignore maintenance or frequent the racetrack.

However, this solution also illustrates the downside of the technique: it’s expensive. Sometimes accidents happen - e.g. the air conditioner fails without Alice hiding it or Bob abusing the car. Our both-pay solution will make such accidents twice as expensive. If we can’t tell in hindsight whether a problem was Alice’ fault, Bob’s fault, or an accident, then both Alice and Bob need to pay the full cost of the problem in order to fully align their incentives. That means they’ll both need to pay for accidents, which reduces the overall surplus from the car-sale. If the car is worth enough to Bob and little enough to Alice, there may still be room to make the deal work, but the (expected) cost of accidental problems will eat into both of their wallets.

Similarly, if Alice and Bob have less-than-perfect trust in each others’ capabilities, that will eat into (expected) value. If Bob thinks that Alice just doesn’t know her own car very well, he may expect problems that Alice doesn’t know about. If Alice thinks that Bob is a careless driver regardless of incentives, then she’ll expect problems. These sorts of problems are effectively the same as accidents: they’re problems which won’t be avoided by good incentives, and therefore their overall cost will be doubled when both Alice and Bob need to pay for them.

O-Ring Production Functions

Suppose we have 100 workers, all working to produce a product. In order for the product to work, all 100 workers have to do their part correctly; if even just one of them messes up, then the whole product fails. This is an o-ring production function - named for the explosion of the space shuttle Challenger, where the failure of one o-ring led to the fatal failure of the whole shuttle. The model has some interesting economic implications - in particular, under o-ring-like production, adding a high-skill worker to a team of other high-skilled workers generates more value than adding the same high-skill worker to a team of low-skill workers. Conversely, it offers theoretical support for common claims like “hiring one bad worker creates more damage than hiring ten good workers creates benefit”.

Here, I want to think about incentive design in an o-ring-like production model. If any worker fails to build their component well, then the whole product fails. How do we incentivize each worker to make their particular component work well? If we can figure out in hindsight which component(s) failed, then incentive design is easy: reward workers whose components succeeded, punish workers whose components failed. But what if we can’t tell in hindsight which components failed? What if we only know whether the product as a whole failed?

We can apply our value-destruction trick: if the product fails, then punish each worker as though their component had failed. Each worker is then fully incentivized to make their component work; if it fails, they’ll face the full cost of failure.

Just like the used car example, accidents are a problem. If there’s a non-negligible chance of accident, then workers will expect a non-negligible chance of failure outside of their control. In order to make up for that chance of punishment, the company will have to offer extra base pay to convince workers to work for them in the first place.

Also like the used car example, if the workers don’t trust each others’ capabilities, then that has the same effect as expecting accidents. Anything which makes the workers expect failure regardless of the incentives makes them expect punishment outside of their control, which makes them demand higher base pay in order to make it worthwhile to work for this company at all.

Even worse: if the workers think there’s a high probability of failure regardless of incentives, that reduces their own incentive to avoid failure. If they expect the final product to fail regardless of whether their own component fails, then they have little incentive to make their own component work. In order for this whole strategy to work well, there has to be a high probability that the end product succeeds, assuming the incentives are aligned. Accidents and incompetence have to be rare. (Drawing the analogy back to the used-car problem: if Alice knows that the clutch is bad, but expects Bob to abuse the clutch enough that it would be ruined anyway regardless of incentives, then she has little reason to mention the bad clutch, even under the both-pay strategy.)


In the context of a modern business, one model I think about is the game of telephone. The players all sit in a line, and the first player receives a secret message. The first player whispers the message in the ear of the second, the second whispers it to the third, and so forth. When the message reaches the last player, we compare the message received to the message sent to see if they match. Inevitably, a starting message of “please buy milk and potatoes at the store” turns into “cheesy guys grow tomatoes on the shore”, or something equally ridiculous, one mistake at a time.

In a business context, the telephone chain might involve a customer research group collecting data from customers, then passing that data to product managers, who turn it into feature requests for designers, who then hand the design over to engineers, who build and release the product, often with several steps of information passing up and down management chains in the middle. This goes about as well as the game of telephone - thus, “jokes” like this:

Viewed as economic production, the game of telephone is itself an example of an o-ring production function. In order to get a successful final product - i.e. a final message which matches the original message - every person in the chain must successfully convey the message. If one person fails, the whole product fails. (Even if individual failures are only minor, a relatively small number of them still wipes out the contents of the message.) And, if there’s an end-to-end mismatch, it will often be expensive to figure out where communication failed, even in hindsight.

So, we have the preconditions for our technique: we can incentivize good message-passing by punishing everyone in the chain when the output message doesn’t match the input message.

Would this be a good idea? It depends on how much miscommunication can be removed by good incentives. If the limiting factor is poor communication skills, and the people involved can’t do any better even if they try, then we’re in the “expect accidents” regime: the incentives will be expensive and the system will often fail anyway. On the other hand, if incentivizing reliable communication produces reliable communication, then the strategy should work.

That said, we’re talking about punishing managers for miscommunicating, so presumably few managers would want to adopt such a rule regardless. Good incentive design doesn’t make much difference if the people who choose the incentives do not want to fix them.

New to LessWrong?

New Comment
24 comments, sorted by Click to highlight new comments since: Today at 6:30 AM

Your solution to the gold sharing problem doesn't work that well, because that solution allows any pirate to give an ultimatum to any other pirate of the form "Write X less than you actually had, because I am going to write X more. If you don't do that, you'll lose it all." And that's the Ultimatum game.

Correct. The solution to that is to prevent any communication beforehand. In principle, someone could still try this strategy acausally, but that's the sort of thing which everyone else can prevent by simply not simulating any other pirates.


The tricky part in real life is there not being perfect memory.

(However, instead of throwing the chest overboard you could attempt to catch one major defector, and toss them overboard, and distribute the remaining gold (even if the sum before didn't quite work, but you couldn't catch the remaining defectors):

If everyone is honest:

then coins_1 + coins_2 + ... + coins_n = coins_chest.

If there's only one major defector, then not only is coins_1 + ... + coins_n > coins_chest, but

coins_1 + ... + coins_n - coins_major-defector <= coins_chest. (More generally, any sum of entirely honest (and entirely accurate) entries will be less than the number of coins in the chest.))


If you split people up into groups ahead of time, you could keep track of the amount that a group has (more cheaply than keeping track of everyone).

I'm going to stop this train of though here before recreating bitcoin. (If everyone keeps track of a random partition involving everyone else, then will that probably be enough to figure everything else out afterwards?)

Might work against fraud detection though.

(Simpler to just remember what one other person's amount is though.)


Enough people who have more (and this is common knowledge) agree to an equal distribution. (For example 'We were all going to use this gold to get drunk anyway. So why don't we just spend this chest of gold on drinks?')


Not sure it has an effect but:

Multiple rounds. First round, not everyone is honest, half the gold in the chest is dumped out. Restart.

(Can be combined with method 2.)

Also, if the amounts submitted add ups o a multiple of the amount in the chest, then it can be split based on the proportion.


Pick a random order to go in. Once the amount in the chest is exceeded, it is divided among those people (minus the person who went over). If someone cheats by a lot, they will probably not get it.


Throw out the person who said the most, if the sum is too great.

My immediate reaction is that I remember hating it very much at school when a teacher punished the entire class for the transgression of an unidentifiable person!

Same! I thought about putting that in as an example, but didn't end up using it.


Just a side note. During my time in the Army it was always noted that group punishment was not to be imposed (outside basic training but I think that was a separate situation). I always thought that as a bit odd given the need for the unit to function as a whole. One might think that such an approach would promote more unity by make each unit member essentially their brother's keeper (and cell mate as it were).

The only way I could understand the point -- outside the innocent should not be punished aspect -- was that such an approach was likely to both disrupt unit cohesion and trust as well as allow one disgruntled member undermine the entire unit.

I'd be very interested to know how that rule came about.

Perhaps experience with saboteurs* (or conditions with high rates of natural accidents) could have that effect. (Though the original story would be rather nice to know.)

*Especially if the saboteurs weren't part of the unit - stealing stuff, messing things up, etc. Then it's obviously a good idea to implement such a rule.

And, if there’s an end-to-end mismatch, it will often be expensive to figure out where communication failed, even in hindsight.

Might be true of this type of production (O-ring), less clear that it's the case with a game of telephone.

(Implicitly you're able to compare the messages at the ends, to know there's a mismatch.)

Ask the person halfway along what they heard. If what they heard is fine, then the error/s occurred before halfway along.

(Also, this is what redundancy and error correcting codes are for.)

Strict liability in tort law seems to be a pretty obvious example, doesn't it? I mean, I guess a lot of corporate law can be seen as "vicariously holding a group accountable"

Yeah, markets aren't very nice when they have mostly one-shot, fly-by-night interactions. You could fix that with punishments, but a less wasteful alternative is reputation. Sellers of used cars can join into bigger companies that are incentivized to uphold their reputation and provide warranties; workers in critical jobs can bring references from previous jobs where they proved their quality; owners of vacation homes can benefit when Airbnb lets future renters see the reviews written by past renters.

+1. This also ties in to the context in which I was thinking about this stuff: theory of the firm. The question is, why do people organize into companies? Why isn't everyone always an independent contractor? Or, conversely, if firms are more efficient, then why isn't there one giant firm? What determines the size of companies, and what's insourced vs outsourced?

One broad class of theories boils down to "the employment relationship gives repeat interactions, which gives more data with which to figure out (in hindsight) what employees are doing". Thus, companies. However, this is only useful until we have enough interactions to identify good employee behavior - after that, further interactions don't add much, and the usual allocative benefits of market competition are more useful. Thus, not just one giant firm.

You could go further and say that when firms are too small, the level of trust is inefficiently low ("fly-by-night"), and when firms are too big, the level of trust is inefficiently high ("managerial feudalism").

That's a great explanation. Bonus points for panache.

So this is the 2-of-2 exploding Nash equilibrium technique applied to multiple parties/transactions? What's this generalized kind called?

(On a side note, it now strikes me that there's a parallel to RL blackbox optimization: by setting up a large penalty for any divergence from the golden path, it creates an unbiased, but high variance estimator of credit assignment. When pirates participate in enough rollouts with enough different assortments of pirates, they receive their approximate honesty-weighted return. You can try to pry open the blackbox and reduce variance by taking into account pirate baselines etc, but at the risk of losing unbiasedness if you do it wrong.)

Very interesting! Quick question: Why can't Alice and Bob split the bill 50%-50%? Or 70%-70%, with the extra 40% being cash set on fire? Or 150%-150%, with the extra 200% being cash set on fire? What if anything makes the 100%-100% "split" special?

If someone pays less than the cost of a problem, then they're incentivized to take inefficiently-low levels of effort to avoid it. If someone pays more than the cost, they're incentivized to take inefficiently-high levels of effort to avoid it. In general, if someone can pay X "dollars" (or effort, etc) to avoid a problem, then we want that to happen if-and-only-if the problem would cost more than X "dollars" to fix.

Tho the deadweight loss of a "both pay" solution is probably optimized somewhere between "split evenly" and "both pay fully". For example, in the pirate case, I think there are schemes that you can do that result in honesty being the optimal policy and yet they only destroy some of the gold in the case of accidents (or dishonesty), tho this may be sensitive to how many pirates there are and what the wealth distribution is like.

The pirate problem is kind of a weird one, because it's one-shot. In general, there will be many opportunities for the players to prevent problems at some cost. The more such opportunities they have, the more important it is to get the incentives right. The "everyone pays exactly 100% of cost" solution should generally win asymptotically, when players have many opportunities to pay to prevent problems, but in a finite problem (i.e. most realistic problems), optimal will probably be less-than-100%.

Are you sure? Let's say Bob is considering whether to drive through a swamp which will give him $101 of time savings but risks damaging the car, which causes $100 cost in expectation. If both Alice and Bob pay $100 for the repair, then Bob drives through the swamp, which destroys $99 in aggregate value. If we repeat the scenario N times, Bob drives through the swamp every time. Right?

Yup. That does seem off, but it would make sense if we zoom out a level and assume that the money is paid to a third party, so it's not really deadweight loss in the broader system context.

Right, but any such trash-car-for-net-win opportunity for Bob will make Alice less likely to make the deal: from her perspective, Bob taking such a win is equivalent to accident/carelessness. In the car case, I'd imagine this is a rare scenario relative to accident/carelessness; in the general case it may not be.

Perhaps a reasonable approach would be to split bills evenly, with each paying 50% and burning an extra k%, where k is given by some increasing function of the total repair cost so far.

I think this gives better incentives overall: with an increasing function, it's dangerous for Alice to hide problems, given that she doesn't know Bob will be careful. It's dangerous for Bob to be careless (or even to drive through swamps for rewards) when he doesn't know whether there are other hidden problems.


I don't think you can use the "Or donates to a third party they don’t especially like" version: if trust doesn't exist, you can't trust Alice/Bob to tell the truth about which third parties they don't especially like.
You do seem to need to burn the money (and to hope that Alice doesn't enjoy watching piles of money burn).

Ok, I buy this argument.

[+][comment deleted]3y20