# 44

This post has been nominated for the 2018 Review
Write a Review

Requisite background: high school level programming and calculus. Explanation of backprop is included, skim it if you know it. This was originally written as the first half of a post on organizational scaling, but can be read standalone.

#### Backpropagation

If you’ve taken a calculus class, you’ve probably differentiated functions like . But if you want to do math on a computer (for e.g. machine learning) then you’ll need to differentiate functions like

function f(x):    # Just wait, the variable names get worse…              a = x^(1/2)     # Step 1              b = 1–1/(a+1) # Step 2              c = ln(b)     # Step 3              return c

This is a toy example, but imagine how gnarly this could get if contains loops or recursion! How can we differentiate , without having to write it all out in one line? More specifically, given the code to compute , how can we write code to compute ? The answer is backpropagation. We step through the function, differentiating it line-by-line, in reverse. At each line, we compute the derivative with respect to one of the internal variables. As we work back, we differentiate with respect to earlier and earlier variables until we reach x itself.

The key to backprop is the chain rule. Look the bottom left arrow: . That’s just the chain rule! A similar formula applies to each arrow in the bottom row.

Let’s work through the example. Keep an eye on the diagram above to see where we are as we go. First, we differentiate the last line:

return c: df_dc = 1

Then, we work backward:

Step 3: c = ln(b) -> df_db = df_dc * dc_db = df_dc ∗ (1/b) 

Notice what happened here: we used the chain rule, and one of the two pieces to come out was  — the result of the previous step. The other piece is the derivative of this particular line. Continuing:

Step 2: b = 1–1/(a+1) -> df_da = df_db * db_da = df_db ∗ (1/(a+1)^2) Step 1: a = x^(1/2) -> df_dx = df_da * da_dx = df_da ∗ (1/2*x^(−1/2)) 

Each step takes the result of the previous step and multiplies it by the derivative of the current line. Now, we can assemble it all together into one function:

function df_dx(x):          a = x^(1/2)                    # Step 1          b = 1–1/(a+1)                  # Step 2          c = ln(b)                      # Step 3           df_dc = 1          df_db = df_dc * 1/b            # df_dc times derivative of step 3    df_da = df_db * (1/(a+1)^2)    # df_db times derivative of step 2    df_dx = df_da * (1/2*x^(-1/2)) # df_da times derivative of step 1      return df_dx

The name “backpropagation” derives from the df_d* terms, which “propagate backwards” through the original function. These terms give us the derivative with respect to each internal variable.¹ In the next section, we’ll relate this to intermediate prices in supply chains.

#### Price Theory

Suppose we have a bunch of oversimplified profit-maximizing companies, each of which produces one piece in the supply chain for tupperware. So, for instance:

1. Company A produces ethylene from natural gas
2. Company B produces plastic (polyethylene) from ethylene
3. Company C molds polyethylene into tupperware

We’ll give each company a weird made-up production function:

1. Company A can produce units of ethylene from units of natgas
2. Company B can produce units of polyethylene from units of ethylene
3. Company C can produce units of tupperware from units of polyethylene

You may notice that these weird made-up cost functions look suspiciously similar to steps 1, 2 and 3 in our function from the previous section. Indeed, from the previous section tells us how much tupperware can be made from a given amount of natgas: we compute how much ethylene can be made from the natgas (step 1, company A), then how much polyethylene can be made from the ethylene (step2, company B), then how much tupperware can be made from the polyethylene (step 3, company C).

Each company wants to maximize profit. If company 3 produces units of tupperware (at unit price ) from units of polyethylene (unit price ), then their profit is : value of the tupperware minus value of the polyethylene. In order to maximize that profit, we set the derivative to zero, then mutter something about KKT and pretend to remember what that means:

Company C: ddb[Pc∗c(b)—Pb∗b]=0→Pb=Pcdcdb=Pc∗1b 

We’ve assumed competitive markets here: no single company is large enough to change prices significantly, so they all take prices as fixed when maximizing profit. Then, at a whole-industry-level, the above formula lets us compute the price of polyethylene in terms of the price of tupperware.​

Compute prices just like derivatives.

Well, now we can work back up the supply chain. Maximize profit for company B, then company A:

Company B: dda[Pb∗b(a)—Pa∗a]=0→Pa=Pb∗(1(a+1)2)  Company A: ddx[Pa∗a(x)—Px∗x]=0→Px=Pa∗(12x−12) 

Notice that these formulas are exactly the same as the formulas we used to compute in the previous section. Just replace by , by , etc — the price of the intermediate good is just the derivative of the production function with respect to that good. (Actually, the price is proportional to the derivative, but it’s equal if we set the price of tupperware to 1 — i.e. price things in tupperware rather than dollars.)

So the math is the same, but how physically realistic is this picture? So far we've assumed that the market is at equilibrium, which makes it tricky to assign causality. The causality of backprop does apply directly if the final output (tupperware in our example) is perfectly elastic, meaning that it always has the exact same price. But more generally, once we look at out-of-equilibrium markets, the behavior isn't just distributed backpropagation - it's an unusual variant of gradient descent. It's using local information to move in a general "downhill" direction.

To sum it all up: we can think of profit-maximizing firms in a competitive market at equilibrium as a distributed backpropagation algorithm. Each firm “backpropagates” price information from their output prices to their input prices. This isn't always a perfect reflection of physical causality, but it does match the math for markets at equilibrium.

#### Footnotes

¹In ML we usually have multiple inputs, in which case we compute the gradient. Other than single-variable, I’ve also simplified the example by only reading each variable in one line, strictly sequentially — otherwise we sometimes need to update the derivatives in place. All of this also carries over to price theory, for supply chains with multiple inputs which can be used in multiple ways.

[UPDATE Dec 2019: removed a couple comments toward the end about a future post which I never got around to writing, and replaced them with a few sentences on out-of-equilibrium markets. Someday I will learn not to promise posts which I have not yet written.]