This is a supplemental post to Geometric Utilitarianism (And Why It Matters), in which I show that, if we use the weights we derived in the previous post, a gradient ascender will reach the Harsanyi hyperplane H. This is a subproblem of the proof laid out in the first post of this sequence, and the main post describes why that problem is interesting.

The Gradient and Contour Lines

It's easy to find the points s∈Rn which have the same G score as p: they're the points which satisfy G(s,ψ)=G(p,ψ). They all lie on a skewed hyperbola that touches P at p.

One way to think about G is as a hypersurface in n+1-dimensional space sitting "above" the n-dimensional space of utilities we've been working with. When there are 2 agents, we can plot G using the third vertical axis.

Check out the intersection of G and the vertical plane above the Harsanyi line H: this tells us about the values of G along this line, and as we shift p we can recalculate ψ so that among H, G peaks at p.

Our choice of p determines where we land on that surface "above" p. If we take a slice through G by only looking at the point of G at the same "altitude" as p, we get exactly that hyperbola back!

Doing this for many altitudes gives us a contour map, which you're probably familiar with in the context of displaying altitude of real 3D landscapes on flat 2D maps.

There's a theorem which tells us that the gradient of G must either be 0 or perpendicular to these contour hypersurfaces. So by calculating the gradient, we can calculate the tangent hyperplane of our skewed hyperbolas! And then we'll see if anything interesting happens at p.

This is a different gradient than the one we calculated earlier, where we were specifically interested in how G changes along the Harsanyi hyperplane H. But the slope of H, encoded in ϕ(p,F), showed up in how we defined ψ(p,F). So let's see how ϕ and p have shaped the geometry of G(_,ψ).

Thankfully, this gradient calculation is a lot easier than the other one (which was so long I broke it out into its own separate post). The gradient ∇uG(u,ψ) is just a vector of partial derivatives ∇uGi(u,ψ)=∂G∂ui, where we're using ∇u to remind ourselves that this is just the gradient with respect to u, holding ψ constant. We're holding the weights constant, and u isn't a function this time where we'd need to use the chain rule, so all we need to do is apply the power rule:

∂G∂ui(u,ψ)=∂∂uiG(u,ψ)

∂G∂ui(u,ψ)=∂∂ui(∏nj=1uψjj)

∂G∂ui(u,ψ)=ψiuψi−1i∏nj=1,j≠iuψjj

∂G∂ui(u,ψ)=ψiuiuψii∏nj=1,j≠iuψjj

∂G∂ui(u,ψ)=ψiui∏nj=1uψjj

∂G∂ui(u,ψ)=ψiuiG(u,ψ)

If we use ⊘ to denote element-wise division, this gives us a simple formula for ∇uG:

∇uG=(ψ⊘u)G

or a component

∇uGi=ψiuiG

And that's it! ∇uG is defined when ui>0 for all agents. Where ∇uG is defined, we can keep taking derivatives; G(u,ψ) is smooth everywhere ui>0 for all agents.

Here's what it looks like!

Playing around with an interactive version, you can see that as you approach giving an agent 0 utility, the gradient arrow gets longer and longer. As long as ψi>0, ∇uGi diverges off to infinity as ui approaches 0. When ψi=0, changing ui doesn't change G and ∇uGi=0.

Normatively, G being 0 whenever any individual utility is 0 is a nice property to have. As long as we give an agent some weight, there is a pressure towards giving them more utility. If you've gotten this far you've probably taken a calculus class, and you probably studied how to enclose the largest area using a fixed perimeter of fencing. This is exactly the same pressure pushing us towards squares and away from skinny rectangles. The Pareto optima are points where the pressures favoring each agent balance, for some weights ψ, and we can design ψ to cause all those pressures to balance at any point we choose along the Pareto frontier.

The visual analogy between "maximizing a product" and "sliding a hyperbola until it reaches the Pareto frontier" was really helpful in thinking about this problem. I first learned about that lens from Abram Demski's great Comparing Utilities post, which included illustrations by Daniel Demski that really helped me visualize what was going on as we maximize G.

Another thing we can notice is that ∇Gi≥0. This is exactly what we'd want from a Pareto monotone aggregation function. Geometrically, this means those contour lines always get further and further away from F, and they don't curve back in to make some other point in F score higher on G(_,ψ) than p.

Gradient Ascent

The simplest proof I've found that goes from

p maximizes G(_,ψ) among H

to

p maximizes G(_,ψ) among F

relies on the fact that, if you start inside F and follow the gradient ∇uG to make G larger and larger, you'll eventually run into the Harsanyi hyperplane H. In order for this to be true, ∇uG needs to point at least a little bit in the direction perpendicular to H.

The Normal Vector to The Harsanyi Hyperplane

What is that direction? One way to think about H is as a contour hyperplane of H, the Harsanyi aggregation function. H is all of the joint utilities u∈Rn where H(u,ϕ)=H(p,ϕ). We know that the gradient ∇uH will be perpendicular to this contour line, so let's compute that in order to find the normal vector to H:

∂H∂ui=∂∂ui(∑nj=1ujϕj)

∂H∂ui=ϕi

∇uHi=ϕi

It would make my tensor calculus teacher too sad for me to write that ∇uH=ϕ, but the components of the vector ∇uH are always the same as the components of the covectorϕ. We can then normalize∇uH to get the normal vector to H which I'll denote h:

hi=ϕi|ϕ|

h=ϕ|ϕ|

The distinction isn't important for most of this sequence, but I do want to use different alphabets to keep track of which objects are vectors and which are maps from vectors to scalars because they're different geometric objects with different properties. If we decide to start measuring one agent's utility in terms of milli-utilons, effectively multiplying all of their utility measurements by 1,000, the component of that agent's Harsanyi weight ϕi scales inversely in a way that perfectly cancels out this change. The slope of a line doesn't change when we change the units we use to measure it.

Gradient Ascenders Reach H

One way to think about the dot product is as a tool for measuring the lengths of vectors and the angles between them. If the dot product ∇uG⋅h is ever 0, it means that a gradient ascender will not be moving towards or away from H at that point. There are n−1 such orthogonal directions available, so for our proof to work we need to check that our choice of ψ always leads to movement towards or away from H. Let's see what it takes to satisfy ∇uG⋅h=0

∇uG⋅h=0

∑ni=1(∇uGi)hi=0

∑ni=1(ψiuiG)ϕi|ϕ|=0

All of these terms are non-negative, so in order for ∇uG⋅h to be 0, each element of this sum needs to be simultaneously 0. Can this happen for our choice of ψ?

ψi=piϕip⋅ϕ

(piϕiui(p⋅ϕ)G)ϕi|ϕ|=0

We know that ϕi>0 for at least one agent i, and that if pi=0 for all agents, that the entire Pareto frontier must consist only of that point at the origin. (In which case all choices of ϕ and ψ make that point optimal, and any gradient ascender will trivially find the optimum.) Otherwise, wherever ∇uG is defined it points at least a little in the same direction as h. (And where it's not defined, it's because some component has diverged off to positive infinity because ui=0 for some agent.)

In other words, using our choice of ψ, all gradient ascenders will either stay at their initial local minimum (because they were placed on a part of the boundary of F where G=0 and ∇uG is undefined), or they will eventually reach the Harsanyi hyperplane.

This is also a great time to point out that

∇uGi(p,ψ)=ψipiG=piϕipi(p⋅ϕ)G

When pi>0 for all agents, ∇uG points in the same direction as ϕ at p. This is a direct consequence of choosing ψ such that ∇uG is perpendicular to H. So H is the tangent hyperplane to the Pareto frontier at p, but it's also the tangent hyperplane to the contour curve of G at p.

This is a supplemental post to Geometric Utilitarianism (And Why It Matters), in which I show that, if we use the weights we derived in the previous post, a gradient ascender will reach the Harsanyi hyperplane H. This is a subproblem of the proof laid out in the first post of this sequence, and the main post describes why that problem is interesting.

## The Gradient and Contour Lines

It's easy to find the points s∈Rn which have the same G score as p: they're the points which satisfy G(s,ψ)=G(p,ψ). They all lie on a skewed hyperbola that touches P at p.

One way to think about G is as a hypersurface in n+1-dimensional space sitting "above" the n-dimensional space of utilities we've been working with. When there are 2 agents, we can plot G using the third vertical axis.

Check out the intersection of G and the vertical plane above the Harsanyi line H: this tells us about the values of G along this line, and as we shift p we can recalculate ψ so that among H, G peaks at p.

Our choice of p determines where we land on that surface "above" p. If we take a slice through G by only looking at the point of G at the same "altitude" as p, we get exactly that hyperbola back!

Doing this for many altitudes gives us a contour map, which you're probably familiar with in the context of displaying altitude of real 3D landscapes on flat 2D maps.

You can see how these contours change as we change p and H using the interactive version here.

There's a theorem which tells us that the gradient of G must either be 0 or perpendicular to these contour hypersurfaces. So by calculating the gradient, we can calculate the tangent hyperplane of our skewed hyperbolas! And then we'll see if anything interesting happens at p.

This is a different gradient than the one we calculated earlier, where we were specifically interested in how G changes along the Harsanyi hyperplane H. But the slope of H, encoded in ϕ(p,F), showed up in how we defined ψ(p,F). So let's see how ϕ and p have shaped the geometry of G(_,ψ).

Thankfully, this gradient calculation is a lot easier than the other one (which was so long I broke it out into its own separate post). The gradient ∇uG(u,ψ) is just a vector of partial derivatives ∇uGi(u,ψ)=∂G∂ui, where we're using ∇u to remind ourselves that this is just the gradient with respect to u, holding ψ constant. We're holding the weights constant, and u isn't a function this time where we'd need to use the chain rule, so all we need to do is apply the power rule:

∂G∂ui(u,ψ)=∂∂uiG(u,ψ)

∂G∂ui(u,ψ)=∂∂ui(∏nj=1uψjj)

∂G∂ui(u,ψ)=ψiuψi−1i∏nj=1,j≠iuψjj

∂G∂ui(u,ψ)=ψiuiuψii∏nj=1,j≠iuψjj

∂G∂ui(u,ψ)=ψiui∏nj=1uψjj

∂G∂ui(u,ψ)=ψiuiG(u,ψ)

If we use ⊘ to denote element-wise division, this gives us a simple formula for ∇uG:

∇uG=(ψ⊘u)G

or a component

∇uGi=ψiuiG

And that's it! ∇uG is defined when ui>0 for all agents. Where ∇uG is defined, we can keep taking derivatives; G(u,ψ) is smooth everywhere ui>0 for all agents.

Here's what it looks like!

Playing around with an interactive version, you can see that as you approach giving an agent 0 utility, the gradient arrow gets longer and longer. As long as ψi>0, ∇uGi diverges off to infinity as ui approaches 0. When ψi=0, changing ui doesn't change G and ∇uGi=0.

Normatively, G being 0 whenever any individual utility is 0 is a nice property to have. As long as we give an agent some weight, there is a pressure towards giving them more utility. If you've gotten this far you've probably taken a calculus class, and you probably studied how to enclose the largest area using a fixed perimeter of fencing. This is exactly the same pressure pushing us towards squares and away from skinny rectangles. The Pareto optima are points where the pressures favoring each agent balance, for some weights ψ, and we can design ψ to cause all those pressures to balance at any point we choose along the Pareto frontier.

The visual analogy between "maximizing a product" and "sliding a hyperbola until it reaches the Pareto frontier" was really helpful in thinking about this problem. I first learned about that lens from Abram Demski's great Comparing Utilities post, which included illustrations by Daniel Demski that really helped me visualize what was going on as we maximize G.

Another thing we can notice is that ∇Gi≥0. This is exactly what we'd want from a Pareto monotone aggregation function. Geometrically, this means those contour lines always get further and further away from F, and they don't curve back in to make some other point in F score higher on G(_,ψ) than p.

## Gradient Ascent

The simplest proof I've found that goes from

to

relies on the fact that, if you start inside F and follow the gradient ∇uG to make G larger and larger, you'll eventually run into the Harsanyi hyperplane H. In order for this to be true, ∇uG needs to point at least a little bit in the direction perpendicular to H.

## The Normal Vector to The Harsanyi Hyperplane

What is that direction? One way to think about H is as a contour hyperplane of H, the Harsanyi aggregation function. H is all of the joint utilities u∈Rn where H(u,ϕ)=H(p,ϕ). We know that the gradient ∇uH will be perpendicular to this contour line, so let's compute that in order to find the normal vector to H:

∂H∂ui=∂∂ui(∑nj=1ujϕj)

∂H∂ui=ϕi

∇uHi=ϕi

It would make my tensor calculus teacher too sad for me to write that ∇uH=ϕ, but the components of the vector ∇uH are always the same as the components of the covector ϕ. We can then

normalize∇uH to get the normal vector to H which I'll denote h:hi=ϕi|ϕ|

h=ϕ|ϕ|

The distinction isn't important for most of this sequence, but I do want to use different alphabets to keep track of which objects are vectors and which are maps from vectors to scalars because they're different geometric objects with different properties. If we decide to start measuring one agent's utility in terms of milli-utilons, effectively multiplying all of their utility measurements by 1,000, the component of that agent's Harsanyi weight ϕi scales inversely in a way that perfectly cancels out this change. The slope of a line doesn't change when we change the units we use to measure it.

## Gradient Ascenders Reach H

One way to think about the dot product is as a tool for measuring the lengths of vectors and the angles between them. If the dot product ∇uG⋅h is ever 0, it means that a gradient ascender will not be moving towards or away from H at that point. There are n−1 such orthogonal directions available, so for our proof to work we need to check that our choice of ψ always leads to movement towards or away from H. Let's see what it takes to satisfy ∇uG⋅h=0

∇uG⋅h=0

∑ni=1(∇uGi)hi=0

∑ni=1(ψiuiG)ϕi|ϕ|=0

All of these terms are non-negative, so in order for ∇uG⋅h to be 0, each element of this sum needs to be simultaneously 0. Can this happen for our choice of ψ?

ψi=piϕip⋅ϕ

(piϕiui(p⋅ϕ)G)ϕi|ϕ|=0

We know that ϕi>0 for at least one agent i, and that if pi=0 for all agents, that the entire Pareto frontier must consist only of that point at the origin. (In which case all choices of ϕ and ψ make that point optimal, and any gradient ascender will trivially find the optimum.) Otherwise, wherever ∇uG is defined it points at least a little in the same direction as h. (And where it's not defined, it's because some component has diverged off to positive infinity because ui=0 for some agent.)

In other words, using our choice of ψ, all gradient ascenders will either stay at their initial local minimum (because they were placed on a part of the boundary of F where G=0 and ∇uG is undefined), or they will eventually reach the Harsanyi hyperplane.

This is also a great time to point out that

∇uGi(p,ψ)=ψipiG=piϕipi(p⋅ϕ)G

When pi>0 for all agents, ∇uG points in the same direction as ϕ at p. This is a direct consequence of choosing ψ such that ∇uG is perpendicular to H. So H is the tangent hyperplane to the Pareto frontier at p, but it's also the tangent hyperplane to the contour curve of G at p.

You can play with an interactive version here!