We can do better than argmax

Jan_Kulveit

Summary: A much-discussed normative model of prioritisation in EA is akin to argmax (putting all resources on your top option). But this model often prescribes foolish things, so we rightly deviate from it – but in ad hoc ways. We describe a more principled approach: a kind of softmax, in which it is best to allocate resources to several options by confidence. This is a better yardstick when a whole community collaborates on impact; when some opportunities are fleeting or initially unknown; or when large actors are in play.

Epistemic status: Relatively well-grounded in theory, though the analogy to formal methods is inexact. You could mentally replace “argmax” with “all-in” and “softmax” with “smooth” and still get the gist.

Gavin wrote almost all of this one, based on Jan’s idea.

> many EAs’ writings and statements are much more one-dimensional and “maximizy” than their actions.

– Karnofsky

Cause prioritisation is often talked about like this:

Evaluate a small number of options (e.g. 50 causes);
Estimate their {importance, tractability, and neglectedness} from expert point estimates;
Give massive resources to the top option.

You can see this as taking the argmax: as figuring out which input (e.g. “trying out AI safety”; “going to grad school”) will get us the most output (expected impact). So call this argmax prioritisation (AP).

AP beats the hell out of the standard procedure (“do what your teachers told you you were good at”; “do what polls well”). But it’s a poor way to run a portfolio or community, because it only works when you’re allocating marginal resources (e.g. one additional researcher); when your estimates of the effect or cost-effect are not changing fast; and when you already understand the whole action space. ^[1]

It serves pretty well in global health. But where these assumptions are severely violated, you want a different approach – and while alternatives are known in technical circles, they are less understood by the community at large.

Problems with AP, construed naively:

Monomania: the argmax function returns a single option; the winner takes all the resources. If people naively act under AP without coordinating, we get diminishing returns and decreased productivity (because of bottlenecks in the complements to adding people to a field, like ops and mentoring). Also, under plausible assumptions, the single cause it picks will be a poor fit for most people. To patch this, the community has responded with the genre "You should work on X instead of AI safety" or “Why X is actually the best way to help the long-term future”. We feel we need to justify not argmaxing, or to represent our thing as the true community argmax. And in practice justification often involves manipulating your own beliefs (to artificially lengthen your AI timelines, say), appealing to ad hoc principles like worldview diversification ^[2], or getting into arguments about the precise degree of crowdedness of alignment.
Stuckness: Naive argmax gives no resources to exploration (because we assume at the outset that we know all the actions and have good enough estimates of their rank). As a result, decisions can get stuck at local maxima. The quest for "Cause X" is a meta patch for a lack of exploration in AP. Also, from experience, existing frameworks treat value-of-information as an afterthought, sometimes ignoring it entirely. ^[3]
Flickering: If the top two actions have similar utilities, small changes in the available information lead to endless costly jumps between options. (Maybe even cycles!) Given any realistic constraints about training costs or lags or bottlenecks, you really don't want to do this. This has actually happened in our experience, with some severe switching costs (years lost per person). OpenPhil intentionally added inertia to their funding pledges to build trust and ride out this kind of dynamic.

(But see the appendix for a less mechanical approach.)

Softmax prioritisation

Softmax (“soft argmax”) is a function which approximates the argmax in a smooth fashion. Instead of giving you one top action, it gives you a set of probabilities. See this amazing (and runnable!) tutorial from Evans and Stuhlmüller.

Softmax is a natural choice to inject a bit of principled exploration. Say you were picking where to eat, instead of what to devote your whole life to. Example with made-up numbers: where argmax goes to the best restaurant every day, softmax might go to the best restaurant ~70% of the time, the second best ~20% of the time, the third best ~5% of the time... The weights here are the normalised exponential of how good you think each is.

The analogy is nicer when we consider softmax’s temperature parameter, which controls how wide your uncertainty is, and so how much exploration you want to do. This lets us update our calculation as we explore more: under maximum uncertainty, softmax just outputs a uniform distribution over options, so that we explore literally every option with equal intensity. Under maximum information, the function outputs just the argmax option. In real life we are of course in the middle for all causes – but this is useful, if you can guess how far along the temperature scale your area (or humanity!) currently is.

We have been slowly patching AP over the years, approximating a principled alternative. We suggest that softmax prioritisation incorporates many of these patches. It goes like this:

Start at high temperature, placing lots of weight on each option.
Observe what happens after taking each option.
Decrease your temperature as you explore and learn more, thus gradually approaching argmax.

(To be clear, we’re saying the community should run this, and that this indirectly implies individual attitudes and decisions. See Takeaways below.)

So "softmax" prioritisation says that it is normative for some people to work on the second and third-ranked things, even if everyone agrees on the ranking.

Why should the EA community as a whole do softmax prioritisation?

Smoothing. As discussed in Monomania, in practice it’s not optimal to throw everything at one cause, because of uncertainty, diminishing returns, limited mentoring and room for funding, and poor fit with most individuals. So we should spread our resources out (and in fact we do).
Total allocation. Allocating marginal resources (e.g. as a single altruistic donor looking to improve global health) is really different from allocating a large fraction of total resources (e.g. when the resources of an entire nation-state are allocated among pandemic interventions). For marginal allocation, what matters is the local cost/benefit slope. When managing total resource allocation, understanding the whole curve is vital. ^[4]

Consider EA as one big "collective agent", and think about the allocation of people as if it was collaborative. Notably, the optimal allocation of people is not achieved by everyone individually doing argmax prioritisation, since we need to take into account crowding and elasticities (“room for more funding” and so on).
Due weight on exploration and optionality. When you're choosing between many possible interventions and new interventions are constantly appearing and disappearing, then you should spend significant resources on 1) finding options, 2) trying them out, and 3) preserving options that might otherwise vanish. (e.g. Going to the committee meetings and fundraisers so that people see you as a live player.)

Takeaways

Lots of people already act as if the community was using something like softmax prioritisation. So our advice is more about cleaning up the lore; resolving cognitive dissonance between our argmax rhetoric and softmax actions; and pushing status gradients to align with the real optima. In practice:

Explore with a clear conscience! List new causes; look for new interventions within them; look for new approaches within interventions; one might be better than all current ones.
Stand tall! Work on stuff which is not the argmax with a clear conscience, because you are part of an optimal community effort. Similarly, praise people for exploring new options and prioritising in a “softmax” fashion; they are vital to the collective project.
Similarly, you don’t need to distort your beliefs to say that what you’re working on is the argmax.
Don’t flicker! Think twice about switching causes if you’ve already explored several.

Ultimately: we have made it high status to throw yourself at 80,000 Hours’ argmax output, AI. (Despite their staff understanding all of the above and communicating it!) We have not made it very high status to do anything else. ^[5]

So: notice that other directions are also uphill; help lift things wrongly regarded as downhill. Diversifying is not suboptimal pandering. Under more sensible assumptions, it’s just sound.

Appendix: Other reasons to diverge from argmax

In order of how much we endorse them:

Value of information is usually incredibly high
You don’t know the whole option set
diminishing returns
Moral uncertainty
Concave altruism (i.e. Jensen’s inequality!)
The optimiser’s curse
Worldview diversification
Principled risk aversion, as at GiveWell
Strategic skulduggery
Decrease variance of your portfolio for more impact compounding(?)

For individual deviations from the community's argmax ("this is the best thing to do, ignoring who you are and your aptitudes"), your (lack of) personal fit and local opportunities can let you completely outdo the impact of you working on cause #1. (But this deviation is still a one-person argmax.)

Appendix: Technical details

The main post doesn’t distinguish all of the ways argmax prioritisation fails:

When you have uncertain rank estimates

The optimizer’s curse means that you will in general (even given an unbiased estimator) screw up and have inflated estimates for the utility of option #1. The associated inflation can be more than a standard deviation, which will often cause the top few ranks to switch. Softmax prioritisation softens the blow here by spreading resources in a way which means that the true top option will not be totally neglected.

When you have an incomplete action set

The first justification for exploration (i.e. optimal allocations which sample actions besides the current top few) is that we just do not know all of the actions that are available. This idea is already mainstream in the form of “Cause X” and Cause Exploration, so all we’re doing is retroactively justifying these. See also Macaskill’s discussion on Clearer Thinking.

When you have a time-varying action set

If actions are only available for a window in time, then you have a much more complicated decision problem. At minimum, you need to watch for new opportunities (research which takes some of your allocation) and rerun the prioritisation periodically.

When we further consider actions with a short lifespan, we have even more reason to devote permanent resources to the order and timing of actions.

When you’re doing total allocation instead of marginal

When we’re allocating many people, we don’t get the optimal allocation by everyone individually doing argmax prioritisation, since we need to take into account diminishing returns, limiting factors (“room for more funding”, “the mentor bottleneck”, and so on), and other scaling effects.

Fin gives a nice visual overview of the production functions involved. As well as letting us exploit interventions with increasing returns to scale, total allocation means you have the resources to escape local minima. This bears on the perennial “systemic change” critique: that critique is ~true when you are large!

When you have light tails

The power-law distribution of global health charity impacts was the key founding insight of EA. But there is nothing inevitable about altruistic acts taking this distribution, and Covid policy is a good example: the effects were mostly pretty additive, or multiplicative in the unfortunate way (when each effect is bounded in [0,1] and coupled, so that failure at one diminishes your overall impact). Argmax is much less bad under heavy tails.

Appendix: Softmax against Covid

With Covid in February 2020, it was hard to estimate the importance, tractability, or neglectedness of different actions; most options were unknown to me, and the impact of a given actor was highly contextual. Also, it was possible to get into a position to influence the total allocation of resources.

When planning his Covid work, Jan usually relied on a softmax heuristic: repeatedly re-evaluating his estimates of the options, putting extra weight on the value of information. While he’s happy with the results, on some occasions it was difficult to explain what he was trying to do to people who grew up with ITN and argmax prioritisation more generally.

Future crises will have different subsets of assumptions holding. Granted that the softmax approach is important, and sometimes practically diverges from the argmax, it seems good to have more people think about the tradeoff between argmax and softmax, exploration, VoI, and so on.

In dynamic environments like crises and politics, you benefit from rapidly re-evaluating actions and chaining together sequences of low-probability outcomes. ("Can I get invited to the ‘room where it happens’? Can I be then taken seriously as a scientist? Can I then be appointed to a committee? Can we then write unprecedented policy? Can that policy then be enforced?")

Classify crises by how many things you need to get right to solve them. In some crises, discovering the single silver bullet and investing in it heavily is the best approach. In others, solutions require e.g. 8 out of 10 pieces in place, and getting all of them highly functional.

Covid, at least after spreading out of Wuhan, seems closer to the second case: the winning strategy consisted of many policies, implemented well. Countries which did even better (e.g. Taiwan) did even more: "do a hundred different things, some of them won't have any effect, but all together it’ll work".

Note that this goes against the community instinct that some solutions will be much more effective than others. This expectation is often unsound outside the original context of global health projects. (Again, this context models the marginal returns to investing more or less on given actions, assuming we have no (market) power. But national policy is closer to total allocation than marginal allocation: you can have profound effects on other actors’ incentives, up to and including full power.)

i.e. Emergency policy impact differs from global health NGO impact! Governments are better modelled as performing total allocation or as monopsonies. And Covid national policy arguably didn’t have a single "silver bullet" policy with 10x the impact of the next best policy. ^[6]

Appendix: Enlightened argmax

In theory there’s nothing stopping the argmaxer and their argmaxer friends from doing something cleverer. For instance, you should argmax over lots of local factors (like each person’s comparative advantage and personal fit, the room for funding each cause has, the mentor bottlenecks, etc), which would let you reach quite different answers than the naive “what would be best, being agnostic about individual factors and bottlenecks?” we do to produce consensus reality.

Done optimally (i.e. by a set of perfectly coordinated epistemic peers with all the necessary knowledge of every cause), argmax could get all the good stuff. But in practice, people can't compute the argmax for themselves (well), so we rely on cached calculations from a few central estimators, which produces the above issues. A lot of us simply take the output of 80k's argmax and add a little noise from personal fit.

e.g. Does argmax really say to not explore?

If you have perfectly equal estimates of impact of everything, argmax sensibly says you must explore to distinguish them better. But once you know a bit, or if your estimates are noisy and something appears slightly better, a naive argmax over actions tells you to do just that one thing

You can always rescue argmax by going meta – instead of taking argmax over object-level actions, you instead take the argmax decision procedure. “What way of deciding actions is best? Enumerate, rank, and saturate.” (This is a relative of what Michael Nielsen calls “EA judo”.)

Still: yes, you can reframe the whole post as "EAs are often using argmax at the wrong level, evaluating "actions" (like career choices). This is often suboptimal: they should be using argmax on the meta-level of decision procedures and community portfolios".

FAQ

Q: It's not clear to me that “softmax” prioritisation avoids flickering? Specifically: At low temperature, there's lots of moving around. At high temperature, it converges to argmax, which supposedly has the flickering problem!

A: Yes, this is vague. This is because there's several levels at play: the literal abstract algorithm, the EA community’s pseudoalgorithm mixture, and each individual realising that it is or should be the community-level algorithm.

First, assume a middling temperature (if nothing else because human optimisation efforts are so young). "Softmax" prioritisation says that it is normative for some people to work on the second and third, even if everyone agreed on the ranking. Flickering is suppressed at the level of an individual switching their cause (because we all recognise that some stickiness is optimal for the community). That’s the idea anyway.

Q: How do I do this in practice? Am I even supposed to calculate anything?

A: If you share all of the intuitions already there's maybe not much point. But I expect thinking in terms of temperature and community portfolio to help, even if it's informal. One stronger formal claim we can make is that "the type signature of prioritisation is W --> D", where

W = a world model including uncertainty and a gap where missing options should be
D = a distribution of weights over causes.

Q: Is this post about creating defensibility for doing more normal things? That is, making it easier to persuade someone who needs a new formalism ("softmax") to be moved to depart from an earlier formalism ("argmax"). There's something broken about banning sensible intuitions if they don't align with a formalism. If this is the way a large part of the community works, you might have to play up to it anyway. If that is indeed the context & goal, details are probably irrelevant. But I think it would be more honest to say so.

A: It’s somewhere in between a methodology and a bid to shift social reality. I can't deny that maths is therapy for a certain part of the audience. But I actually do think softmax is better than existing justifications for sensible practices. We made up a bunch of bad coping mechanisms for our own argmax rhetoric. We want the sensible practices without the cognitive dissonance and suboptimal spandrels.

^{^}
This form is a legacy of argmax prioritisation’s original context: evaluating global health charities. It seems to work quite well there because the assumptions are not violated very hard. (Though note that rank uncertainty continues to bite, even in this old and intensely studied domain.) The trouble is when it is used as a universal prioritisation algorithm where it will reliably fail.
^{^}
Worldview diversification is way less ad hoc than the other things: it is a lot like Thompson sampling.
^{^}
This is changing! OpenPhil recently commissioned Tom Adamczewski to build this cool tool.
^{^}
We wrote this with government policy advice in mind as a route to affecting large fractions of the total allocation. But it also applies to optimistic scenarios of EA scaling.
^{^}
OK, it’s also high-status to pontificate about the problems of EA-as-a-whole, like us.
^{^}
At the individual level, there were some silver bullets, like not going to the pub or wearing a P100 mask.

[-][anonymous]4y52

I am not quite sure this is correct.

Take the classic investment advice "a mixture of stocks AND bonds". For the purpose of this post assume the classic advice is correct.

What that advice is saying "stocks is the best option normally but bonds act as a hedge".

Meaning for the conditional probability where stocks are going down, bonds tend to hold more of their value.

So this isn't softmax. Softmax would say "invest in the best past performing stock fund and also spread money around to the other stocks". (Argmax would be "all in on Berkshire Hathaway" or whatever)

"Conditional probability hedging" is something else. It's taking the most probable bad outcome from your top choice and reevaluating your actions, assuming the most probable bad choice is true.

For example a robot picking something up might consider the most probable bad outcome to be the item was dropped, and it might choose to take some action to mitigate if that happens. (If it has 2 arms it could put the second one in the path of the most probable way the item could fall)

[-]harsimony4y30

I like this intuition and it would be interesting to formalize the optimal charitable portfolio in a more general sense.

I talked about a toy model of hits-based giving which has a similar property (the funder spends on projects proportional to their expected value rather than on the best projects):

https://ea.greaterwrong.com/posts/eGhhcH6FB2Zw77dTG/a-model-of-hits-based-giving

Updated version here: https://harsimony.wordpress.com/2022/03/24/a-model-of-hits-based-giving/

[-]Donald Hobson4y20

Suppose you have several different, similarly good plans to build a nuke. (Sure, not an EA goal, but it makes the rest of the analogy work) You have a single critical mass of uranium. Trying to split it between both projects would guarantee that neither succeed. Neither one would have the critical mass. Switching is costly. Sometimes, gains aren't sublinear. If you need a minimum scale to get stuff done, and are faced with 2 equally good projects and only resources for 1. Sometimes the best solution is to pick one project, put all your resources on that, and stick with it (unless it goes really seriously wrong.)

[-]Dagon4y20

Needs examples and worked calculations. I can't tell if it's obvious or not - I've only ever heard the softer/more complete version - "put your effort into the most important thing that needs more resources". Uncertainty could play a part here, but for most cases, simple declining returns to effort, and changing relative importance as "the most important" gets partly addressed cover the need to vary.

Likewise, the difference between urgent and important leads to some diversity in effort. In order to optimize for the long term, you have to GET TO the long term.

49