Cool post! I’m curious what class of problems you think this would work well for (besides the weight exfiltration example) and what grading schemes would work for different problems. In technical alignment, many problems aren’t well-specified / don’t have robust metrics that can grade solutions. For specific problems that are very well-specified and have robust metrics, like weak to strong supervision, we may already be able to have an AI researcher iterate on them.
I am not very confident about the first question, but I'm heartened by the fact that other people have had ideas when I asked them. The most obvious thing is probably formal verification, either for a pre existing model or more likely for a new model which is built to verifiably have some nice properties. More practically, one could use trusted judges along with a less formal rubric (for problems where perfect specification is too hard), although this requires building trust over time.
The second point I'm less worried about at the moment; AI researchers are good at working on problems that have robust continuous metrics–e.g. the classic RL thing where they get clear signal from improvement, even if they don't solve the whole thing. I think that they are worst at exactly the kind of challenges inducement prizes should attack: where the final outcome can be made quite explicit, but we don't know what methods would get us there.
This was my submission to Dwarkesh's essay contest, which had an 1000 word limit. If you want a longer version, or have specific questions, please comment.
I answered the question "If you were in charge of the OpenAI Foundation right now, what exactly would you do?"
I'm unsure how good marginal push funding in the current AI safety ecosystem is, which is obviously a fundamental consideration when deciding to put money into pull funding. Still, the fact that I have seen practically nothing about this topic makes me think it probably deserves far more discussion than it's received.
We need a way to turn money into effective institutions. If only we knew of some–
The Profit Incentive
The vast majority of philanthropic money is in push funding, supporting people and organizations that already do good work. In AIS, this happens via salaried positions in frontier labs, as well as grantmaking in academia and the non profit sector.
Pull funding funds outcomes rather than processes. It’s best suited to social goods which do not generate profit, and cases where the desired outcome is clear but the path forward is not. There are many cases like this in AI safety; I do not know which researchers or organizations I would fund in order to find a cheap security intervention which reduces model-weight exfiltration risk by some percent according to independent superforecasters. I do know that that would be good.
Pull mechanisms of the past have either been limited in funding (<$200m), like DARPA's, or targeted massive pre-existing industries, like the Pneumococcal AMC. What would happen if billions of dollars were committed to novel AIS goals? Investors could fund promising contestants in exchange for equity, just as they do with other high upside companies. For-profit funders one shot Nan’s qualifications: they resource generously because it will make them money. They give leeway because it will make them money. They invest in whatever problems make money, which AIS experts in turn decide via the allocation of prize funds.
GMs then float over to the cartoon pie smell of the prize money. We get the founder types and the wall street efficiency, the same forces which have so swiftly funneled talent and structure to AI capabilities research. Industries pop up to chase piles of money every year. Surely, similarly sized piles of money for a noble cause with the prestige of a major prize won’t do worse.
I want to note that these GMs won’t be laser focused on AI safety; they’ll be laser focused on whatever we specify. After they get their cash, they might be gone. I don’t think this flaw is fatal; having a GM of whatever specifiable outcome we choose is still quite good and we could have prizes which naturally build on each other. It once again emphasizes, though, how carefully we must choose our targets.
Pull Funding has generated some spectacular successes but, right now, institutional funders largely avoid it due to challenges. I think that with careful planning these challenges can be overcome.
Challenges
We are most interested in inducement prizes, where a fixed award is set aside for any team that meets some specified criteria. The Netflix Prize is a great example: teams searched for an improved algorithm to predict users movie ratings from past activity, measured by sufficiently lowering statistical error on a test set. Whoever got to the finish line first won the pot. After only three years, three teams combined their efforts to win a million dollars.
XPRIZE uses a competition format. Qualifying teams in their water desalination contest will meet for a semifinal later this year and attempt to desalinate 100k liters per day for a month. If they succeed, they win 2 million dollars and go to the finals in 2028, where they spend a year desalinating a million liters per day. Whichever team succeeds “most sustainably” at the “lowest total cost” wins.
In XPRIZE’s guidelines, they describe the sustainability goals as “marine-friendly intake design, brine resource recovery… and system footprint.” Well, how will each of these be scored? Do the judges care more about brine resource recovery or system footprint?
XPRIZE waves their hand towards some expert judges and this is apparently enough to motivate 143 qualified teams, but their wishywashiness has backfired on them before. XPRIZE imposed new rules midcontest for a prior prize, and many competitors dropped out. Then, after the finals, XPRIZE didn’t shell out a grand prize at all. Several contestants had passed the objective rubric they’d received, but failed subjective judging. Contestants were furious. The whole thing was a fiasco.
So, caution is key not only in choosing which objectives to fund, but also in the mechanism design.
Take the Pneumococcal AMC. The CGD spent a Claude-estimated one to two million dollars drafting a plan to successfully run an AMC for global vaccine development. Senior professionals in global development, law, biology, and more worked on the report. With hindsight, this 0.1% investment of the eventual $1.5 billion AMC is common sense. With billions coming our way, we are in a similar position.
The plan
With the help of a few million dollars, can we specify AIS outcomes that matter a lot and that current programs aren’t on track to achieve? Can we then give them aggregate valuations? I think yes.
We start with a long list of outcomes, so that experts and open dialogue can sort out the very best; this is the time for “blue sky” prizes, a solicitation of great ideas where the best contestants win money and get their ideas boosted. Coefficient and GiveWell both ran such prizes and were happy with their results. Each contest cost only about 100,000 dollars, and they received 150 and 49 quality submissions. We should spend much more.
Then, we must judge the value and tractability of these goals. With billions at stake, many talented and trusted AIS researchers will lend their expertise to valuations of the outcomes (e.g. 1.2 billion for automated circuit tracing as defined by XYZ, 80 million for a one hour meeting between Dr. AISafety and the president, etc.). Many hours should be spent just on the design of this conference; one could, for example, give the experts their own play funds to allocate across issues, implicitly giving relative valuations.
Mission aligned lawyers, economists, forecasters, grantmakers and whoever else will carefully establish the tractability of pull mechanisms for each goal: the contract law and registration procedure and all the practicalities. Then, we let the dogs loose.