Thanks for sharing this! I'm curious if you have any takes on Nate's comment or Oliver's comment:
Nate:
I don't think we have any workable plan for reacting to the realization that dangerous capabilities are upon us. I think that when we get there, we'll predictably either (a) optimize against our transparency tools or otherwise walk right of the cliff-edge anyway, or (b) realize that we're in deep trouble, and slow way down and take some other route to the glorious transhumanist future (we might need to go all the way to WBE, or at least dramatically switch optimization paradigms).
Insofar as this is true, I'd much rather see efforts go _now_ into putting hard limits on capabilities in this paradigm, and booting up alternative paradigms (that aren't supposed to be competitive with scaling, but that are hopefully competitive with what individuals can do on home computers). I could see evals playing a role in that policy (of helping people create sane capability limits and measure whether they're being enforced), but that's not how I expect evals to be used on the mainline.
Oliver:
I have a generally more confident take that slowing things down is good, i.e. don't find arguments that "current humanity is better suited to handle the singularity" very compelling.
I think I am also more confident that it's good for people to openly and straightforwardly talk about existential risk from AI.
I am less confident in my answer to the question of "is generic interpretability research cost-effective or even net-positive?". My guess is still yes, but I really feel very uncertain, and feel a bit more robust in my answer to your question than that question.
My team at Open Philanthropy just launched two requests for proposals:
I think creating a shared scientific understanding of where LLMs are at has large benefits, but it can also accelerate AI capabilities: for example, it might demonstrate possible commercial use cases and spark more investment, or it might allow researchers to more effectively iterate on architectures or training processes. Other things being equal, I think acceleration is harmful because we’re not ready for very powerful AI systems — but I believe the benefits outweigh these costs in expectation, and think better measurements of LLM capabilities are net-positive and important.
To get a sense for whether acting on this belief by launching these two RFPs would constitute falling prey to the unilateralist’s curse, I sent a survey about whether funding this work would be net-positive or net-negative to 47 relatively senior people who have been full-time working on AI x-risk reduction for multiple years and have likely thought about the risks and benefits of sharing information about AI capabilities.
Out of the 47 people who received the survey, 30 people (64%) responded. Of those, 25 out of 30 said they were “Positive” or “Lean positive” on the RFP, and only 1 person said they were “Lean negative,” with no one saying they were “Negative.” The remaining four people said they had “No idea,” meaning that 29 out of 30 respondents (97%) would not vote to stop the RFPs from happening. With that said, many respondents (~37%) felt torn about the question or considered it complicated.
The rest of this post provides more detail on the information that the survey-takers received and the survey results (including sharing answers from those respondents who gave permission to share).
The information that was sent to the survey-takers
The survey-takers received the below email, which links to a one-pager on the risks and benefits of these RFPs, and a four-pager (written in late July and early August) about the sorts of projects I expected to fund. After the survey, the latter document evolved into the public-facing RFPs here and here.
The survey results in more detail
Who took the survey
Out of the 30 survey respondents, 17 people (~57%) gave me permission to share the fact that they responded to the survey:
Of these 17 people, 8 gave me permission to share some portion of their responses; I’ve collected these at the end.
Answers to multiple choice and numerical questions
The survey consisted of five substantive (non-meta / procedural) questions, three of which were multiple choice or numerical, and two which were text box responses elaborating on one of the multiple choice or numerical questions. The distribution of answers to the multiple choice and numerical questions are given in this section.
Instinct: A slim majority of respondents feel instinctively positive, and many feel torn
The first multiple choice or numerical question of the survey asks about respondents’ initial instincts about the RFP:
There were four response choices: “Yay, I like it!”; “Ugh, can you not?”; “I don’t have much of an instinct”; and “Uhhhh I’m torn / it’s complicated.”
While it was optional, all thirty respondents chose to answer it. These were their responses:
A slim majority (16) had a positive initial instinct, and a large minority (14) did not, with most of the latter group (11) feeling torn.
Independence: Most respondents consider themselves to have independent views
The second multiple choice or numerical question asks about respondents’ level of deference to others on this kind of question:
The response was given as a numerical scale from 1 to 5. Here, 1 was labeled “I’m almost entirely deferring to others” and 5 was labeled “I have a very well-developed independent view.”
All thirty respondents answered this question as well. These are the results:
I chose to send the survey to people who I thought would have independent views, and indeed the majority (25 people, 83%) were a 4 or a 5 out of 5.
Overall view: Most respondents are positive or lean positive on the RFPs
The most important question of the survey asks:
It gives five options:
These are the results:
Numerically, they were as follows:
Specific respondents’ answers
Some survey takers gave permission for a portion of their answers to be shared publicly. For those who gave permission to share multiple choice and numerical answers, they are given below:
Four of these people gave permission to share free text responses in addition to their multiple choice responses; they are copied below:
Daniel Kokotajlo’s full response
Instinct: Yay, I like it!
Independence: 5
Overall view: Positive (and I'm about as confident as I ever get on debated questions of AI strategy)
Free response:
I think this might be less good than your opportunity cost, i.e. I'm at like 50% that there is something better for you to be doing with your time.
And I'm not confident it's net-positive. But I'm about as confident that it's net-positive as I ever am about AI strategy questions.
Jonathan Uesato’s full response
Instinct: Yay, I like it!
Independence: 5
Overall view: Positive (and I'm about as confident as I ever get on debated questions of AI strategy)
Free response:
Nate Soares’s full response
Instinct: Uhhhh I’m torn / it’s complicated
Independence: 5 out of 5
Overall view: No idea
Free response:
Oliver Habryka’s full response
Instinct: Uhhhh I’m torn / it’s complicated
Independence: 5 out of 5
Overall view: Lean positive (but I'm not as confident about this as I am about some other AI strategy debates)
Free response:
Note that ARC Evals itself would not be eligible to apply for this RFP, because I am married to Paul Christiano, the Executive Director of its parent org Alignment Research Center.
Note: I got no suggestions of this form in the relevant survey section(s).
It was a combined RFP in the original draft, which got split into two after further iteration.
The provided description text reads: