I agree with ~all of your subpoints but it seems like we disagree in terms of the overall appraisal.
Thanks for explaining your overall reasoning though. Also big +1 that the internal deployment stuff is scary. I don’t think either lab has told me what protections they’re going to use for internally deploying dangerous (~ASL-4) systems, but the fact that Anthropic treats internal deployment like external deployment is a good sign. OpenAI at least acknowledges that internal deployment can be dangerous through its distinction between high risk (can be internally deployed) and critical risk (cannot be), but I agree that the thresholds are too high, particularly for model autonomy.
RE 1& 2:
Agreed— my main point here is that the marketplace of ideas undervalues criticism.
I think one perspective could be “we should all just aim to do objective truth-seeking”, and as stated I agree with it.
The main issue with that frame, imo, is that it’s very easy to forget that the epistemic environment can be tilted in favor of certain perspectives.
EG I think it can be useful for “objective truth-seeking efforts” to be aware of some of the culture/status games that underincentivize criticism of labs & amplify lab-friendly perspectives.
RE 3:
Good to hear that responses have been positive to lab watch. My impression is that this is a mix of: (a) lab watch doesn’t really threaten the interests of labs (especially Anthropic, which is currently winning & currently the favorite lab among senior AIS ppl), (b) the tides have been shifting somewhat and it is genuinely less taboo to criticize labs than a year ago, and (c) EAs respond more positively to criticism that feels more detailed/nuanced (look I have these 10 categories, let’s rate the labs on each dimension) than criticisms that are more about metastrategy (e.g., challenging the entire RSP frame or advocating for policymaker outreach).
RE 4: I haven’t heard anything about Conjecture that I’ve found particularly concerning. Would be interested in you clarifying (either here or via DM) what you’ve heard. (And clarification note that my original point was less “Conjecture hasn’t done anything wrong” and more “I suspect Conjecture will be more heavily scrutinized and examined and have a disproportionate amount of optimization pressure applied against it given its clear push for things that would hurt lab interests.”)
My current perspective is that criticism of AGI labs is an under-incentivized public good. I suspect there's a disproportionate amount of value that people could have by evaluating lab plans, publicly criticizing labs when they break commitments or make poor arguments, talking to journalists/policymakers about their concerns, etc.
Some quick thoughts:
With all this in mind, I find myself more deeply appreciating folks who have publicly and openly critiqued labs, even in situations where the cultural and economic incentives to do so were quite weak (relative to staying silent or saying generic positive things about labs).
Examples: Habryka, Rob Bensinger, CAIS, MIRI, Conjecture, and FLI. More recently, @Zach Stein-Perlman, and of course Jan Leike and Daniel K.
I personally have a large amount of uncertainty around how useful prosaic techniques & control techniques will be. Here are a few statements I'm more confident in:
It's still plausible to me that perhaps this period of a few months is enough to pull off actions that get us out of the acute risk period (e.g., use the ASL-4 system to generate evidence that controlling more powerful systems would require years of dedicated effort and have Lab A devote all of their energy toward getting governments to intervene).
Given my understanding of the current leading labs, it's more likely to me that they'll underestimate the difficulties of bootstrapped alignment and assume that things are OK as long as empirical tests don't show imminent evidence of danger. I don't think this prior is reasonable in the context of developing existentially dangerous technologies, particularly technologies that are intended to be smarter than you. I think sensible risk management in such contexts should require a stronger theoretical/conceptual understanding of the systems one is designing.
(My guess is that you agree with some of these points and I agree with some points along the lines of "maybe prosaic/control techniques will just work, we aren't 100% sure they're not going to work", but we're mostly operating in different frames.)
(I also do like/respect a lot of the work you and Buck have done on control. I'm a bit worried that the control meme is overhyped, partially because it fits into the current interests of labs. Like, control seems like a great idea and a useful conceptual frame, but I haven't yet seen a solid case for why we should expect specific control techniques to work once we get to ASL-4 or ASL-4.5 systems, as well as what we plan to do with those systems to get us out of the acute risk period. Like, the early work on using GPT-3 to evaluate GPT-4 was interesting, but it feels like the assumption about the human red-teamers being better at attacking than GPT-4 will go away– or at least be much less robust– once we get to ASL-4. But I'm also sympathetic to the idea that we're at the early stages of control work, and I am genuinely interested in seeing what you, Buck, and others come up with as the control agenda progresses.)
@Zach Stein-Perlman I appreciate your recent willingness to evaluate and criticize safety plans from labs. I think this is likely a public good that is underprovided, given the strong incentives that many people have to maintain a good standing with labs (not to mention more explicit forms of pressure applied by OpenAI and presumably other labs).
One thought: I feel like the difference between how you described the Anthropic RSP and how you described the OpenAI PF feels stronger than the actual quality differences between the documents. I agree with you that the thresholds in the OpenAI PF are too high, but I think the PF should get "points" for spelling out risks that go beyond ASL-3/misuse.
OpenAI has commitments that are insufficiently cautious for ASL-4+ (or what they would call high/critical on model autonomy), but Anthropic circumvents this problem by simply refusing to make any commitments around ASL-4 (for now).
You note this limitation when describing Anthropic's RSP, but you describe it as "promising" while describing the PF as "unpromising." In my view, this might be unfairly rewarding Anthropic for just not engaging with the hardest parts of the problem (or unfairly penalizing OpenAI for giving their best guess answers RE how to deal with the hardest parts of the problem).
We might also just disagree on how firm or useful the commitments in each document are– I walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks. I do think OpenAI's thresholds are too high, but It's likely that I'll feel the same way about Anthropic's thresholds. In particular, I don't expect either (any?) lab to be able to resist the temptation to internally deploy models with autonomous persuasion capabilities or autonomous AI R&D capabilities (partially because the competitive pressures and race dynamics pushing them to do so will be intense). I don't see evidence that either lab is taking these kinds of risks particularly seriously, has ideas about what safeguards would be considered sufficient, or is seriously entertaining the idea that we might need to do a lot (>1 year) of dedicated safety work (that potentially involves coming up with major theoretical insights, as opposed to a "we will just solve it with empiricism" perspective) before we are confident that we can control such systems.
TLDR: I would remove the word "promising" and maybe characterize the RSP more like "an initial RSP that mostly spells out high-level reasoning, makes few hard commitments, and focuses on misuse while missing the all-important evals and safety practices for ASL-4."
Thanks for this (very thorough) answer. I'm especially excited to see that you've reached out to 25 AI gov researchers & already have four governance mentors for summer 2024. (Minor: I think the post mentioned that you plan to have at least 2, but it seems like there are already 4 confirmed and you're open to more; apologies if I misread something though.)
A few quick responses to other stuff:
Thanks, Neel! I responded in greater detail to Ryan's comment but just wanted to note here that I appreciate yours as well & agree with a lot of it.
My main response to this is something like "Given that MATS selects the mentors and selects the fellows, MATS has a lot of influence over what the fellows are interested in. My guess is that MATS' current mentor pool & selection process overweights interpretability and underweights governance + technical governance, relative to what I think would be ideal."
Thanks for these explanations– I think they're reasonable & insightful. A few thoughts:
Most of the scholars in this cohort were working on research agendas for which there are world-leading teams based at scaling labs
I suspect there's probably some bidirectional causality here. People want to work at scaling labs because they're interested in the research that scaling labs are doing, and people want to focus on the research the scaling labs are doing because they want to work at scaling labs.
There seems to be an increasing trend in the AI safety community towards the belief that most useful alignment research will occur at scaling labs
I think this is true among a subset of the AI safety community but I don't think this characterizes the AI safety community as a whole. For example, another (even stronger IMO) trend in the AI safety community has been towards the belief that policy work & technical governance work is more important than many folks previously expected it to be (see EG Paul joining USAISI, MIRI shifting to technical governance, UKAISI being established, and not to mention the general surge in interest among policymakers).
One perspective on this could be "well, MATS is a technical research program, and we're adding some governance mentors, so shrug." Another perspective on this could be "well, it seems like perhaps MATS is shifting more slowly than one might've imagined, resulting in a culture/ecosystem/mentor cohort/selection process/fellow cohort that disproportionately wants to join scaling labs."
RE shifting more slowly or having a disproportionate focus, note that the ERA fellowship has prioritized toward governance and technical governance– 2/3 of their fellows will be focused on governance + technical governance projects. I'm not necessarily saying this is what would be best for MATS, but it at least points out that we should be seeing MATS' focus on incubating "technical researchers that want to work at scaling labs" as something that's part of its design.
I might be a bit "biased" in that I work in AI policy and my worldview generally suggests that AI policy (as well as technical governance) is extremely neglected. I personally think it's harder to make the case that giving scaling labs better alignment talent is as neglected– it's still quite important, but scaling labs are extremely popular & I think their ability to hire (and pay for) top technical talent is much stronger than that of governments.
Anecdotally, scholars seemed generally in favor of careers at an AISI or evals org, but would prefer to continue pursuing their current research agenda
Again, I think my primary response here is something like the research interests of the MATS cohort are a function of the program and its selection process– not an immutable characteristic of the world. The ERA example is a "strong" example of prioritizing people with other interests, but I imagine there are plenty of "weaker" things MATS could be doing to select/prioritize fellows who had an interest in governance & technical governance. (Or put differently, my guess is that there are ways in which the current selection process and mentor pool disproportionately attracts/favors those who are interested in the kinds of topics you mentioned).
If I could wave a magic wand, I would probably have MATS add many more governance & technical governance mentors and shift to something closer to ERA's breakdown. This would admittedly be a rather big shift for MATS, and perhaps current employees/leaders/funders wouldn't want to do it. I think it ought to be seriously considered, though, and if I were a MATS exec person or a MATS funder I would probably be pushing for this. Or at least asking some serious questions along the lines of "do we really feel like the most impactful thing a training program could be doing right now is serving as an upskilling program for the scaling labs?" (With all due respect to the importance of getting great people to the scaling labs, acknowledging the importance of technical research at scaling labs, agreeing with some of Neel's points etc.)
Thank you for explaining the shift from scholar support to research management— I found that quite interesting and I don’t think I would’ve intuitively assumed that the research management frame would be more helpful.
I do wonder if as the summer progresses, the role of the RM should shift from writing the reports for mentors to helping the fellows prepare their own reports for mentors. IMO, fellows getting into the habit of providing these updates & learning how to “manage up” when it comes to mentors seems important. I suspect something in the cluster of “being able to communicate well with mentors//manage your mentor+collaborator relationships” is one of the most important “soft skills” for research success. I suspect a transition from “tell your RM things that they include in their report” to “work with your RM to write your own report” would help instill this skill.
There are some conversations about policy & government response taking place. I think there are a few main reasons you don't see them on LessWrong:
If anyone here is interested in thinking about "40% agreement" scenarios or more broadly interested in how governments should react in worlds where there is greater evidence of risk, feel free to DM me. Some of my current work focuses on the idea of "emergency preparedness"– how we can improve the government's ability to detect & respond to AI-related emergencies.