Assuming we're working with near-frontier models (s.t., the cost of training them once is near the limit of what any institution can afford), we presumably can't actually retrain a model without the data. Are there ways to approximate this technique that preserve its appeal?
(Just to check my understanding, this would be a component of a sufficient-but-not-necessary solution, right?)
Just flagging that another cross-post has been collecting some comments: https://www.lesswrong.com/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety
I mostly agree, but it's messy. I don't think it's obvious that a PhD is anywhere near the ideal way to pick up some of these skills, or that earning a PhD definitely means that you've picked them up, but PhD programs do include lots of nudges in these directions, and PhD-holders are going to be much stronger than average at most of this.
In particular, like Johannes said, doing a PhD is notoriously hard on mental health for a number of reasons, even at a more-supportive-than-average lab. So to the extent that they teach 'taking care of your mental health' and 'staying motivated when you're lost', it's often by throwing you into stressful, confusing work situations without great resources and giving you the degree if you figure out how to navigate them.
When I converse with junior folks about what qualities they’re missing, they often focus on things like “not being smart enough” or “not being a genius” or “not having a PhD.” It’s interesting to notice differences between what junior folks think they’re missing & what mentors think they’re missing.
This issue is real, it's the thing that frustrates me most about alignment pipeline-building work in general right now. There are very likely some important formal/theoretical areas of alignment research that really do need to recruit mostly for something like 'genius'. But a lot more of the active work that's getting done (and a way more of the hard-to-fill open jobs) depend much, much more on skills 1–5 here much more than on intelligence in that sense.
(This is on the margin. Here I'm focused on the actual population of people who tend to be interested in ML alignment research, so I'm baking in the assumption that all of the candidates could, say, get above-average grades in a STEM undergrad degree at a top-100 university if they tried.)
As someone who's supervised/trained ML researchers for ~8 years now, I'd pretty much always hire someone who's 90th-percentile on two or three of these skills than someone who's no better than 70th percentile but has world-class IMO (or IOI) performance or a verified IQ of 160 or some other classic raw intelligence signal.
- A New York-based alignment hub that aims to provide talent search and logistical support for NYU Professor Sam Bowman’s planned AI safety research group.
:D
I think my lab is bottlenecked on things other than talent and outside support for now, but there probably is more that could be done to help build/coordinate an alignment research scene in NYC more broadly.
More organizations like CAIS that aim to recruit established ML talent into alignment research
This is somewhat risky, and should get a lot of oversight. One of the biggest obstacles to discussing safety in academic settings is that academics are increasingly turned off by clumsy, arrogant presentations of the basic arguments for concern.
+1. The combination of the high dollar amount, the subjective criteria, and the panel drawn from the relatively small/insular 'core' AI safety research community mean that I expect this to look pretty fishy to established researchers. Even if the judgments are fair (I think they probably will be!) and the contest yields good work (it might!), I expect the benefit of that to be offset to a pretty significant degree by the red flags this raises about how the AI safety scene deals with money and its connection to mainstream ML research.
(To be fair, I think the Inverse Scaling Prize, which I'm helping with, raises some of these concerns, but the more precise/partially-quantifiable prize rubric, bigger/more diverse panel, and use of additional reviewers outside the panel mitigates them at least partially.)
Update: We did a quick follow-up study adding counterarguments, turning this from single-turn to two-turn debate, as a quick way of probing whether more extensive full-transcript debate experiments on this task would work. The follow-up results were negative.
Tweet thread here: https://twitter.com/sleepinyourhat/status/1585759654478422016
Direct paper link: https://arxiv.org/abs/2210.10860 (To appear at the NeurIPS ML Safety workshop.)
We're still broadly optimistic about debate, but not on this task, and not in this time-limited, discussion-limited setting, and we're doing a broader more fail-fast style search of other settings. Stay tuned for more methods and datasets.
Fair. For better or worse, a lot of this variation came from piloting—we got a lot of nudges from pilot participants to move toward framings that were perceived as controversial or up for debate.
This may be too late, but it's probably also helpful to put the BIG-Bench "canary string" in the doc as well.