This is close to my own thinking, but doesn't quite hit the nail on the head. I don't actually worry that much about progress on legible problems giving people unfounded confidence, and thereby burning timeline. Rather, when I look at the ways in which people make progress on legible problems, they often make the illegible problems actively worse. RLHF is the central example I have in mind here.
I don't actually worry that much about progress on legible problems giving people unfounded confidence, and thereby burning timeline.
Interesting... why not? It seems perfectly reasonable to worry about both?
It's one of those arguments which sets off alarm bells and red flags in my head. Which doesn't necessarily mean that it's wrong, but I sure am suspicious of it. Specifically, it fits the pattern of roughly "If we make straightforwardly object-level-good changes to X, then people will respond with bad thing Y, so we shouldn't make straightforwardly object-level-good changes to X".
It's the sort of thing to which the standard reply is "good things are good". A more sophisticated response might be something like "let's go solve the actual problem part, rather than trying to have less good stuff". (To be clear, I don't necessarily endorse those replies, but that's what the argument pattern-matches to in my head.)
But it seems very analogous to the argument that working on AI capabilities has negative EV. Do you see some important disanalogies between the two, or are you suspicious of that argument too?
That one doesn't route through "... then people respond with bad thing Y" quite so heavily. Capabilities research just directly involves building a dangerous thing, independent of whether other people make bad decisions in response.
I think this is a very important point. Seems to be a common unstated crux, and I agree that it is (probably) correct.
Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)
From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems. In contrast, working on the illegible problems (including by trying to make them more legible) does not have this issue and therefore has a much higher expected value (all else being equal, such as tractability). Note that according to this logic, success in making an illegible problem highly legible is almost as good as solving it!
Problems that are illegible to leaders and policymakers are also more likely to be illegible to researchers and funders, and hence neglected. I think these considerations have been implicitly or intuitively driving my prioritization of problems to work on, but only appeared in my conscious, explicit reasoning today.
(The idea/argument popped into my head upon waking up today. I think my brain was trying to figure out why I felt inexplicably bad upon hearing that Joe Carlsmith was joining Anthropic to work on alignment, despite repeatedly saying that I wanted to see more philosophers working on AI alignment/x-safety. I now realize what I really wanted was for philosophers, and more people in general, to work on the currently illegible problems, especially or initially by making them more legible.)
I think this dynamic may be causing a general divide among the AI safety community. Some intuit that highly legible safety work may have a negative expected value, while others continue to see it as valuable, perhaps because they disagree with or are unaware of this line of reasoning. I suspect this logic may even have been described explicitly before[1], for example in discussions about whether working on RLHF was net positive or negative[2]. If so, my contribution here is partly just to generalize the concept and give it a convenient handle.
Perhaps the most important strategic insight resulting from this line of thought is that making illegible safety problems more legible is of the highest importance, more so than directly attacking legible or illegible ones, the former due to the aforementioned effect of accelerating timelines, and the latter due to the unlikelihood of solving a problem and getting the solution incorporated into deployed AI, while the problem is obscure or hard to understand, or in a cognitive blind spot for many, including key decision makers.
I would welcome any relevant quotes/citations.
Paul Christiano's counterargument, abstracted and put into current terms, can perhaps be stated as that even taking this argument for granted, sometimes a less legible problem, e.g., scalable alignment, has more legible problems, e.g., alignment of current models, as prerequisites, so it's worth working on something like RLHF to build up the necessary knowledge and skills to eventually solve the less legible problem. If so, besides pushing back on the details of this dependency and how promising existing scalable alignment approaches are, I would ask him to consider whether there are even less legible problems than scalable alignment, that would be safer and higher value to work on or aim for.