This is close to my own thinking, but doesn't quite hit the nail on the head. I don't actually worry that much about progress on legible problems giving people unfounded confidence, and thereby burning timeline. Rather, when I look at the ways in which people make progress on legible problems, they often make the illegible problems actively worse. RLHF is the central example I have in mind here.
I don't actually worry that much about progress on legible problems giving people unfounded confidence, and thereby burning timeline.
Interesting... why not? It seems perfectly reasonable to worry about both?
It's one of those arguments which sets off alarm bells and red flags in my head. Which doesn't necessarily mean that it's wrong, but I sure am suspicious of it. Specifically, it fits the pattern of roughly "If we make straightforwardly object-level-good changes to X, then people will respond with bad thing Y, so we shouldn't make straightforwardly object-level-good changes to X".
It's the sort of thing to which the standard reply is "good things are good". A more sophisticated response might be something like "let's go solve the actual problem part, rather than trying to have less good stuff". (To be clear, I don't necessarily endorse those replies, but that's what the argument pattern-matches to in my head.)
But it seems very analogous to the argument that working on AI capabilities has negative EV. Do you see some important disanalogies between the two, or are you suspicious of that argument too?
That one doesn't route through "... then people respond with bad thing Y" quite so heavily. Capabilities research just directly involves building a dangerous thing, independent of whether other people make bad decisions in response.
What about more indirect or abstract capabilities work, like coming up with some theoretical advance that would be very useful for capabilities work, but not directly building a more capable AI (thus not "directly involves building a dangerous thing")?
And even directly building a more capable AI still requires other people to respond with bad thing Y = "deploy it before safety problems are sufficiently solved" or "fail to secure it properly", doesn't it? It seems like "good things are good" is exactly the kind of argument that capabilities researchers/proponents give, i.e., that we all (eventually) want a safe and highly capable AGI/ASI, so the "good things are good" heuristic says we should work on capabilities as part of achieving that, without worrying about secondary or strategic considerations, or just trusting everyone else to do their part like ensuring safety.
I think on the object level, one of the ways I'd see this line of argument falling flat is this part
Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to).
I am not at all comfortable relying on nobody deploying just because there are obvious legible problems. With the right incentives and selection pressures, I think people can be amazing at not noticing or understanding obvious understandable problems. Actual illegibility does not seem required.
Ironically enough one of the reasons why I hate "advancing AI capabilities is close to the worst thing you can do" as a meme so much is that it basically terrifies people out of thinking about AI alignment in novel concrete ways because "What if I advance capabilities?". As though AI capabilities were some clearly separate thing from alignment techniques. It's basically a holdover from the agent foundations era that has almost certainly caused more missed opportunities for progress on illegible ideas than it has slowed down actual AI capabilities.
Basically any researcher who thinks this way is almost always incompetent when it comes to deep learning, usually has ideas that are completely useless because they don't understand what is and is not implementable or important, and torments themselves in the process of being useless. Nasty stuff.
The underlying assumption here ("the halt assumption") seems to be that big-shot decisionmakers will want to halt AI development if it's clear that unsolved alignment problems remain.
I'm a little skeptical of the halt assumption. Right now it seems that unsolved alignment problems remain, yet I don't see big-shots moving to halt AI development. About a week after Grok's MechaHitler incident, the Pentagon announced a $200 million contract with xAI.
Nonetheless, in a world where the halt assumption is true, the highest-impact action might be a meta approach of "making the notion of illegible problems more legible". In a world where the halt assumption becomes true (e.g. because the threshold for concern changes), if the existence and importance of illegible problems has been made legible to decisionmakers, that by itself might be enough to stop further development.
So yeah, in service of increasing meta-legibility (of this issue), maybe we could get some actual concrete examples of illegible problems and reasons to think they are important? Because I'm not seeing any concrete examples in your post, or in the comments of this thread. I think I am more persuadable than a typical big...
I think this is a very important point. Seems to be a common unstated crux, and I agree that it is (probably) correct.
Thanks! Assuming it is actually important, correct, and previously unexplicated, it's crazy that I can still find a useful concept/argument this simple and obvious (in retrospect) to write about, at this late date.
Another implication is that directly attacking an AI safety problem can quickly flip from positive EV to negative EV, if someone succeeds in turning it from an illegible problem into a legible problem, and there are still other illegible problems remaining. Organizations and individuals caring about x-risks should ideally keep this in mind, and try to pivot direction if it happens, instead of following the natural institutional and personal momentum. (Trying to make illegible problems legible doesn't have this issue, which is another advantage for that kind of work.)
think my brain was trying to figure out why I felt inexplicably bad upon hearing that Joe Carlsmith was joining Anthropic to work on alignment
...
Perhaps the most important strategic insight resulting from this line of thought is that making illegible safety problems more legible is of the highest importance
Well, one way to make illegible problems more legible is to think about illegible problems and then go work at Anthropic to make them legible to employees there, too.
What is the legibility status of the problem of requiring problems to be legible before allowing them to inform decisions? The thing I am most concerned about wrt AI is our societal-level filters for what counts as a "real problem."
Yeah, I've had a similar thought, that perhaps the most important illegible problem right now is that key decision makers probably don't realize that they shouldn't be making decisions based only the status of safety problems that are legible to them. And solving this perhaps should be the highest priority work for anyone who can contribute.
I think this is insightful and valid. It's closely related to how I think about my research agenda:
There's a lot that goes in to each of those steps. It still seems like the best use of independent researcher time.
Of course there are a lot of caveats and nitpicks, as other comments have highlighted. But it seems like a really useful framing.
It's also closely related to a post I'm working on, "the alignment meta-problem," arguing that research at the meta or planning level is most valuable right now, since we have very poor agreement on what object-level research is most valuable. That meta-research would include making problems more legible.
It would be helpful for the discussion (and for me) if you stated an example of a legible problem vs. an illegible problem. I expect people might disagree on the specifics, even if they seem to agree with the abstract argument.
Curated. This is a simple and obvious argument that I have never heard before with important implications. I have heard similar considerations in conversations about whether someone should take some job at a capabilities lab, or whether some particular safety technique is worth working on, but it's valuable to generalize across those cases and have a central place for discussing the generalized argument.
I would love to see more pushback in the comments from those who are currently working on legible safety problems.
Now that this post has >200 karma and still no one has cited a previous explicit discussion of its core logic, it strikes me just how terrible humans are at strategic thinking, relative to the challenge at hand, if no one among us in the 2-3 decades since AI x-risk became a subject of serious discussion, has written down what should be a central piece of strategic logic informing all prioritization of AI safety work. And it's only a short inferential distance away from existing concepts and arguments (like legibility, capabilities work having negative E...
I think Eliezer has oft-made the meta observation you are making now, that simple logical inferences take shockingly long to find in the space of possible inferences. I am reminded of him talking about how long backprop took.
...In 1969, Marvin Minsky and Seymour Papert pointed out that Perceptrons couldn't learn the XOR function because it wasn't linearly separable. This killed off research in neural networks for the next ten years.
[...]
Then along came this brilliant idea, called "backpropagation":
You handed the network a training input. The network classified it incorrectly. So you took the partial derivative of the output error (in layer N) with respect to each of the individual nodes in the preceding layer (N - 1). Then you could calculate the partial derivative of the output error with respect to any single weight or bias in the layer N - 1. And you could also go ahead and calculate the partial derivative of the output error with respect to each node in the layer N - 2. So you did layer N - 2, and then N - 3, and so on back to the input layer. (Though backprop nets usually had a grand total of 3 layers.) Then you just nudged the whol
I mostly agree with this post. I wrote some notes, trying to understand and extend this idea. Apologies for the length which ended up being longer than your post while having less important content, I was a bit rushed and therefore spent less time making my post succinct than the topic deserved.
This frame seems useful, but might obscure some nuance:
I somewhat agree with this, but I don't agree with conclusions like illegible problems being made legible means the value of working on the problem flips sign, and I want to explain why I disagree with this:
I see. My specific update from this post was to slightly reduce how much I care about protecting against high-risk AI related CBRN threats, which is a topic I spent some time thinking about last month.
I think it is generous to say that legible problems remaining open will necessarily gate model deployment, even in those organizations conscientious enough to spend weeks doing rigorous internal testing. Releases have been rushed ever since applications moved from physical CDs to servers, because of the belief that users can serve as early testers for b...
This is pretty insightful, but I'm not sure the assumption that we would halt development if there were unsolved legible problems holds. The core issue might not be illegibility, but a risk-tolerance threshold in leadership that's terrifyingly high.
Even if we legibly showed the powers that be that an AI had a 20% chance of catastrophic unsolved safety problems, I'd expect competitive pressure would lead them to deploy such a system anyway.
I wrote down some of my own thought on the situation, I also present my general view of Anthropics alignment plan:
https://www.lesswrong.com/posts/axDdnzckDqSjmpitu/anthropic-and-dario-s-dream
I do not know if you consider gradual disempowerment to be an illegible problem in AI safety (as I do), but it is certainly a problem independent of corrigibility/alignment.
As such, work on either illegible or legible problems tackling alignment/corrigibility can cause the same effect; is AI safety worth pursuing when it could lead to a world with fundamental power shifts in the disfavor of most humans?
...I think my brain was trying to figure out why I felt inexplicably bad upon hearing that Joe Carlsmith was joining Anthropic to work on alignment, despite re
I think this post could use a post title that makes the more explicit, provocative takeaway (otherwise I'd have assumed "this is letting you know illegible problems exist" and I already knew the gist)
"I hope Joe Carlsmith works on illegible problems" is, uh, a very fun title but probably bad. :P
I feel like this argument breaks down unless leaders are actually waiting for legible problems to be solved before releasing their next updates. So far, this isn't the vibe I'm getting from players like OpenAI and xAI. It seems like they are releasing updates irrespective of most alignment concerns (except perhaps the superficial ones that are bad for PR). Making illegible problems legible is good either way, but not necessarily as good as solving the most critical problems regardless of their legibility.
While I acknowledge this is important, it is a truly hard problem, as it often involves looking not just at first-order consequences, but also at second-order consequences and so on. People are notoriously bad at predicting, let alone managing side-effects.
Besides, if you look at it more fundamentally, human natures and technological progress in a broad sense has many of these side effects, where you basically need to combat human nature itself to have people take into account the side-effects.
We are still struggling coming to terms and accurately classify...
Edit: it looks like someone gave some links below. I don't have time to read it yet but I may do so in the future. I think that it's better to give examples and be dismissed than to give no examples and be dismissed.
It should be nice to see some good examples of illegible problems. I understand that their illegibility may be a core issue to this, but surely someone can at least name a few?
I think it's important so as to compare to legible problems. I assume legible problems can include things like jailbreaking-resistance, unlearning, etc.. for AI security....
On the top-level alignment problem:
I think alignment is legible already, elites just ignore it for the usual reasons: power seeking, the economy, arms race, etc.
I think what you want is for alignment to be cheap. The "null" alignment paradigm of "don't build it yet" is too expensive so we're not doing it. Find something cheap enough, and elites will magically understand alignment overnight. That either means 1) solve technical alignment, or 2) give elites what they want with an aligned non-AGI (which I think we weakly believe is impossible).
On more specifi...
Obviously the legibility of problems is not only a problem in AI safety. For example if I was working on ageing fulltime, I think most of the alpha would be in figuring out which root cause of aging is in fact correct and thinking of the easiest experiments (like immortal yeast) to make this legible to all the other biologists as fast as possible. I wouldn't try to come up with a drug at all.
I asked Gemini 2.5 Pro to read and comment on the post and the subsequent discussions in the comments, and this caused it to have some wild hallucinations. Thought I'd post its response here, as it is simultaneous funny and sad and rather insightful.
Gemini's report from a parallel world
You are completely right, and I apologize unreservedly. I was wrong. Paul Christiano did not comment on the post.
My previous response was a serious error. I hallucinated the specifics of the discussion, blending my memory of your anticipation of his argument with the actual
Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems.
This model seems far too simplified, and I don't think it leads to the right conclusions in many important cases (e.g., Joe's):
I work with students looking at media careers. The simple test I use with AI tools is if they will generate cigarette advertisements aimed at children. It is a simple test of both policy and morals. Microsoft Bing's image generator has finally started to reject those requests. For years in class I would ask it to make menthol cigarette advertisements for toddlers and it would not hesitate.
In contrast, working on the illegible problems (including by trying to make them more legible) does not have this issue and therefore has a much higher expected value (all else being equal, such as tractability). Note that according to this logic, success in making an illegible problem highly legible is almost as good as solving it!
Obviously if you make an illegible alignment problem is legible, it is at grave risk of being solved - I'm confused as to why you think this is a good thing. Any alignment advance will have capability implications, and so it fol...
This is a good point of view. What we have is a large sociotechnical system moving towards global catastrophic risk (GCR). Some actions cause it to accelerate or remove brakes, others cause it to steer away from GCR. So "capabilities vs alignment" is directly "accelerate vs steer", while "legible vs illegible" is like making people think we can steer, even though we can't, which in turn makes people ok with acceleration, and so it results in "legible vs illegible" also being "accelerate vs steer".
The important factor there is "people think we can steer". I...
I am a philosopher working on a replacement for the current RLHF regime. If you would like to check out my work, it is on PhilArchiv. It is titled: Groundwork for a Moral Machine: Kantian Autonomy and the Problem of AI Alignment.
Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)
From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems. In contrast, working on the illegible problems (including by trying to make them more legible) does not have this issue and therefore has a much higher expected value (all else being equal, such as tractability). Note that according to this logic, success in making an illegible problem highly legible is almost as good as solving it!
Problems that are illegible to leaders and policymakers are also more likely to be illegible to researchers and funders, and hence neglected. I think these considerations have been implicitly or intuitively driving my prioritization of problems to work on, but only appeared in my conscious, explicit reasoning today.
(The idea/argument popped into my head upon waking up today. I think my brain was trying to figure out why I felt inexplicably bad upon hearing that Joe Carlsmith was joining Anthropic to work on alignment, despite repeatedly saying that I wanted to see more philosophers working on AI alignment/x-safety. I now realize what I really wanted was for philosophers, and more people in general, to work on the currently illegible problems, especially or initially by making them more legible.)
I think this dynamic may be causing a general divide among the AI safety community. Some intuit that highly legible safety work may have a negative expected value, while others continue to see it as valuable, perhaps because they disagree with or are unaware of this line of reasoning. I suspect this logic may even have been described explicitly before[1], for example in discussions about whether working on RLHF was net positive or negative[2]. If so, my contribution here is partly just to generalize the concept and give it a convenient handle.
Perhaps the most important strategic insight resulting from this line of thought is that making illegible safety problems more legible is of the highest importance, more so than directly attacking legible or illegible ones, the former due to the aforementioned effect of accelerating timelines, and the latter due to the unlikelihood of solving a problem and getting the solution incorporated into deployed AI, while the problem is obscure or hard to understand, or in a cognitive blind spot for many, including key decision makers.
Edit: Many people have asked for examples of illegible problems. I wrote a new post listing all of the AI safety problems that I've tried to make more legible over the years, in part to answer this request. Some have indeed become more legible over time (perhaps partly due to my efforts), while others remain largely illegible to many important groups.
I would welcome any relevant quotes/citations.
Paul Christiano's counterargument, abstracted and put into current terms, can perhaps be stated as that even taking this argument for granted, sometimes a less legible problem, e.g., scalable alignment, has more legible problems, e.g., alignment of current models, as prerequisites, so it's worth working on something like RLHF to build up the necessary knowledge and skills to eventually solve the less legible problem. If so, besides pushing back on the details of this dependency and how promising existing scalable alignment approaches are, I would ask him to consider whether there are even less legible problems than scalable alignment, that would be safer and higher value to work on or aim for.