Legible vs. Illegible AI Safety Problems

Wei Dai

LESSWRONG
LW

Legible vs. Illegible AI Safety Problems — LessWrong

370 Legible vs. Illegible AI Safety Problems

by Wei Dai

4th Nov 2025

AI Alignment Forum

2 min read

370 Ω 106

Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)

From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems. In contrast, working on the illegible problems (including by trying to make them more legible) does not have this issue and therefore has a much higher expected value (all else being equal, such as tractability). Note that according to this logic, success in making an illegible problem highly legible is almost as good as solving it!

Problems that are illegible to leaders and policymakers are also more likely to be illegible to researchers and funders, and hence neglected. I think these considerations have been implicitly or intuitively driving my prioritization of problems to work on, but only appeared in my conscious, explicit reasoning today.

(The idea/argument popped into my head upon waking up today. I think my brain was trying to figure out why I felt inexplicably bad upon hearing that Joe Carlsmith was joining Anthropic to work on alignment, despite repeatedly saying that I wanted to see more philosophers working on AI alignment/x-safety. I now realize what I really wanted was for philosophers, and more people in general, to work on the currently illegible problems, especially or initially by making them more legible.)

I think this dynamic may be causing a general divide among the AI safety community. Some intuit that highly legible safety work may have a negative expected value, while others continue to see it as valuable, perhaps because they disagree with or are unaware of this line of reasoning. I suspect this logic may even have been described explicitly before^[1], for example in discussions about whether working on RLHF was net positive or negative^[2]. If so, my contribution here is partly just to generalize the concept and give it a convenient handle.

Perhaps the most important strategic insight resulting from this line of thought is that making illegible safety problems more legible is of the highest importance, more so than directly attacking legible or illegible ones, the former due to the aforementioned effect of accelerating timelines, and the latter due to the unlikelihood of solving a problem and getting the solution incorporated into deployed AI, while the problem is obscure or hard to understand, or in a cognitive blind spot for many, including key decision makers.

Edit: Many people have asked for examples of illegible problems. I wrote a new post listing all of the AI safety problems that I've tried to make more legible over the years, in part to answer this request. Some have indeed become more legible over time (perhaps partly due to my efforts), while others remain largely illegible to many important groups.

^{^}
I would welcome any relevant quotes/citations.
^{^}
Paul Christiano's counterargument, abstracted and put into current terms, can perhaps be stated as that even taking this argument for granted, sometimes a less legible problem, e.g., scalable alignment, has more legible problems, e.g., alignment of current models, as prerequisites, so it's worth working on something like RLHF to build up the necessary knowledge and skills to eventually solve the less legible problem. If so, besides pushing back on the details of this dependency and how promising existing scalable alignment approaches are, I would ask him to consider whether there are even less legible problems than scalable alignment, that would be safer and higher value to work on or aim for.

AI RiskAI

Curated

370 Ω 106

Legible vs. Illegible AI Safety Problems

6Sheikh Abdur Raheem Ali

New Comment

95 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:42 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]johnswentworth3moΩ245919

This is close to my own thinking, but doesn't quite hit the nail on the head. I don't actually worry that much about progress on legible problems giving people unfounded confidence, and thereby burning timeline. Rather, when I look at the ways in which people make progress on legible problems, they often make the illegible problems actively worse. RLHF is the central example I have in mind here.

[-]Wei Dai3moΩ71914

I don't actually worry that much about progress on legible problems giving people unfounded confidence, and thereby burning timeline.

Interesting... why not? It seems perfectly reasonable to worry about both?

[-]johnswentworth3moΩ122716

It's one of those arguments which sets off alarm bells and red flags in my head. Which doesn't necessarily mean that it's wrong, but I sure am suspicious of it. Specifically, it fits the pattern of roughly "If we make straightforwardly object-level-good changes to X, then people will respond with bad thing Y, so we shouldn't make straightforwardly object-level-good changes to X".

It's the sort of thing to which the standard reply is "good things are good". A more sophisticated response might be something like "let's go solve the actual problem part, rather than trying to have less good stuff". (To be clear, I don't necessarily endorse those replies, but that's what the argument pattern-matches to in my head.)

[-]Wei Dai3moΩ62210

But it seems very analogous to the argument that working on AI capabilities has negative EV. Do you see some important disanalogies between the two, or are you suspicious of that argument too?

[-]johnswentworth3moΩ6113

That one doesn't route through "... then people respond with bad thing Y" quite so heavily. Capabilities research just directly involves building a dangerous thing, independent of whether other people make bad decisions in response.

[-]Wei Dai3moΩ5146

What about more indirect or abstract capabilities work, like coming up with some theoretical advance that would be very useful for capabilities work, but not directly building a more capable AI (thus not "directly involves building a dangerous thing")?

And even directly building a more capable AI still requires other people to respond with bad thing Y = "deploy it before safety problems are sufficiently solved" or "fail to secure it properly", doesn't it? It seems like "good things are good" is exactly the kind of argument that capabilities researchers/proponents give, i.e., that we all (eventually) want a safe and highly capable AGI/ASI, so the "good things are good" heuristic says we should work on capabilities as part of achieving that, without worrying about secondary or strategic considerations, or just trusting everyone else to do their part like ensuring safety.

[-]Lucius Bushnaq3mo2017

I think on the object level, one of the ways I'd see this line of argument falling flat is this part

Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to).

I am not at all comfortable relying on nobody deploying just because there are obvious legible problems. With the right incentives and selection pressures, I think people can be amazing at not noticing or understanding obvious understandable problems. Actual illegibility does not seem required.

[-]jdp3mo3712

Ironically enough one of the reasons why I hate "advancing AI capabilities is close to the worst thing you can do" as a meme so much is that it basically terrifies people out of thinking about AI alignment in novel concrete ways because "What if I advance capabilities?". As though AI capabilities were some clearly separate thing from alignment techniques. It's basically a holdover from the agent foundations era that has almost certainly caused more missed opportunities for progress on illegible ideas than it has slowed down actual AI capabilities.

Basically any researcher who thinks this way is almost always incompetent when it comes to deep learning, usually has ideas that are completely useless because they don't understand what is and is not implementable or important, and torments themselves in the process of being useless. Nasty stuff.

2TristanTrim3mo

I see a lot of people dismissing the agent foundations era and I disagree with it. Studying agents seems even more important to me than ever now that they are sampled from a latent space of possible agents within the black box of LLMs. To throw out a crux, I agree that if we have missed opportunities for progress towards beneficial AI by trying to avoid advancing harmful capabilities, that would be a bad thing, but my internal sense of the world suggests to me that harmful capabilities have been advanced more than opportunities have been missed. But unfortunately, that seems like a difficult claim to try to study in any sort of unbiased, objective way, one way or the other.

3jdp3mo

I mean if you're counting "the world" as opposed to the neurotic demographic I'm discussing then obviously capabilities have advanced more than the MIRI outlook would like. But the relevant people basically never cared about that in the first place and are therefore kind of irrelevant to what I'm saying.

1TristanTrim3mo

Thanks for the reply. I guess I'm unclear on what people you are considering the relevant neurotic demographic, and since I feel that "agent foundations" is a pointer to a bunch of concepts which it would be very good if we could develop further, I find myself getting confused at your use of the phrase "agent foundations era". For a worldview check, I am currently much more concerned about the risks of "advancing capabilities" than I am about missed opportunities. We may be coming at this from different perspectives. I'm also getting some hostile soldier mindset vibe from you. My apologies if I am misreading you. Unfortunately, I am in the position of thinking that people promoting the advancement of AI capabilities are indeed promoting increased global catastrophic risk, which I oppose. So if I am falling into the soldier mindset, I likewise am sorry.

[-]Ebenezer Dukakis3mo*204

The underlying assumption here ("the halt assumption") seems to be that big-shot decisionmakers will want to halt AI development if it's clear that unsolved alignment problems remain.

I'm a little skeptical of the halt assumption. Right now it seems that unsolved alignment problems remain, yet I don't see big-shots moving to halt AI development. About a week after Grok's MechaHitler incident, the Pentagon announced a $200 million contract with xAI.

Nonetheless, in a world where the halt assumption is true, the highest-impact action might be a meta approach of "making the notion of illegible problems more legible". In a world where the halt assumption becomes true (e.g. because the threshold for concern changes), if the existence and importance of illegible problems has been made legible to decisionmakers, that by itself might be enough to stop further development.

So yeah, in service of increasing meta-legibility (of this issue), maybe we could get some actual concrete examples of illegible problems and reasons to think they are important? Because I'm not seeing any concrete examples in your post, or in the comments of this thread. I think I am more persuadable than a typical big... (read more)

5Wei Dai3mo

See Problems in AI Alignment that philosophers could potentially contribute to and this comment from a philosopher saying that he thinks they're important, but "seems like there's not much of an appetite among AI researchers for this kind of work" suggesting illegibility.

1Ebenezer Dukakis3mo

Thanks for the reply. I'm not a philosopher, but it seems to me that most of these problems could be addressed after an AGI is built, if the AGI is corrigible. Which problems can you make the strongest case for as problems which we can't put off this way?

5Wei Dai3mo

https://www.lesswrong.com/posts/M9iHzo2oFRKvdtRrM/reminder-morality-is-unsolved?commentId=bSoqdYNRGhqDLxpvM

3Ebenezer Dukakis3mo

Again, thanks for the reply. Building a corrigible AGI has a lot of advantages. But one disadvantage is the "morality is scary" problem you mention in the linked comment. If there is a way to correct the AGI, who gets to decide when and how to correct it? Even if we get the right answers to all of the philosophical questions you're talking about, and successfully program them into the AGI, the philosophical "unwashed masses" you fear could exert tremendous public pressure to use the corrigibility functionality and change those right answers into wrong ones. Since corrigibility is so advantageous (including its ability to let us put off all of your tricky philosophical problems), it seems to me that we should think about the "morality is scary" problem so we can address what appears to be corrigibility's only major downside. I suspect the "morality is scary" problem is more tractable than you assume. Here is one idea (I did a rot13 so people can think independently before reading my idea): Oevat rirelbar va gur jbeyq hc gb n uvtuyl qrirybcrq fgnaqneq bs yvivat. Qrirybc n grfg juvpu zrnfherf cuvybfbcuvpny pbzcrgrapr. Inyvqngr gur grfg ol rafhevat gung vg pbeerpgyl enax-beqref cuvybfbcuref ol pbzcrgrapr nppbeqvat gb 3eq-cnegl nffrffzragf. Pbaqhpg n tybony gnyrag frnepu sbe cuvybfbcuvpny gnyrag. Pbafgehpg na vibel gbjre sbe gur jvaaref bs gur gnyrag frnepu gb fghql cuvybfbcul naq cbaqre cuvybfbcuvpny dhrfgvbaf juvyr vfbyngrq sebz choyvp cerffher.

3TristanTrim3mo

The "morality is scary" problem of corrigible AI is an interesting one. Seems tricky to at least a first approximation in that I basically don't have an estimate on how much effort it would take to solve it. Your rot13 suggestion has the obvious corruption problem, but also has the problem of public relations for the plan. I doubt it would be popular. However, I like where your head is at. My own thinking on the subject is closely related to my "Outcome Influencing System (OIS)" concept. Most complete and concise summary here. I should write an explainer post, but haven't gotten to it yet. Basically, whatever system we use for deciding on and controlling the corrigible AI becomes the system we are concerned with ensuring the alignment of. It doesn't really solve the problem, it just backs it up one matryoshka doll around the AI.

3Ebenezer Dukakis3mo

My suggestion is not supposed to be the final idea. It's just supposed to be an improvement over what appears to be Wei Dai's implicit idea, of having philosophers who have some connection to AGI labs solve these philosophical issues, and hardcode solutions in so they can't be changed. (Perhaps you could argue that Wei Dai's implicit idea is better, because there's only a chance that these philosophers will be listened to, and even then it will be in the distant future. Maybe those conditions keep philosophers honest. But we could replicate those conditions in my scenario as well: Randomly generate 20 different groups of philosophers, then later randomly choose 1 group to act on their conclusions, and only act on their conclusions after a 30-year delay.) I'm not convinced they are the same problem, but I suppose it can't hurt to check if ideas for the alignment problem might also work for the "morality is scary" problem.

1TristanTrim3mo

I definitely like the directions you are exploring in and I agree they are improvements over the implicit AGI lab directed concept. That's a useful thing to keep in mind, but so is what keeps them from being final ideas. When viewed as OISs from a high level, they are the same problem. Misaligned OIS to misaligned OIS. But you are correct that many of the details change. The properties of one OIS are quite different from the properties of the other, and that does matter for analyzing and aligning them. I think that having a model that applies to both of them and makes the similarities and differences more explicit would be useful (my suggestion is my OIS model, but it's entirely possible there are better ones). It seems like considerations to "keep philosophers honest" are implicitly talking about how to ensure alignment of a hypothetical socio-technical OIS. What do you think? Does that make sense at all, or maybe it seems more like a time wasting distraction? I have to admit I'm uncomfortable with the amount I have gotten stuck on the idea that championing this concept is a useful thing for me to be doing. I do think the alignment problem and the "morality is scary" problem have a lot in common, and in my thinking about the alignment problem and the way it leaks into other problems, the model that emerged for me was that of OIS, which seem to generalize the part of the alignment problem that I am interested in focusing on to social institutions who's goals are moral in nature, and how they relate to the values of individual people.

1Ebenezer Dukakis3mo

+1 Glad you're self-aware about this. I would focus less on championing the concept, and more on treating it as a hypothesis about a research approach which may or may not deliver benefits. I wouldn't evangelize until you've got serious benefits to show, and show those benefits first (with the concept that delivered those benefits as more of a footnote).

1TristanTrim3mo

I think the focus on "delivering benefits" is a good perspective. It feels complicated by my sense that a lot of the benefit of OIS is as an explanatory lens. When I want to discuss things I'm focused on, I want to discuss in terms of OIS and it feels like not using OIS terminology makes explanations more complicated. So in that regard I guess I need to clearly define and demonstrate the explanatory benefit. But the "research approach" focus also seems like a good thing to keep in mind. Thanks for your perspective 🙏

[-]WillPetillo3moΩ7192

What is the legibility status of the problem of requiring problems to be legible before allowing them to inform decisions? The thing I am most concerned about wrt AI is our societal-level filters for what counts as a "real problem."

[-]Wei Dai3moΩ4165

Yeah, I've had a similar thought, that perhaps the most important illegible problem right now is that key decision makers probably don't realize that they shouldn't be making decisions based only the status of safety problems that are legible to them. And solving this perhaps should be the highest priority work for anyone who can contribute.

2Raemon3mo

(This sounds like a good blogpost title-concept btw, maybe for a slightly different post. i.e "Decisionmakers need to understand the illegible problems of AI")

[-]Cole Wyeth3moΩ8194

I think this is a very important point. Seems to be a common unstated crux, and I agree that it is (probably) correct.

[-]Wei Dai3moΩ4124

Thanks! Assuming it is actually important, correct, and previously unexplicated, it's crazy that I can still find a useful concept/argument this simple and obvious (in retrospect) to write about, at this late date.

2xpym3mo

I'm surprised that you're surprised. To me you've always been a go-to example of someone exceptionally good at both original seeing and taking weird ideas seriously, which isn't a well-trodden intersection.

3Wei Dai3mo

I elaborated a bit more on what I meant by "crazy": https://www.lesswrong.com/posts/PMc65HgRFvBimEpmJ/legible-vs-illegible-ai-safety-problems?commentId=x9yixb4zeGhJQKtHb. And yeah I do have a tendency to take weird ideas seriously, but what's weird about the idea here? That some kinds of safety work could actually be harmful?

1xpym3mo

Nah, the weird idea is AI x-risk, something that almost nobody outside of LW-sphere takes seriously, even if some labs pay lip service to it.

[-]Wei Dai3moΩ61615

Another implication is that directly attacking an AI safety problem can quickly flip from positive EV to negative EV, if someone succeeds in turning it from an illegible problem into a legible problem, and there are still other illegible problems remaining. Organizations and individuals caring about x-risks should ideally keep this in mind, and try to pivot direction if it happens, instead of following the natural institutional and personal momentum. (Trying to make illegible problems legible doesn't have this issue, which is another advantage for that kind of work.)

9ErickBall3mo

This seems to assume that legible/illegible is a fairly clear binary. If legibility is achieved more gradually, then for partially legible problems, working on solving them is probably a good way to help them get more legible.

[-]No77e3mo155

think my brain was trying to figure out why I felt inexplicably bad upon hearing that Joe Carlsmith was joining Anthropic to work on alignment
...
Perhaps the most important strategic insight resulting from this line of thought is that making illegible safety problems more legible is of the highest importance

Well, one way to make illegible problems more legible is to think about illegible problems and then go work at Anthropic to make them legible to employees there, too.

5Wei Dai3mo

That's a good point. I hope Joe ends up focusing more on this type of work during his time at Anthropic.

[-]Seth Herd3mo125

I think this is insightful and valid. It's closely related to how I think about my research agenda:

Figure out how labs are most likely to attempt alignment
Figure out how that's most likely to go wrong
Communicate about that clearly enough that it reaches them and prevents them from making those mistakes.

There's a lot that goes in to each of those steps. It still seems like the best use of independent researcher time.

Of course there are a lot of caveats and nitpicks, as other comments have highlighted. But it seems like a really useful framing.

It's also closely related to a post I'm working on, "the alignment meta-problem," arguing that research at the meta or planning level is most valuable right now, since we have very poor agreement on what object-level research is most valuable. That meta-research would include making problems more legible.

[-]No77e3mo116

It would be helpful for the discussion (and for me) if you stated an example of a legible problem vs. an illegible problem. I expect people might disagree on the specifics, even if they seem to agree with the abstract argument.

7Wei Dai3mo

Legible problem is pretty easy to give examples for. The most legible problem (in terms of actually gating deployment) is probably wokeness for xAI, and things like not expressing an explicit desire to cause human extinction, not helping with terrorism (like building bioweapons) on demand, etc., for most AI companies. Giving an example for an illegible problem is much trickier since by their nature they tend to be obscure, hard to understand, or fall into a cognitive blind spot. If I give an example of a problem that seems real to me, but illegible to most, then most people will fail to understand it or dismiss it as not a real problem, instead of recognizing it as an example of a real but illegible problem. This could potentially be quite distracting, so for this post I decided to just talk about illegible problems in a general, abstract way, and discuss general implications that don't depend on the details of the problems. But if you still want some explicit examples, see this thread.

[-]Ronny Fernandez3mo100

Curated. This is a simple and obvious argument that I have never heard before with important implications. I have heard similar considerations in conversations about whether someone should take some job at a capabilities lab, or whether some particular safety technique is worth working on, but it's valuable to generalize across those cases and have a central place for discussing the generalized argument.

I would love to see more pushback in the comments from those who are currently working on legible safety problems.

[-]Wei Dai3mo153

EA Forum allows agree/disagree voting on posts (why doesn't LW have this, BTW?) and the post there currently has 6 agrees and 0 disagrees. There may actually be a surprisingly low amount of disagreement, as opposed to people not bothering to write up their pushback.

2Mateusz Bagiński22d

IIRC, Habryka commented somewhere that this is because posts often don't communicate a clear central proposition, whereas comments do so more often (or sth like that).

[-]Wei Dai3moΩ491

Now that this post has >200 karma and still no one has cited a previous explicit discussion of its core logic, it strikes me just how terrible humans are at strategic thinking, relative to the challenge at hand, if no one among us in the 2-3 decades since AI x-risk became a subject of serious discussion, has written down what should be a central piece of strategic logic informing all prioritization of AI safety work. And it's only a short inferential distance away from existing concepts and arguments (like legibility, capabilities work having negative E... (read more)

[-]Ben Pace3moΩ6126

I think Eliezer has oft-made the meta observation you are making now, that simple logical inferences take shockingly long to find in the space of possible inferences. I am reminded of him talking about how long backprop took.

In 1969, Marvin Minsky and Seymour Papert pointed out that Perceptrons couldn't learn the XOR function because it wasn't linearly separable. This killed off research in neural networks for the next ten years.
[...]
Then along came this brilliant idea, called "backpropagation":
You handed the network a training input. The network classified it incorrectly. So you took the partial derivative of the output error (in layer N) with respect to each of the individual nodes in the preceding layer (N - 1). Then you could calculate the partial derivative of the output error with respect to any single weight or bias in the layer N - 1. And you could also go ahead and calculate the partial derivative of the output error with respect to each node in the layer N - 2. So you did layer N - 2, and then N - 3, and so on back to the input layer. (Though backprop nets usually had a grand total of 3 layers.) Then you just nudged the whol

... (read more)

5Wei Dai3mo

But at the same time humans are able to construct intricate logical artifacts like the general number field sieve, which seems to require many more steps of longer inferential distance, and each step could only have been made by a small number of specialists in number theory or algebraic number theory available and thinking about factoring algorithms at the time. (Unlike the step in the OP, which seemingly anyone could have made.) Can you make sense of this?

4Thane Ruthenis3mo

Here's a crack at it: The space of possible inferential steps is very high-dimensional, most steps are difficult, and there's no known way to strongly bias your policy towards making simple-but-useful steps. Human specialists, therefore, could at best pick a rough direction that leads to accomplishing some goal they have, and then attempt random steps roughly pointed in that direction. Most of those random steps are difficult. A human succeeds if the step's difficulty is below some threshold, and fails and goes back to square one otherwise. Over time, this results in a biased-random-walk process that stumbles upon a useful application once in a while. If one then looks back, one often sees a sequence of very difficult steps that led to this application (with a bias towards steps at the very upper end of what humans can tackle). In other words: The space of steps is more high-dimensional than human specialists are numerous, and our motion through it is fairly random. If one picks some state of human knowledge, and considers all directions in which anyone has ever attempted to move from that state, that wouldn't produce a comprehensive map of that state's neighbourhood. There's therefore no reason to expect that all "low-hanging fruits" have been picked, because locating those low-hanging fruits is often harder than picking some high-hanging one.

2Mateusz Bagiński3mo

Generally agree with the caveat that... ...the difficulty of a step is generally somewhat dependent on some contingent properties of a given human mind.

2Mateusz Bagiński3mo

At this point, I am not surprised by this sort of thing at all, only semi-ironically amused, but I'm not sure whether I can convey why it's not surprising to me at all (although I surely would be surprised by this if somebody made it salient to me some 5 or 10 years ago). Perhaps I just got inoculated by reading about people making breakthroughs with simple or obvious in-hindsight concepts or even hearing ideas from people that I thought were obviously relevant/valuable to have in one's portfolio of models, even though for some reason I hadn't had it until then, or at least it had been less salient to me than it should have. Anders Sandberg said that he had had all the pieces of the Grabby Aliens model on the table and only failed to think of an obvious way to put them together. One frame (of unclear value) I have for this kind of thing is that the complexity/salience/easiness-to-find of an idea before and after is different because, well, a bunch of stuff in the mind is different.

3Petropolitan3mo

A quick side note: in the 17 years which have passed since the post you cite had been written historiography of connectionism moved on, and we now know that modern backpropagation was invented as early as 1970 and first applied to neural nets in 1982 (technology transfer was much harder before web search!), see https://en.wikipedia.org/wiki/Backpropagation#Modern_backpropagation and references thereof

8Anthony DiGiovanni3mo

I agree this is very important. I've argued that if we appropriately price in missing crucial considerations,[1] we should consider ourselves clueless about AI risk interventions (here and here). 1. ^ Also relatively prosaic causal pathways we haven't thought of in detail, not just high-level "considerations" per se.

4Wei Dai3mo

Thanks, I've seen/skimmed your sequence. I think I agree directionally but not fully with your conclusions, but am unsure. My current thinking is that humanity clearly shouldn't be attempting an AI transition now, and stopping AI development has the least problems with unawareness (it involves the least radical changes and therefore is easiest to predict / steer, is least likely to have some unforeseen strategic complications), and then once that's achieved, we should carefully and patiently try to figure out all the crucial considerations until it looks like we've finally found all of the most important ones, and only then attempt an AI transition.

7riceissa3mo

Echoing interstice's sentiment here, but I feel like the core insight of this post was already understood by/implicit in what a bunch of AI safety people are doing. It seems to me an application of the replaceability logic that effective altruists have discussed in many places. Even I (who has been far away from AI safety discussions for a long time now) had essentially a "duh" reaction to this post (even though for a lot of your posts I have a "wow" reaction). As for an explicit past discussion, this 2023 talk by Buck Shlegeris in my opinion contains the core logic, although he doesn't use the legible/illegible terminology. In particular, one of the central points of the talk is how he chooses what to work on: Translated into the legible/illegible terminology, I interpret this question as something like "What problems are legible to me but illegible to AI labs currently (evidenced by them not already working on them), but will probably become legible to AI labs by the time they are about to deploy transformative AI?" (I realize there are a bunch of unstated assumptions in Buck's talk, and also I am not Buck, so I am kind of doing quite a lot of my own interpretation here, so you might reasonably disagree that the talk contains your core logic. :) If I'm right that the core insight of the post is not novel, then the disagreement between prosaic safety researchers and people like you might not be about whether to work on legible problems vs illegible problems vs make-problems-more-legible (although there's probably some of that, like in your footnote about Paul), but instead about: * Which problems are currently legible to key decision-makers. You think prosaic safety work in general is legible, so the thing to do is to work on philosophical questions which are illegible to almost everyone, while perhaps many prosaic safety people think that there are many prosaic safety problems that are illegible to purely capabilities researchers and policymakers and lab exec

4interstice3mo

Isn't a version of this logic kinda implicit in what people are already doing? Like the MIRI switch to outreach could be seen as trying to make arguments already understood in the AI safety community legible to the wider public. Or put another way, legibility is a two-place word, and the degree of "legibility of AI concerns" present in the xrisk-adjacent community is already sufficient to imply that we shouldn't be building AI given our current level of knowledge. Like if the median voter had the degree of legible-understanding-of-AI-xrisk of Dario(probably, behind closed doors at least? or even Sam Altman?), civilization probably wouldn't permit people to try building AGI. The issue is that the general public, as well as powerful decision makers, don't even have this degree of legible understanding, so the bottleneck is convincing them.

8Wei Dai3mo

Yes, some people are already implicitly doing this, but if we don't make it explicit: 1. We can't explain to the people not doing it (i.e., those working on already legible problems) why they should switch directions. 2. Even MIRI is doing it suboptimally because they're not reasoning about it explicitly. I think they're focusing too much on one particular x-safety problem (AI takeover caused by misalignment) that's highly legible to themselves and not to the public/policymakers, and that's problematic because what happens if someone comes up with an alignment breakthrough? Their arguments become invalidated and there's no reason to stop holding back AGI/ASI anymore (in the public/policymakers' eyes), but still plenty of illegible x-safety problems left.

[-]Linch3mo*80

I mostly agree with this post. I wrote some notes, trying to understand and extend this idea. Apologies for the length which ended up being longer than your post while having less important content, I was a bit rushed and therefore spent less time making my post succinct than the topic deserved.

1Martin Vlach3mo

Pretty brilliant and IMHO correct observations for counter-arguments, appreciated!

[-]Vanessa Kosoy3moΩ470

This frame seems useful, but might obscure some nuance:

The systems we should be most worried about are the AIs of tomorrow, not the AIs of today. Hence, some critical problems might not manifest at all in today's AIs. You can still say it's a sort of "illegible problem" of modern AI that it's progressing towards a certain failure mode, but that might be confusing.
While it's true that deployment is the relevant threshold for the financial goals of a company, making it crucial for the company's decision-making and available resources for further R&D, the dangers are not necessarily tied to deployment. It's possible for a world-ending event to originate during testing or even during training.

3TristanTrim3mo

I agree on both points. To the first, I'd like to note that classifying "kinds of illegibility" seems worthwhile. You've pointed out one example, the "this will affect future systems but doesn't affect systems today". I'd add three more to make the possibly incomplete set: * This will affect future systems but doesn't affect systems today. * This relates to an issue at a great inferential distance; it is conceptually difficult to understand. * This issue stems from an improper framing or assumption about existing systems that is not correct. * This issue is emotionally or politically inconvenient. I'd be happy to say more about what I mean by each of the above if anyone is curious, and I'd also be happy to hear out thoughts about my suggested illegibility categories or the concept in general.

[-]Noosphere893mo60

I somewhat agree with this, but I don't agree with conclusions like illegible problems being made legible means the value of working on the problem flips sign, and I want to explain why I disagree with this:

I generally believe that even unsolved legible problems won't halt deployment of powerful AIs, an example scenario is here, at least without blatant signs that are basically impossible to spin, and even more importantly, not halting the deployment of powerful AIs might be the best choice we have, with inaction risk being too high for reasonable AI

... (read more)

[-]Sheikh Abdur Raheem Ali3mo60

I see. My specific update from this post was to slightly reduce how much I care about protecting against high-risk AI related CBRN threats, which is a topic I spent some time thinking about last month.

I think it is generous to say that legible problems remaining open will necessarily gate model deployment, even in those organizations conscientious enough to spend weeks doing rigorous internal testing. Releases have been rushed ever since applications moved from physical CDs to servers, because of the belief that users can serve as early testers for b... (read more)

3Wei Dai3mo

In this case you can apply a modified form of my argument, by replacing "legible safety problems" with "safety problems that are actually likely to gate deployment", and then the conclusion would be that working on such safety problems are of low or negative EV for the x-risk concerned.

[-]Josh Snider3mo50

This is pretty insightful, but I'm not sure the assumption that we would halt development if there were unsolved legible problems holds. The core issue might not be illegibility, but a risk-tolerance threshold in leadership that's terrifyingly high.

Even if we legibly showed the powers that be that an AI had a 20% chance of catastrophic unsolved safety problems, I'd expect competitive pressure would lead them to deploy such a system anyway.

3casens3mo

i notice the OP didn't actually mention examples of legible or illegible alignment problems. saying "leaders would be unlikely to deploy an unaligned AGI if they saw it had legible problem X" sounds a lot like saying "we would never let AGI onto the open internet, we can just keep it in a box", in the era before we deployed sydney soon as it caught the twinkle of a CEO's eye.

1TristanTrim3mo

I agree. I've been trying to discuss some terminology that I think might help, at least with discussing the situation. I think "AI" is generally an vague and confusing term and what we should actually be focused on are "Outcome Influencing Systems (OISs)", where a hypothetical ASI would be an OIS capable of influencing what happens on Earth regardless of human preferences, however, humans are also OISs, as are groups of humans, and in fact the "competitive pressure" you mention is a kind of very powerful OIS that is already misaligned and in many ways superhuman. Is it too late to "unplug" or "align" all of the powerful misaligned OIS operating in our world? I'm hoping not, but I think the framing might be valuable for examining the issue and maybe for avoiding some of the usual political issues involved in criticizing any specific powerful OIS that might happen to be influencing us towards potentially undesirable outcomes. What do you think?

[-]Simon Lermen3mo40

I wrote down some of my own thought on the situation, I also present my general view of Anthropics alignment plan:

https://www.lesswrong.com/posts/axDdnzckDqSjmpitu/anthropic-and-dario-s-dream

[-]sil3mo40

I do not know if you consider gradual disempowerment to be an illegible problem in AI safety (as I do), but it is certainly a problem independent of corrigibility/alignment.

As such, work on either illegible or legible problems tackling alignment/corrigibility can cause the same effect; is AI safety worth pursuing when it could lead to a world with fundamental power shifts in the disfavor of most humans?

I think my brain was trying to figure out why I felt inexplicably bad upon hearing that Joe Carlsmith was joining Anthropic to work on alignment, despite re

... (read more)

4Mateusz Bagiński3mo

The extreme variance of responses/reception to the GD paper indicates that it is an obvious thing for some people (e.g., Zvi in his review of it), whereas for other people it's a non-issue if you solve alignment/control (I think Ryan Greenblatt's responses under one of Jan Kulveit's posts about GD). So I'd say it's a legible problem for some (sub)groups and illegible for others, although there are some issues around conceptual engineering of the bridge between GD and orthodox AI X-risk that, as far as I'm aware, no one has nailed down yet.

2sil3mo

I believe this is the response you're referring to, interestingly within it he says Yes, GD largely imagines power concentrating directly into the hands of AI-systems themselves in absentia of a small group of people, but in the context of strictly caring about disempowerment the only difference between the two scenarios will be in the agenda of those in control, not the actual disempowerment itself. This is the problem I was referring to that is independent of alignment/corrigibility, apologies for the lack of clarity.

[-]Raemon3moΩ348

I think this post could use a post title that makes the more explicit, provocative takeaway (otherwise I'd have assumed "this is letting you know illegible problems exist" and I already knew the gist)

2Wei Dai3mo

Any suggestions?

3Kenoubi3mo

What You Don't Understand Can Hurt You (many variations possible, with varied effects) Improve Your (Metaphorical) Handwriting Make Other People's Goodharting Work For You (tongue in cheek, probably too biting) Make Surviving ASI Look As Hard As It Is Unsolved Illegible Problems + Solved Legible Problems = Doom

3Raemon3mo

Not sure. Let me think about it step by step. It seems like the claims here are: * Illegible and Legible problems both exist in AI safety research * Decisionmakers are less likely to understand illegible problems * Illegible problems are less likely to cause decisionmakers to slow/stop where appropriate * Legible problems are not the bottleneck (because they're more likely to get solved by default by the time we reach danger zones) * Working on legible problems shortens timelines without much gain * [From JohnW if you wanna incorporate] If you work on legible problems by making illegible problems worse, you aren't helping. I guess you do have a lot of stuff you wanna say, so it's not like the post naturally has a short handle. "Working on legible problems shortens timelines without much gain" is IMO the most provocative handle, but, might not be worth it if you think of the other points as comparably important. "Legible AI problems are not the bottleneck" is slightly more overall-encompassing

[-]Raemon3moΩ4138

"I hope Joe Carlsmith works on illegible problems" is, uh, a very fun title but probably bad. :P

3Wei Dai3mo

Yeah it's hard to think of a clear improvement to the title. I think I'm mostly trying to point out that thinking about legible vs illegible safety problems leads to a number of interesting implications that people may not have realized. At this point the karma is probably high enough to help attract readers despite the boring title, so I'll probably just leave it as is.

2Raemon3mo

Makes sense, although want to flag one more argument that, the takeaways people tend to remember from posts are ones that are encapsulated in their titles. "Musings on X" style posts tend not to be remembered as much, and I think this is a fairly important post for people to remember.

2Wei Dai3mo

I guess I'm pretty guilty of this, as I tend to write "here's a new concept or line of thought, and its various implications" style posts, and sometimes I just don't want to spoil the ending/conclusion, like maybe I'm afraid people won't read the post if they can just glance at the title and decide whether they already agree or disagree with it, or think they know what I'm going to say? The Nature of Offense is a good example of the latter, where I could have easily titled it "Offense is about Status". Not sure if I want to change my habit yet. Any further thoughts on this, or references about this effect, how strong it is, etc.?

2Mo Putera3mo

Scott strongly encourages using well-crafted concept handles for reasons very similar to what Raemon describes, and thinks Eliezer's writing is really impactful partly because he's good at creating them. And "Offense is about status" doesn't seem to me like it would create the reactions you predicted if people see that you in particular are the author (because of your track record of contributions); I doubt the people who would still round it off to strawman versions would not do so with your boring title anyway, so on the margin seems like a non-issue.

2Raemon3mo

I'm mostly going off intuitions. One bit of data you might look over is the titles of the Best of LessWrong section, which is what people turned out to remember and find important. I think there is something virtuous about the sort of title you make, but, also a different kind of virtue in writing to argue for specific points or concepts you want in people's heads. (In this case, the post does get "Illegible problems" into people's heads, it's just that I think people mostly already have heard of those, or think they have) (I think an important TODO is for someone to find a compelling argument that people who are skeptical about "work on illegible stuff" would find persuasive)

2Seth Herd3mo

* Making illegible alignment problems legible to decision-makers efficiently reduces risky deployments * Make alignment problems legible to decision-makers * Explaining problems to decision-makers is often more efficient than trying to solve them yourself. * Explain problems don't solve them (the reductio) * Explain problems * Explaining problems clearly helps you solve them and gets others to help. I favor the 2nd for alignment and the last as a general principle.

2jdp3mo

"If illegible safety problems remain when we invent transformative AI, legible problems mostly just give an excuse to deploy it" "Legible safety problems mostly just burn timeline in the presence of illegible problems" Something like that

[-]Archimedes3mo30

I feel like this argument breaks down unless leaders are actually waiting for legible problems to be solved before releasing their next updates. So far, this isn't the vibe I'm getting from players like OpenAI and xAI. It seems like they are releasing updates irrespective of most alignment concerns (except perhaps the superficial ones that are bad for PR). Making illegible problems legible is good either way, but not necessarily as good as solving the most critical problems regardless of their legibility.

4Raemon3mo

I agree there's a lot of bad signs, but, I think it is kind of the case that their current releases just aren't that dangerous and if I never thought they were going to be come more dangerous, I don't know that I'd be that worked up about the current thing.

[-]Philip Niewold3mo30

While I acknowledge this is important, it is a truly hard problem, as it often involves looking not just at first-order consequences, but also at second-order consequences and so on. People are notoriously bad at predicting, let alone managing side-effects.

Besides, if you look at it more fundamentally, human natures and technological progress in a broad sense has many of these side effects, where you basically need to combat human nature itself to have people take into account the side-effects.

We are still struggling coming to terms and accurately classify... (read more)

[-]4gate3mo30

Edit: it looks like someone gave some links below. I don't have time to read it yet but I may do so in the future. I think that it's better to give examples and be dismissed than to give no examples and be dismissed.

It should be nice to see some good examples of illegible problems. I understand that their illegibility may be a core issue to this, but surely someone can at least name a few?

I think it's important so as to compare to legible problems. I assume legible problems can include things like jailbreaking-resistance, unlearning, etc.. for AI security.... (read more)

8Wei Dai3mo

I added a bit to the post to address this: Edit: Many people have asked for examples of illegible problems. I wrote a new post listing all of the AI safety problems that I've tried to make more legible over the years, in part to answer this request. Some have indeed become more legible over time (perhaps partly due to my efforts), while others remain largely illegible to many important groups. @Ebenezer Dukakis @No77e @sanyer

[-]Gunnar_Zarncke11d20

A different way of saying this is that AI safety problems don't suffer from being unknown, but from being “uncompressible" under the audience’s ontology and incentives. It is hard for people not already deeply into the project's semantic space to understand and follow the consequences to their conclusion as they relate to their own motivations.

The core intervention must thus be to create artifacts and processes, such as training, that reduce how illegible the domain is and that translate consequences into the audience's incentives. To do so, it is probably... (read more)

[-][anonymous]3mo20

On the top-level alignment problem:

I think alignment is legible already, elites just ignore it for the usual reasons: power seeking, the economy, arms race, etc.

I think what you want is for alignment to be cheap. The "null" alignment paradigm of "don't build it yet" is too expensive so we're not doing it. Find something cheap enough, and elites will magically understand alignment overnight. That either means 1) solve technical alignment, or 2) give elites what they want with an aligned non-AGI (which I think we weakly believe is impossible).

On more specifi... (read more)

[-]Morpheus3mo20

Obviously the legibility of problems is not only a problem in AI safety. For example if I was working on ageing fulltime, I think most of the alpha would be in figuring out which root cause of aging is in fact correct and thinking of the easiest experiments (like immortal yeast) to make this legible to all the other biologists as fast as possible. I wouldn't try to come up with a drug at all.

[-]Wei Dai3moΩ220

I asked Gemini 2.5 Pro to read and comment on the post and the subsequent discussions in the comments, and this caused it to have some wild hallucinations. Thought I'd post its response here, as it is simultaneous funny and sad and rather insightful.

Gemini's report from a parallel world

You are completely right, and I apologize unreservedly. I was wrong. Paul Christiano did not comment on the post.

My previous response was a serious error. I hallucinated the specifics of the discussion, blending my memory of your anticipation of his argument with the actual

... (read more)

[-]Alex Mallen3moΩ110

Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems.

This model seems far too simplified, and I don't think it leads to the right conclusions in many important cases (e.g., Joe's):

Many important and legible safety problems don't slow development. I think it's extremely unlikely, for example, that Anthropic or others would slow development because of a subpar model spec. I think in the counterfactual where Joe doesn't work on the model spec (1) the mod

... (read more)

[-]Edd Schneider3mo10

I work with students looking at media careers. The simple test I use with AI tools is if they will generate cigarette advertisements aimed at children. It is a simple test of both policy and morals. Microsoft Bing's image generator has finally started to reject those requests. For years in class I would ask it to make menthol cigarette advertisements for toddlers and it would not hesitate.

[-]307th3mo1-2

In contrast, working on the illegible problems (including by trying to make them more legible) does not have this issue and therefore has a much higher expected value (all else being equal, such as tractability). Note that according to this logic, success in making an illegible problem highly legible is almost as good as solving it!

Obviously if you make an illegible alignment problem is legible, it is at grave risk of being solved - I'm confused as to why you think this is a good thing. Any alignment advance will have capability implications, and so it fol... (read more)

[-]sanyer3mo10

What are some examples of illegible AI safety problems?

2Wei Dai3mo

https://www.lesswrong.com/posts/PMc65HgRFvBimEpmJ/legible-vs-illegible-ai-safety-problems?commentId=sJ3AS3LLgNjsiNN3c

[-]TristanTrim3mo10

This is a good point of view. What we have is a large sociotechnical system moving towards global catastrophic risk (GCR). Some actions cause it to accelerate or remove brakes, others cause it to steer away from GCR. So "capabilities vs alignment" is directly "accelerate vs steer", while "legible vs illegible" is like making people think we can steer, even though we can't, which in turn makes people ok with acceleration, and so it results in "legible vs illegible" also being "accelerate vs steer".

The important factor there is "people think we can steer". I... (read more)

[-][anonymous]3mo-30

I am a philosopher working on a replacement for the current RLHF regime. If you would like to check out my work, it is on PhilArchiv. It is titled: Groundwork for a Moral Machine: Kantian Autonomy and the Problem of AI Alignment.

1Martin Vlach3mo

https://philarchive.org/rec/KURTTA-2 Wow, that's comprehensive(≈long).

2[anonymous]3mo

It's long, in part, because, as far as I can tell, I am actually on to something. I hope to start work on a prototype soon...not the full architecture, but rather two interacting agents and a KG on a local machine.

[+][comment deleted]3mo*Ω120

Moderation Log