[I'm posting this as a very informal community request in lieu of a more detailed writeup, because if I wait to do this in a much more careful fashion then it probably won't happen at all. If someone else wants to do a more careful version that would be great!]

By crux here I mean some uncertainty you have such that your estimate for the likelihood of existential risk from AI - your "p(doom)" if you like that term - might shift significantly if that uncertainty were resolved.

More precisely, let's define a crux as a proposition such that: (a) your estimate for the likelihood of existential catastrophe due to AI would shift a non-trivial amount depending on whether that proposition was true or false; (b) you think there's at least a non-trivial probability that the proposition is true; and (c) you also think there's at least a non-trivial probability that the proposition is false.

Note 1: It could also be a variable rather than a binary proposition, for example "year human-level AGI is achieved". In that case substitute "variable is above some number x" and "variable is below some number y" instead of proposition is true / proposition is false.

Note 2: It doesn't have to be that the proposition / variable on it's own would significantly shift your estimate. If some combination of propositions / variables would shift your estimate, then those propositions / variables are cruxes at least when combined.

For concreteness let's say that "non-trivial" here means at least 5%. So you need to think there's at least a 5% chance the proposition is true, and at least a 5% chance that it's false, and also that your estimate for p(existential catastrophe due to AI) would shift by at least 5% depending on whether the proposition is true or false.

Here are just a few examples of potential cruxes people might have (among many others!):

  • Year human-level AGI is achieved
  • How fast the transition will be from much lower-capability AI to roughly human-level AGI, or from roughly human-level AGI to vastly superhuman AI
  • Whether power seeking will be an instrumentally convergent goal
  • Whether AI will greatly upset the offense-defense balance for CBRN technologies in a way that favors malicious actors
  • Whether AGIs could individually or collectively defeat humanity if they wanted to
  • Whether the world can collectively get their collective act together to pause AGI development given a clear enough signal (in combination with the probability that we'll in fact get a clear enough signal in time

Listing all your cruxes would be the most useful, but if that is too long a list then just list the ones you find most important. Providing additional details (for example, your probability distribution for each crux and/or how exactly it would shift your p(doom) estimates) is recommended if you can but isn't necessary.

Commenting with links to other related posts on LW or elsewhere might be useful as well.

New to LessWrong?

New Comment
7 comments, sorted by Click to highlight new comments since:
[-]jdp378

It would take many hours to write down all of my alignment cruxes but here are a handful of related ones I think are particularly important and particularly poorly understood:

Does 'generalizing far beyond the training set' look more like extending the architecture or extending the training corpus? There are two ways I can foresee AI models becoming generally capable and autonomous. One path is something like the scaling thesis, we keep making these models larger or their architecture more efficient until we get enough performance from few enough datapoints for AGI. The other path is suggested by the Chinchilla data scaling rules and uses various forms of self-play to extend and improve the training set so you get more from the same number of parameters. Both curves are important but right now the data scaling curve seems to have the lowest hanging fruit. We know that large language models extend at least a little bit beyond the training set. This implies it should be possible to extend the corpus slightly out of distribution by rejection sampling with "objective" quality metrics and then tuning the model on the resulting samples.

This is a crux because it's probably the strongest controlling parameter for whether "capabilities generalize farther than alignment". Nate Soares's implicit model is that architecture extensions dominate. He writes in his post on the sharp left turn that he expects AI to generalize 'far beyond the training set' until it has dangerous capabilities but relatively shallow alignment. This is because the generator of human values is more complex and ad-hoc than the generator of e.g. physics. So a model which is zero-shot generalizing from a fixed corpus about the shape of what it sees will get reasonable approximations of physics which interaction with the environment will correct the flaws in and less-reasonable approximations of the generator of human values which are potentially both harder to correct and Optional on its part to fix. By contrast if human readable training data is being extended in a loop then it's possible to audit the synthetic data and intervene when it begins to generalize incorrectly. It's the difference between trying to find an illegible 'magic' process that aligns the model in one step vs. doing many steps and checking their local correctness. Eliezer Yudkowsky explains a similar idea in List of Lethalities as there being 'no simple core of alignment' and nothing that 'hits back' when an AI drifts out of alignment with us. This resolves the problem by putting humans in a position to 'hit back' and ensure alignment generalization keeps up with capabilities generalization.

A distinct but related question is the extent to which the generator of human values can be learned through self play. It's important to remember that Yudkowsky and Soares expect 'shallow alignment' because consequentialist-materialist truth is convergent but human values are contingent. For example there is no objective reason why you should eat the peppers of plants that develop noxious chemicals to stop themselves from being eaten, but humans do this all the time and call them 'spices'. If you have a MuZero style self-play AI that grinds say, lean theorems, and you bootstrap it from human language then over time a greater and greater portion of the dataset will be lean theorems rather than anything to do with the culinary arts. A superhuman math agent will probably not care very much about humanity. Therefore if the self play process for math is completely unsupervised but the self play process for 'the generator of human values' requires a large relative amount of supervision then the usual outcome is that aligned AGI loses the race compared to pure consequentialists pointed at some narrow and orthogonal goal like 'solve math'. Furthermore if the generator of human values is difficult to compress then it will take more to learn and be more fragile to perturbations and damage. That is rather than think in terms of whether or not there is a 'simple core to alignment' what we care about is the relative simplicity of the generator of human values vs. other forms of consequentialist objective.

My personal expectation is that the generator of human values is probably not a substantially harder math object to learn than human language itself. Nor are they distinct, human language encodes a huge amount of the mental workspace, it is clear at this point that it's more of a 1D projection of higher dimensional neural embeddings than 'shallow traces of thought'. The key question then is how reasonable an approximation of English do large language models learn? From a precision-recall standpoint it seems pretty much unambiguous that large language models include an approximate understanding of every subject discussed by human beings. You can get a better intuitive sense of this by asking them to break every word in the dictionary into parts. This implies that their recall over the space of valid English sentences is nearly total. Their precision however is still in question. The well worn gradient methods doom argument is that if we take superintelligence to have general-search like Solomonoff structure over plans (i.e. instrumental utilities) then it is not enough to learn a math object that is in-distribution inclusive of all valid English sentences, but one which is exclusive of invalid sentences that score highly in our goal geometry but imply squiggle-maximization in real terms. That is, Yudkowsky's theory says you need to be extremely robust to adversarial examples such that superhuman levels of optimization against it don't yield Goodharted outcomes. My intuition strongly says that real agents avoid this problem by having feedback-loop structure instead of general-search structure (or perhaps a general search that has its hypothesis space constrained by a feedback loop) and a solution to this problem exists but I have not yet figured out how to rigorously state it.

[-]porby170

Stated as claims that I'd endorse with pretty high, but not certain, confidence:

  1. There exist architectures/training paradigms within 3-5 incremental insights of current ones that directly address most incapabilities observed in LLM-like systems. (85%; if false, my median strong AI estimate would jump by a few years, p(doom) effect would vary depending on how it was falsified)
  2. It is not an accident that the strongest artificial reasoners we have arose from something like predictive pretraining. In complex and high dimensional problem spaces like general reasoning, successful training will continue to depend on schemes with densely informative gradients that can constrain the expected shape of the training artifact. In those problem spaces, training that is roughly equivalent to sparse/distant reward in naive from-scratch RL will continue to mostly fail.[1] (90%; if false, my p(doom) would jump a lot)
  3. Related to, and partially downstream of, #2: the strongest models at the frontier of AGI will continue to be remarkably corrigible (in the intuitive colloquial use of the word, but not strictly MIRI's use). That is, the artifact produced by pretraining and non-malicious fine tuning will not be autonomously doomseeking even if it has the capability. (A bit less than 90%; this being false would also jump by p(doom) by a lot)
  4. Creating agents out of these models is easy and will get easier. Most of the failures in current agentic applications are not fundamental, and many are related to #1. There are no good ways to stop a weights-available model from, in principle, being used as a potentially dangerous agent, and outcome variance will increase as capabilities increase. (95%; I'm not even sure what the shape of this being false would be, but if there was a solution, it'd drop my current p(doom) by at least half)
  5. Scale is sufficient to bypass the need for some insights. While a total lack of insights would make true ASI difficult to reach in the next few years, the hardware and scale of 2040 is very likely enough to do it the dumb way, and physics won't get in the way soon enough. (92%; falsification would make the tail of my timelines longer. #1 and #5 being falsified together could jump my median by 10+ years.)
  6. We don't have good plans for how to handle a transition period involving widely available high-capability systems, even assuming that those high-capability systems are only dangerous when intentionally aimed in a dangerous direction.[2] It looks an awful lot like we're stuck with usually-reactive muddling, and maybe some pretty scary sounding defensive superintelligence propositions. (75%; I'm quite ignorant of governance and how international coordination could actually work here, but it sure seems hard. If this ends up being easy, it would also drop my p(doom) a lot.)
  1. ^

    Note that this is not a claim that something like RLHF is somehow impossible. RLHF, and other RL-adjacent techniques that have reward-equivalents that would never realistically train from scratch, get to select from the capabilities already induced by pretraining. Note that many 'strong' RL-adjacent techniques involve some form of big world model, operate in some constrained environment, or otherwise have some structure to work with that makes it possible for the optimizer to take useful incremental steps.

  2. ^

    One simple story of many, many possible stories:

    1. It's 20XY. Country has no nukes but wants second strike capacity.

    2. Nukes are kinda hard to get. Open-weights superintelligences can be downloaded.

    3. Country fine-tunes a superintelligence to be an existential threat to everyone else that is activated upon Country being destroyed.

    4. Coordination failures occur; Country gets nuked or invaded in a manner sufficient to trigger second strike.

    5. There's a malign superintelligence actively trying to kill everyone, and no technical alignment failures occurred. Everything AI-related worked exactly as its human designers intended.

I think this is a great project! Clarifying why informed people have such different opinions on AGI x-risk seems like a useful path to improving our odds. I've been working on a post on alignment difficulty cruxes that covers much of hte same ground.

Your list is a good starting point. I'd add:

Time window of analysis: I think a lot of people give a low p(doom) because they're only thinking about the few years after we get real AGI.

Paul Christiano, for instance, adds a substantial chance that we've "irreversibly messed up our future within 10 years of building powerful AI" over and above the odds that we all die from takeover or misuse. (in My views on “doom”, from April 2023).

Here are my top 4 cruxes for alignment difficulty which is different but highly overlapping with p(doom).

How AGI will be designed and aligned

  • How we attempt it is obviously important for the odds of success

How well RL alignment will generalize

Whether we need to understand human values better

Whether societal factors are included in alignment difficulty

  • Sometimes people are just answering whether the AGI will do what we want, and not including whether that will result in doom (eg., ASI super-weapons proliferate until someone starts a hyper-destructive conflict).

Other important cruxes are mentioned in Stop talking about p(doom) - basically, what the heck one includes in their calculation, like my first point on time windows.

Here’s an event that would change my p(doom) substantially:

Someone comes up with an alignment method that looks like it would apply to superintelligent entities.  They get extra points for trying it and finding that it works, and extra points for society coming up with a way to enforce that only entities that follow the method will be created.

So far none of the proposed alignment methods seem to stand up to a superintelligent AI that doesn’t want to obey them.  They don’t even stand up to a few minutes of merely human thought.  But it‘s not obviously impossible, and lots of smart people are working on it.

In the non-doom case, I think one of the following will be the reason:

—Civilization ceases to progress, probably because of a disaster.

—The governments of the world ban AI progress.

—Superhuman AI turns out to be much harder than it looks, and not economically viable.

—The above happy circumstance, giving us the marvelous benefits of superintelligence without the omnicidal drawbacks.

Cruxes connected to whether we get human level A.I. soon:
Do LLM agents become useful in the short term?
How much better is GPT-5 than GPT-4?
Does this generation of robotics startups (e.g. Figure) succeed?

Cruxes connected to whether takeoff is fast:
Are A.I. significantly better at self improving while maintaining alignment of future versions than we are at aligning A.I.?

Cruxes that might change my mind about mech. interp. being doomed:
Can a tool which successfully explains cognitive behavior in GPT-N do the same for GPT-N+1 without significant work?

Last ditch crux:
In high dimensional spaces, do agents with radically different utility functions actually stomp on each other or do they trade? When the intelligence of one agent scales far beyond the other, does trade become stomping or do both just reduce etc. 

Relevant: My Taxonomy of AI-risk counterarguments, inspired by Zvi Mowshowitz's The Crux List.

Just wanted to note that I had a similar question here.

Also, DM me if you want to collaborate on making this a real project. I've been slowly working towards something like this, but I expect to focus more on it in the coming months. I'd like to have something like a version 1.0 in 2-3 months from now. I appreciate you starting this thread, as I think it's ideal for this to be a community effort. My goal is to feed this stuff into the backend of an alignment research assistant system.