I doubt this would be the ideal moment for a pause, even assuming it were politically tractable, which it obviously isn't right now.
Very likely you'd want to pause after you've automated AI safety research, or at least strongly (e.g. 10x) accelerated at least prosaic AI safety research (none of which has happened yet) - given how small the current AI safety human workforce is, and how much more numerous (and very likely cheaper per equivalent hour of labor) an automated workforce would be.
What makes you confident that AI safety research will be automated before catastrophe is automated?
I don't think 'catastrophe' is the relevant scary endpoint; e.g., COVID was a catastrophe, but unlikely to have been x-risky. Something like a point-of-no-return (e.g. humanity getting disempowered) seems more relevant.
I'm pretty confident it's feasible to at the very least 10x AI safety prosaic research through AI augmentation without increasing x-risk by more than 1% yearly (and that would probably be a conservative upper bound). For some intuition - see the low levels of x-risk that current AIs pose, while already having software engineering 50%-time-horizons of around 4 hours, and while already getting IMO gold medals. Both of these skills (coding and math) seem among the most useful for strongly augmenting AI safety research, especially since LLMs already seem like they might be human-level at (ML) research ideation.
Also, AFAICT, there are so many low hanging fruit to make current AIs safer, some of which I'd suspect are barely being used at all (and even with this relative recklessness, current AIs are still surprisingly safe and aligned - to the point where I think Claudes are probably already more beneficial and more prosocial companions than the median human). Things like unlearning / filtering the most dangerous and most antisocial data, or like production evaluations, or like trying harder to preserve CoT legibility through rephrasing or other forms of regularization, or, more speculatively, trying to use various forms of brain data for alignment.
By catastrophe, I was thinking of something much worse than Covid; or indeed, x-risky. Point-of-no-return is a good stand-in. So: what makes you confident that AI safety research will be automated before a point-of-no-return for humanity is crossed?
I'm pretty confident it's feasible to at the very least 10x AI safety prosaic research through AI augmentation without increasing x-risk by more than 1% yearly
I'd agree that it's feasible - but is it at all likely? Surely that would require us to Pause at ~the current level (as you say: "LLMs already seem like they might be human-level at (ML) research ideation."). You aren't getting only a 1% increase in x-risk yearly on the current trajectory.
I think Claudes are probably already more beneficial and more prosocial companions than the median human
I think you (like many in the LW/EA/AIS community) might be on a slippery slope here to having your mind altered by AI use to the point of losing sight of the fact that these things are fundamentally alien underneath. (See also.)
Except that Zvi covered this potential evidence for misalignment and I had this to add. As for the AIs being alien underneath due to training and architecture, I and Claude Opus 4.5 came up with both a case for it and a case against it.
I think your prompt to Claude is pretty leading[1]. You are assuming the answer with "the AIs end up with motivations similar to those of the humans". The point is that we don't actually know what their underlying motivations are - we only see how they act when trained and system-prompted into mimicking humans. And no alignment techniques are even 3 9s reliable (and we need >13 9s in the limit of ASI).
Also "Can this crux be partially resolved by, say, studying the values of humans whose brain was developed abnormally" is not thinking at the right level of abstraction. Humans who's brains developed abnormally are still very close to normal humans in the grand scheme of mindspace. AIs share zero evolutionary history and development (evo-devo), and close to zero brain architecture with humans. Sharing our corpus of media is a very shallow and brittle substitute (i.e. it can make a half-decent mask for the shoggoth, but it doesn't do anything in the way of evolving the shoggoth into a digital human).
not to mention that the fact that you are using Claude as a trusted source of information on this in the first place is problematic.
What do you mean by automating catastrophe? Is it the creation of a misaligned AGI who has a chance to escape or to (create an ASI who will) fake alignment, be given the throne and commit genocide? Automating AI safety research would have us automate generating safety-related ideas, coding, gathering or creating data sets. But I don't think that I understand how automated coding alone will cause a catastrophe.
One option, if you want to do a lot more about it than you currently are, is Pause House. Another is donating to PauseAI (US, Global). In my experience, being pro-active about the threat does help.
Unfortunately, those in positions of power won't listen. From their perspective it's simply absurd to suggest that a system that currently directly causes, at most, a few dozen induced suicide deaths per year, may explode into death of all life. They have no instinctive, gut feeling for exponential growth, so it doesn't exist for them. And even if they acknowledge there's a risk, their practical reasoning moves more along arms-race lines:
"If we stop and don't develop AGI before our geopolitical enemies because we're afraid of a tiny risk of an extinction, they will develop it regardless, then one of two things happen: either global extinction, or our extinction in our enemies' hands. Which is why we must develop it first. If it goes well, we extinguish them before they have a chance to do it to us. If it goes bad, it'd have gone bad anyway in their or our hands, so that case doesn't matter."
Which is to say they won't care until they see thousands or millions of people dying due to rogue GAIs. Then, and only then, they'd start thinking in terms of maybe starting talks about perchance organizing an international meeting to perhaps agree on potential safeguards that might start being implemented after the proper committees are organized and the adequate personal selected to begin defining...
"If we stop and don't develop AGI before our geopolitical enemies because we're afraid of a tiny risk of an extinction, they will develop it regardless, then one of two things happen: either global extinction, or our extinction in our enemies' hands. Which is why we must develop it first. If it goes well, we extinguish them before they have a chance to do it to us. If it goes bad, it'd have gone bad anyway in their or our hands, so that case doesn't matter."
This has become a common description of why AI companies and governments are moving quickly. In general, I agree with the description, but I specifically struggle with this portion if it:
“Which is why we must develop it first. If it goes well, we extinguish them before they have a chance to do it to us.”
I’m assuming that - and please correct me if I’m misinterpreting here - “extinguish” here means something along the lines of, “remove the ability to compete effectively for resources (e.g. customers or other planets)” not “literally annihilate”.
If I got that totally wrong, no need to read on.
If that’s roughly correct, well, so what? How does being “first” actually solve the misaligned AGI problem? “Global extinction” as you put it.
Being first doesn’t come with the benefit of forcing all subsequently created AGI to be aligned / safe. The government or corporation in second (third, fourth, etc.) place surely can and probably will continue to attempt to build an AGI. They’re probably even more likely to create one in a more reckless manner by trying to catch up as quickly as possible.
I’m assuming that - and please correct me if I’m misinterpreting here - “extinguish” here means something along the lines of, “remove the ability to compete effectively for resources (e.g. customers or other planets)” not “literally annihilate”.
I wish that were the case, but my reference is imagining a paranoid M.A.D. mentality coupled with a Total War scenario unbounded by moral constraints, that is, all sides thinking all the other sides are X-risks to them.
In practice things tend not to get that bad most of the time, but sometimes they do, and much of military preparation concern mitigation of these perceived X-risks, the idea being that if "our side" becomes so powerful it can in fact annihilate the others, and in consequence the others submit without resisting, then "our side" may be magnanimous towards them conditional on their continued subservience and submission, but if they resist to the point of becoming an X-risk towards us, then removing them from the equation entirely is the safest defense from the X-risk they pose us.
A global consensus on stopping GAI development due to its X-risk for all life passes through a global consensus, by all sides, that none of the other sides is an X-risk to any of side. Once everyone agrees on this, then they all together agreeing to deal with a global X-risk becomes feasible. Before that, only if they all see that global X-risk as more urgent and immediate than the many local-to-them X-risks.
Once they realise the risk of extinction isn't "tiny" (and we can all help, here), then the rational move is to not play, and prevent anyone else from playing.
Artificial General Intelligence (AGI) poses an extinction risk to all known biological life. Given the stakes involved -- the whole world -- we should be looking at 10% chance-of-AGI-by timelines as the deadline for catastrophe prevention (a global treaty banning superintelligent AI), rather than 50% (median) chance-of-AGI-by timelines, which seem to be the default[1].
It’s way past crunch time already: 10% chance of AGI this year![2] AGI will be able to automate further AI development, leading to rapid recursive self-improvement to ASI (Artificial Superintelligence). Given alignment/control is not going to be solved in 2026, and if anyone builds it [ASI], everyone dies (or at the very least, the risk of doom is uncomfortably high by most estimates), a global Pause of AGI development is an urgent immediate priority. This is an emergency. Thinking that we have years to prevent catastrophe is gambling a huge amount of current human lives, let alone all future generations and animals.
To borrow from Stuart Russell's analogy: if there was a 10% chance of aliens landing this year[3], humanity would be doing a lot more than we are currently doing[4]. AGI is akin to an alien species more intelligent than us that is unlikely to share our values.
This is an updated version of this post of mine from 2022.
In the answer under “Why 80% Confidence?” on the linked page, it says “there's roughly a 10% chance AGI arrives before [emphasis mine] the lower bound”, so before 2027; i.e. 2026. See also: the task time horizon trends from METR. You might want to argue that 10% is actually next year (2027), based on other forecasts such as this one, but that only makes things slightly less urgent - we’re still in a crisis if we might only have 18 months.
This is different to the original analogy, which was an email saying: "People of Earth: We will arrive on your planet in 50 years. Get ready." Say astronomers spotted something that looked like a spacecraft, heading in our direction, and estimated there was 10% chance that it was indeed an alien spacecraft.
Although perhaps we wouldn't. Maybe people would endlessly argue about whether the evidence is strong enough to declare a 10%(+) probability. Or flatly deny it.