That idea of catching bad mesa-objectives during training sounds key and I presume fits under the 'generalization science' and 'robust character training' from Evan's original post. In the US, NIST is working to develop test, evaluation, verification and validation standards for AI and it would be good to include this concept into that effort.
The data relating exponential capabilities to Elo is seen over decades in the computer chess history too. From the 1960s into the 2020's, while computer hardware advanced exponentially at 100-1000x per decade in performance (and SW for computer chess advanced too), Elo scores grew linearly at about 400x per decade, taking multiple decades to go from 'novice' to 'superhuman'. Elo scores have a tinge of exponential to them - a 400 point Elo advantage is about a 10:1 chance for the higher scored competitor to win, and an 800 point Elo is about 200:1, etc. It appears that the current HW/SW/dollar rate of growth towards AGI means the Elo relative to humans is increasing faster than 400 Elo/decade. And, of course, unlike computer chess, as AI Elo at 'AI development' approaches the level of a skilled human, we'll likely get a noticable increase in the rate of capability increase.
Thanks Evan for an excellent overview of the alignment problem as seen from within Anthropic. Chris Olah's graph showing perspectives on alignment difficulty is indeed a useful visual for this discussion. Another image I've shared lately relates to the challenge of building inner alignment illustrated by figure 2 from Contemplative Artificial Intelligence:
In the images, the blue arrows indicate our efforts to maintain alignment on AI as capabilities advance through AGI to ASI. In the left image, we see the case of models that generalize in misaligned ways - the blue constraints (guardrails, system prompts, thin efforts at RLHF, etc) fail to constrain the advancing AI to remain aligned. The right image shows the happier result where training, architecture, interpretability, scalable oversight, etc contribute to a 'wise world model' that maintains alignment even as capabilities advance.
I think the Anthropic probability distribution of alignment difficulty seems correct - we probably won't get alignment by default from advancing AI, but, as you suggest, by serious concerted effort we can maintain alignment through AGI. What's critical is to use techniques like generalization science, interpretability, and introspective honesty to gauge whether we are building towards AGI capable of safely automating alignment research towards ASI. To that end, metrics that allow us determine if alignment is actual closer to P-vs-NP in difficulty are crucial, and efforts from METR, UK AISI, NIST, and others can help here. I'd like to see more 'positive alignment' papers such as Cooperative Inverse Reinforcement Learning, Corrigibility, and AssistanceZero: Scalably Solving Assistance Games, as detecting when an AI is positively aligned internally is critical to getting to the 'wise world model' outcome.
I concur with that sentiment. GPUs hit a sweet spot between compute efficiency and algorithmic flexibility. CPUs are more flexible for arbitrary control logic, and custom ASICs can improve compute efficiency for a stable algorithm, but GPUs are great for exploring new algorithms where SIMD-style control flows exist (SIMD=single instruction, multiple data).
I would include "constructivist learning" in your list, but I agree that LLMs seem capable of this. By "constructivist learning" I mean a scientific process where the learning conceives of an experiment on the world, tests the idea by acting on the world, and then learns from the result. A VLA model with incremental learning seems close to this. RL could be used for the model update, but I think for ASI we need learnings from real-world experiments.
This post provides a good overview of some topics I think need attention by the 'AI policy' people at national levels. AI policy (such as the US and UK AISI groups) has been focused on generative AI and recently agentic AI to understand near-term risks. Whether we're talking LLM training and scaffolding advances, or a new AI paradigm, there is new risk when AI begins to learn from experiments in the world or reasoning about its own world model. In child development, imitation learning focuses on learning from examples, while constructivist learning focuses on learning by reflecting on interactions with the world. Constructivist learning is, I expect, key to push past AGI to ASI and caries obvious risks to alignment beyond imitation learning.
In general, I expect something LLM-like (i.e. transformer models or an improved derivative) to be able to reach ASI with a proper learning-by-doing structure. But I also expect ASI could find and implement a more efficient intelligence algorithm once ASI exists.
1.4.1 Possible counter: “If a different, much more powerful, AI paradigm existed, then someone would have already found it.”
This paragraph tries to provide some data for a probability estimate of this point. AI as a field has been around at least since the Dartmouth conference in 1956. In this time we've had Eliza, Deep Blue, Watson, and now transformer-based models including OpenAI o3-pro. In support of Steven's position, one could note that AI research publications are much higher now that during the previous 70 years, but at the same time many AI ideas have been explored and the current best results are with models based on the 8-year-old "Attention is all you need" paper. To get a sense for the research rate, we can note that the doubling time for AI/ML research papers per month was about 2 years between 1994 and 2023 according to this Nature paper. Hence, every 2 years we have about as many papers as created in the last 70 years. I don't expect this doubling can continue forever, but certainly many new ideas are being explored now. If a 'simple model' for AI exists and it's discovery is, say, randomly positioned on a given AI/ML research paper published between 1956 and ASI achievement then one could estimate the probability of the paper's position using this simplistic research model. If ASI is only 6 years out and the doubling every 2 years continues, then almost 90% of the AI/ML research papers before ASI are still in the future. Even though many of these papers are LLM focused, there is still active work in alternative areas. But even though the foundational paper for ASI may yet be in our future, I would expect something like a 'complex' ML model will win out (for example, Yann LeCun's ideas involve differentiable brain modules). And the solution may or may not be more compute-intensive than current models. The brain compute estimates vary widely and the human brain has been optimized by evolution for many generations. In short, it seems reasonable to expect another key idea before ASI, but I would not expect it to be a simple model.
From what I gather reading the ACAS X paper, it formally proved a subset of the whole problem and many issues uncovered by using the formal method were further analyzed using simulations of aircraft behaviors (see the end of section 3.3). One of the assumptions in the model is that the planes react correctly to control decisions and don't have mechanical issues. The problem space and possible actions were well-defined and well-constrained in the realistic but simplified model they analyzed. I can imagine complex systems making use of provably correct components in this way but the whole system may not be provably correct. When an AI develops a plan, it could prefer to follow a provably safe path when reality can be mapped to a usable model reliably, and then behave cautiously when moving from one provably safe path to another. But the metric for 'reliable model' and 'behave cautiously' still require non-provable decisions to solve a complex problem.
Steve, thanks for your explanations and discussion. I just posted a base reply about formal verification limitations within the field of computer hardware design. In that field, ignoring for now the very real issue of electrical and thermal noise, there is immense value in verifying that the symbolic 1's and 0's of the digital logic will successfully execute the similarly symbolic software instructions correctly. So the problem space is inherently simplified from the real world, and the silicon designers have incentive to build designs that are easy to test and debug, and yet only small parts of designs can be formally verified today. It would seem to me that, although formal verification will keep advancing, AI capabilities will advance faster and we need to develop simulation testing approaches to AI safety that are as robust as possible. For example, in silicon design one can make sure the tests have at least executed every line of code. One could imaging having a METR test suite and try to ensure that every neuron in a given AI model has been at least active and inactive. It's not a proof, but it would speak to the breadth of the test suite in relation to the model. Are there robustness criteria for directed and random testing that you consider highly valuable without having a full safety proof?
Thanks for the study Andrew. In the field of computer hardware design, formal verification is often used on smaller parts of the design, but randomized dynamic verification (running the model and checking results) is still necessary to test corner cases in the larger design. Indeed, the idea that a complex problem can be engineered so as to be easier to formally verify is discussed in this recent paper formally verifying IEEE floating point arithmetic. In that paper, published in 2023, they report using their divide-and-conquer approach on the problem resulting in a 7.5 hour run time to prove double-precision division correct. Another illustrative example is given by a paper from Intel which includes the diagram below showing how simulation is relied on for the Full IP level verification of complex systems.
This data supports your points in limitations 2 and 3 and shows the difficulty in engineering a system to be easily proven formally. Certainly silicon processor design has had engineer-millennia spent on the problem of proving the design correct before manufacturing. For AI safety, the problem is much more complex than 'can the silicon boot the OS and run applications?' and I expect directed and random testing will need to be how we test advanced AI as we move towards AGI and ASI. AI can help improve the quality of safety testing including contributing to red-teaming of next-generation models, but I doubt it will be able to help us formally prove correctness before we actually have ASI.
Thanks for the interesting peak into your brain. I have a couple thoughts to share on how my own approaches relate.
The first is related to watching plenty of sci-fi apocalyptic future movies. While it's exciting to see the hero's adventures, I'd like to think that I'd be one of the scrappy people trying to hold some semblance of civilization together. Or the survivor trying to barter and trade with folks instead of fighting over stuff. In general, even in the face of doom, just trying to help minimize suffering unto the end. So the 'death with dignity' ethos fits in with this view.
A second relates to the idea of seeing yourself getting out of bed in the morning. When I've had a lot on my plate to the point of seeming stressful, it helps to visualize the future state where I've gotten the work done and am looking back. Then just imagining inside my brain sodium ions moving around, electrons dropping energy states, proteins changing shapes, etc, as the problem gets resolved. Visualizing the low-level activity in my brain helps me shift focus from the stress and actually move ahead solving the problem.