I think is unreasonable to put non-trivial weight (e.g. > 5%) on a superexponential fit to METR's 50% time horizon measurements, or similar recently-collected measurements.
To be precise about what I am claiming and what I am not claiming:
Note that my argument has significant overlap with this critique of AI 2027, but is focused on what I think is a key crux rather than being a general critique. There has also been some more recent discussion of superexponential fits since the GPT-5 release here, although my points are based on METR's original data. I make no claims of originality and apologize if I missed similar points being made elsewhere.
METR's data (see Figure 1) exhibits a steeper exponential trend over the last year or so (which I'll call the "1-year trend") than over the last 5 years or so (which I'll call the "5-year trend"). A superexponential fit would extrapolate this to an increasingly steep trend over time. Here is my why I think such an extrapolation is unwarranted:
In the spirit of sticking my neck out rather than merely criticizing, I will make the following series of point forecasts which I expect to outperform a superexponential fit: just follow an exponential trend, with an appropriate weighting based on recency. If you want to forecast 1 year out, use data from the last year. If you want to forecast 5 years out, use data from the last 5 years. (No doubt it's better to use a decay rather than a cutoff, but you get the idea.) I obviously have very wide error bars on this, but probably not wide enough to include the superexponential fit more than a few years out.
As an important caveat, I'm not making a claim about the real-world impact of an AI that achieves a certain time horizon measurement. That is much harder to predict than the measurement itself, since you can't just follow straight lines on graphs.
The model sizes were likely chosen based on typical inference constraints. Given that, they mostly care about maximizing performance, and aren't too concerned about the compute cost, since training such small models is very affordable for them. So it's worth going a long way into the regime of diminishing returns.
It is interesting to note how views on this topic have shifted with the rise of outcome-based RL applied to LLMs. A couple of years ago, the consensus in the safety community was that process-based RL should be prioritized over outcome-based RL, since it incentivizes choosing actions for reasons that humans endorse. See for example Anthropic's Core Views On AI Safety:
Learning Processes Rather than Achieving Outcomes
One way to go about learning a new task is via trial and error – if you know what the desired final outcome looks like, you can just keep trying new strategies until you succeed. We refer to this as “outcome-oriented learning”. In outcome-oriented learning, the agent’s strategy is determined entirely by the desired outcome and the agent will (ideally) converge on some low-cost strategy that lets it achieve this.
Often, a better way to learn is to have an expert coach you on the processes they follow to achieve success. During practice rounds, your success may not even matter that much, if instead you can focus on improving your methods. As you improve, you might shift to a more collaborative process, where you consult with your coach to check if new strategies might work even better for you. We refer to this as “process-oriented learning”. In process-oriented learning, the goal is not to achieve the final outcome but to master individual processes that can then be used to achieve that outcome.
At least on a conceptual level, many of the concerns about the safety of advanced AI systems are addressed by training these systems in a process-oriented manner. In particular, in this paradigm:
- Human experts will continue to understand the individual steps AI systems follow because in order for these processes to be encouraged, they will have to be justified to humans.
- AI systems will not be rewarded for achieving success in inscrutable or pernicious ways because they will be rewarded only based on the efficacy and comprehensibility of their processes.
- AI systems should not be rewarded for pursuing problematic sub-goals such as resource acquisition or deception, since humans or their proxies will provide negative feedback for individual acquisitive processes during the training process.
At Anthropic we strongly endorse simple solutions, and limiting AI training to process-oriented learning might be the simplest way to ameliorate a host of issues with advanced AI systems. We are also excited to identify and address the limitations of process-oriented learning, and to understand when safety problems arise if we train with mixtures of process and outcome-based learning. We currently believe process-oriented learning may be the most promising path to training safe and transparent systems up to and somewhat beyond human-level capabilities.
Or Solving math word problems with process- and outcome-based feedback (DeepMind, 2022):
Second, process-based approaches may facilitate human understanding because they select for reasoning steps that humans understand. By contrast, outcome-based optimization may find hard-to-understand strategies, and result in less understandable systems, if these strategies are the easiest way to achieve highly-rated outcomes. For example in GSM8K, when starting from SFT, adding Final-Answer RL decreases final-answer error, but increases (though not significantly) trace error.
[...]
In contrast, consider training from process-based feedback, using user evaluations of individual
actions, rather than overall satisfaction ratings. While this does not directly prevent actions which
influence future user preferences, these future changes would not affect rewards for the corresponding
actions, and so would not be optimized for by process-based feedback. We refer to Kumar et al. (2020)
and Uesato et al. (2020) for a formal presentation of this argument. Their decoupling algorithms
present a particularly pure version of process-based feedback, which prevent the feedback from
depending directly on outcomes.
Or Let's Verify Step by Step (OpenAI, 2023):
Process supervision has several advantages over outcome supervision related to AI alignment. Process supervision is more likely to produce interpretable reasoning, since it encourages models to follow a process endorsed by humans. Process supervision is also inherently safer: it directly rewards an aligned chain-of-thought rather than relying on outcomes as a proxy for aligned behavior (Stuhlmüller and Byun, 2022). In contrast, outcome supervision is harder to scrutinize, and the preferences conveyed are less precise. In the worst case, the use of outcomes as an imperfect proxy could lead to models that become misaligned after learning to exploit the reward signal (Uesato et al., 2022; Cotra, 2022; Everitt et al., 2017).
In some cases, safer methods for AI systems can lead to reduced performance (Ouyang et al., 2022; Askell et al., 2021), a cost which is known as an alignment tax. In general, any alignment tax may hinder the adoption of alignment methods, due to pressure to deploy the most capable model. Our results show that process supervision in fact incurs a negative alignment tax. This could lead to increased adoption of process supervision, which we believe would have positive alignment side-effects. It is unknown how broadly these results will generalize beyond the domain of math, and we consider it important for future work to explore the impact of process supervision in other domains.
It seems worthwhile to reflect on why this perspective has gone out of fashion:
Some tentative takeaways:
I thought about this a bit more (and discussed with others) and decided that you are basically right that we can't avoid the question of empirical regularities for any realistic alignment application, if only because any realistic model with potential alignment challenges will be trained on empirical data. The only potential application we came up with is LPE for a formalized distribution and formalized catastrophe event, but we didn't find this especially compelling, for several reasons.[1]
To me the challenges we face in dealing with empirical regularities do not seem bigger than the challenges we face with formal heuristic explanations, but the empirical regularities challenges should become much more concrete once we have a notion of heuristic explanations to work with, so it seems easier to resolve them in that order. But I have moved in your direction, and it does seem worth our while to address them both in parallel to some extent.
Objections include: (a) the model is trained on empirical data, so we need to only explain things relevant to formal events, and not everything relevant to its loss; (b) we also need to hope that empirical regularities aren't needed to explain purely formal events, which remains unclear; and (c) the restriction to formal distributions/events limits the value of the application.
Thank you for writing this up – I think this (and the other posts in the series) do a good job of describing ARC's big-picture alignment plan, common objections, our usual responses, and why you find those uncompelling.
In my personal opinion (not necessarily shared by everyone at ARC), the best case for our research agenda comes neither from the specific big-picture plan you are critiquing here, nor from "something good falling out of it along the way" (although that is a part of my motivation), but instead for some intermediate goal along the lines of "a formal framework for heuristic arguments that is well-developed enough that we can convincingly apply it to neural networks". If we can achieve that, it seems quite likely to me that it will be useful for something, for essentially the same reason we would expect exhaustive mechanistic interpretability to be useful for something (and probably quite a lot). Under this view, the point of fleshing out the LPE and MAD applications is important as a proof of concept and for refining our plans, but they are subject to revision.
This isn't meant to downplay your objections too much. The ones that loom largest in my mind are false positives in MAD, small estimates being "lost in the noise" for LPE, and the whole minefield of empirical regularities (all of which you do good justice to). Paul still seems to think we can resolve all of these issues, so hopefully we will get to the bottom of them at some point, although in the short term we are more focused on the more limited dream of heuristic arguments for neural networks (and instances of LPE we think they ought to enable).
A couple of your objections apply even to this more limited dream though, especially the ones under "Explaining everything" and "When and what do we explain?". But your arguments there seem to boil down to "that seems incredibly daunting and ambitious", which I basically agree with. I still come down on the side of thinking that it is still a promising target, but I do think that ARC's top priority should be to come up with concrete cruxes here and put them to the test, which is our primary research focus at the moment.
I recently gave this talk at the Safety-Guaranteed LLMs workshop:
The talk is about ARC's work on low probability estimation (LPE), covering:
Yes, by "unconditionally" I meant "without an additional assumption". I don't currently see why the Reduction-Regularity assumption ought to be true (I may need to think about it more).
Thanks for writing this up! Your "amplified weak" version of the conjecture (with complexity bounds increasing exponentially in 1/ε) seems plausible to me. So if you could amplify the original (weak) conjecture to this unconditionally, it wouldn't significantly decrease my credence in the principle. But it would be nice to have this bound on what the dependence on ε would need to be.
The statements are equivalent if only a tiny fraction (tending to 0) of random reversible circuits satisfy . We think this is very likely to be true, since it is a very weak consequence of the conjecture that random (depth-) reversible circuits are pseudorandom permutations. If it turned out to not be true, it would no longer make sense to think of as an "outrageous coincidence" and so I think we would have to abandon the conjecture. So in short we are happy to consider either version (though I agree that "for which is false" is a bit more natural).
I'm happy to talk about a theoretical HCAST suite with no bugs and infinitely many tasks of arbitrarily long time-horizon tasks, for the sake of argument (even though it is a little tricky to reason about and measuring human performance would be impractical).
I think the notion of an "infinite time horizon" system is a poor abstraction, because it implicitly assumes 100% reliability. Almost any practical, complex system has a small probability of error, even if this probability is too small to measure in practice. Once you stop using this abstraction, the argument doesn't seem to hold up: surely a system that has 99% reliability at million-year tasks has lower than 99% reliability at 10 million-year tasks? This seems true even if a 10 million-year task is nothing more than 10 consecutive million-year tasks, and that seems strictly easier than an average 10 million-year task.