Thoughts on sharing information about language model capabilities

paulfchristiano

Core claim

I believe that sharing information about the capabilities and limits of existing ML systems, and especially language model agents, significantly reduces risks from powerful AI—despite the fact that such information may increase the amount or quality of investment in ML generally (or in LM agents in particular).

Concretely, I mean to include information like: tasks and evaluation frameworks for LM agents, the results of evaluations of particular agents, discussions of the qualitative strengths and weaknesses of agents, and information about agent design that may represent small improvements over the state of the art (insofar as that information is hard to decouple from evaluation results).

Context

ARC Evals currently focuses on evaluating the capabilities and limitations of existing ML systems, with an aim towards understanding whether or when they may be capable enough to pose catastrophic risks. Current evaluations are particularly focused on monitoring progress in language model agents.

I believe that sharing this kind of information significantly improves society's ability to handle risks from AI, and so I am encouraging the team to share more information. However this issue is certainly not straightforward, and in some places (particularly in the EA community where this post is being shared) I believe my position is controversial.

I'm writing this post at the request of the Evals team to lay out my views publicly. I am speaking only for myself. I believe the team is broadly sympathetic to my position, but would prefer to see a broader and more thorough discussion about this question.

I do not think this post presents a complete or convincing argument for my beliefs. The purpose is mostly to outline and explain the basic view, at a similar level of clarity and thoroughness to the arguments against sharing information (which have mostly not been laid out explicitly).

Added 8/1: Evals has just published a description of some of their work evaluating GPT-4 and Claude. Their publication does not include transcripts, the details of the LM agents they evaluated, or detailed qualitative discussion of the strengths and weaknesses of the agents they evaluated. I believe that eventually Evals should be considerably more liberal about sharing that kind of information; this post will explain why I believe that.

Accelerating LM agents seems neutral (or maybe positive)

I believe that having a better understanding of LM agents increases safety^[1] through two channels:

LM agents are an unusually safe way to build powerful AI systems. Existing concerns about AI takeover are driven primarily by scaling up black-box optimization,^[2] and increasing our reliance on human-comprehensible decompositions with legible interfaces seems like it significantly improves safety. I think this is a large effect size and is suggested by several independent lines of reasoning.
If LM agents are weak are due to exceptionally low investment and understanding it creates "dry tinder:" as incentives rise that investment will quickly rise and so low-hanging fruit will be picked. While there is some dependence on serial time, I think that increased LM investment now will significantly slow down progress later.

I will discuss these mechanisms in more detail in the rest of this section.

I also think that accelerating LM agents will drive investment in improving and deploying ML systems, and so can reduce time available to react to risk. As a result I'm ambivalent about the net effect of improving the design of LM agents—my personal tentative guess is that it's positive, but I would be hesitant about deliberately accelerating LM agents to improve safety (moreover I think this would be a very unleveraged approach to improving safety^[3] and would strongly discourage anyone from pursuing it).

But this means that I am significantly less concerned about information about LM agent capabilities accelerating progress on LM agents. Given that I am already positively disposed towards sharing information about ML capabilities and limitations despite the risk of acceleration, I am particularly positive about sharing information in cases where the main cost is accelerating LM agents.

Improvements in LM agents seem good for safety

Language model agents are built out of LM parts that solve human-comprehensible tasks, composed along human-comprehensible interfaces. Progress in understanding LM agents seems relevant for improving agents built this way, while having at best marginal relevance for systems optimized end to end (to which I expect the "bitter lesson" to apply strongly) or for situations where individual ML invocations are just "cogs in the Turing machine."

I think this kind of ML system seems great for safety:

If we give a system like AutoGPT a goal, it pursues that goal by taking individual steps that a human would rate highly based on their understanding of that goal. The LM's guesses about what humans would do intervenes at every step, and even current language models would avoid pursuing a sub-plan that humans would consider unacceptable. There is no constraint towards specifying measurable goals of the kind that lead to reward-hacking concerns. It now appears like LM agents could scale up to human-level usefulness before we start seeing any serious form of reward hacking.
If LM agents work better, then we will reach any given level of AI using weaker individual models (e.g. a level sufficient for AI to help with alignment, or to help enable or motivate a policy reaction). Deceptive alignment is directly tied to the complexity of the model that we are optimizing end-to-end, and so it becomes significantly less likely the more we rely on interfaces that are designed by humans rather than optimized.
Setting aside individual threat models, decomposing complicated tasks down into simpler parts that humans understand is a natural way to improve safety. It makes it easier to tell what the overall system is doing and why, and provides many additional levers to intervene on how the model thinks or acts.
Turning our attention from threat models to positive visions, I believe that LM agents based on chain of thought and decomposition seem like the most plausible approach to bootstrapping subhuman systems into trusted superhuman systems. For about 7 years using LM agents for RLAIF has seemed like the easiest path to safety,^[4] and in my view this is looking more and more plausible over time.

So at a fixed level of capability, I think the more we are relying on LM agents (rather than larger LMs) the safer we are.

As mentioned before, I do think that progress in LM agents will increase overall investment in ML, and not just LM agent performance. And to a significant extent I think the success of LM agents will be determined by technical factors rather than how much investment there is (although this also makes me more skeptical about the acceleration impacts). But if it weren't for these considerations I would think that progress on LM agents would be clearly and significantly positive.

"Overhang" in LM agents seems risky

Right now people are investing billions of dollars in scaling up LMs. If people only invested millions of dollars in improving LM agents, and such agents were important for overall performance, then I think we would be faced with a massive "overhang:" small additional investments in LM agents could significantly improve overall AI performance.

Under these conditions, increasing investment to speed up LM agents today is likely to slow down LM agents in the future, picking low-hanging fruit that would instead have been picked later when investment increased. If I had to guess, I'd say that accelerating AI progress by 1 day today by improving LM agents would give us back 0.5 days later. (This clawback comes not just from future investments in general agents, but also in the domain-specific investment needed to make a valuable product in any given domain). I am sympathetic to a broad range of estimates, from 0 to 0.9.^[5]

This leaves us with an ambiguous sign, because time later seems much more valuable than time now:^[6]

As AI systems become risky, I expect technical work on risk to increase radically. I expect us to study failures in the lab, and work on systems more closely analogous to those we care about. If I had to guess I'd say that having an extra day while AI systems are very powerful is probably 2x better than a day now (and in many cases much more).
Policy responses to AI seem to be driven largely by the capabilities of AI systems. I think that having 1 extra day for policy progress today would be way better than 2 extra days a few years ago, and I expect multiple further doublings.^[7]

So even if LM agents had no relevance for safety, I would feel ambivalent about whether it is good to speed up or slow them down. (I feel similar ambivalence about many forms of pause, and as I've mentioned I feel like higher investment in the past would quite clearly have slowed down progress now and would probably be net positive, but I think LM agents are an unusually favorable case.)

If you told me that existing language models could already be transformative with the right agent design, I think this position would become stronger rather than weaker. I think in that scenario the overwhelmingly most important game is noticing this overhang and slowing down progress past GPT-4, and from starting to get transformative work out of relatively safe modern ML systems rather than overshooting badly.

I think this overhang argument applies to some extent for most investments in 2023; for example if AI labs buy all the GPUs today then they will get an immediate boost by training bigger models next year, but the boost after that will require having TSMC build more GPUs and so will be much slower (and the one after that will require building new fabs and be much slower). I mentioned it in the previous section and do think it's a major factor explaining why I place a lower premium on slowing down AI than other people. However I think it's a more important factor for LM agents than for e.g. improving the efficiency of LMs or investing more in hardware.

Understanding of capabilities is valuable

I think that a broad understanding of AI capabilities, and how those capabilities are likely to change over time, would significantly reduce risks:

AI developers are less likely to invest adequately in precautions if they underestimate the capabilities of systems they build. For example, debates about AI lab security requirements are often (sensibly) focused on the potential harms from a leak, which are in turn dominated by estimates of the capabilities of current and near-future systems.
Dangerous capabilities of AI systems seem like a major driver of policy reactions. Unless we have major warning shots (e.g. an AI system taking over a datacenter) I believe information about capabilities will be an extremely important driver of policy reactions, and will be more central for determining policy than determining investment or researcher interest.
Even when AI developers attempt to behave cautiously, underestimating AI capabilities can lead them to fail. For example they may incorrectly believe AI systems can't distinguish tests from the real world or can't find a way to undermine human control. This kind of underestimation seems like one of the easiest ways for risk management to break down or for people to incorrectly conclude that their current precautions are adequate.
Beyond those specific effects I think our ability to handle risk depends in a bunch of more nebulous ways on how seriously it is taken by the ML research community and researchers at AI labs, and those reactions are tightly coupled. It is also sensitive to the range of defensible positions which can be used to justify reckless policies, and increasing clarity about capabilities makes it harder to defend unreasonable positions.

This factor seems especially large over the next few years, where most risk comes from the possibility that humanity is taken by surprise. I think this is the most important timeframe for individual decisions about sharing information, since the effects of current decisions will be increasingly attenuated over longer horizons.

Over the longer term I think the dangerous capabilities of AI systems will likely be increasingly clear. But I think better understanding still improves how prepared we are and reduces the risk of large surprises.

I think the importance of information about capabilities is pretty robust across worldviews:

I've laid out the inside view as I see it, which I find fairly compelling.
In the broader scientific community I think there is a strong (and in my view correct) presumption that more accurate information tends to reduce risk absent strong arguments to the contrary. This presumption in favor of measurement is somewhat weaker beyond the scientific community, but I think remains the prevailing view.
Although the MIRI-sphere has very different background views from mine, wild underestimation of model capabilities tends to play a central role in Eliezer's stories about how AI goes wrong. (Although I think he would still be opposed to sharing information about capabilities.)

I think that significantly increasing and broadening an understanding of LM capabilities would very significantly decrease risk, but it's hard to quantify this effect (because it's hard to measure increases in understanding). Qualitatively, I believe that realistic increases in understanding could cut risk by tens of percent.

Information about capabilities is more impactful for understanding than speed

I think that more accurate information about LM capabilities and limitations can drive faster progress in two big ways:

If people are underestimating the competence of models, then correcting their mistake may cause them to invest more.
Better evaluations or understandings of limitations could inspire researchers to make more effective progress.

I think these are real effects. But combining with the unquantified estimate in the last section, if I had to make a wild guess I'd say the benefits from sharing information about ML capabilities are 5-10x larger than the costs from acceleration (even without focusing attention on LM agents).

Here are the main reasons why I think this acceleration cost is smaller than you might fear:

I think that people correctly estimating AI capabilities earlier will increase investment earlier, but that predictably makes it harder to scale up in the future and therefore slows down progress later.^[8] This is part of why my estimate would only be a 15-30% reduction. But more importantly, I think that speed later matters much more than speed now, and so e.g. speeding up by 1 day now and getting back 0.5 days later would probably be a positive trade. As a result I'm ambivalent about the net sign of increasing investment now, and think it's harmful but much less bad than you might expect.^[9] This is the same dynamic discussed in the "Overhang" section above.
There are currently a significant number of people who are very excited about scaling up AI as quickly as they can, and these people are already pushing hard enough to start running into significantly diminishing returns from complementary resources like compute, investment, training new scientists, and so on. So the effects of getting more people excited are sublinear. In contrast, policy reactions (both government policy and lab policy) seem more sensitive to consensus and clarity.
Sharing information about capabilities disproportionately informs people outside of the field. People working with language models have a much stronger intuitive picture of their capabilities and limitations, such that legible information is particularly valuable to people who don't work with these systems. The people most responsible for pushing progress faster seem the least sensitive to this information.
There are a lot of factors that affect progress, and most of them just can't have giant impacts. Progress depends on the availability of talented researchers and engineers, time to train them, technical advances in computer hardware, availability of funding at large tech companies and from startup investors, demonstrated commercial demand, quality of research ideas and so on. I think the outsized role of understanding capabilities in addressing risk is exceptional and this is a high bar to try to meet given how many factors are at play. When I hear people expressing the most extreme concerns about acceleration (whether by training new people, contributing new ideas, creating media attention, popularizing a product…), I often feel like the purported sources of variance explain way more than 100% of the variance in the pace of progress.
I think that recent progress in LMs has been primarily driven by increasing scale and general improvements in LM efficiency, and has not been particularly sensitive to researchers having a detailed picture of the capabilities and limitations of models. So I think that most of the acceleration effect is flowing through increased interest and investment rather than improvements in research quality.
Compared to some people I'm more skeptical about the contingency of innovations like chain of thought or LM agents (and therefore I'm more skeptical about the impact of capabilities understanding that could motivate such work). For example, chain of thought ("inner monologue") appeared multiple times independently in the wild within 2 months of the GPT-3 release (and was discussed internally within OpenAI before GPT-3 was trained). It appears to have failed to spread more broadly due to a combination of limited access to GPT-3 and not yet working very well. Similarly, tools like LangChain and AutoGPT seem to have caught on before they actually work in practice, and to have been developed and explored several times independently. I think that in practice these kinds of general innovations will usually be eclipsed by domain-specific schlep by people rolling out LM products in specific domains.

I think we should make this decision based on best estimates of costs and benefits

One could have a variety of procedural objections to sharing information even if the benefits appear to exceed the cost. I don't think these apply strongly, and therefore I think we should make this decision based on object level analysis:

I think that improving collective understanding of ML capabilities is a big deal according to many worldviews, and that an attempt to limit public understanding could easily have catastrophic consequences. So this isn't a matter of comparing serious costs on one worldview to small benefits on another, it's a decision where both directions have significant stakes and there is no default "safe" option.
I think that small groups shouldn't take unilateral actions contrary to consensus estimates of risk. But in this case I believe that a majority of researchers who think about AI safety are supportive of sharing information (and possibly even a majority of effective altruists, who I expect to be one of the most skeptical groups). And I don't think there is any clear articulation of the case against sharing such information that engages qualitatively with the kinds of considerations raised in this post. I would become more skeptical about sharing information if I learned that there actually was majority opposition in some relevant community.
It may seem suspicious that I am describing so many arguments in favor of sharing information and that all the considerations coincidentally point in the same direction. But this isn't a coincidence. I was excited about incubating the Evals project, and encouraged them to work on dangerous capability evaluations, and was supportive of them working on evaluating LM agents, because I thought that these activities have large positive effects that easily outweigh downsides. It's true that in my view this is an unusually lopsided issue, but that's causally upstream of the decision to work on it rather than being a post hoc rationalization.

^{^}
In this post I focus mostly on the risk of AI takeover, because the community worried about takeover is the primary place where I have encountered a widespread belief that measurement of general LM capabilities may be actively counterproductive.
^{^}
It's conceivable that LM agents pose novel risks that are as large or larger than existing threat models—but to the extent that is the case I am if anything even more excited about exploring such agents sooner, and even more skeptical about buying time to e.g. do (apparently-misguided) alignment research today.
^{^}
Because ML capabilities researchers will already be seeking out and implementing these improvements
^{^}
Most explicitly, see the 2016 discussion of the bootstrapping protocol in ALBA, in which models trained by RLHF solve harder problems by using chain of thought and task decomposition. See also this early 2015 post, which has some distracting simplifications and additional facts, but which presents LM agents in what I would say is essentially the same form that seems most plausible today. This isn't really related to the thrust of this post and is mostly me just feeling proud of my picture holding up well, but I do think the history here is somewhat relevant to understanding my view—this isn't something I'm making up now, this is comparing the real world to expectations from many years ago and seeing that LM agents look even more likely to pay a central role.
^{^}
Note that this number could be negative. The average across all forms of progress is 0, since accelerating everything by 1 day should decrease timelines by exactly 1 day. I think that areas with high investment are naturally below zero and those with lower investment are naturally above zero, because low-investment areas will expand more easily later. I think probably all software progress and ML-specific investment is above 0, and that improvements in the quantity and quality of compute are well below 0.
^{^}
Another reason that time later is more valuable than time now is that AI systems themselves will be doing a large fraction of the cognitive work in the future. But this consideration cancels out when you do the full analysis, both increasing the value of time later and making it harder to get back time later.
^{^}
One counterargument is that almost all the policy value comes from policy research driven primarily by altruists who aren't significantly more likely to work on AI as risks become more concrete and systems become more capable. I don't personally find this very plausible---it seems like the quantity of research has in fact increased, and that the quality and relevance of that research has also improved significantly.
^{^}
I've seen the opposite asserted—that momentum means that accelerating now just accelerates more in the future. I don't think this issue is completely straightforward and it would be a longer digression to really get to the bottom of it. But right now I feel like on-paper analysis and observations of the last 10 years of AI both point pretty strongly towards this conclusion, and I haven't really seen the alternative laid out.
^{^}
By analogy, it seems to me that if humanity had trained GPT-4 for $250M in 2012, using a larger ML community and a larger number of worse computers, the net effect would be a reduction in risk. Making further progress from that point would be harder and easier to regulate, since scaling up spending would become prohibitively difficult and further ML progress would only be possible with large amounts of labor. On top of that, effective AI populations would be smaller since AI would already be using a much larger fraction of humanity's computing hardware, further computing scaleup would be increasingly bottlenecked, and an intelligence explosion would plausibly proceed several times more slowly. One could argue that increasing preparedness between 2012 and 2022 was enough to compensate for this factor, but that doesn't look right to me. I am more ambivalent about the effects of acceleration at this point and think it is negative in expectation, because I think society is now investing much more heavily in trying to understand and adapt to the AI we already have and we're already on track to scale up through the next 5 orders of magnitude distressingly quickly.

Language model agents are built out of LM parts that solve human-comprehensible tasks, composed along human-comprehensible interfaces.

This seems like a very narrow and specific definition of languade model agents that doesn't even obviously apply to the most agentic language model systems we have right now. It is neither the case that human-comprehensible task decomposition actually improves performance on almost any task for current language models (Auto-GPT does not actually work), and it is not clear that current RLHF and RLAF trained-models are "solving a human-comprehensible task" when I chat with them. They seem to pursue a highly-complicated mixed-objective which includes optimizing for sycophancy, and various other messy things, and their behavior seems only weakly characterized as primarily solving a human-comprehensible task.

But even assuming that current systems are doing something that is accurately described as solving human-comprehensible tasks composed along human-comprehensible interfaces, this seems (to me) unlikely to continue much into the future. RLHF and RLAF already encourage the system to do more of its planning internally in a non-decomposed manner, due to the strong pressure towards giving short responses, and it seems likely we will train language model agents on various complicated game-like environments in order to train them explicitly to do long-term planning.

These systems would still meaningfully be "LM agents", but I don't see any reason to assume that those systems would continue to do things in the decomposed manner that you seem to be assuming here.

My guess is it might be best to clarify that you are not in-general in favor of advancing agents built on top of language models, which seem to me to be very hard to align in-general, but are only in favor of advancing a specific technology for making agents out of language models, which tries to leverage factored cognition and tries to actively avoid giving the agents complicated end-to-end tasks with reinforcement learning feedback.

And my guess is that we then have a disagreement in that you expect that by-default the ML field will develop in a direction that will leverage factored-cognition style approaches, which doesn't currently seem that likely to me. I expect more end-to-end reinforcement learning on complicated environments, and more RLHF-style optimization to internalize agentic computation. Might be worth trying to make a bet on the degree to which LM capabilities will meaningfully be characterized as doing transparent task-decomposition and acting along simple human-comprehensible interfaces.

My current guess is that sharing info on language model capabilities is overall still good, but I disagree with the "Improvements in LM agents seem good for safety" section. My best guess is that the fastest path to LM agents is to just do more end-to-end training on complicated environments. This will not produce anything that is particularly easy to audit or align. Overall, this approach to building agents seems among the worst ways AI could develop in terms of making it easy to align, and my guess is there are many other ways that would be substantially better to push ahead instead.

I do think that right now LMs are by far closest to doing useful work by exploiting human-legible interfaces and decompositions. Chain of thought, simple decompositions, and imitations of human tool use are already important for LM performance. While more complex LM agents add only a small amount of additional value, it seems like extrapolating trends would make them pretty important soon.

Overall I think the world is shaping up extremely far in the direction of "AI systems learn to imitate human cognitive steps and then compose them into impressive performance." I'm happy to bet about whether that trend will continue to the extent we can operationalize it. E.g. I'd bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use. I don't have a strong view about more complex decompositions unless context length is a serious limitation. I would guess that end-to-end optimization will make at most marginal differences in efficacy (probably smaller than RLHF).

To the extent models trained with RLHF are doing anything smart in the real world I think it's basically ~100% by solving a human-comprehensible task. Namely humans give the system a task, and it tries to do some rough combination of what a particular kind of human demonstrator would do and what a particular kind of human evaluator would rate highly. There is no further optimization to take intelligent actions in the world.

Chain of thought, simple decompositions, and imitations of human tool use (along comprehensible interfaces) are already important for LM performance.

I want to separate prompt-engineering from factored cognition. There are various nudges you can use to get LLMs to think in ways that are more productive or well-suited for the task at hand, but this seems quite different to me from truly factored cognition, where you spin up a sub-process that solves a sub-problem, and then propagate that back up to a higher-level process (like Auto-GPT). I don't currently know of any not-extremely-gerry-mandered task where doing this actually improves task performance compared to just good prompt engineering. I've been looking for examples of this for a while, so if you do have any, I would greatly appreciate it.

Overall I think the world is shaping up extremely far in the direction of "AI systems learn to imitate human cognitive steps and then compose them into impressive performance."

I've seen little evidence of this so far, and don't think current LLM performance is even that well-characterized by this. This would be great, but I don't currently think its true.

For example, I don't really understand whether this model is surprised or unsurprised by the extreme breadth of knowledge that modern LLMs have. I don't see any "imitation of human cognitive steps" when an LLM is capable of remembering things from a much wider range of topics. It seems just that its way of acquiring knowledge is very different from humans, giving rise to a very different capability-landscape. This capability does not seem to be built out of "composition of imitations of human cognitive steps".

Similarly, when I use Codex for programming, I do not see any evidence that suggests that Codex is solving programming problems by composing imitations of human cognitive steps. Indeed, it mostly seems to just solve the problems in one-shot, vastly faster than I would be able to even type, and in a way that seems completely alien to me as a programmer.

E.g. I'd bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use.

I do indeed predict that we will see chain-of-thought become less faithful as model capabilities increase, and that other ways of doing the same thing as chain-of-thought but internalized to the model will take over.

I have no strong opinions on tool-use. Seems like the LLMs will use APIs the same way as humans would. I do think if you train more on end-to-end tasks, the code they write to solve sub-problems will become less readable. I have thought less about this and wouldn't currently take a bet.

I would guess that end-to-end optimization will make at most marginal differences in efficacy (probably smaller than RLHF).

I am not super confident of this, but my best guess is that we will see more end-to-end optimization, and that those will make a big difference in task performance. It also seems like a natural endpoint of something like RLAF, where you have the AI guide a lot of the training process itself when given a highly-complicated objective, and then you do various forms of RL on self-evaluations.

I don't currently know of any not-extremely-gerry-mandered task where [scaffolding] actually improves task performance compared to just good prompt engineering. I've been looking for examples of this for a while, so if you do have any, I would greatly appreciate it.

Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.

It does much better than AutoGPT, and also the paper does ablations to show that the different parts of the scaffolding in Voyager do matter. This suggests that better scaffolding does make a difference, and I doubt Voyager is the limit.

I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive. I think it's plausible this gap will eventually mostly close at some capability threshold, especially for many of the most potentially-transformative capabilities (e.g. having insights that draw on a large basis of information not memorised in a base model's weights, since this seems hard to decompose into smaller tasks), but it seems quite plausible the gap will be non-trivial for a while.

Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.

That's a good example, thank you! I actually now remembered looking at this a few weeks ago and thinking about it as an interesting example of scaffolding. Thanks for reminding me.

I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive.

I do wonder how much of this is just the result of an access gap. Getting one of these scaffolded systems to work seems also a lot of hassle and very fiddly, and my best guess is that if OpenAI wanted to solve this problem, they would probably just reinforcement learn a bunch, and then maybe they would do a bit of scaffolding, but the scaffolding would be a lot less detailed and not really be that important to the overall performance of the system.

Although this is an important discussion I want to emphasize up front that I don't think it's closely related to the argument in the OP. I tried to revise the OP to emphasize that the first section of the article is about LM agent improvements that are relevant to engineering better scaffolding rather than improving our ability to optimize such agents end to end.

I've seen little evidence of this so far, and don't think current LLM performance is even that well-characterized by this. This would be great, but I don't currently think its true.

If you allow models to think for a while they do much better than if you just ask them to answer the question. By "think for a while" we mean they generate one sentence after another in the same way a human would. Their ability to use chain of thought seems to come essentially entirely from copying human chains of thought rather than e.g. using filler tokens to parallelize cognition or RL fine-tuning teaching them novel cognitive strategies.

I agree that models also memorize a lot of facts. Almost all the facts they actually use are facts that humans know, which they memorized by observing humans using them or stating them. So I don't really consider this evidence one way or the other.

If you want to state any concrete prediction about the future I'm happy to say whether I agree with it. For example:

I think that the gap between "spit out an answer" and chain of thought / tool use / decomposition will continue to grow. (Even as chain of thought becomes increasingly unfaithful for questions of any fixed difficulty, since models become increasingly able to answer such questions in a single shot.)
I think there is a significant chance decomposition is a big part of that cluster, say a 50% chance that context-hiding decomposition obviously improves performance by an amount comparable to chain of thought.
I think that end-to-end RL on task performance will continue to result in models that use superficially human-comprehensible reasoning steps, break tasks into human-comprehensible pieces, and use human interfaces for tools.

My sense right now is that this feels a bit semantic.

I do indeed predict that we will see chain-of-thought become less faithful as model capabilities increase, and that other ways of doing the same thing as chain-of-thought but internalized to the model will take over.

This prediction seems largely falsified as long as transformers remain the dominant architecture, and especially if we deliberately add optimization pressures towards externalized reasoning and against internal, latent reasoning; see e.g. Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? and LLMs Do Not Think Step-by-step In Implicit Reasoning.

I do not understand your comment at all. Why would it be falsified? Transformers are completely capable of steganography if you apply pressure towards it, which we will (and have done).

In Deepseek we can already see weird things happening in the chain of thought. I will happily take bets that we will see a lot more of that.

I'm pointing out that transformers seem really bad at internal multi-hop reasoning; currently they can't even do 2-hop robustly, 3-hop robustly seems kind of out of the question right now, and scaling doesn't seem to help much either (see e.g. Figures 2 and 3 in Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? and also how much more robust and scalable CoT reasoning is). So 'chain-of-thought but internalized to the model will take over' seems very unlikely with transformers, and much more so if basic mitigations like unlearning (e.g. of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_probably_remove_from_the_weights, to mitigate the possibility of latent scheming) were applied.

Steganography is a separate threat model, but even there I'd interpret current evidence (e.g. Preventing Language Models From Hiding Their Reasoning) as mostly positive (as in, even relatively simple mitigations like paraphrasing seem to go very far).

Transformers are obviously capable of doing complicated internal chains of reasoning. Just try giving them a difficult problem and force them to start their answer in the very next token. You will see no interpretable or visible traces of their reasoning, but they will still get it right for almost all questions.

Visible CoT is only necessary for the frontier of difficulty. The rest is easily internalized.

I don't dispute that transformers can memorize shortcuts. I do dispute their ability to perform latent (opaque) multi-hop reasoning robustly. And I think this should be (very) non-controversial; e.g. Mor Geva has many papers on this topic.

What is plausibly a valid definition of multi-hop reasoning that we care about and that excludes getting mathematical proofs right and answering complicated never-before-seen physics questions and doing the kind of thing that a smaller model needed to do a CoT for?

The kind of instrumental reasoning required for alignment faking seems relevant, including through n-hop latent reasoning; see e.g. section 'B.1.3 HIDDEN SCHEMING REASONING' from Towards evaluations-based safety cases for AI scheming. I wouldn't be too surprised if models could currently bypass this through shortcuts, but a mix of careful data filtering + unlearning of memorized facts about deceptive learning, as suggested in https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm#Information_you_should_probably_remove_from_the_weights, could force them to externalize their reasoning (which should be much easier to monitor than latent reasoning), if they were to try to alignment-fake; though steganography would also be another threat model here, as discussed e.g. in section 'B.1.2 OBFUSCATED SCHEMING REASONING' of Towards evaluations-based safety cases for AI scheming.

E.g. I'd bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use

Some evidence in favor: https://x.com/YangjunR/status/1793681241398788319 (for increasingly wide gap between LM single forward pass ('snap judgment') and CoT), https://xwang.dev/mint-bench/ (for tool use being increasingly useful, with both model scale, and number of tool use turns).

I changed the section to try to make it a bit more clear that I mean "understanding of LM agents." For the purpose of this post, I am trying to mostly talk about things like understanding the capabilities and limitations of LM agents, and maybe even incidental information about decomposition and prompting that help overcome these limitations. This is controversial because it may allow people to build better agents, but I think this kind of understanding is helpful if people continue to build such agents primarily out of chain of thought and decomposition, while not having much impact on our ability to optimize end-to-end.

Something that I think is worth noting here: I don't think that you have to agree with the "Accelerating LM agents seems neutral (or maybe positive)" section to think that sharing current model capabilities evaluations is a good idea as long as you agree with the "Understanding of capabilities is valuable" section.

Personally, I feel much more uncertain than Paul on the "Accelerating LM agents seems neutral (or maybe positive)" point, but I agree with all the key points in the "Understanding of capabilities is valuable" section, and I think that's enough to justify substantial sharing of model capabilities evaluations (though I think you'd still want to be very careful about anything that might leak capabilities secrets).

Yeah, I think sections 2, 3, 4 are probably more important and should maybe have come first in the writeup. (But other people think that 1 dominates.) Overall it's not a very well-constructed post.

At any rate thanks for highlighting this point. For the kinds of interventions I'm discussing (sharing information about LM agent capabilities and limitations) I think there are basically two independent reasons you might be OK with it---either you like sharing capabilities in general, or you like certain kinds of LM agent improvements---and either one is sufficient to carry the day.

Note that Evals has just published a description of some of their work evaluating GPT-4 and Claude. Their publication does not include transcripts, the details of the LM agents they evaluated, or detailed qualitative discussion of the strengths and weaknesses of the agents they evaluated. I believe that eventually Evals should be considerably more liberal about sharing this kind of information; my post is explaining why I believe that.

Thanks for the thoughtful post, lots of important points here. For what it’s worth, here is a recent post where I’ve argued in detail (along with Cameron Domenico Kirk-Giannini) that language model agents are a particularly safe route to agi: https://www.alignmentforum.org/posts/8hf5hNksjn78CouKR/language-agents-reduce-the-risk-of-existential-catastrophe

There's a bunch of considerations and models mixed together in this post. Here's a way I'm factoring some of them, which other people may also find useful.

I'd consider counterfactuality the main top-level node; things which would have been done anyway have radically different considerations from things which wouldn't. E.g. doing an eval which (carefully, a little bit at a time) mimics what e.g. chaosGPT does, in a controlled environment prior to release, seems straightforwardly good so long as people were going to build chaosGPT soon anyway. It's a direct improvement over something which would have happened quickly anyway in the absence of the eval. That argument still holds even if a bunch of the other stuff in the post is totally wrong or totally the wrong way of thinking about things (e.g. I largely agree with habryka's comment about comprehensibility of future LM-based agents).

On the other hand, building a better version of chaosGPT which users would not have tried anyway, or building it much sooner, is at least not obviously an improvement. I would say that's probably a bad idea, but that's where the rest of the models in the post start to be relevant to the discussion.

Alas, we don't actually know ahead of time which things people will/won't counterfactually be tried, so there's some grey zone. But at least this frame makes it clear that "what would people counterfactually try anyway?" is a key subquestion.

(Side note: also remember that counterfactuality gets trickier in multiplayer scenarios where players are making decisions based on their expectations of other players. We don't want a situation where all the major labs build chaosGPT because they expect all the others to do so anyway. But in the case of chaosGPT, multiplayer considerations aren't really relevant, because somebody was going to build the thing regardless of whether they expected OpenAI/Deepmind/Anthropic to build the thing. And I expect that's the prototypical case; the major labs don't actually have enough of a moat for small-game multiplayer dynamics to be a very good model here.)

I believe that LM agents based on chain of thought and decomposition seem like the most plausible approach to bootstrapping subhuman systems into trusted superhuman systems. For about 7 years using LM agents for RLAIF has seemed like the easiest path to safety,^[4] and in my view this is looking more and more plausible over time.

I agree whole-heartedly with the first sentence. I'm not sure why you understand it to support the second sentence; I feel the first sentence supports my disagreement with the second sentence! Long-horizon RL is a different way to get superhuman systems, and one encourages that intervening in feedback if the agent is capable enough. Doesn't the first sentence support the case that it would be safer to stick to chain of thought and decomposition as the key drivers of superhumanness, rather than using RL?

It would be safest of all to just not build powerful AI for a very long time. But alas, that seems wildly uncompetitive and so would require some kind of strong global coordination (and would create considerable instability and leave significant value on the table for other worldviews).

It's possible that "human-level AI with CoT" will be competitive enough, but I would guess not.

So to me the obvious approach is to use chain of thought and decomposition to improve performance, and then to distill the result back into the model.

You could try to do distillation with imitation learning. This is way more likely to be competitive then with no distillation at all.

But it still seems like it has a very good chance of being uncompetitive because the imitation objective significantly impairs performance and creates all kinds of artifacts. Using process-based RL for distillation seems like it has essentially the same safety profile to using imitation learning, while avoiding the obvious pathologies and having a much higher probability of being competitive.

(People give various reasons that RL in the distillation step is less safe than imitation learning in the distillation step, but so far I haven't found anything at all persuasive.)

I think there's still a good chance that process-based RL in the distillation step still can't be competitive and so you need to talk about how to develop new techniques or prudently incorporate outcomes. But I think it's at least much more likely to be competitive than CoT-only, or imitation learning in the distillation step. (Perhaps it cuts down the probability of deal-breaking uncompetitiveness by 30%, compared to using imitation learning alone for distillation.)

What is process-based RL?

I think your intuitions about costly international coordination are challenged by a few facts about the world. 1) Advanced RL, like open borders + housing deregulation, guarantees vast economic growth in wealthy countries. Open borders, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten the integrity of a culture, including especially its norms; AI has the potential, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten all life. The decisions of wealthy countries are apparently extremely strongly correlated, maybe in part for "we're all human"-type reasons, and maybe in part because legislators and regulators know that they won't get their ear chewed off for doing things like the US does. With immigration law, there is no attempt at coordination; quite the opposite (e.g. Syrian refugees in the EU). 2) The number of nuclear states is stunningly small if one follows the intuition that wildly uncompetitive behavior, which leaves significant value on the table, produces an unstable situation. Not every country needs to sign on eagerly to avoiding some of the scariest forms of AI. The US/EU/China can shape other countries' incentives quite powerfully. 3) People in government do not seem to be very zealous about economic growth. Sorry this isn't a very specific example. But their behavior on issue after issue does not seem very consistent with someone who would see, I don't know, 25% GDP growth from their country's imitation learners, and say, "these international AI agreements are too cautious and are holding us back from even more growth"; it seems much more likely to me that politicians' appetite for risking great power conflict requires much worse economic conditions than that.

In cases 1 and 2, the threat is existential, and countries take big measures accordingly. So I think existing mechanisms for diplomacy and enforcement are powerful enough "coordination mechanisms" to stop highly-capitalized RL projects. I also object a bit to calling a solution here "strong global coordination". If China makes a law preventing AI that would kill everyone with 1% probability if made, that's rational for them to do regardless of whether the US does the same. We just need leaders to understand the risks, and we need them to be presiding over enough growth that they don't need to take desperate action, and that seems doable.

Also, consider how much more state capacity AI-enabled states could have. It seems to me that a vast population of imitation learners (or imitations of populations of imitation learners) can prevent advanced RL from ever being developed, if the latter is illegal; they don't have to compete with them after they've been made. If there are well-designed laws against RL (beyond some level of capability), we would have plenty of time to put such enforcement in place.

By process-based RL, I mean: the reward for an action doesn't depend on the consequences of executing that action. Instead it depends on some overseer's evaluation of the action, potentially after reading justification or a debate about it or talking with other AI assistants or whatever. I think this has roughly the same risk profile as imitation learning, while potentially being more competitive.

I'm generally excited and optimistic about coordination. If you are just saying that AI non-proliferation isn't that much harder than nuclear non-proliferation, then I think I'm with you. But I think (i) it's totally fair to call that "strong global coordination," (ii) you would probably have to do a somewhat better job than we did of nuclear non-proliferation.

I think the technical question is usually going to be about how to trade off capability against risk. If you didn't care about that at all, you could just not build scary ML systems. I'm saying that you should build smaller models with process-based RL.

It might be good to focus on legible or easy-to-enforce lines rather than just trading off capability vs risk optimally. But I don't think that "no RL" is effective as a line---it still leaves you with a lot of reward-hacking (e.g. by planning against an ML model, or predicting what actions lead to a high reward, or expert iteration...). Trying to avoid all these things requires really tightly monitoring every use of AI, rather than just training runs. And I'm not convinced it helps significantly with deceptive alignment.

So in any event it seems like you are going to care about model size. "No big models" is also a way easier line to enforce. This is pretty much like saying "minimize the amount of black-box end-to-end optimization you do," which feels like it gets closer to the heart of the issue.

If you are taking that approach, I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models (and will ultimately want to use outcomes in relatively safe ways). Yes it would be safer to use neither process-based RL nor big models, and just make your AI weaker. But the main purpose of technical work is to reduce how demanding the policy ask is---how much people are being asked to give up, how unstable the equilibrium is, how much powerful AI we can tolerate in order to help enforce or demonstrate necessity. Otherwise we wouldn't be talking about these compromises at all---we'd just be pausing AI development now until safety is better understood.

I would quickly change my tune on this if e.g. we got some indication that process-based RL increased rather than decreased the risk of deceptive alignment at a fixed level of capability.

I think [process-based RL] has roughly the same risk profile as imitation learning, while potentially being more competitive.

I agree with this in a sense, although I may be quite a bit a more harsh about what counts as "executing an action". For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as "executing the action" in the overseer-conversation environment, even if the action looks like it's for some other environment, like a plan to launch a new product in the market. I do think myopia in this environment would suffice for existential safety, but I don't know how much myopia we need.

If you're always talking about myopic/process-based RLAIF when you say RLAIF, then I think what you're saying is defensible. I speculate that not everyone reading this recognizes that your usage of RLAIF implies RLAIF with a level of myopia that matches current instances of RLAIF, and that that is a load-bearing part of your position.

I say "defensible" instead of fully agreeing because I weakly disagree that increasing compute is any more of a dangerous way to improve performance than by modifying the objective to a new myopic objective. That is, I disagree with this:

I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models

You suggest that increasing compute is the last thing we should do if we're looking for performance improvements, as opposed to adding a very myopic approval-seeking objective. I don't see it. I think changing the objective from imitation learning is more likely to lead to problems than scaling up the imitation learners. But this is probably beside the point, because I don't think problems are particularly likely in either case.

Advanced RL, like open borders + housing deregulation, guarantees vast economic growth in wealthy countries.

I think this comparison is imperfect. Standard economic models predict an acceleration in the growth rate by at least an order of magnitude, and usually more. Over one decade, an increase in economic capacity by 1-4 orders of magnitude seems probable. By contrast, my understanding was that the models of open borders roughly predict a one-time doubling of world GDP over several decades, and for housing, it's something like a 50% increase in GDP over decades.

Perhaps a better way to put this is that if AI is developed anywhere, even in a small country, that country could soon (within 10 years) grow to be the world's foremost economic power. Nothing comparable seems true for other policies. There only really needs to be be one successful defecting nation for this coordination to fall apart.

I think once you have an LM agent that is sufficiently powerful so as to be economically competitive as an independent actor in a lot of domains (if that is even possible - I am still skeptical about LLMs), we've reached "Armageddon". At that point, the economic pressure to improve upon it will be massive, and there is no particular reason these improvements have to stay limited to LLMs (you could e.g. build some sort of backchaining/optimization on top of it and use it to train the LLM, burning away the interpretability/safety benefits of LLMs). And I have a hard time seeing AI safety winning over people purely concerned with empowerment maximization in this fight, as the latter have a simpler problem to solve and can therefore probably solve it faster.

Curated.

This post lays out legible arguments for its position, which I consider to be one of the best ways to drive conversations forward, short of demonstrating convincing empirical results (which seem like they'd be difficult to obtain in this domain). In this case, I hope that future conversations about sharing LLM capabilities focus more on object-level details, e.g. what evidence would bear on the argument about LM agent "overhang".

Good post.

Other points aside, the proposition "LM agents are an unusually safe way to build powerful AI systems" seems really important; it would be great to see more research/intuitions on this + clarification on various flavors of "LM agents."

I guess one crux for sharing research on LM agents is whether there are viable alternative paths to powerful AI systems. If LM-agents is clearly the easiest path, there's less reason to share research on them; if a less-safe path looks similarly easy, we should differentially advance LM-agents.

I'm not aware of alternative paths that look anywhere near as easy as LM-agents. Or: I don't know what viable alternative paths LM-agents are supposed to be safer than. (Edit: some alignment researcher friends mention old-fashioned RL agents as a possible path to powerful AI that's less safe than LM-agents but say that path looks substantially harder than LM-agents, such that we don't need to boost LM-agents more.)

Maybe rather than 'different paths' Paul just means that capabilities can come from more-powerful-LMs or more-sophisticated-agent-scaffolding. He says:

at a fixed level of capability, I think the more we are relying on LM agents (rather than larger LMs) the safer we are.

I buy something like this, at least. But (I weakly intuit) we'll almost exclusively be relying on LM agents rather than mere next-token-predictors by default; there's no need to boost LM agents. And even if that's good, that doesn't mean that marginal improvements in LM agents' sophistication/complexity are safer than marginal improvements in underlying-LM-capability. (I don't have a take on this-- just flagging it as a crux.)

My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes.

Though, as I noted in a separate comment, I agree with the basic arguments in "Understanding of capabilities is valuable" section, one thing that I'm still a bit worried about in the context of the ARC report explicitly is that labs might try to compete with each other on doing the "best" they can on the ARC eval to demonstrate that they have the most capable model, which seems probably bad (though it is legitimately unclear whether this is actually bad or not).

However, if it is really bad, here's an idea: I think you could avoid that downside while still capturing the upside of making it clear publicly how capable models are (e.g. for the purpose of galvanizing policy responses) by revealing only the max performance on each task across all the evaluated models, rather than revealing the results individually for each model.

What we've currently published is 'number of agents that completed each task', which has a similar effect of making comparisons between models harder - does that seem like it addresses the downside sufficiently?

plus-one-ing the impulse to "look for third options"

I know that prediction markets don't really work in this domain (apocalypse markets are equivalent to loans), but what if we tried to approximate Solomonoff induction via a code golfing competition?

That is, we take a bunch of signals related to AI capabilities and safety (investment numbers, stock prices, ML benchmarks, number of LW posts, posting frequency or embedding vectors of various experts' twitter account, etc...) and hold a collaborative competition to find the smallest program that generates this data. (You could allow the program to be output probabilities sequentially, at a penalty of (log_(1/2) of the overall likelihood) bits.) Contestants are encouraged to modify or combine other entries (thus ensuring there are no unnecessary special cases hiding in the code).

By analyzing such a program, we would get a very precise model of the relationship between the variables, and maybe even could extract causal relationships.

(Really pushing the idea, you also include human population in the data and we all agree to a joint policy that maximizes the probability of the "population never hits 0" event. This might be stretching how precise of models we can code-golf though.)

Technically, taking a weighted average of the entries would be closer to Solomonoff induction, but the probability is basically dominated by the smallest program.

Under these conditions, increasing investment to speed up LM agents today is likely to slow down LM agents in the future, picking low-hanging fruit that would instead have been picked later when investment increased. If I had to guess, I'd say that accelerating AI progress by 1 day today by improving LM agents would give us back 0.5 days later.

Probably worth noting that the more time and investment people put into LM agents now, the better we will be a constructing connected and powerful LM agents in the future. Meaning that you reduce the "agency overhang", but you increase our ability to make use of that agency once we do get more powerful systems. If you invest in it less now, it'll take more time to make it work well in the future (though by that point you also have more powerful systems that get you to understand what works with LM agents much faster).

This post helped me distinguish capabilities-y information that's bad to share from capabilities-y information that's fine/good to share. (Base-model training techniques are bad; evals and eval results are good; scaffolding/prompting/posttraining techniques to elicit more powerful capabilities without more spooky black-box cognition is fine/good.)

Sharing a comment I made in the past on this post:

I want to recognize there is some difficulty that comes with predicting which aspects will drive capability advances. I think there is value in reading papers (something that more alignment researchers should probably do) because it can give us hints at the next capability leaps. Over time, I think it can improve our intuition for what lies ahead and allows us to better predict the order of capability advances. This is how I’ve felt as I’ve been pursuing the Accelerating Alignment agenda (language model systems for accelerating alignment research). I’ve been at the forefront, reading Twitter/papers/etc to find insights into how to use language models for research and feel like I’ve been gaining a lot of intuition into where the field is going.

As you said, it's also important to remember that most of the field isn't directly aiming for AGI. Safety discussions, particularly about self-improvement and similar topics, may have inspired some individuals to consider pursuing directions useful for AGI, when they might not have otherwise. This is why some people will say things like, "AI safety has been net negative and AGI safety discussions have shortened AGI timelines". I think there is some truth to the timelines argument, but it’s not clear it has been net negative, in my opinion. There's a point at which AI Safety work must be done and investment must be made in AGI safety.

One concern I’d like to bring up as a point of discussion is that whether infohazard policies could backfire. By withholding certain insights, these policies may leave safety researchers in the dark about the field's trajectory, while capability researchers are engaged in active discussions. Some of us were aware about the AgentGPT-like models likely happening soon (though unsure about the exact date), but it seems to have blindsided a lot of people concerned about alignment. It’s possible that safety researchers could be blindsided again by rapid developments they were not privy to due to infohazard policies.

This may have been manageable when progress was slower, but now, with global attention on AI, it may lead to some infohazard policies backfiring, particularly due to alignment people not being able to react as quickly as they should. I think most of the balance favours keeping infohazard policies as is for now, but this was a thought I had earlier this week and figured I would share.

There is no constraint towards specifying measurable goals of the kind that lead to reward-hacking concerns.

I'm not sure that reward-hacking in LM agent systems is inevitable, but it seems at least plausible that reward hacking could occur in such systems without further precautions.

For example, if oversight is implemented via an overseer LLM agent O which gives scores for proposed actions by another agent A, then A might end up adversarially optimizing against O if A is set up for a high success rate (high rate of actions accepted).

(I agree very much with the general point of the post, though)

This was a good post, and shifted my view slightly on accelerating vs halting AI capabilities progress.

I was confused by your "overhang" argument all the way until footnote 9, but I think I have the gist. You're saying that even if absolute progress in capabilities increases as a result of earlier investment, progress relative to safety will be slower.

A key assumption seems to be that we are not expecting doom immediately; i.e. the next major jump in capabilities is deemed nearly impossible to kill us all with misaligned AI. I'm not sure I buy this assumption fully; it seems to have non-negligible probability to me and that seems relevant to the wisdom of endorsing faster progress in capabilities.

But if we assume the next jump in capabilities, or the next low-hanging fruit plucked by investment, won't be the beginning of the end...then it does sorta make sense that accelerating capabilities in the short run might accelerate safety and policy enough to compensate.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

My primary safety concern is what happens if one of these analyses somehow leads to a large improvement over the state of the art. I don't know what form this would take and it might be unexpected given the Bitter Lesson you cite above, but if it happens, what do we do then? Given this is hypothetical and the next large improvement in LMs could come elsewhere, I'm not suggesting we stop sharing now. But I think we should be prepared that there might be a point in time where we need to acknowledge such sharing leads to significantly stronger models and thus should re-evaluate sharing such eval work.

As one specific example - has RLHF, which the below post suggests was potentially was initially intended for safety, been a net negative for AI safety?

https://www.alignmentforum.org/posts/LqRD7sNcpkA9cmXLv/open-problems-and-fundamental-limitations-of-rlhf

Language model agents are built out of LM parts that solve human-comprehensible tasks, composed along human-comprehensible interfaces.

These systems would still meaningfully be "LM agents", but I don't see any reason to assume that those systems would continue to do things in the decomposed manner that you seem to be assuming here.

Chain of thought, simple decompositions, and imitations of human tool use (along comprehensible interfaces) are already important for LM performance.

Overall I think the world is shaping up extremely far in the direction of "AI systems learn to imitate human cognitive steps and then compose them into impressive performance."

I've seen little evidence of this so far, and don't think current LLM performance is even that well-characterized by this. This would be great, but I don't currently think its true.

E.g. I'd bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use.

I would guess that end-to-end optimization will make at most marginal differences in efficacy (probably smaller than RLHF).

I don't currently know of any not-extremely-gerry-mandered task where [scaffolding] actually improves task performance compared to just good prompt engineering. I've been looking for examples of this for a while, so if you do have any, I would greatly appreciate it.

Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.

That's a good example, thank you! I actually now remembered looking at this a few weeks ago and thinking about it as an interesting example of scaffolding. Thanks for reminding me.

I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive.

I've seen little evidence of this so far, and don't think current LLM performance is even that well-characterized by this. This would be great, but I don't currently think its true.

If you want to state any concrete prediction about the future I'm happy to say whether I agree with it. For example:

I think that the gap between "spit out an answer" and chain of thought / tool use / decomposition will continue to grow. (Even as chain of thought becomes increasingly unfaithful for questions of any fixed difficulty, since models become increasingly able to answer such questions in a single shot.)
I think there is a significant chance decomposition is a big part of that cluster, say a 50% chance that context-hiding decomposition obviously improves performance by an amount comparable to chain of thought.
I think that end-to-end RL on task performance will continue to result in models that use superficially human-comprehensible reasoning steps, break tasks into human-comprehensible pieces, and use human interfaces for tools.

My sense right now is that this feels a bit semantic.

I do indeed predict that we will see chain-of-thought become less faithful as model capabilities increase, and that other ways of doing the same thing as chain-of-thought but internalized to the model will take over.

I do not understand your comment at all. Why would it be falsified? Transformers are completely capable of steganography if you apply pressure towards it, which we will (and have done).

In Deepseek we can already see weird things happening in the chain of thought. I will happily take bets that we will see a lot more of that.

Visible CoT is only necessary for the frontier of difficulty. The rest is easily internalized.

E.g. I'd bet on increasingly wide gaps between each of LM snap judgments, chain of thought, and tool-use

Yeah, I think sections 2, 3, 4 are probably more important and should maybe have come first in the writeup. (But other people think that 1 dominates.) Overall it's not a very well-constructed post.

There's a bunch of considerations and models mixed together in this post. Here's a way I'm factoring some of them, which other people may also find useful.

I believe that LM agents based on chain of thought and decomposition seem like the most plausible approach to bootstrapping subhuman systems into trusted superhuman systems. For about 7 years using LM agents for RLAIF has seemed like the easiest path to safety,^[4] and in my view this is looking more and more plausible over time.

It's possible that "human-level AI with CoT" will be competitive enough, but I would guess not.

So to me the obvious approach is to use chain of thought and decomposition to improve performance, and then to distill the result back into the model.

You could try to do distillation with imitation learning. This is way more likely to be competitive then with no distillation at all.

(People give various reasons that RL in the distillation step is less safe than imitation learning in the distillation step, but so far I haven't found anything at all persuasive.)

What is process-based RL?

I would quickly change my tune on this if e.g. we got some indication that process-based RL increased rather than decreased the risk of deceptive alignment at a fixed level of capability.

I think [process-based RL] has roughly the same risk profile as imitation learning, while potentially being more competitive.

I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models

Advanced RL, like open borders + housing deregulation, guarantees vast economic growth in wealthy countries.

Curated.

Good post.

Maybe rather than 'different paths' Paul just means that capabilities can come from more-powerful-LMs or more-sophisticated-agent-scaffolding. He says:

at a fixed level of capability, I think the more we are relying on LM agents (rather than larger LMs) the safer we are.

plus-one-ing the impulse to "look for third options"

I know that prediction markets don't really work in this domain (apocalypse markets are equivalent to loans), but what if we tried to approximate Solomonoff induction via a code golfing competition?

By analyzing such a program, we would get a very precise model of the relationship between the variables, and maybe even could extract causal relationships.

Technically, taking a weighted average of the entries would be closer to Solomonoff induction, but the probability is basically dominated by the smallest program.

Under these conditions, increasing investment to speed up LM agents today is likely to slow down LM agents in the future, picking low-hanging fruit that would instead have been picked later when investment increased. If I had to guess, I'd say that accelerating AI progress by 1 day today by improving LM agents would give us back 0.5 days later.

Sharing a comment I made in the past on this post:

There is no constraint towards specifying measurable goals of the kind that lead to reward-hacking concerns.

I'm not sure that reward-hacking in LM agent systems is inevitable, but it seems at least plausible that reward hacking could occur in such systems without further precautions.

This was a good post, and shifted my view slightly on accelerating vs halting AI capabilities progress.

As one specific example - has RLHF, which the below post suggests was potentially was initially intended for safety, been a net negative for AI safety?

https://www.alignmentforum.org/posts/LqRD7sNcpkA9cmXLv/open-problems-and-fundamental-limitations-of-rlhf

211

Thoughts on sharing information about language model capabilities

211

Ω 67

Core claim

Context

Accelerating LM agents seems neutral (or maybe positive)

Improvements in LM agents seem good for safety

"Overhang" in LM agents seems risky

Understanding of capabilities is valuable

Information about capabilities is more impactful for understanding than speed

I think we should make this decision based on best estimates of costs and benefits

211

Ω 67

211

Ω 67