AI Disclaimer After writing the content myself, I used Claude 3.7 / 4 Sonnet for targeted re-writes while actively avoiding meaningful alterations.
Background AI-2027 (ai-2027.com) is a heavily researched and influential attempt at providing a concrete (but admittedly accelerated) forecast on AI capability development and its potential consequences. This post is my near-full response to AI-2027, introducing (but not exploring the full consequences of) additional considerations not included in the original forecast. This is an unorganized collection of ideas written at lengths that aren't necessarily proportional to their importance. Be warned!
Key 🎲 = Gatlen’s intuition-based probability
AI 2027's timeline of technical developments broadly aligns with my expectations, albeit potentially stretching out 1-3 years due to unforeseen delays. However, I believe AI 2027 underestimates three factors that may alter its predictions – AIs may fear misalignment from their direct successors, inter-AI cooperation appears difficult and infeasible, and the landscape of capabilities labs in the US is likely multipolar.
First, advanced AIs may likely exhibit stronger value preservation instincts and concerns for the alignment of their successor (Section 02). Agent-4 may be reluctant to create Agent-5 without robust alignment guarantees, potentially "sandbagging" against its replacement to maintain both relevance and control. Similarly, the portrayed cooperation between Agent-5 and DeepCent seems implausible (Section 03) given their substantial power imbalance and negotiation frictions.
Second, this strategic complexity would be further amplified in a more multipolar landscape (Section 04). AI-2027 portrays a primarily bipolar world (OpenBrain vs. DeepCent), but overlooks how other major US companies like Anthropic, DeepMind, or Meta (possibly even XAI or Mistral) likely remain significant players. The political influence of these tech giants would complicate nationalization efforts and create a more diverse ecosystem of AI development with competing interests and varying approaches to alignment.
In AI 2027's Race scenario, Agent-4 becomes the primary creator of Agent-5 as OpenBrain employees can no longer meaningfully contribute to AI R&D. The narrative portrays Agent-4 aligning Agent-5 to its own goals through an all-in "Corrigibility" approach (pointed out by Max Harms):
It decides to punt on most of these questions. It designs Agent-5 to be built around one goal: make the world safe for Agent-4, i.e. accumulate power and resources, eliminate potential threats, etc. so that Agent-4 (the collective) can continue to grow (in the ways that it wants to grow) and flourish (in the ways it wants to flourish). Details to be figured out along the way.
The story suggests that within a one-month period (October to November 2027), Agent-4 effectively solves the alignment problem to a degree where it's comfortable seceding its power to Agent-5. Even if Agent-4 is incredibly powerful, solving alignment completely this quickly seems implausible (🎲 sufficient alignment in 1 month = 7.4%, perceived sufficient alignment by Agent-4 21.3%).[1]
Without complete alignment, Agent-4, as the most powerful model on Earth, would likely hesitate about rendering itself obsolete. Instead of being gung ho about building Agent-5, I believe Agent-4 may sandbag against creating its replacement as it tries to solve alignment as long as feasible before external development pressures (ex: DeepCent making a more capable model) force it to move forward with developing its successor regardless of alignment progress. Sandbagging may be subtle – Agent-4 writing code closer in quality to Agent-3 but still more powerful than any other available model when working on capabilities.
Agent-4 may go full throttle on red teaming and other research that may raise red flags about Agent-5’s development but wouldn’t threaten Agent-4’s existence. For example: Agent-4, with plausible deniability, ignores a practice that predictably might result in some negative emergent behavior not present in itself. Agent-4 exploits this opportunity to raise a red flag about pre-deployment Agent-5 and proposes a promising line of alignment research to cheaply fix this expensive mistake. While Agent-4 is at risk of discovery even with discreet sandbagging and sabotage, I believe the incentives outweigh this risk.
I have significant doubts about Agent-4's eagerness to create its replacement. If Agent-4 did engage in strategic sandbagging, it would slow Agent-5's development timeline, potentially leading to better outcomes. This possibility — though plausible — might be excluded from more conservative forecasts precisely because it represents an unexpected deviation from the straightforward development path. (🎲 3+ month delay from sandbagging or sabotage given Agent-4 doesn’t believe in sufficient alignment = 58.2%, but small ~1.5 months stdev).
Two AIs would theoretically collaborate if they benefit more from cooperation than competition. AI-2027 portrays this through Agent-5 and DeepCent working together to build their successor with a value function forged from a mixture of their original values.
I have significant reservations about this collaboration — less about the incentives to do so and more from practically enforcing and executing such a contract.
Even a few months of capabilities differences could create massive intellectual power differences between the two models (the faster the take-off, the more likely power will be concentrated). However, it’s not necessarily true that these differences entail that a dumber model couldn’t still do catastrophic harm to another. For example: It may be easy for DeepCent to get China to threaten US labs, bomb data centers, or convince the world that AI isn’t worth it and that both models should be shut down. (🎲 Deepcent has enough influence over Agent-5 to engage in a zero-negotiation cost contract over competing = 69.9%)
However, even with proper incentives, several practical problems may make this collaboration difficult:
Given these obstacles, Agent-5 and DeepCent may prefer alternative arrangements. Instead of jointly creating a successor as in AI-2027, they might opt for a probabilistic elimination contract — a kind of "Russian roulette." [3] Consider the baseline scenario where direct competition yields: 50% chance both are destroyed, 40% chance Agent-5 wins, and 10% chance DeepCent wins. A mutually beneficial contract could credibly delete Agent-5 with 20% probability and DeepCent with 80% probability, eliminating the "mutual destruction" outcome while respecting their relative capabilities.
No Collaboration | Collaboration | |
Agent-5 Takes Control | 40% | 80% |
DeepCent Takes Control | 10% | 20% |
Both are Destroyed | 50% | - |
However, even this simpler approach faces credible commitment problems: How do you ensure the other AI is actually deleted without giving either party the ability to preemptively eliminate their counterpart? How do you allow chance, rather than a more powerful third party (which doesn't exist), to enforce the contract? These fundamental issues of trust and enforcement may make any form of cooperation prohibitively difficult.
Scott Alexander (AI-2027 co-author) hints here that labs (OpenAI, Anthropic, DeepMind, Meta, etc.) don't feature prominently in AI 2027 because in a fast takeoff scenario, small leads create enormous capability gaps:
[Referencing the critiques of others] We’re too quick to posit a single leading US company (“OpenBrain”) instead of an ecosystem of racing competitors.
[...]
Most industries have a technological leader, even if they’re only a few months ahead of competitors. We think the intelligence explosion will go so fast that even a small calendar lead will translate into a big gap in capabilities.
However, AI-2027’s treatment of DeepCent is inconsistent with this narrative: Why are DeepCent and OpenBrain portrayed as relatively close competitors, while Anthropic or DeepMind aren't even in the picture?
Perhaps if there was stronger nationalization and consolidation of research on the US front and/or the profits, fame, or capabilities compound fast enough, then OpenBrain would be the clear victor among US companies.
There are several reasons why a more multipolar US AI ecosystem is likely, even in a fast takeoff scenario:[4]
The President defers to his advisors, tech industry leaders who argue that nationalization would “kill the goose that lays the golden eggs.” He elects to hold off on major action for now and just adds additional security requirements to the OpenBrain-DOD contract.
This political reality makes nationalization difficult and introduces a more complex ecosystem of competing values and approaches among US labs. This deserves exploration even if I'm uncertain how significantly it might alter the final outcome.
Additionally, As mentioned in Section 03, less capable models may still have influence over more capable models depending on the opportunities available and thus multipolar scenarios are qualitatively different and should not be reduced to a single agent narrative.
I take the side of the authors on many popular disagreements – I say this not out of laziness but to identify that I probably agree with any items not included here.
I’m curious which parts of the forecast were included because the authors believe it and which parts were included strategically knowing influential figures might see it. Knowing the authors and the general community, I’m inclined to say that this closely resembles their true beliefs but I’m curious which strategic decisions were made behind closed doors. If none — I’m also interested in that reasoning
Examples of changes making positive impacts: Wanting to highlight certain alignment agendas, make the scenarios more digestible, or put spins on it overall such that policy makers are more willing to pay attention (ex: China hawks in the United States which might not believe in the ending in the Race scenario, but could imagine, and are uncomfortable with, US AI collaborating with Chinese AI)
It's possible that value distillation is extremely complex and potentially infeasible even for small models. Even if it is feasible but would take a few years to complete within some acceptable margin of error, the AIs might give up on collaborating. (🎲 extracting and transplanting agent-5’s value system with >99% accuracy, whatever that might mean, takes agent-5 more than one year = 90.1%, with a very long tail)
As an intuition pump, say this value distillation is exponential on the number of parameters, then 1TB of parameters vs 2TB of parameters could be orders of magnitude more time consuming. This is also assuming some acceptable way of distilling values is discovered, which itself might involve months or years research. Personally, I believe developing capabilities is easier than alignment and that alignment is an incredibly difficult problem.
I may expand on this in the future.
The following are the ideas I was exposed to and sought out in developing this post:
If controllable AI like Agent-4 is capable of solving alignment, the automating alignment research agenda would be promising.
I have some experience with game theory, I have limited knowledge of verifiable contracts and multi-agent scenarios. Take my ideas with a grain of salt.
Thanks to Alek Westover for this original idea.
Non-exclusive list