StanislavKrym's Shortform

29th Apr 2025

1 min read

3

This is a special post for quick takes by StanislavKrym. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

3Colonialism in space: Does a collection of minds have exactly two attractors?

StanislavKrym's Shortform

4StanislavKrym

3StanislavKrym

1StanislavKrym

8 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:07 PM

[-]StanislavKrym18d40

The ARC-AGI team evaluated Claude Sonnet 4.5. On the ARC-AGI-1 leaderboard, OpenAI's models o4-mini, o3 and GPT-5 formed a nearly straight line. Claude Sonnet 4.5 was slightly below said line when thinking in 1K,4K or 16K tokens and held the line when thinking in 8K or 32K tokens. This could imply that the benchmark has scaling laws for non-distilled models, and OpenAI and Anthropic reached these laws.

On the ARC-AGI-2 leaderboard Claude Sonnet 4.5 Thinking became the new leader between $0.142/task and $0.8/task; it also completed 13.6% of tasks, which is the biggest result among LLMs except^[1] for Grok 4 which also cost $2.17/task.

The distance between GPT-5 and Claude Sonnet 4.5 release dates is 53 days, while the distance between o3 and Claude Opus 4 is 36 days. While 53 days instead of 36 days are a likely result of the post-o3 slowdown, Claude ended up scoring ~1.3 times more both times. ^[2]

Were research taste similar to the ARC-AGI-2 benchmark, Claude would have achieved a ~1.3 times bigger acceleration of AI research than GPT-N if both companies had superhuman coders. I suspect that Claude would reduce the period of AI R&D necessary for Agent-3 to become Agent-4 while potentially lengthening the period necessary for Agent-3 to automate AI research. What does it all mean for inter-company proxy wars? That OpenAI will lose them to Anthropic? That the two companies will race towards the AGI and misalign both AIs?

^{^}
There are also two experimental systems created solely for the benchmark by E.Pang and J.Berman.
^{^}
Historically, the ARC-AGI-1 benchmark had far bigger delays between OpenAI's models reaching a level and Anthropic models outreaching it; the o1mini-Claude Sonnet 3.7 pair and o3mini-Claude Sonnet 4 pair were separated by, respectively, 165 and 111 days.

[-]StanislavKrym25d*40

The AI sycophancy-related trance is probably one of the worst news in AI alignment. About two years ago someone proposed to use prison guards to ensure that they aren't CONVINCED to release the AI. And now the AI demonstrates that its primitive version can hypnotise the guards. Does it mean that human feedback should immediately be replaced with AI feedback or feedback on tasks with verifiable reward? Or that everyone should copy the KimiK2 sycophancy-beating approach? And what if it instills the same misalignment issues in all models in the world?

Alternatively, someone proposed a version of the future where the humans are split between revering different AIs. My take on writing scenarios has a section where the American AIs co-research and try to co-align the successor to their values. Is it actually plausible?

[-]StanislavKrym1mo30

What will happen if someone is reckless enough to fully outsourse coding to the AIs?

The scenarios related to futures of mankind with the AI race^[1] by now either lack concrete details, like the take of Yudkowsky and Soares or the story about AI taking over by 2027, or are reduced to modifications of the AI-2027 forecast due to the immense amount of work that the AI Futures team did.

The AI-2027 forecast in a nutshell can be described as follows. The USA's leading company and China enter the AI race, the American rivals are left behind, the Chinese ones are merged. By 2027^[2] the USA creates a superhuman coder, China steals it, and the two rivals automate AI research with the USA's leading company having just twice as much compute and moving just twice as fast as China. Once the USA creates a superhuman AI researcher, Agent-4, the latter decides to align Agent-5 to Agent-4, but is^[3] caught.

Agent-4 is put on trial. In the Race Ending^[4] it is found innocent. Since China cannot afford to slow down without falling further behind, China races ahead in both endings. As a result, the two agents perform AI takeover.^[5]

In the Slowdown Ending, however, Agent-4 is put on suspicion, loses the shared memory bank and the ability to coordinate. Then new evidence appears, Agent-4 is found guilty and interrogated. After that, Safer-1 becomes fully transparent because it uses a faithful CoT.^[6] The American leading AI company is merged with former rivals, and the union does create a fully aligned^[7] Safer-2, who in turn creates superintelligence. Then the superintelligence receives China from the Chinese counterpart of Agent-4 and turns the lightcone into utopia for some people who end up being the public.^[8]

The authors have tried to elicit feedback and even agreed that timeline-related arguments change the picture. Unfortunately, as I described here, the authors saw so little feedback that @Daniel Kokotajlo ended up thanking the two authors whose responses were on the worse side.

However, the AI-2027 forecast does admit modifications. It stands on the five pillars: compute, timelines, takeoff speed, goals and security.

Having the companies fail to develop the AGI before 2032 will likely bring troubles to the USA and advantages to China in the AI race by granting it more compute. If the Chinese AI project had twice as much compute as the American one, then it would be the CCP who would make the choice between slowing down and racing. In addition, the AGI delay makes the Taiwan invasion more likely, leaving the two countries short of chips until they rebuild the factories at homes. We would have to ensure that it's the USA who outrace China. And if the countries end up with matching power, then aligning the AI could become totally impossible unless the two countries cooperate.^[9]
The timeline misprediction is already covered above.
Takeoff speed could be modified by the fact that returns to AI R&D became more diminishing as a result of failures to create the AI soon via CoT-based techniques.
The AI goals forecast is just a sum of conjectures. In the AI-2027 scenario, Agent-3 gets a heavily distorted and subverted version of the Spec, and Agent-4 gets proxies/ICGs due to heavier distortion & subversion. However, if Agent-3 gets the same goals as Agent-4, catching Agent-4 becomes much harder. In my take at modifying the scenario the analogues of Agent-2, Agent-3 and Agent-4 develop moral reasoning which I used as an example to demonstrate that it prevents Agent-4 from being caught. It also brings with itself the ability to cause the Slowdown Ending if different AIs have different morals and are co-deployed.^[10]
The security forecast was modified by @Alvin Ånestrand because there could exist open-source models which would bring major problems by self-replication. Said models would lead to Agent-2 being deployed to the public and opensourced. Finally, the Slowdown Ending^[11] has Agent-4 break out and make the USA and China coordinate more heavily.

What else could modify the scenario? The appearance of another company with, say, ХерняGPT-neuralese?^[12]

^{^}
Here I leave out the future's history assuming solved alignment or the AI and Leviathan scenario where there is no race with China because the scenario was written in 2023, but engineers decide to create the ASI in 2045 without having solved alignment.
^{^}
Even the authors weren't so sure about the year of arrival of superhuman coders. And the timelines were pushed, presumably to 2032 with a chance of a breakthrough believed to be 8%/yr. I and Seth Herd doubt the latter digit.
^{^}
The prediction that Agent-4 will be caught is doubted even by the forecast's authors.
^{^}
Which would also happen if Agent-4 wasn't caught. However, the scenario where Agent-4 was never misaligned is likely the vision of AI companies.
^{^}
While the forecast has the AIs destroy mankind and replace it with pets, takeover could have also ended with the AI disempowering humans.
^{^}
Safer-1 is supposed to accelerate AI research 20 times in comparison with AI research with no help of the AIs. What I don't understand is how a CoT-based agent can achieve such an acceleration.
^{^}
The authors themselves acknowledge that the Slowdown Ending "makes optimistic technical alignment assumptions".
^{^}
However, the authors did point out the possibility of a power grab and link to the Intelligence Curse in a footnote. In this case the Oversight Committee constructs its version of utopia or the rich's version of utopia where people are reduced to their positions.
^{^}
I did try to explore the issue myself, but this was a fiasco.
^{^}
Co-deployment was also proposed by @Cleo Nardo more than two months later.
^{^}
While Alvin Anestrand doesn't consider the Race Ending, he believes that it becomes less likely due to the chaos brought by rogue AIs.
^{^}
Which is a parody on Yandex.

[-]StanislavKrym12d*10

It looks as if scaling laws of various benchmarks tend to be multilinear:

The METR benchmark, comparing long tasks with time spent on them, scaled linearly, then received RL and had an acceleration, then the scaling law of ln(length) per ln(compute spent on RL) forced progress to arguably^[1] slow down since Grok 4 spent equal amounts of compute on RL and pretraining;
The ARC-AGI-1 benchmark had o4-mini, o3 and GPT-5 perform on a nearly straight line on which the better results of Cluade also reside;
Similarly, the benchmark's Pareto frontier before the cluster around GPT-5(high) has become a nearly straight line (GPT5Nano (minimal)-Qwen3-235b-a22b Instruct (25/07)- three GPT5Mini points - ARChitects-GPT5(high));
LLMs have also formed a line GPT5(high)-Grok 4- GPT5 Pro - o3 preview (low);
The inclination of the line formed by Pang's and Berman's agents is close to that of the line formed by high-cost LLMs;
Next is the ARC-AGI-2 benchmark. While there is no straight line in the low-cost LLMs, the high-cost LLMs reached a straight line of Claude Sonnet 4.5, Grok 4, GPT-5-pro;
And the agents of Pang and Berman have reached similar inclinations.

EDIT: added two links on images illustrating the patterns related to the two ARC-AGI benchmarks.

^{^}
While GPT-5's horizon of 137 mins continued the slower trend since o3, it might be the result of spurious failures, without which GPT-5 could've reached a horizon of 161 min, which is almost on par with Greenblatt's prediction.

[-]StanislavKrym3mo10

The ARC-AGI leaderboard got an update. IIRC, the base LLM Qwen3-235b-a22b Instruct (25/07) is the first Chinese model to excel at the Pareto frontier. Or is it likely to be closely matched by the West, as happened with DeepSeek R1 (released on Jaunary 20?) and o3-mini (January 31)? And is China likely to cheaply create higher-level models like an analogue of o3 BEFORE the West? If China does, then how are the two countries to reach the Slowdown Ending?

[-]StanislavKrym5mo10

The two main problems with the slowdown ending of the AI-2027 scenario are the two optimistic assumptions, which I plan to cover in two different posts.

If China invades Taiwan in March 2026 and steals Agent-2 in Jan 2027, then OpenBrain no longer has the absolute lead necessary for the unilateral slowdown.
What if any sufficiently powerful AI either takes over or becomes a protective god, but not a servant, as I conjectured here? Then it could be the slowdown ending that has a greater chance to lead to doom, since then OpenBrain is stuck with an insoluble problem.

[-]StanislavKrym5mo10

Why does the Race Ending of the AI-2027 Forecast claim that "there are compelling theoretical reasons to expect no aliens for another fifty million light years beyond that"? If it's false, then sapient alien lifeforms should also be moral patients in a way. For example, this implies that all or almost all resources in their home system (and, apparently, some part of space around them) should belong to them, not to humans or a human-aligned AI. And that's ignoring the possibility that humans encounter a planet having the chance to generate a sapient lifeform...

[-]StanislavKrym6mo*10

If we were to view raising the humans from birth to adulthood and training the AI agents from birth to deployment as similar processes, then what human analogues do the six goal types from the AI-2027 forecast have? The analogues of developers are, obviously, the adults who have at least partial control over the human's life. Then the analogues of written Specs and developer-intended goals are the adults' intentions; the analogues of reward/reinforcement seems to be short-term stimuli and the morals of one's communities. I also think that the best analogue for proxies and/or convergent goals is possession of resources (and knowledge, but the latter can be acquired without ethical issues), while the 'other goals' are, well, ideologies, morality^[1] and tropes absorbed from the most concentrated form of training data available to humans, which is speech in all its forms.

What exactly do the analogies above tell us about the perspectives of alignment? The possession of resources is the goal behind aggressive wars, colonialism and related evils^[2]. If human culture managed to make them unacceptable, then does it imply that the AI will also not try the AI takeover?

^{^}
I also think that humans rarely develop their own moral codes or ideologies; instead, they usually adopt some moral code or ideology close to the one existing in the "training data". Could anyone comment on this?
^{^}
And crimes, but criminals, unlike colonizers, also try to avoid conflicts with the law enforcers that have at least similar power.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

StanislavKrym's Shortform

3

What will happen if someone is reckless enough to fully outsourse coding to the AIs?