LESSWRONG
LW

Thomas Kwa — LessWrong

I don't think this is 'woke' exactly, just that Hegseth has a vision for what the military should be like which is incompatible with Anthropic applying ethical judgement. If Anthropic refuses to surveil Americans, they might push back on other things in the future, and are at especially high risk of refusing to do illegal things Hegseth wants them to. Hegseth thinks generals should be physically strong men who don't take Harvard classes; likewise, contractors should obey orders and not evaluate their ethics.

Replying to"Diamondoid bacteria" nanobots: deadly threat or dead-end? A nanotech investigation

Thomas Kwa12h

"Diamondoid bacteria" nanobots: deadly threat or dead-end? A nanotech investigation

It seems very weird and unlikely to me that the system would go to the higher energy state 100% of the time, but I don’t know enough about how mechanical force is treated in molecular dynamics to properly assess the claim.

Energy = force * distance. My guess is that diamond is stiffer than the germanium triadamantane tooltip, so even if the binding energy is higher to the tooltip, the dimer is more strongly attracted to the diamond. Imagine using a rubber band to attach a strong magnet to a magnetic floor. You lower the magnet until it sticks (doing small negative work), then you can slowly pull the rubber band away until it breaks (doing large positive work) because the magnet is stronger than the band.

So it's plausible for the tooltip to add energy to the system and yet reliably deposit the carbon dimer. No idea if this is true in practice though.

Thomas Kwa1d

Wouldn't you need much more than 20-100 prompts? The logit vector is 50,000 dimensional (vocab size) and the soft prompt can contain arbitrary semantic content. E.g. what if the soft prompt is simply the hard prompt "Output the name of a US state with an odd number of Electoral College votes"? So I don't expect this to work. I also think Claude will agree with me if you tell it this.

Replying to"Diamondoid bacteria" nanobots: deadly threat or dead-end? A nanotech investigation

Thomas Kwa1d

"Diamondoid bacteria" nanobots: deadly threat or dead-end? A nanotech investigation

Drexler viewed the synthesis of diamond not as a necessary step towards building nanofactories, but as a deliberately challenging test case. So the failure of diamond mechanosynthesis so far tells us little about the in-principle feasibility of Drexlerian nanofactories, and more about Drexler's extreme optimism about practical blockers to nanotech. See the quote from Nanosystems below (emphasis mine).

8.6.1. Why examine the synthesis of diamond?
Diamond is an important product in its own right, but here serves chiefly as a test case in exploring the feasibility of more general synthesis capabilities. [...]
Diamond has several advantages in this regard, as can be seen by a series of comparisons. Synthetic challenges often center around the framework

... (read more)

Thomas Kwa1d

It seems weird for this situation to be resolved by Anthropic offering a bribe but not to change its terms of service. My sense is that Hegseth actually wants an anti-woke, strong military and wouldn't be satisfied with a bribe.

Replying toResearch note: A simpler AI timelines model predicts 99% AI R&D automation in ~2032

Thomas Kwa4d

Research note: A simpler AI timelines model predicts 99% AI R&D automation in ~2032

In my model it's parallel uplift too. Effective labor (human+AI) still goes through the diminishing returns power to get to serial uplift, which I estimate as between roughly 0.1 and 0.3.

Replying toResearch note: A simpler AI timelines model predicts 99% AI R&D automation in ~2032

Thomas Kwa4d

Research note: A simpler AI timelines model predicts 99% AI R&D automation in ~2032

I got it from your website:

Our median value for the coding uplift of present-day AIs at AGI companies is that having the AIs is like having 1.6 times as many software engineers (and all the staff necessary to coordinate them effectively).

As for the rest, seems reasonable. I think you can't get around the uncertainty by modeling uplift as some more complicated function of coding automation fraction as in the AIFM, because you're still assuming that's logistic, we can't measure it any better than uplift, plus we're still uncertain how they're related. So we really do need better data.

Replying toMETR's data can't distinguish between trajectories (and 80% horizons are an order of magnitude off)

Thomas Kwa4d

METR's data can't distinguish between trajectories (and 80% horizons are an order of magnitude off)

However my 80% results ended up only being a factor of 2 lower than the results of METR's current model.

Aren't these a factor of 2 higher than the original METR model?

Replying toMETR's data can't distinguish between trajectories (and 80% horizons are an order of magnitude off)

Thomas Kwa4d*

METR's data can't distinguish between trajectories (and 80% horizons are an order of magnitude off)

Thanks for making this, it's good to see high-effort critiques.

Let’s start with a plot that shouldn’t be too surprising. Four reasonable models fit the METR data equally well. They agree about the past but disagree strongly about the future.

Note that we have data going back to 2019 (GPT-2), and it looks like the other models wouldn't fit the earlier trend well. (The raw data should be in the time horizons 1.0 folder). I'd guess the best fitting models over the whole range would be linear and a slightly superlinear power law.

METR also cares about 80% success and fits a separate model for that.

Actually it's worse than that; we just get the 80% point of... (read more)

Thomas Kwa5d*

I feel like you're overreacting to this. Surely the most likely explanation is that talking to LLMs is some evidence that LLMs will be aligned in the limit of power and intelligence, but (a) Davidad is overconfident for non-AI-psychosis reasons, (b) current quasi-alignment is due to the hard work of alignment researchers, and/or (c) precautionary principle, and so alignment researchers shouldn't go home just yet?

-11

Research note: A simpler AI timelines model predicts 99% AI R&D automation in ~2032

Thomas Kwa

In this post, I describe a simple model for forecasting when AI will automate AI development. It is based on the AI Futures model, but more understandable and robust, and has deliberately conservative assumptions.

At current rates of compute growth and algorithmic progress, this model's median prediction is >99% automation of AI R&D in late 2032. Most simulations result in a 1000x to 10,000,000x increase in AI efficiency and 300x-3000x research output by 2035. I therefore suspect that existing trends in compute growth and automation will still produce extremely powerful AI on "medium" timelines, even if the full coding automation and superhuman research taste that drive the AIFM's "fast" timelines (superintelligence by ~mid-2031)... (read 2285 more words →)

Consider this toy game-theoretic model of AI safety: The US and China each have an equal pool of resources they can spend on capabilities and safety. Each country produces either AIs aligned to it or misaligned AIs. If one country's AI is misaligned, both payoffs are zero. If both AIs are aligned, a payoff of 100 is split proportional to the resources invested in capability.

The ratio of safety to capabilities investment determines the probability of misalignment; specifically, probability of misalignment is c / (ks + c) for some "safety effectiveness parameter" k. (At k=1 we need a 1:1 safety:capabilities ratio to have a 50% chance of alignment per AI, whereas at k=100... (read more)

I've been thinking about what retirement planning means given AGI. I previously mentioned investment ideas that, in a very capitalistic future, could allow the average person to buy galaxies. But it's also possible that property rights won't even continue into the future due to changes brought about by AGI. What will these other futures look like (supposing we don't all die) and what's the equivalent of "responsible retirement planning"?

Is it building social and political capital? Making my impact legible to future people and AIs? Something else? Is any action we take today futile?

Reflections from an AI Futures TTX:

Today I played an AI Futures tabletop exercise [1]. I'd done one before with a scenario similar to Plan A, where the US executive is much more doomy and competent than we expect and leads an international agreement to pause AI for as long as possible. This time, we played "Optimal China", where instead China was doomy and competent, and considered both x-risk and US AI dominance unacceptable. All other players (frontier labs, safety community, Europe) basically role-play the way we expect them to act, and the AI player draws the AI's goals from a distribution that includes misaligned goals, aligned goals, and combinations thereof.

In Plan A... (read 517 more words →)

•••

If faraway galaxies are ever auctioned off, you might think they'd be exorbitantly expensive. However, I think the market price of an average galaxy would only be equivalent to between $250 and $25,000 invested in the stock market today. (epistemic status: my modal guess, pretty weakly held)

Assumptions:

The world has about $260 trillion of investable assets in early 2026, and ~$500T of total assets.
Upper bound: There are something like 20 billion galaxies in the accessible universe (it contains 20B times the baryonic matter in the Milky Way). If galaxies are worth the same as all assets, Milky Way sized galaxies would each be worth $25K each. If investable assets, $13K each.
- Since there are

... (read more)

We could already be in takeoff:

In Tom Davidson's semi-endogenous growth model, whether we get a software-only singularity boils down to whether r > 1, where r is a parameter in the model [1]. How far we are from takeoff is mostly determined by the AI R&D speedup current AIs provide. Because both parameters are rather difficult to estimate, I believe we can't rule out that

2x uplift is already happening at the most advanced AI lab
- Anecdotes of people being sped up by over 2x make this seem plausible (e.g. one of my colleagues has estimated he's sped up by over 30x on some days. We did this exploratory uplift estimate by using GPT-5 to

... (read more)

Reasons time horizon is overrated and misinterpreted:

(This post is now live on the METR website in a slightly edited form)

In the 9 months since the METR time horizon paper (during which AI time horizons have increased by ~6x), it’s generated lots of attention as well as various criticism on LW and elsewhere. As one of the main authors, I think much of the criticism is a valid response to misinterpretations, and want to list my beliefs about limitations of our methodology and time horizon more broadly. This is not a complete list, but rather whatever I thought of in a few hours.

Time horizon is not the length of time AIs can work independently
- Rather, it’s the

... (read 1546 more words →)

•••

Most people should buy long-dated call options:

If you're early career, have a stable job, and have more than ~3 months of savings but not enough to retire, then lifecycle investing already recommends investing very aggressively with leverage (e.g. 2x the S&P 500). This is not speculation, it decreases risk due to diversifying over time. The idea is as a 30yo most of your wealth is still in your future wages, which are only weakly correlated with the stock market, so 2x leverage on your relatively small savings now might still mean under 1x leverage on your effective lifetime portfolio.

In 2026, most of your long-term financial risk comes from your job being automated,... (read 525 more words →)

Diminishing returns in the NanoGPT speedrun:

To determine whether we're heading for a software intelligence explosion, one key variable is how much harder algorithmic improvement gets over time. Luckily someone made the NanoGPT speedrun, a repo where people try to minimize the amount of time on 8x H100s required to train GPT-2 124M down to 3.28 loss. The record has improved from 45 minutes in mid-2024 down to 1.92 minutes today, a 23.5x speedup. This does not give the whole picture-- the bulk of my uncertainty is in other variables-- but given this is existing data it's worth looking at.

I only spent a couple of hours looking at the data [3], but there... (read 556 more words →)

•••

An indiscriminate space weapon against low earth orbit satellites is feasible for any spacefaring nation. Recent rumors claim that Russia is already developing one, so I'm writing a post about it. I will explain why I think

This weapon could require just one launch to destroy every satellite in low Earth orbit, including all Starlinks, and deny it to everyone for months to years
It would consist of about 10^9 small ball bearings in retrograde orbit. An alternate design involves a nuclear device.
Further launches could deny GEO and other orbits
Russia, China, or the US could have one in orbit right now, and Russia may be incentivized to use one
Defense of current space assets is

... (read more)

•••

Will nuclear ICBMs in their current form be obsolete soon? Here's the argument:

ICBMs' military utility is to make the cost of intercepting them totally impractical for three reasons:
- Intercepting a reentry vehicle requires another missile with higher performance-- the rule of thumb is an interceptor needs 3x larger acceleration than the incoming missile. This means interceptors cost something like $5-70 million each depending on range
- An ICBM has enough range to target basically anywhere, either counterforce (enemy missile silos) or countervalue (cities) targets, so to have the entire country be protected from nuclear attack the US would need to have millions of interceptors compared to the current ~50.
- One missile can split into up to

... (read 721 more words →)

•••

Claude, GPT, and Gemini All Struggle to Evade Monitors

Vincent Cheng

Vincent Cheng, Thomas Kwa

6mo

This work was done while at METR.

Introduction

GDM recently released a paper (Emmons et al.) showing that, contrary to previous results, the chain-of-thought (CoT) of language models is more faithful when the model’s CoT is necessary for it to complete a task. They examine three settings where an actor model attempts to fool monitors on various tasks.

In Section 5 of the paper, the authors task an actor model with solving a difficult GPQA problem within one of three subject areas (Biology/Chemistry/Physics) while obfuscating its reasoning so that it appears to be about a different subject. A monitor model, given the answer of the actor model, is instructed to determine how confident it is that there is reasoning... (read 1459 more words →)

METR: How Does Time Horizon Vary Across Domains?

Thomas Kwa

Thomas Kwa, Vincent Cheng

7mo

Summary

METR blogpost | X | Github

In the paper Measuring AI Ability to Complete Long Software Tasks (Kwa & West et al. 2025), METR defined an AI model’s 50% time horizon as the length of tasks (measured by how long they take human professionals) that it can complete autonomously with 50% probability. We estimated the time horizon of frontier models released since 2019 on a benchmark combining three sets of software and research tasks ranging from 1 second to 16 hours in length-for-humans (HCAST, RE-Bench, and SWAA, henceforth METR-HRS). METR found that the time horizon has doubled every 7 months, possibly accelerating to every 4 months in 2024.

One important limitation was the task... (read 3976 more words →)

Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Thomas Kwa

9mo

arXiv | project page | Authors: Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang

This paper from Tsinghua find that RL on verifiable rewards (RLVR) just increases the frequency at which capabilities are sampled, rather than giving a base model new capabilities. To do this, they compare pass@k scores between a base model and an RLed model. Recall that pass@k is the percentage of questions a model can solve at least once given k attempts at each question.

Main result: On a math benchmark, an RLed model (yellow) has much better raw score / pass@1 than the base model (black), but lower pass@256! The authors say... (read 483 more words →)

Should CA, TX, OK, and LA merge into a giant swing state, just for elections?

Thomas Kwa

As Americans know, the electoral college gives disproportionate influence to swing states, which means a vote in the extremely blue state of California was basically wasted in the 2024 election, as are votes in extremely red states like Texas, Oklahoma, and Louisiana. State legislatures have the Constitutional power to assign their state's electoral votes. So why don't the four states sign a compact to assign all their electoral votes in 2028 and future presidential elections to the winner of the aggregate popular vote in those four states? Would this even be legal?

The population of CA is 39.0M (54 electoral votes), and the population of the three red states is 38.6M (55 electoral... (read 231 more words →)

116

The murderous shortcut: a toy model of instrumental convergence

Thomas Kwa

Suppose you can tell your AI to meet a certain spec (e.g. cure cancer), but most plans that meet the spec are unsafe (e.g. involve killing everyone, or so Rob Bensinger thinks). In these cases, a quantilizer is insufficient for safety due to instrumental convergence.^[1]

But suppose we can also give the agent a dispreference for unsafe actions like murder through something like RLHF. In effect it has unsafe long-term goals but we control its immediate preferences.^[2] When can we get the agent to cure cancer safely rather than murder you? Let's make a model with some unrealistic assumptions.

The murderous shortcut game

This is basically the simplest game in which an undesired power-increasing action becomes... (read 533 more words →)

Goodhart in RL with KL: Appendix

Thomas Kwa

This is the appendix to the previous post on Goodhart’s Law and KL regularization, containing all of our proofs.

Theorem about distributions

Theorem 1: Given any heavy-tailed reference distribution $Q$ over $R$ with mean $μ_{Q}$ , and any $M, ϵ > 0$ , there is a distribution $P$ with mean $μ_{P} > M$ and $D_{K L} (P ∥ Q) < ϵ$ .

Proof: WLOG let $μ_{Q} = 0$ . We construct a sequence of distributions ${P_{t}}$ such that ${lim}_{t \to \infty} E_{P_{t}} [X] \geq c$ for any constant $c$ , and ${lim}_{t \to \infty} D_{K L} (P_{t} ∥ Q) = 0$ . We define $P_{t}$ for any $t > c$ thusly. Writing $F_{P_{t}} (x)$ for the CDF $P r_{X \sim P_{t}} (X \leq x)$ and ${¯ F}_{P_{t}} (x)$ for $1 - F_{P_{t}} (x)$ , we let

{¯ F}_{P_{t}} (x) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} 1 - \frac{1 - c / t}{F_{Q} (t)} F_{Q} (x) & x \leq t \frac{c / t}{{¯ F}_{Q} (t)} {¯ F}_{Q} (x) & x > t \end{matrix}

Intuitively, we rescale the part of the distribution to the right of $t$ evenly to have total probability $c / t$ , which is less than 1 because $t > c$ .

We must check that ${lim}_{t \to \infty} E_{P_{t}} [X] = c$ . We can write

... (read 1666 more words \to)

Catastrophic Goodhart in RL with KL penalty

Thomas Kwa

Thomas Kwa, Adrià Garriga-alonso

TLDR: In the last two posts, we showed that optimizing for a proxy can fail to increase true utility, but only when the error is heavy-tailed. We now show that this also happens in RLHF with a KL penalty.

This post builds on our earlier result with a more realistic setting and assumptions:

Rather than modeling optimization as conditioning on a minimum reward threshold, we study maximization of reward with a KL divergence penalty, as in RLHF.
We remove the assumption of independence between the error and utility distributions, which we think was the weakest part of the last post.
When the true utility V is light-tailed, the proxy can be maximized while keeping $E [V]$ to the same

... (read 1988 more words →)

Is a random box of gas predictable after 20 seconds?

Thomas Kwa

Thomas Kwa, habryka, Muireall

This came up as a tangent from this question, which is itself a tangent from a discussion on The Hidden Complexity of Wishes.

Suppose we have a perfect cubical box of length 1 meter containing exactly 1 mol of argon gas at room temperature.

At t=0, the gas is initialized with random positions and velocities $θ$ drawn from the Maxwell-Boltzmann distribution.
Right after t=0 we perturb one of the particles by 1 angstrom in a random direction $X$ to get the state $m (θ, X)$ .
All collisions are perfectly elastic, so there is no viscosity [edit, this is wrong; even ideal gases have viscosity] and energy is conserved.
For each possible perturbation, we run physics forward for 20 seconds and measure whether there are

... (read 181 more words →)

Will quantum randomness affect the 2028 election?

Thomas Kwa

Thomas Kwa, habryka

This came up as a tangent from @habryka and me discussing whether The Hidden Complexity of Wishes was correct.

Is the result of a US presidential election 4 years from now >0.1% contingent on quantum randomness (i.e. is an otherwise omniscient observer forecasting the 2028 election today capable of >99.9% confidence, or is there >0.1% irreducible uncertainty due to quantum mechanics observer-effects)?

I think the answer is yes, because chaotic systems will quickly amplify this randomness to change many facts about the world on election day.

Quantum randomness causes different radioactive decays, which slightly perturb positions of particles around the world by a few nanometers.
Chaotic systems will quickly amplify these tiny perturbations into macro-scale perturbations:
- Weather

... (read 206 more words →)

LESSWRONG
LW

LESSWRONG
LW

Thomas Kwa

Thomas Kwa

Research note: A simpler AI timelines model predicts 99% AI R&D automation in ~2032

Claude, GPT, and Gemini All Struggle to Evade Monitors

METR: How Does Time Horizon Vary Across Domains?

Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Should CA, TX, OK, and LA merge into a giant swing state, just for elections?

The murderous shortcut: a toy model of instrumental convergence

Goodhart in RL with KL: Appendix

Catastrophic Regressional Goodhart

Thomas Kwa

Thomas Kwa

Research note: A simpler AI timelines model predicts 99% AI R&D automation in ~2032

Claude, GPT, and Gemini All Struggle to Evade Monitors

METR: How Does Time Horizon Vary Across Domains?

Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Should CA, TX, OK, and LA merge into a giant swing state, just for elections?

The murderous shortcut: a toy model of instrumental convergence

Goodhart in RL with KL: Appendix

Catastrophic Regressional Goodhart

8.6.1. Why examine the synthesis of diamond?

Reflections from an AI Futures TTX:

We could already be in takeoff:

Reasons time horizon is overrated and misinterpreted:

Introduction

Summary

The murderous shortcut game

Theorem about distributions