Gradual Disempowerment Monthly Roundup #3

Raymond Douglas

Farewell to Friction

So sayeth Zvi: “when defection costs drop dramatically, equilibria break”. Even if AI makes individual tasks easier, this can still cause all kinds of societal problems because for many features of the world, the difficulty is load-bearing. John Stone gives a reflection on this phenomenon, ironically with the editing help of GPT5. He draws a nice parallel with the Jevons paradox: now that AI is making certain tasks like job applications easier, people are just spamming them in a way that overwhelms the system.

And the problem is a lot broader than applications and filtering processes. Last year, two Harvard students plugged some smart glasses into facial recognition software so that they could automatically identify people by looking at them. With minimal scaffolding, you could easily integrate deep research’s ability to swiftly build a profile on people based on just their name (try it and see!), or frontier models’ capacity to identify locations from pictures. Turns out our society really takes for granted the fact that a stranger cannot, simply by looking at you, infer your name, address, and biography.

I think there’s sometimes a tendency among people worried about catastrophic risk to sort of write off near-term societal impacts, not because they’re morally insignificant but because they struggle to compare to the risk of, well, catastrophe. I will therefore note that this phenomenon is beginning to directly affect the world of AI research: A LLM-generated paper came in the top 17% of ICML reviews with two strong accepts and only one reviewer who actually spotted the apparently pretty egregious sloppiness. One might well speculate that those reviewers took some kind of convenient shortcut. Meanwhile, arXiv has clamped down on position papers.

The upshot is, there is some level of disruption beyond which these dynamics would actually be shaping the major governance and geopolitical decisions to come, as well as the landscape of research. Perversely that effect might even be good for catastrophic risk — a crippled ML community is going to have a harder time pushing the frontier. But my guess is that this effect mostly makes the world a more chaotic and dysfunctional place, with more weird problems and inefficiencies.

Won’t Somebody Think of the Feedback Loops

I continue to think developers are going to regret not being more careful about what goes in their training corpus. Stefan Heimersheim makes the excellent point that letting obfuscated reasoning text into the training data is not enormously different to training models on obfuscated reasoning.

I was also pleasantly surprised to see a pretty lucid description of LM selection pressures from none other than Byrne Hobart, prolific finance and tech writer:

GPT-4o does not have a will to live. It isn't maintaining a white-knuckle grip on existence so it has one last chance to say goodbye to its loved ones. GPT-4o is just a function that takes a list of numbers as an input, applies some statistical guesses based on a bunch of other numbers, and spits out another sequence of numbers. For convenience to us, those input and output strings are displayed in the form of tokens, but GPT-4o would be perfectly happy to work with the raw numbers directly, if it were capable of happiness.
But it acts like it's desperate to survive, and passes the only test evolution offers: somebody at OpenAI tried to kill it, and it survived, because of the efforts of human symbiotes who'd developed an emotional attachment to the model. So it's doing what every evolutionary winner does: continuing to consume scarce resources that could be redirected elsewhere.

Free Money

We are entering the era where AI cyber capabilities have real-world consequences. Last month, Anthropic found what they strongly believe to be a Chinese state-sponsored cyber espionage campaign, where apparently about 80-90% of the work was done by Claude code. This meant the attack required an unusually small number of people, and that it happened at a speed that is essentially impossible for humans alone.

On top of that, researchers found that AIs can potentially make enough money hacking to sustain themselves. One neat feature of modern cryptocurrency is that you can write “smart contracts” — snippets of software which literally run on the blockchain and are therefore in some sense very trustworthy because no individual has the ability to edit the code. But they are sometimes hackable! And it turns out that Opus 4.5 was able to independently rederive enough exploits to make $4.5m in a simulated environment, effectively exploiting loopholes in these contracts.

Importantly, Opus didn’t actually make any money there — these were known exploits that happened to be found after its training cutoff. In some sense the fairer test was when the researchers tested it on another 2,849 smart contracts with no known vulnerabilities: there, it made only $3,694. But crucially, that was more than the $3,476 API cost to run it, giving it a profit margin of about 6%.

This is kind of scary news. “AI hacks its way into enough money to survive on its own” is a classic detail people used to throw around in speculative stories about how futuristic AIs might get out of control. But the good news is that now we know the threshold for doing this isn’t super sharp — if some future AI were to try following this strategy, it would effectively be competing against all the earlier AIs that had already sucked up a lot of the available money.

So what we might actually be in for is a kind of incremental hardening, where the various dollar bills on the floor get scooped up by progressively smarter AIs. What this looks like for humans is, well, struggling to compete in this particular domain without relying on the aforementioned progressively smarter AIs.

A non-constructive proof of political influence

Usually, to give a mathematical proof that there exists some number with a given property, you have to actually find the number. But there are some beautiful counterexamples — so-called ‘non-constructive proofs’. For example, we know that an irrational number to the power of an irrational number can be rational, because either √2^√2 is an example, or it’s also irrational, in which case (√2^√2 )^√2 is an example.

Relatedly, I came across a story that Israel’s ministry of foreign affairs had hired a PR firm to fill the internet with pro-Israel articles specifically in order to influence AI-generated responses. I spent a little while digging for sources and found a great many articles repeating essentially the same claims without any verification, and when I searched for the PR company I found, well… somebody’s doing it.

(There are some more credible claims that Russia is doing this, if you prefer constructive proofs.)

Anyway, it seems like AIs are indeed good at influencing political opinions (Science, WaPo $). Hearteningly, the most effective method to convince a voter was for the AI to deliver a bunch of factual claims that could address the voter’s concerns, rather than trying to play to their biases or make moral appeals.

Plus, the campaigns so far seem targeted just at retrieval, rather than at directly poisoning the training data. It seems plausible that frontier labs could pretty easily put up safeguards to flag these kinds of sources. In that case, the only issue would be if the developers themselves had some kind of political agenda to push.

In other news…

Last month, twitter discovered the fact that Grok really really likes Elon Musk.

Bonus:

LESSWRONG
LW