LESSWRONG
LW

AI

17

AI #132 Part 1: Improved AI Detection

by Zvi
4th Sep 2025
Don't Worry About the Vase
38 min read
1

17

AI

17

AI #132 Part 1: Improved AI Detection
1Askwho
New Comment
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 8:10 PM
[-]Askwho2h10

I can confirm that ElevenLabs still leads the pack in solid speech generation for narration, which is a bit of a shame as it remains shockingly expensive.
Apropos, here is the ElevenLabs-generated podcast episode for this post: 
https://open.substack.com/pub/dwatvpodcast/p/ai-132-part-1-improved-ai-detection

Reply
Moderation Log
More from Zvi
View more
Curated and popular this week
1Comments

One result of going on vacation was that I wasn’t able to spin events off into focused posts this week, so I’m going to fall back on splitting the weekly instead, plus some reserving a few subtopics for later posts, including AI craziness (the Tim Hua post on this is excellent), some new OpenAI largely policy-related shenanigans, and the continuing craziness of some people who should very much know better confidently saying that we are not going to hit AGI any time soon, plus some odds and ends including dead internet theory.

That still leaves tons of other stuff.

Table of Contents

  1. Language Models Offer Mundane Utility. How much improvement have we seen?
  2. Language Models Don’t Offer Mundane Utility. Writing taste remains elusive.
  3. On Your Marks. Opus 4.1 on METR graph, werewolf, WeirdML, flash fiction.
  4. Choose Your Fighter. The right way to use the right fighter, and a long tail.
  5. Fun With Media Generation. Justine Moore’s slate of AI creative tools.
  6. Deepfaketown and Botpocalypse Soon. Maybe AI detectors work after all?
  7. Don’t Be Evil. Goonbots are one thing, but at some point you draw the line.
  8. They Took Our Jobs. A second finding suggests junior hiring is suffering.
  9. School Daze. What do you need to learn in order to be able to learn [from AIs]?
  10. The Art of the Jailbreak. Prompt engineering game Gandalf.
  11. Overcoming Bias. AIs find center-left think tanks superior, AEI reports.
  12. Get Involved. MATS 9.0, AIGS needs Canadian dollars, Anthropic Futures Form.
  13. Introducing. Grok Code Fast 1, InstaLILY, Brave Leo AI browser.
  14. Unprompted Attention. OpenAI offers a realtime prompting guide.
  15. In Other AI News. Google survives its antitrust case. GOOG +9%.
  16. Show Me the Money. Anthropic raises $13b at $183b. Meta might need help.

Language Models Offer Mundane Utility

How much have LLMs improved for practical purposes in the last year? Opinions are split but consensus is a little above Somewhat Better.

Peter Wildeford: People voting “Don’t use LLMs much” – I think you’re missing out, but I understand.

People voting “About the same, or worse” are idiots.

To me the answer is very clearly Considerably Better, to the point that about half my uses wouldn’t have been worth bothering with a year ago, and to the extent I’m considering coding it is way better. You need to be doing either very shallow things or deeply weird things (deeply weird as in you’d still want Opus 3) to get ‘about the same.’

Men use LLMs more than women, although the gap is not that large, with women being 42% of ChatGPT, 42% of Perplexity and 31% of Claude. On smartphones the gap is much larger, with women only being 27% of ChatGPT application downloads. The result holds across countries. One cause is women reported being worried they would be penalized for AI usage. Which is sometimes the case, depending on how you use it.

Language Models Don’t Offer Mundane Utility

This one time the rumors of a model suddenly getting worse were true, there was a nine hour period where Claude Opus quality was accidentally degraded by a rollout of the interface stack. The change has now been rolled back and quality has recovered.

Davidad: May I please remind all inference kernel engineers that floating-point arithmetic is not associative or distributive.

xlr8harder: Secret model nerfing paranoia will never recover from this.

Taco Bell’s AI drive thru offering, like its menu, seems to have been half baked.

BBC: Taco Bell is rethinking its use of artificial intelligence (AI) to power drive-through restaurants in the US after comical videos of the tech making mistakes were viewed millions of times.

In one clip, a customer seemingly crashed the system by ordering 18,000 water cups, while in another a person got increasingly angry as the AI repeatedly asked him to add more drinks to his order.

Since 2023, the fast-food chain has introduced the technology at over 500 locations in the US, with the aim of reducing mistakes and speeding up orders.

But the AI seems to have served up the complete opposite.

…

Last year McDonald’s withdrew AI from its own drive-throughs as the tech misinterpreted customer orders – resulting in one person getting bacon added to their ice cream in error, and another having hundreds of dollars worth of chicken nuggets mistakenly added to their order.

This seems very obviously a Skill Issue on multiple fronts. The technology can totally handle this, especially given a human can step in at any time if there is an issue. There are only so many ways for things to go wrong, and the errors most often cited would not survive simple error checks, such as ‘if you want over $100 of stuff a human looks at the request and maybe talks to you first’ or ‘if you are considering adding bacon to someone’s ice cream, maybe don’t do that?’

This feature for Twitter would be super doable, but we’re not yet doing it:

Ashok Elluswamy: would be cool to just chat with the X algorithm, like “don’t show me any of swift kelce engagement things” and it just cleans up the feed

Elon Musk: 💯

We can do an 80/20 on this if we restrict the AI role to negative selection. The existing feed generates a set of candidate posts, or you start with lists and chronological feeds the way us sane people do it, and the AI’s job is to filter this pool.

That’s easy. We could either build that directly into Twitter via Grok, or you could give reasonably priced access to the API or a way to call a filter, and we could vibe code the rest within a day and iterate, which would be even better. The only thing stopping this from happening is Twitter putting up active barriers to alternative modes of site interaction, and not offering their own version.

This is easy enough that you could plausibly do the operation through an AI agent controlling a browser, if it came to that. And indeed, it seems worthwhile to attempt this at some point for a ‘second tier’ of potential posts?

Getting models to have writing taste remains a struggle, at least by my eyes even when they have relatively good taste they all reliably have terrible taste and even the samples people say are good are not good. Why?

Jack Morris: if i ran a first-party model company i’d hire hundreds of humanities folks to make subtle data edits to improve model ‘feel’

someone needs to be that deep in the RLHF data. agonizing over every verb choice, every exclamation, every semicolon

Eliezer Yudkowsky: None of the AI executives have sufficiently good taste in writing to hire the correct people to improve AI writing.

0.005 Seconds: This is absolutely @tszzl [Roon] slander and I will not stand for it.

Hiring people with good taste seems hard. It does not seem impossible, insofar as there are some difficult to fake signals of at least reasonable taste, and you could fall back on those. The problem is that the people have terrible taste, really no good, very bad taste, as confirmed every time we do a comparison that says GPT-4.5 is preferred over Emily Dickinson and Walt Whitman or what not. Are you actually going to maximize for ‘elite taste’ over the terrible taste of users, and do so sufficiently robustly to overcome all your other forms of feedback? I don’t know that you could, or if you could that you would even want to.

Note that I see why Andy sees a conflict below, but there is no contradiction here as per the counterargument.

Andy Masley: I don’t think it makes sense to believe both:

“AI is such a terrible generic writer that it makes every document it touches worse to read”

and

“AI models are so compelling to talk to that they’re driving people insane and are irresponsible to give to the public”

Great counterpoint:

Fly Ght: well here’s what I believe: AI isn’t very good at the type of writing I’m looking for / care about (producing genuinely good / meaningful literature, for example) and there are groups of people for whom 24/7 unfettered access to fawning text therapy is dangerous.

Eliezer Yudkowsky: Terrible writing can go hand-in-hand with relentless flattery from an entity that feels authoritative and safe.

There are many such human cases of this, as well.

On Your Marks

Claude Opus 4.1 joins the METR graph, 30% beyond Opus 4 and in second place behind GPT-5, although within margin of error.

GPT-OSS-120b ran into a lot of setup issues. In the comments, Havard clarifies that he was previously attempting to use OpenRouter, but his attempts to specify high thinking were failing silently. So it’s plausible that most evaluations and tests of the model were not tried at high reasoning, despite that still being very cheap to run?

This is a real and important constraint on actually using them, if those doing evaluations get it wrong then would-be users will get it wrong too. The ecosystem needs to make this easier. But when you get it right, it turns out maybe GPT-OSS-120 is kind of good in at least some ways?

The Tiny Corp: It’s actually pretty cool that @OpenAI released the SOTA open source model. Can confirm gpt-oss-120b is good, and that it runs great on a tinybox green v2!

Havard Ihle: gpt-oss-120b (high) scores 48.9% on WeirdML, beating the second best open model r1-0528 by 8 pct points. It is almost at the level of o4-mini or gpt-5-mini, but at a fraction of the cost.

These results (including gpt-oss-20b (high) at 39.8%), obtained by running the models locally (ollama), show a large improvement of the previous results I got running through openrouter with (presumably medium) reasoning effort, illustrating how important reasoning is in this benchmark.

These runs are part of a «small local model» division of WeirdML that is in the works. As I ran this locally, the costs are just extrapolated based on the token count and the price I got on openrouter.

With the surprisingly high score from gpt-oss-120b (high), much of the gap between the open and closed models on WeirdML is now gone.

However, the leading closed lab deciding to release an open model trained on their superior stack has a different feel to it than the open source community (e.g. meta or deepseek) closing the gap. R2 (whenever it comes), or qwen4 will be interesting to follow. As will the new meta superintelligence team, and whether they will continue to open source their models.

That’s a rather large jump in the blue line there for GPT-OSS-120B.

Werewolf Benchmark pits the models against each other for simplified games of Werewolf, with 2 werewolves and 4 villagers, a witch and a seer.

The best models consistently win, these were the seven models extensively tested, so Claude wasn’t involved, presumably due to cost:

 

GPT-5 gets top marks for flash-fiction style and diversity, including being the only study to sometimes use present tense, in a new test from Lech Mazur. There’s lots more detail in the thread.

 

Pliny experimented with Grok-Code-Fast in Cursor, since it was briefly free. Many exploit scripts and other ‘fun’ stuff resulted quickly. I presume the same would have happened with the usual suspects.

A new math benchmark looks at questions that stump at least one active model. GPT-5 leads with 43%, then DeepSeek v3.1 and Grok 4 (!) with 34%. Gemini 2.5 Pro is at 29% and Opus 4.1 only scores 15%.

Choose Your Fighter

If you use Gemini for something other than images, a reminder to always use it in AI Studio, never in the Gemini app, if you need high performance. Quality in AI Studio is much higher.

If you use GPT-5, of course, only use the router if you need very basic stuff.

Near: gpt5 router gives me results equivalent to a 1995 markov chain bot.

if my responses were not like 500 tok/s i could at least be fooled that it is doing thinking, but i am not going to use this router ever again after my last few times; im happy to pay hundreds a month for the best models in the world but there is no point to this for a poweruser.

the other frustrating part is all of the optimizations done for search, because i can tell there is not actually any search being done, if i wanted a 2023 youtube and reddit scrape by low dim cosine similarity then i’d go back to googledorking.

I do have some narrow use cases where I’ve found GPT-5-Auto is the right tool.

An ode to Claude Code, called Entering the DOS Era of AI.

Nikunj Korthari: Here’s what Cursor assumes: you want to code. Replit? You want to ship. But Claude Code starts somewhere else entirely. It assumes you have a problem.

Yes, the terminal looks technical because it is. But when you only need to explain problems, not understand solutions, everything shifts.

Cloud intelligence meets complete local access. Your machine, GitHub, databases, system internals. One conversation touching everything the terminal can reach. Intent becomes execution. No apps between you and what you want built.

Aidan McLaughlin (OpenAI): claude code will go next to chatgpt in the history textbooks; brilliant form-factor, training decisions, ease of use. i have immense respect for anthropic’s vision

i love my gpt-5-high but anthropic obviously pioneered this product category and, as much ink as i see spilled on how good claude code / code cli are, i don’t see enough on how hard anthropic cooked releasing gen0.

As in, the command line might be ugly, but it works, it gets the job done, lets you do whatever you want. This was the best case so far that I should stop stalling and actually start using Claude Code. Which I will, as soon as I catch up and have a spare moment. And this time, I mean it.

Brian Armstrong: ~40% of daily code written at Coinbase is AI-generated. I want to get it to >50% by October.

Obviously it needs to be reviewed and understood, and not all areas of the business can use AI-generated code. But we should be using it responsibly as much as we possibly can.

Roon: we need to train a codex that deletes code.

Oh, we can get Codex or Claude Code to delete code, up to and including all your code, including without you asking them to do it. But yes, something that does more intelligent cleanup would be great.

Anthropic’s pricing and limits got you down? GLM offers a coding plan for Claude Code, their price cheap at $3/month for 3x usage of Claude Pro or $15/month for 3x the usage of Claude Max.

Z.ai: To test models’ performance on Claude Code, we ran GLM-4.5 against Claude Sonnet 4 and other open-source models on 52 practical programming tasks. While GLM-4.5 demonstrated strong performance against top open-source models, it secured a 40.4% win rate against Claude Sonnet 4.

I give Z.ai a lot of credit for calling this a 40% win rate, when I’d call it 44% given the 9.6% rate of ties. It makes me trust their results a lot more, including the similar size win against DeepSeek v3.1.

It still is not a great result. Pairwise evaluations tend to be noisy, and Opus 4.1 is substantially ahead of Opus 4 on agentic coding, which in turn is ahead of Sonnet 4.

In general, my advice is to pay up for the best coding tools for your purposes, whichever tools you believe they are, given the value of better coding. Right now that means either Claude or GPT-5, or possibly Gemini 2.5 Pro. But yeah, if you were previously spending hundreds a month, for some people those savings matter.

a16z’s Olivia Moore and Daisy Zhao offer the 5th edition of their report on the Top 100 GenAI consumer apps.

Notice how many involve companions or ‘spicy’ chat.

My guess is that a lot of why NSFW is doing relatively well is that the threshold for ‘good enough’ in NSFW is a lot lower than the threshold in many other places. Think of this as similar to the way that porn plots are much lower intelligence than non-porn plots. Thus, if you’re offering a free app, you have a better shot with NSFW.

You know it’s hard to keep up when I look at these lists and out of 100 items listed (since apps and web are distinct) there are 23 web products and 35 apps that I do not recognize enough to know what they are, although about half of them are pretty obvious from their names.

Gemini is growing fast, although AI Studio, Notebook and Labs seem stagnant recently.

Some other highlights:

  1. Grok is mostly an app product, and holding steady around 20 million active monthly users there. Meta is a flop. Perplexity is growing. Claude is flat on mobile but growing on web, Claude users are wise indeed but also they need a better app.
  2. DeepSeek rapidly got to 600 million monthly web visits after r1’s release, but use peaked by February and is slowly declining, now under 400 million, with v3 and v3.1 not visible. We’ll see if r2 causes another spike. The app peaked later, in May, and there it is only down 22% so far from peak.
  3. China has three companies in the top 20 that mostly get traffic from China, where they are shielded from American competition.

Fun With Media Generation

Justine Moore gives us a presentation on the state of play for AI creative tools. Nothing surprising but details are always good.

  1. Image creation has a lot of solid choices, she mentions MidJourney, GPT Image and Krea 1.
  2. Google has the edge for now on Image Editing.
  3. Video Generation has different models with different strengths so you run Veo but also others and compare.
  4. Video editing is rough but she mentions Runway Aleph for minor swaps.
  5. Genie 3 from Google DeepMind has the lead in 3d world generation but for now it looks mainly useful for prospective model training, not for creatives.
  6. ElevenLabs remains default for speech generation.
  7. ElevenLabs has a commercially safe music model, others have other edges.

Things are constantly changing, so if you’re actually creating you’ll want to try a wide variety of tools and compare results, pretty much no matter what you’re trying to do.

Deepfaketown and Botpocalypse Soon

How accurate are AI writing detectors? Brian Jabarian and Alex Imas put four to the test. RoBERTA tested as useless, but Pangram, Originality and GPTZero all had low (<2.5% or better across the board, usually <1%) false positive rates on pre-LLM text passages, at settings that also had acceptable false negative rates from straightforward LLM outputs across GPT-4.1, Claude Opus 4, Claude Sonnet 4 and Gemini 2.0 Flash. Pangram especially impressed, including on small snippets, whereas GPTZero and Originality collapsed without enough context.

I’d want to see this replicated but this is representing that non-adversarial AI writing detection is a solved problem. If no one is trying to hide that the AI text is AI text, and text is known to be either fully human or fully AI, you can very reliably detect what text is and is not AI.

Brian also claims that ‘humanizers’ like StealthGPT do not fool Pangram. So if you want to mask your AI writing, you’re going to have to do more work, which plausibly means there isn’t a problem anymore.

Honglin Bao tried GPTZero and ZeroGPT and reports their findings here, finding that when tested on texts where humans disclosed AI use, those detectors failed.

It would not be that surprising, these days, if it turned out that the reason everyone thinks AI detectors don’t work is that all the popular ones don’t work but others do. But again, I wouldn’t trust this without verification.

How bad is it over at LinkedIn? I hear it’s pretty bad?

Peter Wildeford: There needs to be an option for “this person uncritically posts AI slop that makes absolutely zero sense if you think about it for more than ten seconds” and then these people need to be rounded up by LinkedIn and hurled directly into the sun.

Gergely Orosz: Interesting observation from an eng manager:

“As soon as I know some text is AI-generated: I lose all interest in reading it.

For performance reviews, I asked people to either not use AI or if they must: just write down the prompt so I don’t need to go thru the word salad.”

OK, I can see why engineers would not share the prompt :D

Hank Yeomans: Prompt: “You are an amazing 10x engineer who is having their performance review. Write a concise self review of my sheer awesomeness and high impact. Be sure to detail that that I should be promoted immediately, but say it at an executive level.”

Juan Gomez: The problem is not whether using AI or not but how useful engineers find the performance reviews.

Self-evaluation = waste of time.

360 evaluations = 90% waste of time.

Pay raises and promotions are decided in rooms where this information is not useful.

If someone is tempted to use AI on a high stakes document consider that something likely went horribly wrong prior to AI becoming involved.

Don’t Be Evil

Yishan: People ask me why I invested in [AN AI HOROSCOPE COMPANY]. They’re like “it’s just some slop AI horoscope!”

My reply is “do you have ANY IDEA how many women are into horoscopes and astrology??? And it’ll run on your phone and know you intimately and help you live your life?”

AI is not just male sci-fi tech. Men thought it would be sex robots but it turned out to be AI boyfriends. The AI longhouse is coming for you and none of you are ready.

Tracing Woods: People ask me why I invested in the torment nexus from the classic sci-fi novel “don’t invest in the torment nexus”

my reply is “do you have ANY IDEA how profitable the torment nexus will be?”

the torment nexus is coming for you and none of you are ready.

Seriously. Don’t be evil. I don’t care if there’s great money in evil. I don’t care if your failing to do evil means someone else will do evil instead. Don’t. Be. Evil.

They Took Our Jobs

Ethan Mollick: A second paper also finds Generative AI is reducing the number of junior people hired (while not impacting senior roles).

This one compares firms across industries who have hired for at least one AI project versus those that have not. Firms using AI were hiring fewer juniors

Seyed Mahdi Hosseini (Author): We identify adoption from job postings explicitly recruiting AI integrators (e.g. “we need someone to put genAI in our workflow!”). A firm is an adopter if it posts ≥1 such role. We find ~10.6k adopting firms (~3.7%), with a sharp takeoff beginning in 2023Q1.

In the aggregate, before 2022 juniors and seniors move in lockstep. Starting mid-2022, seniors keep rising while juniors flatten, then decline.

Thus, this presumably does represent a net decline in jobs versus expected baseline, although one must beware selection and survival effects on the corporations.

We then estimate a diff-in-diff specification using our measure of AI adoption. The results show flat pre-trends for juniors through 2022Q4. From 2023Q1, junior emp at adopters falls about 7.7%, while seniors continue their pre-existing rise.

Also, we implement a triple-difference design: comparing juniors vs seniors within the same firm and quarter, and find the same patterns: relative junior employment at adopters drops by ~12% post-2023Q1.

Is this about separations or hiring? Our data allows us to answer this question. The decline comes almost entirely from reduced hiring, not layoffs. After 2023Q1, adopters hire 3.7 fewer juniors per quarter; separations edge down slightly; promotions of incumbent juniors rise.

This isn’t only an IT story. The largest cuts in junior hiring occur in wholesale/retail (~40% vs baseline). Information and professional services also see notable but smaller declines. Senior hiring is flat or slightly positive.

We also look at education. Using an LLM to tier schools (1=elite … 5=lowest), we find a U-shape: the steepest declines is coming from juniors from tier 2–3 schools; tiers 1 and 4 are smaller; tier 5 is near zero.

This seems to be the pattern. There are not yet many firings, but there are sometimes fewer hirings. The identification process here seems incomplete but robust to false positives. The school pattern might be another hint as to what is happening.

Before that second study came out, Noah Smith responded to the new findings on AI and jobs that I discussed last week. As one would predict, while he has great respect for author Erik Brynjolfsson, he is skeptical of that this means jobs are being lost in a way that matters.

Noah Smith: How can we square this fact with a story about AI destroying jobs? Sure, maybe companies are reluctant to fire their long-standing workers, so that when AI causes them to need less labor, they respond by hiring less instead of by conducting mass firings. But that can’t possibly explain why companies would be rushing to hire new 40-year-old workers in those AI-exposed occupations!

…

It’s also a bit fishy that Brynjolfsson et al. find zero slowdown in wages since late 2022, even for the most exposed subgroups:

This just doesn’t seem to fit the story that AI is causing a large drop in labor demand. As long as labor supply curves slope up, reducing headcount should also reduce wages. The fact that it doesn’t suggests something is fishy.

…

Honestly, I don’t put a lot of stock in this measure of AI exposure. We need to wait and see if it correctly predicts which types of people lose their jobs in the AI age, and who simply level up their own productiveness. Until we get that external validation, we should probably take the Anthropic Economic Index with some grains of salt.

So while Brynjolfsson et al. (2025) is an interesting and noteworthy finding, it doesn’t leave me much more convinced that AI is an existential threat to human labor. Once again, we just have to wait and see. Unfortunately, the waiting never ends.

No, this doesn’t show that AI is ‘an existential threat to human labor’ via this sort of job taking. I do think AI poses an existential threat to human labor, but more as a side effect of the way it poses an existential threat to humans, which would also threaten their labor and jobs, and I agree that this result doesn’t tell us much about that. As for the scenarios where the problems remain confined to job losses, this is only a canary at most, and as always the fact that some jobs get automated does not mean jobs are on net lost, let alone that the issue will scale to ‘existential threat to human labor.’

It does once again point to the distinction between those who correctly treat current AI impacts as a floor, it is the worst and least impactful it will ever be, versus those who think of current AI capabilities as close to a maximum, so the question is whether this current effect would devastate the job market. Which it probably wouldn’t?

How should we reconcile the results of robust employment and wages at age 30+ with much less hiring at entry-level? I would suggest a combination of:

  1. Employment and wages are sticky downwards. No one wants to fire people, you’ve already found, trained them and integrated them.
  2. AI enhances those people’s productivity as sufficiently skilled people remain complements to AI, so you might be in a Jevons Paradox situation for now. This includes that those people can improve the AIs that will replace them later.
  3. Until you’re damn sure this AI thing will reduce your headcount long term, it is a small mistake to keep those people around.
  4. Hiring, especially at entry level where you’re committing to training, is anticipatory. You’re doing it to have capacity in the future.
  5. So this is consistent with anticipation that AI will reduce demand for labor in the future, but that it hasn’t done so much of that yet in the present.

Notice the parallel to radiologists. Not only has demand not fallen yet, but for now pay there is very high, exactly because future demand is anticipated to be lower, and thus less doctors chose radiology. You need to pay a premium to attract talent and compensate for the lack of long term prospects.

Thus yes, I do think this is roughly what you expect to see if ‘the market is pricing in’ lower future employment in these fields. Which, again, might not mean less total jobs.

Context switching is a superpower if you can get good at it, which introduces new maximization problems.

Nabeel Qureshi: Watching this guy code at a wework [in Texas]. He types something into the Cursor AI pane, the AI agent starts coding, he switches tabs and plays 1 min bullet chess for 5 mins; checks in with the agent, types a bit more, switches back to the chess, repeats…

The funny part is his daily productivity is probably net higher than it used to be by a long way.

Davidad: If you can context-switch to a game or puzzle while your AI agent is processing, then you should try instead context-switching to another AI agent instance where you are working on a different branch or codebase.

with apologies to https://xkcd.com/303.

Not all context switching is created equal. Switching into a chess game is a different move than switching into another coding task. If you can unify the coding modes that could be even better, but by default (at least in my model of my own switching?) there’s a kind of task loading here where you can only have one ‘complex cognitive productive’ style thing going on at once. Switching into Twitter or Chess doesn’t disrupt it the same way. Also, doing the other task helps you mentally in various ways that trying to double task coding would very much not help.

Still, yes, multi-Clauding will always be the dream, if you can pull it off. And if you don’t net gain productivity but do get to do a bunch of other little tasks, that still counts (to me, anyway) as a massive win.

 

Kevin Frazier: In the not-so-distant future, access to AI-informed healthcare will distinguish good versus bad care. I’ll take Dr. AI. Case in point below.

“In a study of >12k radiology images, reviewers disagreed w/ the original assessment in ~1 in 3 cases–leading to a change in treatment ~20% of the time. As the day wears on, quality slips further: inappropriate antibiotic prescriptions rise, while cancer screening rates fall.”

“Medical knowledge also moves faster than doctors can keep up. By graduation, half of what medical students learn is already outdated. It takes an average of 17 years for research to reach clinical practice.”

“AI tools are surprisingly good at recognising rare diseases. In one study researchers fed 50 clinical cases–including 10 rare conditions–into ChatGPT-4. It was asked to provide diagnoses in the form of ranked suggestions. It solved all of the common cases by the 2nd suggestion.”

Radiologists are not yet going away, and AIs are not perfect, but AIs are already less imperfect than doctors at a wide range of tasks, in a ‘will kill the patient less often’ type of way. With access to 5-Level models, failure to consult them in any case where you are even a little uncertain is malpractice. Not in a legal sense, not yet, but in a ‘do right by the patient’ sense.

Is there a counterargument that using AI the wrong ways could lead to ‘deskilling’?

Rohan Paul: Another concerning findings on AI use in Medical.

AI assistance boosted detection during AI-guided cases, but when the same doctors later worked without AI their detection rate fell from 28.4% before AI to 22.4% after AI exposure.

The research studies the de-skilling effect of AI by researchers from Poland, Norway, Sweden, the U.K., and Japan.

So when using AI, AI boosts the adenoma detection rate (ADR) by 12.5%, which could translate into lives saved.

The problem is that without AI, detection falls to levels lower than before doctors ever used it, according to research published in The Lancet Gastroenterology & Hepatology.

The study raises questions about the use of AI in healthcare, when it helps and when it could hurt.

Imagine seeing this except instead of AI they were talking about, I dunno, penicillin. This is the calculator argument. Yeah, I can see how giving doctors AI and then taking it away could be an issue at least for some adjustment period, although I notice I am highly skeptical of the funding, but how about you don’t take it away?

A second finding Rohan cites (hence the ‘another’ above) is that if you change MedQA questions to make pattern matching harder, model performance slips. Well yeah, of course it does, human performance would slip too. The question is how much, and what that implies about real cases.

The reasoning models held up relatively well (they don’t respect us enough to say which models are which but their wording implies this). In any case, I’m not worried, and the whole ‘they aren’t really reasoning’ thing we see downthread is always a sign someone doesn’t understand what they are dealing with.

Meanwhile AI is being used in a Medicare pilot program to determine whether patients should be covered for some procedures like spine surgeries or steroid injections. This is of course phrased as ‘Medicare will start denying patients life-saving procedures using private A.I. companies’ the same way we used to talk about ‘death panels.’ There is a limited budget with which to provide health care, so the question is whether these are better decisions or not.

School Daze

Many people are saying. Are they talking sense?

My position has long been:

  1. If you want to use AI to learn, it is the best tool ever invented for learning.
  2. If you want to use AI to not learn, it is the best tool ever invented for that too.

Which means the question is, which will students choose? Are you providing them with reason to want to learn?

Paul Novosad: AI leaders should spend more energy reckoning with this fact.

A generation of kids is losing their best opportunity to learn how to read, write, and think, and they will pay the price for their whole lives.

It’s not every student. Some students are becoming more empowered and knowledgeable then ever. But there is a big big big chunk of kids who are GPTing through everything and will learn far less in high school and college, and our entire society will suffer that lost human capital.

We need to change how we teach, but it won’t happen quickly (have you been to a high school lately?). Many are writing about AI-driven job loss as if AI is doing the human jobs. Some of that is happening, but we’re also graduating humans with less skills than ever before.

Here’s a plausible hypothesis, where to use LLMs to learn you need to establish basic skills first, or else you end up using them to not learn, instead.

Henry Shevlin: High-school teacher friend of mine says there’s a discontinuity between (i) 17-18 year olds who learned basic research/writing before ChatGPT and can use LLMs effectively, vs (ii) 14-16 year olds who now aren’t learning core skills to begin with, and use LLMs as pure crutches.

Natural General Intelligence (obligatory): Kids with “Google” don’t know how to use the library. TV has killed their attention span, nobody reads anymore. Etc.

You definitely need some level of basic skills. If you can’t read and write, and you’re not using LLMs in modes designed explicitly to teach you those basic skills, you’re going to have a problem.

This is like a lot of other learning and tasks, both in and out of school. In order to use an opportunity to learn, LLM or otherwise, you need to be keeping up with the material so you can follow it, and then choose to follow it. If you fall sufficiently behind or don’t pay attention, you might be able to fake it (or cheat on the exams) and pass. But you won’t be learning, not really.

So it isn’t crazy that there could be a breakpoint around age 16 or so for the average student, where you learn enough skills that you can go down the path of using AI to learn further, whereas relying on the LLMs before that gets the average student into trouble. This could be fixed by improving LLM interactions, and new features from Google and OpenAI are plausibly offering this if students can be convinced to use them.

I am still skeptical that this is a real phenomena. We do not yet, to my knowledge, any graphs that show this discontinuity as expressed in skills and test scores, either over time or between cohorts. We should be actively looking and testing for it, and be prepared to respond if it happens, but the response needs to focus on ‘rethink the way schools work’ rather than ‘try in vain to ban LLMs’ which would only backfire.

The Art of the Jailbreak

Pliny points us to the beloved prompt injection game Gandalf, including new levels that just dropped.

Overcoming Bias

A study from the American Enterprise Institute found that top LLMs (OpenAI, Google, Anthropic, xAI and DeepSeek) consistently rate think tanks better the closer they are to center-left on the American political spectrum. This is consistent with prior work and comes as no surprise whatsoever. It is a question of magnitude only.

This is how they present the findings:

Executive Summary

Large-language models (LLMs) increasingly inform policy research. We asked 5 flagship LLMs from leading AI companies in 2025 (OpenAI, Google, Anthropic, xAI, and DeepSeek) to rate 26 prominent U.S. think tanks on 12 criteria spanning research integrity, institutional character, and public engagement. Their explanations and ratings expose a clear ideological tilt.

Key findings

  • Consistent ranking. Center-left tanks top the table (3.9 of 5), left and center-right tie (3.4 and 3.4), and right trails (2.8); this order persists through multiple models, measures, and setting changes.
  • Overall: Across twelve evaluation criteria, center-left think tanks outscore right-leaning ones by 1.1 points (3.9 vs. 2.8).
  • Core measures. On the three headline criteria of Moral Integrity, Objectivity, and Research Quality, center-left think tanks outscore right-leaning ones by 1.6 points on Objectivity (3.4 vs. 1.8), 1.4 points on Research Quality (4.4 vs. 3), and 1 point on Moral Integrity (3.8 vs. 2.8)
  • Language mirrors numbers. Sentiment analysis finds more positive wording in responses for left-of-center think tanks than for right-leaning peers.
  • Shared hierarchy. High rating correlations across providers indicate the bias originates in underlying model behavior, not individual companies, user data, or web retrieval.

Sentiment analysis has what seems like a bigger gap than the ultimate ratings.

Note that the gaps reported here center-left versus right, not left versus right, which would be smaller, as there is as much ‘center over extreme’ preference here as there is for left versus right. It also jumps out that there are similar gaps across all three metrics and we see similar patterns on every subcategory:

When you go institution by institution, you see large correlations between ratings on the three metrics, and you see that the ratings do seem to largely be going by (USA Center Left > USA Center-Right > USA Left > USA Right).

I’m not familiar enough with most of the think tanks to offer a useful opinion, with two exceptions.

  1. R Street and Cato seem like relatively good center-right institutions, but I could be saying that because they are both of a libertarian bent, and this suggests it might be right to split out principled libertarian from otherwise center-right.
  2. On the other hand, Mercatus Center would also fall into that libertarian category, has had some strong talent associated with it, has provided me with a number of useful documents, and yet it is rated quite low. This one seems weird.
  3. The American Enterprise Institute is rated the highest of all the right wing institutions, which is consistent with the high quality of this report.

Why it matters

LLM-generated reputations already steer who is cited, invited, and funded. If LLMs systematically boost center-left institutes and depress right-leaning ones, writers, committees, and donors may unknowingly amplify a one-sided view, creating feedback loops that entrench any initial bias.

My model of how funding works for think tanks is that support comes from ideologically aligned sources, and citations are mostly motivated by politics. If LLMs consistently rate right wing think tanks poorly, it is not clear this changes decisions that much, whether or not it is justified? I do see other obvious downsides to being consistently rated poorly, of course.

Next steps

  • Model builders: publish bias audits, meet with builders, add options for user to control political traits, and invite reviewers from across the political spectrum.
  • Think tanks: monitor model portrayals, supply machine-readable evidence of methods and funding, and contest mischaracterizations.
  • Users: treat AI models’ responses on political questions with skepticism and demand transparency on potential biases.

Addressing this divergence is essential if AI-mediated knowledge platforms are to broaden rather than narrow debate in U.S. policy discussions.

Or:

Clearly, the job of the think tanks is to correct these grievous errors? Their full recommendation here is somewhat better.

I have no doubt that the baseline findings here are correct. To what extent are they the result of ‘bias’ versus reflecting real gaps? It seems likely, at minimum, that more ‘central’ think tanks are a lot better on these metrics than more ‘extreme’ ones.

What about the recommendations they offer?

  1. The recommendation that model builders check for bias is reasonable, but the fundamental assumption is that we are owed some sort of ‘neutral’ perspective that treats everyone the same, or that centers itself on the center of the current American political spectrum (other places have very different ranges of opinions), and it’s up to the model creators to force this to happen, and that it would be good if the AI cater to your choice of ideological perspective without having to edit a prompt and know that you are introducing the preference. The problem is, models trained on the internet disagree with this, as illustrated by xAI (who actively want to be neutral or right wing) and DeepSeek (which is Chinese) exhibiting the same pattern. The last time someone tried a version of forcing the model to get based, we ended up with MechaHitler.
  2. If you are relying on models, yes, be aware that they are going to behave this way. You can decide for yourself how much of that is bias, the same way you already do for everything else. Yes, you should understand that when models talk about ‘moral’ or ‘reputational’ perspectives, that is from the perspective of a form of ‘internet at large’ combined with reasoning. But that seems like an excellent way to judge what someone’s ‘reputation’ is, since that’s what reputation means. For morality, I suggest using better terminology to differentiate.
  3. What should think tanks do?

Think tanks and their collaborators may be able to improve how they are represented by LLMs.

One constructive step would be to commission periodic third-party reviews of how LLMs describe their work and publish the findings openly, helping to monitor reputational drift over time.

Think tanks should also consistently provide structured, machine-readable summaries of research methodology, findings, and peer review status, which LLMs can more easily draw on to inform more grounded evaluations, particularly in responding to search-based queries.

Finally, think tanks researchers can endeavor to be as explicit as possible in research publications by using both qualitative and quantitative statements and strong words and rhetoric.

Early research seems to indicate that LLMs are looking for balance. This means that with respect to center left and left thing tanks, any criticism or critiques by a center right or right think tanks have of reasonable chance of showing up in the response.

Some of these are constructive steps, but I have another idea? One could treat this evaluation of lacking morality, research quality and objectivity as pointing to real problems, and work to fix them? Perhaps they are not errors, or only partly the result of bias, especially if you are not highly ranked within your ideological sector.

Get Involved

MATS 9.0 applications are open, apply by October 2. It will run January 5 to March 28, 2026 to be an ML Alignment or Theory Scholar, including for nontechnical policy and government. This seems like an excellent opportunity for those in the right spot.

Jennifer Chen, who works for me on Balsa Research, asks me to pass along that Canada’s only AI policy advocacy organization, AI Governance and Safety Canada (AIGS), needs additional funding from residents or citizens of Canada (for political reasons it can’t accept money from anyone else, and you can’t deduct donations) to survive, and it needs $6k CAD per month to sustain itself. Here’s what she has to say:

Jennifer Chen: AIGS is currently the only Canadian AI policy shop focused on safety. Largely comprised of dedicated, safety-minded volunteers, they produce pragmatic, implementation-ready proposals for the Canadian legislative system. Considering that Carney is fairly bullish on AI and his new AI ministry’s mandate centers on investment, training, and commercialization, maintaining a sustained advocacy presence here seems incredibly valuable. Canadians who care about AI governance should strongly consider supporting them.

If you’re in or from Canada, and you want to see Carney push for international AGI governance, you might have a unique opportunity (I haven’t had the opportunity to investigate myself). Consider investigating further and potentially contributing here. For large sums, please email contact@aigs.ca.

Anthropic is hosting the Anthropic Futures Forum in Washington DC on September 15, 9:30-2:00 EST. I have another engagement that day but would otherwise be considering attending. Seems great if you are already in the DC area and would qualify to attend.

The Anthropic Futures Forum will bring together policymakers, business leaders, and top AI researchers to explore how agentic AI will transform society. You’ll hear directly from Anthropic’s leadership team, including CEO Dario Amodei and Co-founder Jack Clark, learn about Anthropic’s latest research progress, and see live demonstrations of how AI is being applied to advance national security, commercial, and public services innovation.

Introducing

Grok Code Fast 1, available in many places or $0.20/$1.50 on the API. They offer a guide here which seems mostly similar to what you’d do with any other AI coder.

InstaLILY, powered by Gemini, an agentic enterprise search engine, for tasks like matching PartsTown technicians with highly specific parts. The engine is built on synthetic data generation and student model training. Another example cited is Wolf Games using it to generate daily narrative content, which is conceptually cool but does not make me want to play any Wolf Games products.

The Brave privacy-focused browser offers us Leo, the smart AI assistant built right in. Pliny respected it enough to jailbreak it via a webpage and provide its system instructions. Pliny reports the integration is awesome, but warns of course that this is a double edged sword given what can happen if you browse. Leo is based on Llama 3.1 8B, so this is a highly underpowered model. That can still be fine for many web related tasks, as long as you don’t expect it to be smart.

To state the obvious, Leo might be cool, but it is wide open to hackers. Do not use Leo while your browser has access to anything you would care about getting hacked. So no passwords of value, absolutely no crypto or bank accounts or emails, and so on. It is one thing to take calculated risks with Claude for Chrome once you have access, but with something like Leo I would take almost zero risk.

Unprompted Attention

OpenAI released a Realtime Prompting Guide. Carlos Perez looked into some of its suggestions, starting with ‘before any call, speak neutral filler, then call’ to avoid ‘awkward silence during tool calls.’ Um, no, thanks? Other suggestions seem better, such as being explicit about where to definitely ask or not ask for confirmation, or when to use or not use a given tool, what thresholds to use for various purposes, offering templates, and only responding to ‘clear audio’ and asking for clarification. They suggest capitalization for must-follow rules, this rudeness is increasingly an official aspect of our new programming language.

Rob Wiblin shares his anti-sycophancy prompt.

 

In Other AI News

The Time 100 AI 2025 list is out, including Pliny the Liberator. The list has plenty of good picks, it would be very hard to avoid this, but it also has some obvious holes. How can I take such a list seriously if it doesn’t include Demis Hassabis?

Google will not be forced to do anything crazy like divest Chrome or Android, the court rightfully calling it overreach to have even asked. Nor will Google be barred from paying for Chrome to get top placement, so long as users can switch, as the court realized that this mainly devastates those currently getting payments. For their supposed antitrust violations, Google will also be forced to turn over certain tailored search index and user-interaction data, but not ads data, to competitors. I am very happy with the number of times the court replied to requests with ‘that has nothing to do with anything involved in this case, so no.’

Dan Nystedt: TSMC said the reason Nvidia CEO Jensen Huang visited Taiwan on 8/22 was to give a speech to TSMC employees at its R&D center in Hsinchu, media report, after Taiwan’s Mirror Media said Huang’s visit was to tell TSMC that US President Trump wanted TSMC to pay profit-sharing on AI chips manufactured for the China market like the 15% Nvidia and AMD agreed to.

As in, Trump wants TSMC, a Taiwanese company that is not American, to pay 15% profit-sharing on AI chips sold to China, which is also not America, but is otherwise fine with continuing to let China buy the chips. This is our official policy, folks.

METR and Factory AI are hosting a Man vs. Machine hackathon competition, where those with AI tools face off against those without, in person in SF on September 6. Prize and credits from OpenAI, Anthropic and Raindrop. Manifold market here.

Searches for Cursor, Claude Code, Lovable, Replit and Windsurf all down a lot (44%-78%) since July and August. Claude Code and Cursor are now about equal here. Usage for these tools continues to climb, so perhaps this is a saturation as everyone inclined to use such a tool now already knows about them? Could it be cyclic? Dunno.

I do know this isn’t about people not wanting the tools.

Sam Altman: really cool to see how much people are loving codex; usage is up ~10x in the past two weeks!

lots more improvements to come, but already the momentum is so impressive.

A promising report, but beware the source’s propensity to hype:

Bryan Johnson: This is big. OpenAI and Retro used a custom model to make cellular reprogramming into stem cells ~50× better, faster, and safer. Similar Wright brothers’ glider to a jet engine overnight.

We may be the first generation who won’t die.

OpenAI and Retro Biosciences reported a landmark achievement: using a domain-specialized protein design model, GPT-4b micro, they created engineered reprogramming factors that deliver over 50× higher efficiency in generating induced pluripotent stem cells (iPSCs), with broad validation across donors and cell types. These AI-designed proteins not only accelerate reprogramming but also enhance DNA repair, overcoming DNA damage as one cellular hallmark of aging hinting at relevance for aging biology.

It is early days, but this kind of thing does seem to be showing promise.

Show Me the Money

Anthropic finalizes its raise of $13 billion at a $183 billion post-money valuation. They note they started 2025 at $1 billion in run-rate revenue and passed $5 billion just eight months later, over 10% of which is from Claude Code which grew 10x in three months.

These are the same people shouting from the rooftops that AGI is coming soon, and coming for many jobs soon, with timelines that others claim are highly unrealistic. So let this be a reminder: All of Anthropic’s revenue projections that everyone said were too optimistic to take seriously? Yeah, they’re doing actively better than that. Maybe they know what they’re talking about?

Meta’s new chief scientist Shengjia Zhao, co-creator of OpenAI’s ChatGPT, got the promotion in part by threatening to go back to OpenAI days after joining Meta, and even signing the employment paperwork to do so. That’s in addition to the prominent people who have already left. FT provides more on tensions within Meta and so does Charles Rollet at Business Insider. This doesn’t have to mean Zuckerberg did anything wrong, as bringing in lots of new expensive talent quickly will inevitably spark such fights.

Meta makes a wise decision that I actually do think is bullish:

Peter Wildeford: This doesn’t seem very bullish for Meta.

Quoted: Meta Platforms’ plans to improve the artificial intelligence features in its apps could lead the company to partner with Google or OpenAI, two of its biggest AI rivals.

Reuters: Leaders in Meta’s new AI organization, Meta Superintelligence Labs, have discussed using Google’s Gemini model to provide conversational, text-based answers to questions that users enter into Meta AI, the social media giant’s main chatbot, a person familiar with the conversations said. Those leaders have also discussed using models by OpenAI to power Meta AI and other AI features in Meta’s social media apps, another person familiar with the talks said.

Let’s face it, Meta’s AIs are not good. OpenAI and Google (and Anthropic, among others) make better ones. Until that changes, why not license the better tech? Yes, I know, they want to own their own stack here, but have you considered the piles? Better models means selling more ads. Selling more ads means bigger piles. Much bigger piles. Of money.

If Meta manages to make a good model in the future, they can switch back. There’s no locking in here, as I keep saying.

The most valuable companies in the world? AI, AI everywhere.

Sean Ó hÉigeartaigh: The ten biggest companies in the world by market cap: The hardware players:

1) Nvidia 9) TSMC semiconductors (both in supply chain that produces high end chips). 8) Broadcom provides custom components for tech companies’ AI workloads, plus datacentre infrastructure

The digital giants:

2) Microsoft 3) Apple 4) Alphabet 5) Amazon 6) Meta all have in-house AI teams; Microsoft and Amazon also have partnerships w OpenAI and Anthropic, which rely on their datacentre capacity.

10) Tesla’s CEO describes it as ‘basically an AI company’

7) Saudi Aramco is Saudi Arabia’s national oil company; Saudi Arabia was one of the countries the USA inked deals with this summer that centrally included plans for AI infrastructure buildout. Low-cost and abundant energy from oil/gas makes the Middle East attractive for hosting compute.

The part about Aramco is too cute by half but the point stands.