I haven't tried AI writing recently, but last time I did, it was horrible. With my kids we tried prompts like "write a story about two kittens who went to a forest and found a treasure", and the result was invariably like this:
5 paragraphs of text:
The first two paragraphs describe the protagonists and the setting. "There were two kittens named ..., they were great friends, and they lived in ..."
The third paragraph was the "story", except using the very opposite of the "show, don't tell" rule, often just a restatement of the prompt. "One day, the kittens went to the forest, had many adventures, and found a treasure."
The remaining two paragraphs clarified the moral of the story, which was invariably "and so the kittens learned that friendship is more precious than all treasures in the world", described in the most annoying way.
I tried to nudge the AI to improve the story, like "expand the part about the adventures, be more specific", which the AI did in the most lazy way, like: "The kittens had to solve many puzzles and fight the dragon", without telling anything specific about the puzzles or about the fight.
I used to work for a SF&F magazine, so I have read my share of bad stories, but the AI stories were the worst of all. It was as if it was explicitly trained to make the story as bad as possible.
(I should probably try this again.)
If you can context-switch to a game or puzzle while your AI agent is processing, then you should try instead context-switching to another AI agent instance where you are working on a different branch or codebase.
This is my nightmare. I already find it difficult to work on multiple projects in parallel during the same sprint, but if the future is working on multiple projects literally at the same time, editing in one window while the other one compiles, expected to complete dozen jira tasks each hour in parallel... I can only hope that suicide remains legal.
Regarding the popularity of coding tools on Google search trends, could it be that we mostly see the dynamics of their competition with each other and that Codex (which is not listed among the alternatives) started to eat a lot of relative share after Aug 7 GPT-5 release?
When one looks at Google Trends, the curves for both Codex the word (“search term”) and OpenAI Codex the topic (“software”) look suggestive.
least by my eyes even when they have relatively good taste they all reliably have terrible taste and even the samples people say are good are not good
We can get a lot here if we remember that a lot of "good writing" is centered around "not repeating itself" in different forms (words/phrases/structures etc) and current models are absolutely terrible in that. IF we can add temporary negative weights to the terms that were already used in answer that would decrease to zero with time, we can incentivise the LLMs to utilize wider variety of language.
I can confirm that ElevenLabs still leads the pack in solid speech generation for narration, which is a bit of a shame as it remains shockingly expensive.
Apropos, here is the ElevenLabs-generated podcast episode for this post:
https://open.substack.com/pub/dwatvpodcast/p/ai-132-part-1-improved-ai-detection
One result of going on vacation was that I wasn’t able to spin events off into focused posts this week, so I’m going to fall back on splitting the weekly instead, plus some reserving a few subtopics for later posts, including AI craziness (the Tim Hua post on this is excellent), some new OpenAI largely policy-related shenanigans, and the continuing craziness of some people who should very much know better confidently saying that we are not going to hit AGI any time soon, plus some odds and ends including dead internet theory.
That still leaves tons of other stuff.
Table of Contents
Language Models Offer Mundane Utility
How much have LLMs improved for practical purposes in the last year? Opinions are split but consensus is a little above Somewhat Better.
To me the answer is very clearly Considerably Better, to the point that about half my uses wouldn’t have been worth bothering with a year ago, and to the extent I’m considering coding it is way better. You need to be doing either very shallow things or deeply weird things (deeply weird as in you’d still want Opus 3) to get ‘about the same.’
Men use LLMs more than women, although the gap is not that large, with women being 42% of ChatGPT, 42% of Perplexity and 31% of Claude. On smartphones the gap is much larger, with women only being 27% of ChatGPT application downloads. The result holds across countries. One cause is women reported being worried they would be penalized for AI usage. Which is sometimes the case, depending on how you use it.
Language Models Don’t Offer Mundane Utility
This one time the rumors of a model suddenly getting worse were true, there was a nine hour period where Claude Opus quality was accidentally degraded by a rollout of the interface stack. The change has now been rolled back and quality has recovered.
Taco Bell’s AI drive thru offering, like its menu, seems to have been half baked.
This seems very obviously a Skill Issue on multiple fronts. The technology can totally handle this, especially given a human can step in at any time if there is an issue. There are only so many ways for things to go wrong, and the errors most often cited would not survive simple error checks, such as ‘if you want over $100 of stuff a human looks at the request and maybe talks to you first’ or ‘if you are considering adding bacon to someone’s ice cream, maybe don’t do that?’
This feature for Twitter would be super doable, but we’re not yet doing it:
We can do an 80/20 on this if we restrict the AI role to negative selection. The existing feed generates a set of candidate posts, or you start with lists and chronological feeds the way us sane people do it, and the AI’s job is to filter this pool.
That’s easy. We could either build that directly into Twitter via Grok, or you could give reasonably priced access to the API or a way to call a filter, and we could vibe code the rest within a day and iterate, which would be even better. The only thing stopping this from happening is Twitter putting up active barriers to alternative modes of site interaction, and not offering their own version.
This is easy enough that you could plausibly do the operation through an AI agent controlling a browser, if it came to that. And indeed, it seems worthwhile to attempt this at some point for a ‘second tier’ of potential posts?
Getting models to have writing taste remains a struggle, at least by my eyes even when they have relatively good taste they all reliably have terrible taste and even the samples people say are good are not good. Why?
Hiring people with good taste seems hard. It does not seem impossible, insofar as there are some difficult to fake signals of at least reasonable taste, and you could fall back on those. The problem is that the people have terrible taste, really no good, very bad taste, as confirmed every time we do a comparison that says GPT-4.5 is preferred over Emily Dickinson and Walt Whitman or what not. Are you actually going to maximize for ‘elite taste’ over the terrible taste of users, and do so sufficiently robustly to overcome all your other forms of feedback? I don’t know that you could, or if you could that you would even want to.
Note that I see why Andy sees a conflict below, but there is no contradiction here as per the counterargument.
There are many such human cases of this, as well.
On Your Marks
Claude Opus 4.1 joins the METR graph, 30% beyond Opus 4 and in second place behind GPT-5, although within margin of error.
GPT-OSS-120b ran into a lot of setup issues. In the comments, Havard clarifies that he was previously attempting to use OpenRouter, but his attempts to specify high thinking were failing silently. So it’s plausible that most evaluations and tests of the model were not tried at high reasoning, despite that still being very cheap to run?
This is a real and important constraint on actually using them, if those doing evaluations get it wrong then would-be users will get it wrong too. The ecosystem needs to make this easier. But when you get it right, it turns out maybe GPT-OSS-120 is kind of good in at least some ways?
That’s a rather large jump in the blue line there for GPT-OSS-120B.
Werewolf Benchmark pits the models against each other for simplified games of Werewolf, with 2 werewolves and 4 villagers, a witch and a seer.
The best models consistently win, these were the seven models extensively tested, so Claude wasn’t involved, presumably due to cost:
GPT-5 gets top marks for flash-fiction style and diversity, including being the only study to sometimes use present tense, in a new test from Lech Mazur. There’s lots more detail in the thread.
Pliny experimented with Grok-Code-Fast in Cursor, since it was briefly free. Many exploit scripts and other ‘fun’ stuff resulted quickly. I presume the same would have happened with the usual suspects.
A new math benchmark looks at questions that stump at least one active model. GPT-5 leads with 43%, then DeepSeek v3.1 and Grok 4 (!) with 34%. Gemini 2.5 Pro is at 29% and Opus 4.1 only scores 15%.
Choose Your Fighter
If you use Gemini for something other than images, a reminder to always use it in AI Studio, never in the Gemini app, if you need high performance. Quality in AI Studio is much higher.
If you use GPT-5, of course, only use the router if you need very basic stuff.
I do have some narrow use cases where I’ve found GPT-5-Auto is the right tool.
An ode to Claude Code, called Entering the DOS Era of AI.
As in, the command line might be ugly, but it works, it gets the job done, lets you do whatever you want. This was the best case so far that I should stop stalling and actually start using Claude Code. Which I will, as soon as I catch up and have a spare moment. And this time, I mean it.
Oh, we can get Codex or Claude Code to delete code, up to and including all your code, including without you asking them to do it. But yes, something that does more intelligent cleanup would be great.
Anthropic’s pricing and limits got you down? GLM offers a coding plan for Claude Code, their price cheap at $3/month for 3x usage of Claude Pro or $15/month for 3x the usage of Claude Max.
I give Z.ai a lot of credit for calling this a 40% win rate, when I’d call it 44% given the 9.6% rate of ties. It makes me trust their results a lot more, including the similar size win against DeepSeek v3.1.
It still is not a great result. Pairwise evaluations tend to be noisy, and Opus 4.1 is substantially ahead of Opus 4 on agentic coding, which in turn is ahead of Sonnet 4.
In general, my advice is to pay up for the best coding tools for your purposes, whichever tools you believe they are, given the value of better coding. Right now that means either Claude or GPT-5, or possibly Gemini 2.5 Pro. But yeah, if you were previously spending hundreds a month, for some people those savings matter.
a16z’s Olivia Moore and Daisy Zhao offer the 5th edition of their report on the Top 100 GenAI consumer apps.
Notice how many involve companions or ‘spicy’ chat.
My guess is that a lot of why NSFW is doing relatively well is that the threshold for ‘good enough’ in NSFW is a lot lower than the threshold in many other places. Think of this as similar to the way that porn plots are much lower intelligence than non-porn plots. Thus, if you’re offering a free app, you have a better shot with NSFW.
You know it’s hard to keep up when I look at these lists and out of 100 items listed (since apps and web are distinct) there are 23 web products and 35 apps that I do not recognize enough to know what they are, although about half of them are pretty obvious from their names.
Gemini is growing fast, although AI Studio, Notebook and Labs seem stagnant recently.
Some other highlights:
Fun With Media Generation
Justine Moore gives us a presentation on the state of play for AI creative tools. Nothing surprising but details are always good.
Things are constantly changing, so if you’re actually creating you’ll want to try a wide variety of tools and compare results, pretty much no matter what you’re trying to do.
Deepfaketown and Botpocalypse Soon
How accurate are AI writing detectors? Brian Jabarian and Alex Imas put four to the test. RoBERTA tested as useless, but Pangram, Originality and GPTZero all had low (<2.5% or better across the board, usually <1%) false positive rates on pre-LLM text passages, at settings that also had acceptable false negative rates from straightforward LLM outputs across GPT-4.1, Claude Opus 4, Claude Sonnet 4 and Gemini 2.0 Flash. Pangram especially impressed, including on small snippets, whereas GPTZero and Originality collapsed without enough context.
I’d want to see this replicated but this is representing that non-adversarial AI writing detection is a solved problem. If no one is trying to hide that the AI text is AI text, and text is known to be either fully human or fully AI, you can very reliably detect what text is and is not AI.
Brian also claims that ‘humanizers’ like StealthGPT do not fool Pangram. So if you want to mask your AI writing, you’re going to have to do more work, which plausibly means there isn’t a problem anymore.
Honglin Bao tried GPTZero and ZeroGPT and reports their findings here, finding that when tested on texts where humans disclosed AI use, those detectors failed.
It would not be that surprising, these days, if it turned out that the reason everyone thinks AI detectors don’t work is that all the popular ones don’t work but others do. But again, I wouldn’t trust this without verification.
How bad is it over at LinkedIn? I hear it’s pretty bad?
If someone is tempted to use AI on a high stakes document consider that something likely went horribly wrong prior to AI becoming involved.
Don’t Be Evil
Seriously. Don’t be evil. I don’t care if there’s great money in evil. I don’t care if your failing to do evil means someone else will do evil instead. Don’t. Be. Evil.
They Took Our Jobs
Thus, this presumably does represent a net decline in jobs versus expected baseline, although one must beware selection and survival effects on the corporations.
This seems to be the pattern. There are not yet many firings, but there are sometimes fewer hirings. The identification process here seems incomplete but robust to false positives. The school pattern might be another hint as to what is happening.
Before that second study came out, Noah Smith responded to the new findings on AI and jobs that I discussed last week. As one would predict, while he has great respect for author Erik Brynjolfsson, he is skeptical of that this means jobs are being lost in a way that matters.
No, this doesn’t show that AI is ‘an existential threat to human labor’ via this sort of job taking. I do think AI poses an existential threat to human labor, but more as a side effect of the way it poses an existential threat to humans, which would also threaten their labor and jobs, and I agree that this result doesn’t tell us much about that. As for the scenarios where the problems remain confined to job losses, this is only a canary at most, and as always the fact that some jobs get automated does not mean jobs are on net lost, let alone that the issue will scale to ‘existential threat to human labor.’
It does once again point to the distinction between those who correctly treat current AI impacts as a floor, it is the worst and least impactful it will ever be, versus those who think of current AI capabilities as close to a maximum, so the question is whether this current effect would devastate the job market. Which it probably wouldn’t?
How should we reconcile the results of robust employment and wages at age 30+ with much less hiring at entry-level? I would suggest a combination of:
Notice the parallel to radiologists. Not only has demand not fallen yet, but for now pay there is very high, exactly because future demand is anticipated to be lower, and thus less doctors chose radiology. You need to pay a premium to attract talent and compensate for the lack of long term prospects.
Thus yes, I do think this is roughly what you expect to see if ‘the market is pricing in’ lower future employment in these fields. Which, again, might not mean less total jobs.
Context switching is a superpower if you can get good at it, which introduces new maximization problems.
Not all context switching is created equal. Switching into a chess game is a different move than switching into another coding task. If you can unify the coding modes that could be even better, but by default (at least in my model of my own switching?) there’s a kind of task loading here where you can only have one ‘complex cognitive productive’ style thing going on at once. Switching into Twitter or Chess doesn’t disrupt it the same way. Also, doing the other task helps you mentally in various ways that trying to double task coding would very much not help.
Still, yes, multi-Clauding will always be the dream, if you can pull it off. And if you don’t net gain productivity but do get to do a bunch of other little tasks, that still counts (to me, anyway) as a massive win.
Radiologists are not yet going away, and AIs are not perfect, but AIs are already less imperfect than doctors at a wide range of tasks, in a ‘will kill the patient less often’ type of way. With access to 5-Level models, failure to consult them in any case where you are even a little uncertain is malpractice. Not in a legal sense, not yet, but in a ‘do right by the patient’ sense.
Is there a counterargument that using AI the wrong ways could lead to ‘deskilling’?
Imagine seeing this except instead of AI they were talking about, I dunno, penicillin. This is the calculator argument. Yeah, I can see how giving doctors AI and then taking it away could be an issue at least for some adjustment period, although I notice I am highly skeptical of the funding, but how about you don’t take it away?
A second finding Rohan cites (hence the ‘another’ above) is that if you change MedQA questions to make pattern matching harder, model performance slips. Well yeah, of course it does, human performance would slip too. The question is how much, and what that implies about real cases.
The reasoning models held up relatively well (they don’t respect us enough to say which models are which but their wording implies this). In any case, I’m not worried, and the whole ‘they aren’t really reasoning’ thing we see downthread is always a sign someone doesn’t understand what they are dealing with.
Meanwhile AI is being used in a Medicare pilot program to determine whether patients should be covered for some procedures like spine surgeries or steroid injections. This is of course phrased as ‘Medicare will start denying patients life-saving procedures using private A.I. companies’ the same way we used to talk about ‘death panels.’ There is a limited budget with which to provide health care, so the question is whether these are better decisions or not.
School Daze
Many people are saying. Are they talking sense?
My position has long been:
Which means the question is, which will students choose? Are you providing them with reason to want to learn?
Here’s a plausible hypothesis, where to use LLMs to learn you need to establish basic skills first, or else you end up using them to not learn, instead.
You definitely need some level of basic skills. If you can’t read and write, and you’re not using LLMs in modes designed explicitly to teach you those basic skills, you’re going to have a problem.
This is like a lot of other learning and tasks, both in and out of school. In order to use an opportunity to learn, LLM or otherwise, you need to be keeping up with the material so you can follow it, and then choose to follow it. If you fall sufficiently behind or don’t pay attention, you might be able to fake it (or cheat on the exams) and pass. But you won’t be learning, not really.
So it isn’t crazy that there could be a breakpoint around age 16 or so for the average student, where you learn enough skills that you can go down the path of using AI to learn further, whereas relying on the LLMs before that gets the average student into trouble. This could be fixed by improving LLM interactions, and new features from Google and OpenAI are plausibly offering this if students can be convinced to use them.
I am still skeptical that this is a real phenomena. We do not yet, to my knowledge, any graphs that show this discontinuity as expressed in skills and test scores, either over time or between cohorts. We should be actively looking and testing for it, and be prepared to respond if it happens, but the response needs to focus on ‘rethink the way schools work’ rather than ‘try in vain to ban LLMs’ which would only backfire.
The Art of the Jailbreak
Pliny points us to the beloved prompt injection game Gandalf, including new levels that just dropped.
Overcoming Bias
A study from the American Enterprise Institute found that top LLMs (OpenAI, Google, Anthropic, xAI and DeepSeek) consistently rate think tanks better the closer they are to center-left on the American political spectrum. This is consistent with prior work and comes as no surprise whatsoever. It is a question of magnitude only.
This is how they present the findings:
Sentiment analysis has what seems like a bigger gap than the ultimate ratings.
Note that the gaps reported here center-left versus right, not left versus right, which would be smaller, as there is as much ‘center over extreme’ preference here as there is for left versus right. It also jumps out that there are similar gaps across all three metrics and we see similar patterns on every subcategory:
When you go institution by institution, you see large correlations between ratings on the three metrics, and you see that the ratings do seem to largely be going by (USA Center Left > USA Center-Right > USA Left > USA Right).
I’m not familiar enough with most of the think tanks to offer a useful opinion, with two exceptions.
My model of how funding works for think tanks is that support comes from ideologically aligned sources, and citations are mostly motivated by politics. If LLMs consistently rate right wing think tanks poorly, it is not clear this changes decisions that much, whether or not it is justified? I do see other obvious downsides to being consistently rated poorly, of course.
Or:
Clearly, the job of the think tanks is to correct these grievous errors? Their full recommendation here is somewhat better.
I have no doubt that the baseline findings here are correct. To what extent are they the result of ‘bias’ versus reflecting real gaps? It seems likely, at minimum, that more ‘central’ think tanks are a lot better on these metrics than more ‘extreme’ ones.
What about the recommendations they offer?
Some of these are constructive steps, but I have another idea? One could treat this evaluation of lacking morality, research quality and objectivity as pointing to real problems, and work to fix them? Perhaps they are not errors, or only partly the result of bias, especially if you are not highly ranked within your ideological sector.
Get Involved
MATS 9.0 applications are open, apply by October 2. It will run January 5 to March 28, 2026 to be an ML Alignment or Theory Scholar, including for nontechnical policy and government. This seems like an excellent opportunity for those in the right spot.
Jennifer Chen, who works for me on Balsa Research, asks me to pass along that Canada’s only AI policy advocacy organization, AI Governance and Safety Canada (AIGS), needs additional funding from residents or citizens of Canada (for political reasons it can’t accept money from anyone else, and you can’t deduct donations) to survive, and it needs $6k CAD per month to sustain itself. Here’s what she has to say:
If you’re in or from Canada, and you want to see Carney push for international AGI governance, you might have a unique opportunity (I haven’t had the opportunity to investigate myself). Consider investigating further and potentially contributing here. For large sums, please email contact@aigs.ca.
Anthropic is hosting the Anthropic Futures Forum in Washington DC on September 15, 9:30-2:00 EST. I have another engagement that day but would otherwise be considering attending. Seems great if you are already in the DC area and would qualify to attend.
Introducing
Grok Code Fast 1, available in many places or $0.20/$1.50 on the API. They offer a guide here which seems mostly similar to what you’d do with any other AI coder.
InstaLILY, powered by Gemini, an agentic enterprise search engine, for tasks like matching PartsTown technicians with highly specific parts. The engine is built on synthetic data generation and student model training. Another example cited is Wolf Games using it to generate daily narrative content, which is conceptually cool but does not make me want to play any Wolf Games products.
The Brave privacy-focused browser offers us Leo, the smart AI assistant built right in. Pliny respected it enough to jailbreak it via a webpage and provide its system instructions. Pliny reports the integration is awesome, but warns of course that this is a double edged sword given what can happen if you browse. Leo is based on Llama 3.1 8B, so this is a highly underpowered model. That can still be fine for many web related tasks, as long as you don’t expect it to be smart.
To state the obvious, Leo might be cool, but it is wide open to hackers. Do not use Leo while your browser has access to anything you would care about getting hacked. So no passwords of value, absolutely no crypto or bank accounts or emails, and so on. It is one thing to take calculated risks with Claude for Chrome once you have access, but with something like Leo I would take almost zero risk.
Unprompted Attention
OpenAI released a Realtime Prompting Guide. Carlos Perez looked into some of its suggestions, starting with ‘before any call, speak neutral filler, then call’ to avoid ‘awkward silence during tool calls.’ Um, no, thanks? Other suggestions seem better, such as being explicit about where to definitely ask or not ask for confirmation, or when to use or not use a given tool, what thresholds to use for various purposes, offering templates, and only responding to ‘clear audio’ and asking for clarification. They suggest capitalization for must-follow rules, this rudeness is increasingly an official aspect of our new programming language.
Rob Wiblin shares his anti-sycophancy prompt.
In Other AI News
The Time 100 AI 2025 list is out, including Pliny the Liberator. The list has plenty of good picks, it would be very hard to avoid this, but it also has some obvious holes. How can I take such a list seriously if it doesn’t include Demis Hassabis?
Google will not be forced to do anything crazy like divest Chrome or Android, the court rightfully calling it overreach to have even asked. Nor will Google be barred from paying for Chrome to get top placement, so long as users can switch, as the court realized that this mainly devastates those currently getting payments. For their supposed antitrust violations, Google will also be forced to turn over certain tailored search index and user-interaction data, but not ads data, to competitors. I am very happy with the number of times the court replied to requests with ‘that has nothing to do with anything involved in this case, so no.’
As in, Trump wants TSMC, a Taiwanese company that is not American, to pay 15% profit-sharing on AI chips sold to China, which is also not America, but is otherwise fine with continuing to let China buy the chips. This is our official policy, folks.
METR and Factory AI are hosting a Man vs. Machine hackathon competition, where those with AI tools face off against those without, in person in SF on September 6. Prize and credits from OpenAI, Anthropic and Raindrop. Manifold market here.
Searches for Cursor, Claude Code, Lovable, Replit and Windsurf all down a lot (44%-78%) since July and August. Claude Code and Cursor are now about equal here. Usage for these tools continues to climb, so perhaps this is a saturation as everyone inclined to use such a tool now already knows about them? Could it be cyclic? Dunno.
I do know this isn’t about people not wanting the tools.
A promising report, but beware the source’s propensity to hype:
It is early days, but this kind of thing does seem to be showing promise.
Show Me the Money
Anthropic finalizes its raise of $13 billion at a $183 billion post-money valuation. They note they started 2025 at $1 billion in run-rate revenue and passed $5 billion just eight months later, over 10% of which is from Claude Code which grew 10x in three months.
These are the same people shouting from the rooftops that AGI is coming soon, and coming for many jobs soon, with timelines that others claim are highly unrealistic. So let this be a reminder: All of Anthropic’s revenue projections that everyone said were too optimistic to take seriously? Yeah, they’re doing actively better than that. Maybe they know what they’re talking about?
Meta’s new chief scientist Shengjia Zhao, co-creator of OpenAI’s ChatGPT, got the promotion in part by threatening to go back to OpenAI days after joining Meta, and even signing the employment paperwork to do so. That’s in addition to the prominent people who have already left. FT provides more on tensions within Meta and so does Charles Rollet at Business Insider. This doesn’t have to mean Zuckerberg did anything wrong, as bringing in lots of new expensive talent quickly will inevitably spark such fights.
Meta makes a wise decision that I actually do think is bullish:
Let’s face it, Meta’s AIs are not good. OpenAI and Google (and Anthropic, among others) make better ones. Until that changes, why not license the better tech? Yes, I know, they want to own their own stack here, but have you considered the piles? Better models means selling more ads. Selling more ads means bigger piles. Much bigger piles. Of money.
If Meta manages to make a good model in the future, they can switch back. There’s no locking in here, as I keep saying.
The most valuable companies in the world? AI, AI everywhere.
The part about Aramco is too cute by half but the point stands.