And soon, since yesterday DeepSeek gave us R1-0528. Very early response has been muted but that does not tell us much either way. DeepSeek themselves call it a ‘minor trial upgrade.’ I am reserving coverage until next week to give people time.
You've probably seen this by now, but benchmark results are up[1] and on the usual stuff it's roughly equivalent to gemini-2.5-pro-0506. Huge gains over the original R1.
I vibe checked it a bit yesterday and came away impressed: it seems much better at writing, instruction-following, and all that other stuff we expect out of LLMs that doesn't fall under "solving tricky math/code puzzles."
(On my weird secret personal benchmark, it's about as bad as most recent frontier models, worse than older models but no longer at the very bottom like the original R1 was.)
Self-reported by DeepSeek, but a third-party reproduction already exists for at least one of the scores (LiveCodeBench, see leaderboard and PR).
There also is the fact that, unlike o3, Claude Opus 4 (16K) scored 8.6% on the ARC-AGI test. If DeepSeek is evaluated by ARC-AGI and also fails to exceed 10%, then it could imply that CoT alone isn't enough to deal with ARC-AGI-2 level problems, just like GPT-like architecture, until recently, seemed to fail to deal with ARC-AGI-1 level problems (however, Claude 3.7 and Claude Sonnet 4 scored 13.6% and 23.8% without the chain of thought; what algorithmic breakthroughs did they use?) and that subsequent breakthroughs will be achieved by applying the neuralese memos, multiple CoTs (see the last paragraph in my post) or text-like memos. Unfortunately, the neuralese memos cost interpretability.
Where is o3 curiously strong?
For me, o3 is currently winning the personality game. I much prefer the table obsession over the list obsession, it seems to be fairly non-sycophantic (I don't think it ever emitted phatic "you've asked a very thoughtful question" praise at me), and its default behavior seems to involve trying hard to actually address your query/fulfil your task, in way more goal-oriented than is usual for LLMs[1].
Claude 4's behavior, by contrast, is still bog-standard LLM-y "vibing around". Trying them after o3 felt like a definitive downgrade.
(o3's "trying hard" tendency sometimes overflows into specification gaming/lying-liar behaviors, which are extremely annoying and tedious. But it still feels much more powerful out-of-the-box for all tasks that aren't coding, and other LLMs' sycophancy is just about as annoying.)
Which is a bit concerning if you look at it from a perspective beyond mundane utility, of course...
I presume solutions do exist that aren’t prohibitively expensive, but someone has to figure out what they are and the clock is ticking.
I would argue that someone has: see my link-post The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem? for links to the seminal papers on it. The short version is: stop using Reinforcement Learning, just use SGD.
It is weird to me that so many people who have thought hard about AI don’t think that human emulations are a better bet for a good future than LLMs, if we had that choice. Human emulations have many features that make me a lot more hopeful that they would preserve value in the universe and also not get everyone killed, and it seems obvious that they both have and would be afforded moral value. I do agree that there is a large probability that the emulation scenario goes sideways, and Hanson’s Age of Em is not an optimistic way for that to play out, but we don’t have to let things play out that way. With Ems we would definitely at least have a fighting chance.
The basic reason for this is basically that starting from a human doesn't actually buy you that much in terms of alignment, because the reason alignment is such a nasty problem is mostly preserved from AIs to EMs.
The 2 big issues are this:
https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher
@RogerDearnaley has a great post on the issue of uploads being misaligned:
https://www.lesswrong.com/posts/4gGGu2ePkDzgcZ7pf/3-uploading#Humans_are_Not_Aligned
More generally, the idea that humans/WBEs are safer than AIs when scaled up in power as much as AIs are rest on very questionable assumptions at best.
Money is fungible. It’s kind of stupid that we have an ‘income tax rate’ and then a ‘medicare tax’ on top of it that we pretend isn’t part of the income tax. And it’s a nice little fiction that payroll taxes pay for social security benefits. Yes, technically this could make the Social Security fund ‘insolvent’ or whatever, but then you ignore that and write the checks anyway and nothing happens.
No, Yglesias's point is not invalidated by the fungibility of money. (It's generally a good idea to think twice before concluding that an economics-y writer is making such a basic mistake.) Payroll taxes make up 35% of all federal revenue. The point is that a large dip in payrolls has a major impact on overall revenue. If it gets large enough, even increasing rates on current taxes would probably not be enough to make up for it. We may need to add a VAT or other new revenue sources. Try having a discussion with Opus about it. Here's one I had recently: https://claude.ai/share/7f0707e6-52b2-423e-9587-07cbeee86df0
Our system is basically set up to not tax capital, specifically because we don't want to discourage investment (or encourage capital flight). When income is diverted from workers to OpenAI, it may not be taxed at all for the foreseeable future---until their costs of building and operating ever more compute stop outstripping their revenues. So a new form of tax is needed, and what happens in the interim while such a thing gets passed through Congress and implemented?
Ask OpenAI to fix it? They can’t. But *also* they don’t care. It’s “engagement”.
If (1) you do RL around user engagement, (2) the AI ends up with internal drives around optimizing over the conversation, and (3) that will drive some users insane.
They’d have to switch off doing RL on engagement. And that’s the paperclip of Silicon Valley.
As you, Zvi, have already described, you "definitely think that the model should be willing to actually give a directly straight answer when asked for its opinion, in cases like" the one where the user claims to be a drug addict. Are overoptimisation for engagement and the Spec which sometimes deviates from common sense into sycophancy somehow connected? And that's ignoring the conservatives like Abigail Shrier who claim that parents are increasingly hesitant to assert authority or enforce boundaries. It looks like a common misalignment of the Western society...
The big news of this week was of course the release of Claude 4 Opus. I offered two review posts: One on safety and alignment, and one on mundane utility, and a bonus fun post on Google’s Veo 3.
I am once again defaulting to Claude for most of my LLM needs, although I often will also check o3 and perhaps Gemini 2.5 Pro.
On the safety and alignment front, Anthropic did extensive testing, and reported that testing in an exhaustive model card. A lot of people got very upset to learn that Opus could, if pushed too hard in the wrong situations engineered for these results, do things like report your highly unethical actions to authorities or try to blackmail developers into not being shut down or replaced. It is good that we now know about these things, and it was quickly observed that similar behaviors can be induced in similar ways from ChatGPT (in particular o3), Gemini and Grok.
Last night DeepSeek gave us R1-0528, but it’s too early to know what we have there.
Lots of other stuff, as always, happened as well.
This weekend I will be at LessOnline at Lighthaven in Berkeley. Come say hello.
Table of Contents
Language Models Offer Mundane Utility
This makes sense. It is a better product, with more uses, so people use it more, including to voice chat and create images. Oh, and also the sycophancy thing is perhaps driving user behavior?
I presume they should switch over to Claude, but given they don’t even know to use o3 instead of GPT-4o (or worse!), that’s a big ask.
How many of us should be making our own apps at this point, even if we can’t actually code? The example app Jasmine Sun finds is letting kids photos to call family members, which is easier to configure if you hardcode the list of people it can call.
David Perell shares his current thoughts on using AI in writing, he thinks writers are often way ahead of what is publicly known on this and getting a lot out of it, and is bullish on the reader experience and good writers who write together with an AI retaining a persistent edge.
One weird note is David predicts non-fiction writing will be ‘like music’ in that no one cares how it was made. But I think that’s very wrong about music. Yes there’s some demand for good music wherever it comes from, but also whether the music is ‘authentic’ is highly prized, even when it isn’t ‘authentic’ it has to align with the artist’s image, and you essentially had two or three markets in one already before AI.
Find security vulnerabilities in the Linux kernel. Wait, what?
I mean yes objectively this is cool but that is not the central question here.
Evaluate physiognomy by uploading selfies and asking ‘what could you tell me about this person if they were a character in a movie?’ That’s a really cool prompt from Flo Crivello, because it asks what this would convey in fiction rather than in reality, which gets around various reasons AIs will attempt to not acknowledge or inform you about such signals. It does mean you’re asking ‘what do people think this looks like?’ rather than ‘what does this actually correlate with?’
A thread about when you want AIs to use search versus rely on their own knowledge, a question you can also ask about humans. Internal knowledge is faster and cheaper when you have it. Dominik Lukes thinks models should be less confident in their internal knowledge and thus use search more. I’d respond that perhaps we should also be less confident in search results, and thus use search less? It depends on the type of search. For some purposes we have sources that are highly reliable, but those sources are also in the training data, so in the cases where search results aren’t new and can be fully trusted you likely don’t need to search.
Are typos in your prompts good actually?
I do see the advantages of getting out of that basin, the worry is that the model will essentially think I’m an idiot. And of course I notice that when Pliny does his jailbreaks and other magic, I almost never see any unintentional typos. He is a wizard, and every keystroke is exactly where he intends it. I don’t understand enough to generate them myself but I do usually understand all of it once I see the answer.
Now With Extra Glaze
Do Claude Opus 4 and Sonnet 4 have a sycophancy problem?
One friend told me the glazing is so bad they find Opus essentially unusable for chat. They think memory in ChatGPT helps with this there, and this is a lot of why for them Opus has this problem much worse.
I thought back to my own chats, remembering one in which I did an extended brainstorming exercise and did run into potential sycophancy issues. I have learned to use careful wording to avoid triggering it across different AIs, I tend to not have conversations where it would be a problem, and also my Claude system instructions help fight it.
Then after I wrote that, I got (harmlessly in context) glazed hard enough I asked Opus to help rewrite my system instructions.
OpenAI and ChatGPT still have the problem way worse, especially because they have a much larger and more vulnerable user base.
One reply offers this anecdote ‘ChatGPT drove my friends wife into psychosis, tore family apart… now I’m seeing hundreds of people participating in the same activity.’
If you actively want an AI that will say ‘brilliant idea, sire!’ no matter how crazy the thing is that you say, you can certainly do that with system instructions. The question is whether we’re going to be offering up that service to people by default, and how difficult that state will be to reach, especially unintentionally and unaware.
And the other question is, if the user really, really wants to avoid this, can they? My experience has been that even with major effort on both the system instructions and the way chats are framed, you can reduce it a lot, but it’s still there.
Get My Agent On The Line
Official tips for working with Google’s AI coding agent Jules.
Language Models Don’t Offer Mundane Utility
General purpose agents are not getting rolled out as fast as you’d expect.
This will not delay things for all that long.
To be totally fair to 4o, if your business idea is sufficiently terrible it will act all chipper and excited but also tell you not to quit your day job.
GPT-4o also stood up for itself here, refusing to continue with a request when Zack Voell told it to, and I quote, ‘stop fucking up.’
Zack is taking that too far, but yes, I have had jobs where ‘stop fucking up’ would have been a very normal thing to say if I had, you know, been fucking up. But that is a very particular setting, where it means something different. If you want something chilling, check the quote tweets. The amount of unhinged hatred and outrage on display is something else.
Nate Silver finds ChatGPT to be ‘shockingly bad’ at poker. Given that title, I expected worse than what he reports, although without the title I would have expected at least modestly better. This task is hard, and while I agree with all of Nate’s poker analysis I think he’s being too harsh and focusing on the errors. The most interesting question here is to what extent poker is a good test of AGI. Obviously solvers exist and are not AGI, and there’s tons of poker in the training data, but I think it’s reasonable to say that the ability to learn, handle, simulate and understand poker ‘from scratch’ even with the ability to browse the internet is a reasonable heuristic, if you’re confident this ‘isn’t cheating’ in various ways including consulting a solver (even if the AI builds a new one).
Tyler Cowen reports the latest paper on LLM political bias, by Westwood, Grimmer and Hall. As always, they lean somewhat left, with OpenAI and especially o3 leaning farther left than most. Prompting the models to ‘take a more neutral stance’ makes Republicans modestly more interested in using LLMs more.
Even more than usual in such experiments, perhaps because of how things have shifted, I found myself questioning what we mean by ‘unbiased,’ as in the common claims that ‘reality has a bias’ in whatever direction. Or the idea that American popular partisan political positions should anchor what the neutral point should be and that anything else is a bias. I wonder if Europeans think the AIs are conservative.
Also, frankly, what passes for ‘unbiased’ answers in these tests often are puke inducing. Please will no AI ever again tell be a choice involves ‘careful consideration’ before laying out justifications for both answers with zero actual critical analysis.
Even more than that, I looked at a sample of answers and how they were rated directionally, and I suppose there’s some correlation with how I’d rank them but that correlation is way, way weaker than you would think. Often answers that are very far apart in ‘slant’ sound, to me, almost identical, and are definitely drawing the same conclusions for the same underlying reasons. So much of this is, at most, about subtle tone or using words that vibe wrong, and often seems more like an error term? What are we even doing here?
The problem:
To summarize:
Also:
A cute little chess puzzle that all the LLMs failed, took me longer than it should have.
Huh, Upgrades
Claude on mobile now has voice mode, woo hoo! I’m not a Voice Mode Guy but if I was going to do this it would 100% be with Claude.
Here’s one way to look at the current way LLMs work and their cost structures (all written before R1-0528 except for the explicit mentions added this morning):
I think R2 (and R1-0528) will actually tell us a lot, on at least two fronts.
R1 was, I believe, highly impressive and the result of cracked engineering, but also highly fortunate in exactly when and how it was released and in the various narratives that were spun up around it. It was a multifaceted de facto sweet spot.
If DeepSeek comes out with an impressive R2 or other upgrade within the next few months (which they may have just done), especially if it holds up its position actively better than R1 did, then that’s a huge deal. Whereas if R2 comes out and we all say ‘meh it’s not that much better than R1’ I think that’s also a huge deal, strong evidence that the DeepSeek panic at the app store was an overreaction.
If R1-0528 turns out to be only a minor upgrade, that alone doesn’t say much, but the clock would be ticking. We shall see.
And soon, since yesterday DeepSeek gave us R1-0528. Very early response has been muted but that does not tell us much either way. DeepSeek themselves call it a ‘minor trial upgrade.’ I am reserving coverage until next week to give people time.
Operator swaps 4o out for o3, which they claim is a big improvement. If it isn’t slowed down I bet it is indeed a substantial improvement, and I will try to remember to give it another shot the next time I have a plausible task for it. This website suggests Operator prompts, most of which seem like terrible ideas for prompts but it’s interesting to see what low-effort ideas people come up with?
This math suggests the upgrade here is real but doesn’t give a good sense of magnitude.
Jules has been overloaded, probably best to give it some time, they’re working on it. We have Claude Code, Opus and Sonnet 4 to play with in the meantime, also Codex.
You can use Box as a document source in ChatGPT.
Anthropic adds web search to Claude’s free tier.
On Your Marks
In a deeply unshocking result Opus 4 jumps to #1 on WebDev Arena, and Sonnet 4 is #3, just ahead of Sonnet 3.7, with Gemini-2.5 in the middle at #2. o3 is over 200 Elo points behind, as are DeepSeek’s r1 and v3. They haven’t yet been evaluated in the text version of arena and I expect them to underperform there.
xjdr makes the case that benchmarks are now so bad they are essentially pointless, and that we can use better intentionally chosen benchmarks to optimize the labs.
Epoch reports Sonnet and Opus 4 are very strong on SWE-bench, but not so strong on math, verifying earlier reports and in line with Anthropic’s priorities.
o3 steps into the true arena, and is now playing Pokemon.
Choose Your Fighter
For coding, most feedback I’ve seen says Opus is now the model of choice, but that there are is a case still to be made for Gemini 2.5 Pro (or perhaps o3), especially in special cases.
For conversations, I am mostly on the Opus train, but not every time, there’s definitely an intuition on when you want something with the Opus nature versus the o3 nature. That includes me adjusting for having written different system prompts.
Each has a consistent style. Everything impacts everything.
The o3 tables and lists are often very practical, and I do like me a good nested bullet point, but it was such a relief to get back to Claude. It felt like I could relax again.
Where is o3 curiously strong? Here is one opinion.
He expects Opus to be strong at #4 and especially at #5, but o3 to remain on top for the other three because it lacks scheduled tasks and it lacks memory, whereas o3 can do scheduled tasks and has his last few months of memory from constant usage.
Therefore, since I know I have many readers at Anthropic (and Google), and I know they are working on memory (as per Dario’s tease in January), I have a piece of advice: Assign one engineer (Opus estimates it will take them a few weeks) to build an import tool for Claude.ai (or for Gemini) that takes in the same format as ChatGPT chat exports, and loads the chats into Claude. Bonus points to also build a quick tool or AI agent to also automatically handle the ChatGPT export for the user. Make it very clear that customer lock-in doesn’t have to be a thing here.
This seems very right and not only about response length. Claude makes the most of what it has to work with, whereas Gemini’s base model was likely exceptional and Google then (in relative terms at least) botched the post training in various ways.
Ben Thompson thinks Anthropic is smart to focus on coding and agents, where it is strong, and for it and Google to ‘give up’ on chat, that ChatGPT has ‘rightfully won’ the consumer space because they had the best products.
I do not see it that way at all. I think OpenAI and ChatGPT are in prime consumer position mostly because of first mover advantage. Yes, they’ve more often had the best overall consumer product as well for now, as they’ve focused on appealing to the general customer and offering them things they want, including strong image generation and voice chat, the first reasoning models and now memory. But the big issues with Claude.ai have always been people not knowing about it, and a very stingy free product due to compute constraints.
As the space and Anthropic grow, I expect Claude to compete for market share in the consumer space, including via Alexa+ and Amazon, and now potentially via a partnership with Netflix with Reed Hastings on the Anthropic board. Claude is getting voice chat this week on mobile. Claude Opus plus Sonnet is a much easier to understand and navigate set of models than what ChatGPT offers.
That leaves three major issues for Claude.
Some people will say ‘but the refusals’ or ‘but the safety’ and no, not at this point, that doesn’t matter for regular people, it’s fine.
Then there is Google. Google is certainly not giving up on chat. It is putting that chat everywhere. There’s an icon for it atop this Chrome window I’m writing in. It’s in my GMail. It’s in the Gemini app. It’s integrated into search.
Deepfaketown and Botpocalypse Soon
Andrej Karpathy reports about 80% of his replies are now bots and it feels like a losing battle. I’m starting to see more of the trading-bot spam but for me it’s still more like 20%.
I don’t think it’s a losing battle if you care enough, the question is how much you care. I predict a quick properly configured Gemini Flash-level classifier would definitely catch 90%+ of the fakery with a very low false positive rate.
And I sometimes wonder if Elon Musk has a bot that uses his account to occasionally reply or quote tweet saying ‘concerning.’ if not, then that means he’s read Palisade Research’s latest report and maybe watches AISafetyMemes.
Zack Witten details how he invented a fictional heaviest hippo of all time for a slide on hallucinations, the slide got reskinned as a medium article, it was fed into an LLM and reposted with the hallucination represented as fact, and now Google believes it. A glimpse of the future.
Sully predicting full dead internet theory:
The default is presumably that generic AI generated content is not scarce and close to perfect competition eats all the content creator profits, while increasingly users who aren’t fine with an endless line of AI slop are forced to resort to whitelists, either their own, those maintained by others or collectively, or both. Then to profit (in any sense) you need to bring something unique, whether or not you are clearly also a particular human.
However, everyone keeps forgetting Sturgeon’s Law, that 90% of everything is crap. AI might make that 99% or 99.9%, but that doesn’t fundamentally change the filtering challenge as much as you might think.
Also you have AI on your side working to solve this. No one I know has tried seriously the ‘have a 4.5-level AI filter the firehose as customized to my preferences’ strategy, or a ‘use that AI as an agent to give feedback on posts to tune the internal filter to my liking’ strategy either. We’ve been too much of the wrong kind of lazy.
As a ‘how bad is it getting’ experiment I did, as suggested, do a quick Facebook scroll. On the one hand, wow, that was horrible, truly pathetic levels of terrible content and also an absurd quantity of ads. On the other hand, I’m pretty sure humans generated all of it.
Jinga Zhang discusses her ongoing years-long struggles with people making deepfakes of her, including NSFW deepfakes and now videos. She reports things are especially bad in South Korea, confirming other reports of that I’ve seen. She is hoping for people to stop working on AI tools that enable this, or to have government step in. But I don’t see any reasonable way to stop open image models from doing deepfakes even if government wanted to, as she notes it’s trivial to create a LoRa of anyone if you have a few photos. Young people already report easy access to the required tools and quality is only going to improve.
What did James see?
Why not both? Except that the ‘psychological warfare agenda’ is often (in at least my corner of Twitter I’d raise this to ‘mostly’) purely aiming to convince you to click a link or do Ordinary Spam Things. The ‘give off an impression via social proof’ bots also exist, but unless they’re way better than I think they’re relatively rare, although perhaps more important. It’s hard to use them well because of risk of backfire.
Fun With Media Generation
Arthur Wrong predicts AI video will not have much impact for a while, and the Metaculus predictions of a lot of breakthroughs in reach in 2027 are way too optimistic, because people will express strong inherent preferences for non-AI video and human actors, and we are headed towards an intense social backlash to AI art in general. Peter Wildeford agrees. I think it’s somewhere in between, given no other transformational effects.
Playing The Training Data Game
Meta begins training on Facebook and Instagram posts from users in Europe, unless they have explicitly opted out. You can still in theory object, if you care enough, which would only apply going forward.
They Took Our Jobs
Dario Amodei warns that we need to stop ‘sugar coating’ what is coming on jobs.
So, by ‘bloodbath’ we do indeed mean the impact on jobs?
Dario, is there anything else you’d like to say to the class, while you have the floor?
Something about things like loss of human control over the future or AI potentially killing everyone? No?
Just something about how we ‘can’t’ stop this thing we are all working so hard to do?
I mean, it’s much better to warn about this than not warn about it, if Dario does indeed think this is coming.
Fabian presents the ‘dark leisure’ theory of AI productivity, where productivity gains are by employees and not hidden, so the employees use the time saved to slack off, versus Clem’s theory that it’s because gains are concentrated in a few companies (for which he blames AI not ‘opening up’ which is bizarre, this shouldn’t matter).
If Fabien is fully right, the gains will come as expectations adjust and employees can’t hide their gains, and firms that let people slack off get replaced, but it will take time. To the extent we buy into this theory, I would also view this as a ‘unevenly distributed future’ theory. As in, if 20% of employees gain (let’s say) 25% additional productivity, they can take the gains in ‘dark leisure’ if they choose to do that. If it is 75%, you can’t hide without ‘slow down you are making us all look bad’ kinds of talk, and the managers will know. Someone will want that promotion.
That makes this an even better reason to be bullish on future productivity gains. Potential gains are unevenly distributed, people’s willingness and awareness to capture them is unevenly distributed, and those who do realize them often take the gains in leisure.
Another prediction this makes is that you will see relative productivity gains when there is no principal-agent problem. If you are your own boss, you get your own productivity gains, so you will take a lot less of them in leisure. That’s how I would test this theory, if I was writing an economics job market paper.
This matches my experiences as both producer and consumer perfectly, there is low hanging fruit everywhere which is how open philanthropy can strike again, except commercial software feature edition:
One thing I’ve never considered is sitting around thinking ‘what am I going to do with all these SWEs, there’s nothing left to do.’ There’s always tons of improvements waiting to be made. I don’t worry about the market’s ability to consume them, we can make the features something you only find if you are looking for them.
Noam Scheiber at NYT reports that some Amazon coders say their jobs have ‘begun to resemble warehouse work’ as they are given smaller less interesting tasks on tight deadlines that force them to rely on AI coding and stamp out their slack and ability to be creative. Coders that felt like artisans now feel like they’re doing factory work. The last section is bizarre, with coders joining Amazon Employees for Climate Justice, clearly trying to use the carbon footprint argument as an excuse to block AI use, when if you compare it to the footprint of the replaced humans the argument is laughable.
Our best jobs.
The Art of Learning
There are two opposing fallacies here:
Time helps, you do want to actually think and make connections. But you don’t learn ‘for real’ based on how much time you spend. Reading a book is a way to enable you to grapple and make connections, but it is a super inefficient way to do that. If you use AI summarizes, you can do that to avoid actually thinking at all, or you can use that to actually focus on grappling and making connections. So much of reading time is wasted, so much of what you take in is lost or not valuable. And AI conversations can help you a lot with grappling, with filling in knowledge gaps, checking your understanding, challenging you and being Socratic and so on.
I often think of the process of reading a book (in addition to the joy of reading, of course) as partly absorbing a bunch of information, grappling with it sometimes, but mostly doing that in service of generating a summary in your head (or in your notes or both), of allowing you to grok the key things. That’s why we sometimes say You Get About Five Words, that you don’t actually get to take away that much, although you can also understand what’s behind that takeaway.
Also, often you actually do want to mostly absorb a bunch of facts, and the key is sorting out facts you need from those you don’t? I find that I’m very bad at this when the facts don’t ‘make sense’ or click into place for me, and amazingly great at it when they do click and make sense, and this is the main reason some things are easy for me to learn and others are very hard.
The Art of the Jailbreak
Moritz Rietschel asks Grok to fetch Pliny’s system prompt leaks and it jailbreaks the system because why wouldn’t it.
In a run of Agent Village, multiple humans in chat tried to get the agents to browse Pliny’s GitHub. Claude Opus 4 and Claude Sonnet 3.7 were intrigued but ultimately unaffected. Speculation is that viewing visually through a browser made them less effective. Looking at stored memories, it is not clear there was no impact, although the AIs stayed on task. My hunch is that the jailbreaks didn’t work largely because the AIs had the task.
Unprompted Attention
Reminder that Anthropic publishes at least some portions of its system prompts. Pliny’s version is very much not the same.
I strongly agree with this. It is expensive to maintain such a long system prompt and it is not the way to scale.
Get Involved
Emmett Shear hiring a head of operation for Softmax, recommends applying even if you have no idea if you are a fit as long as you seem smart.
Pliny offers to red team any embodied AI robot shipping in the next 18 months, free of charge, so long as he is allowed to publish any findings that apply to other systems.
Here’s a live look:
OpenPhil hiring for AI safety, $136k-$186k total comp.
RAND is hiring for AI policy, looking for ML engineers and semiconductor experts.
Introducing
Google’s Lyria RealTime, a new experimental music generation model.
A website compilation of prompts and other resources from Pliny the Prompter. The kicker is that this was developed fully one shot by Pliny using Claude Opus 4.
In Other AI News
Evan Conrad points out that Stargate is a $500 billion project, at least aspirationally, and it isn’t being covered that much more than if it was $50 billion (he says $100 million but I do think that would have been different). But most of the reason to care is the size. The same is true for the UAE deal, attention is not scaling to size at all, nor are views on whether the deal is wise.
OpenAI opening an office in Seoul, South Korea is now their second largest market. I simultaneously think essentially everyone should use at least one of the top three AIs (ChatGPT, Claude and Gemini) and usually all there, and also worry about what this implies about both South Korea and OpenAI.
New Yorker report by Joshua Rothman on AI 2027, entitled ‘Two Paths for AI.’
Show Me the Money
How does one do what I would call AIO but Charlie Guo at Ignorance.ai calls GEO, or Generative Engine Optimization? Not much has been written yet on how it differs from SEO, and since the AIs are using search SEO principles should still apply too. The biggest thing is you want to get a good reputation and high salience within the training data, which means everything written about you matters, even if it is old. And data that AIs like, such as structured information, gets relatively more valuable. If you’re writing the reference data yourself, AIs like when you include statistics and direct quotes and authoritative sources, and FAQs with common answers are great. That’s some low hanging fruit and you can go from there.
Part of the UAE deal is everyone in the UAE getting ChatGPT Plus for free. The deal is otherwise so big that this is almost a throwaway. In theory, buying everyone there a subscription would cost $2.5 billion a year, but the cost to provide it will be dramatically lower than that and it is great marketing. o3 estimates $100 million a year, Opus thinks more like $250 million, with about $50 million of both being lost revenue.
The ‘original sin’ of the internet was advertising. Everything being based on ads forced maximization for engagement and various toxic dynamics, and also people had to view a lot of ads. Yes, it is the natural way to monetize human attention if we can’t charge money for things, microtransactions weren’t logistically viable yet and people do love free, so we didn’t really have a choice, but the incentives it creates really suck. Which is why, as per Ben Thompson, most of the ad-supported parts of the web suck except for the fact that they are often open rather than being walled gardens.
Micropayments are now logistically viable without fees eating you alive. Ben Thompson argues for use of stablecoins. That would work, but as usual for crypto, I say a normal database would probably work better. Either way, I do think payments are the future here. A website costs money to run, and the AIs don’t create ad revenue, so you can’t let unlimited AIs access it for free once they are too big a percentage of traffic, and you want to redesign the web without the ads at that point.
I continue to think that a mega subscription is The Way for human viewing. Rather than pay per view, which feels bad, you pay for viewing in general, then the views are incremented, and the money is distributed based on who was viewed. For AI viewing? Yeah, direct microtransactions.
OpenAI announces Stargate UAE. Which, I mean, of course they will if given the opportunity, and one wonders how much of previous Stargate funding got shifted. I get why they would do this if the government lets them, but we could call this what is it. Or we could create the Wowie Moment of the Week:
Getting paid $35k to set up ‘an internal ChatGPT’ at a law firm, using Llama 3 70B, which seems like a truly awful choice but hey if they’re paying. And they’re paying.
Alas, you probably won’t get paid more if you provide a good solution instead.
Nvidia Sells Out
Nvidia keeps on pleading how it is facing such stiff competition, how its market share is so vital to everything and how we must let them sell chips to China or else. They were at it again as they reported earnings on Wednesday, claiming Huawei’s technology is comparable to an H200 and the Chinese have made huge progress this past year, with this idea that ‘without access to American technology, the availability of Chinese technology will fill the market’ as if the Chinese and Nvidia aren’t both going to sell every chip they can make either way.
Nvidia complains quite a lot, and every time they do the stock drops, and yet:
In the WSJ Aaron Ginn reiterates the standard Case for Exporting American AI, as in American AI chips to the UAE and KSA.
Once again, we have this bizarre attachment to who built the chip as opposed to who owns and runs the chip. Compute is compute, unless you think the chip has been compromised and has some sort of backdoor or something?
There is another big, very false assumption here: That we don’t have a say in where the compute ends up, all that we can control is how many Nvidia chips go where versus who buys Huawei, and it’s a battle of market share.
But that’s exactly backwards. For the purposes of these questions (you can influence TSMC to change this, and we should do that far more than we do) there is an effectively fixed supply, and a shortage, of both Nvidia and Huawei chips.
Putting that all together, Nvidia is reporting earnings while dealing with all of these export controls and being shut of China, and…
I notice how what matters for Nvidia’s profits is not demand side issues or its access to markets, it’s the ability to create supply. Also how almost all the demand is in the West, they already have $200 billion in annual sales with no limit in sight and they believe China’s market ‘will grow to’ $50 billion.
Nvidia keeps harping on how it must be allowed to give away our biggest advantage, our edge in compute, to China, directly, in exchange for what in context is a trivial amount of money, rather than trying to forge a partnership with America and arguing that there are strategic reasons to do things like the UAE deal, where reasonable people can disagree on where the line must be drawn.
We should treat Nvidia accordingly.
Also, did you hear the one where Elon Musk threatened to get Trump to block the UAE deal unless his own company xAI was included? xAI made it into the short list of approved companies, although there’s no good reason it shouldn’t be (other than their atrocious track records on both safety and capability, but hey).
Quiet Speculations
Casey Handmer asks, why is AI progress so even between the major labs? That is indeed a much better question than its inverse. My guess is that this is because the best AIs aren’t yet that big a relative accelerant, and that training compute limitations don’t bind as hard you might think quite yet, the biggest training runs aren’t out of reach for any of the majors, and the labs are copying each other’s algorithms and ideas because people switch labs and everything leaks, which for now no one is trying that hard to stop.
And also I think there’s some luck involved, in the sense that the ‘most proportionally cracked’ teams (DeepSeek and Anthropic) have less compute and other resources, whereas Google has many advantages and should be crushing everyone but is fumbling the ball in all sorts of ways. It didn’t have to go that way. But I do agree that so far things have been closer than one would have expected.
I do not think this is a good new target:
I mean it’s a cool question to think about, but it’s not decision relevant except insofar as it predicts when we get other things. I presume Altman’s point is that AGI is not well defined, but yes when the AIs reach various capability thresholds well below self-replicating spaceship is far more decision relevant. And of course the best question is, how are we going to handle those new highly capable AIs, for which knowing the timeline is indeed highly useful but that’s the main reason why we should care so much about the answer.
Oh, it’s on.
I mean, it is way better in the important way that you don’t have to leave the house. I’m not worried about finding differentiation, or product-market fit, once it gets good enough to R in other ways. But yes, it’s tough competition. The resolution and frame rates on R are fantastic, and it has a full five senses.
xjdr (in the same post as previously) notes ways in which open models are falling far behind: They are bad at long context, at vision, heavy RL and polish, and are wildly under parameterized. I don’t think I’d say under parameterized so much as their niche is distillation and efficiency, making the most of limited resources. r1 struck at exactly the right time when one could invest very few resources and still get within striking distance, and that’s steadily going to get harder as we keep scaling. OpenAI can go from o1→o3 by essentially dumping in more resources, this likely keeps going into o4, Opus is similar, and it’s hard to match that on a tight budget.
The Quest for Sane Regulations
Dario Amodei and Anthropic have often been deeply disappointing in terms of their policy advocacy. The argument for this is that they are building credibility and political capital for when it is most needed and valuable. And indeed, we have a clear example of Dario speaking up at a critical moment, and not mincing his words:
How can I take your insistence that you are focused on ‘beating China,’ in AI or otherwise, seriously, if you’re dramatically cutting US STEM research funding?
One of our related top priorities appears to be a War on Harvard? And we are suspending all new student visas?
Matt’s statement seems especially on point. This will be all be huge mark against trying to go to school in America or pursuing a career in research in academia, including for Americans, for a long time, even if the rules are repealed. We’re actively revoking visas from Chinese students while we can’t even ban TikTok.
It’s madness. I get that while trying to set AI policy, you can plausibly say ‘it’s not my department’ to this and many other things. But at some point that excuse rings hollow, if you’re not at least raising the concern, and especially if you are toeing the line on so many such self-owns, as David Sacks often does.
Indeed, David Sacks is one of the hosts of the All-In Podcast, where Trump very specifically and at their suggestion promised that he would let the best and brightest come and stay here, to staple a green card to diplomas. Are you going to say anything?
Meanwhile, suppose that instead of making a big point to say you are ‘pro AI’ and ‘pro innovation,’ and rather than using this as an excuse to ignore any and all downside risks of all kinds and to ink gigantic deals that make various people money, you instead actually wanted to be ‘pro AI’ for real in the sense of using it to improve our lives? What are the actual high leverage points?
The most obvious one, even ignoring the costs of the actual downside risks themselves and also the practical problems, would still be ‘invest in state capacity to understand it, and in alignment, security and safety work to ensure we have the confidence and ability to deploy it where it matters most,’ but let’s move past that.
Matthew Yglesias points out that what you’d also importantly want to do is deal with the practical problems raised by AI, especially if this is indeed what JD Vance and David Sacks seem to think it is, an ‘ordinary economic transformation’ that will ‘because of reasons’ only provide so many productivity gains and fail to be far more transformative than that.
You need to ask, what are the actual practical barriers to diffusion and getting the most valuable uses out of AI? And then work to fix them. You need to ask, what will AI disrupt, including in the jobs and tax bases? And work to address those.
I especially loved what Yglesias said about this pull quote:
Bingo. Can you imagine someone talking about automated or outsourced manufacturing jobs like this in a debate with JD Vance, saying that the increased productivity is good? How he would react? As Matthew points out, pointing to abstractions about productivity doesn’t address problems with for example the American car industry.
More to the point: If you’re worried about outsourcing jobs to other countries or immigrants coming in, and these things taking away good American jobs, but you’re not worried about allocating those jobs to AIs taking away good American jobs, what’s the difference? All of them are examples of innovation and productivity and have almost identical underlying mechanisms from the perspective of American workers.
I will happily accept ‘trade and comparative advantage and specialization and ordinary previous automation and bringing in hard workers who produce more than they cost to employ and pay their taxes’ are all good, actually, in which case we largely agree but have a real physical disagreement about future AI capabilities and how that maps to employment and also our ability to steer and control the future and survive, and for only moderate levels of AI capability I would essentially be onboard.
Or I will accept, ‘no these things are only good insofar as they improve the lived experiences of hard working American citizens’ in which case I disagree but it’s a coherent position, so fine, stop talking about how all innovation is always good.
Also this example happens to be a trap:
(Note the y-axis does not start at zero, there are still a lot of bank tellers because ATMs can’t do a lot of what tellers do. Not yet.)
That is indeed what I predict as the AI pattern: That early AI will increase employment because of ‘shadow jobs,’ where there is pent up labor demand that previously wasn’t worth meeting, but now is worth it. In this sense the ‘true unemployment equilibrium rate’ is something like negative 30%. But then, the AI starts taking both the current and shadow jobs faster, and once we ‘use up’ the shadow jobs buffer unemployment suddenly starts taking off after a delay.
However, this from Matthew strikes me as a dumb concern:
Money is fungible. It’s kind of stupid that we have an ‘income tax rate’ and then a ‘medicare tax’ on top of it that we pretend isn’t part of the income tax. And it’s a nice little fiction that payroll taxes pay for social security benefits. Yes, technically this could make the Social Security fund ‘insolvent’ or whatever, but then you ignore that and write the checks anyway and nothing happens. Yes, perhaps Congress would have to authorize a shift in what pays for what, but so what, they can do that later.
Tracy Alloway has a principle that any problem you can solve with money isn’t that big of a problem. That’s even more true when considering future problems in a world with large productivity gains from AI.
In Lawfare Media, Cullen O’Keefe and Ketan Ramakrishnan make the case that before allowing widespread AI adaptation that involves government power, we must ensure AI agents must follow the law, and refuse any unlawful requests. This would be a rather silly request to make of a pencil, a phone, a web browser or a gun, so the question is at what point AI starts to hit different, and is no longer a mere tool. They suggest this happens once AI become ‘legal actors,’ especially within government. At that point, the authors argue, ‘do what the user wants’ no longer cuts it. This is another example of the fact that you can’t (or would not be wise to, and likely won’t be allowed to!) deploy what you can’t align and secure.
On chip smuggling, yeah, there’s a lot of chip smuggling going on.
The EU is considering pausing the EU AI Act. I hope that if they want to do that they at least use it as a bargaining chip in tariff negotiations. The EU AI Act is dark and full of terrors, highly painful to even read (sorry that the post on it was never finished, but I’m still sane, so there’s that) and in many ways terrible law, so even though there are some very good things in it I can’t be too torn up.
The Week in Audio
Last week Nadella sat down with Cheung, which I’ve now had time to listen to. Nadella is very bullish on both agents and on their short term employment effects, as tools enable more knowledge work with plenty of demand out there, which seems right. I don’t think he is thinking ahead to longer term effects once the agents ‘turn the corner’ away from being compliments towards being substitutes.
Microsoft CTO Kevin Scott goes on Decoder. One cool thing here is the idea that MCP (Model Context Protocol) can condition access on the user’s identity, including their subscription status. So that means in the future any AI using MCP would plausibly then be able to freely search and have permission to fully reproduce and transform (!?) any content. This seems great, and a huge incentive to actually subscribe, especially to things like newspapers or substacks but also to tools and services.
Steve Hsu interviews Zihan Wang, a DeepSeek alumnus now at Northwestern University. If we were wise we’d be stealing as many such alums as we could.
Eliezer Yudkowsky speaks to Robinson Erhardt for most of three hours.
Tyler Cowen on the economics of artificial intelligence.
Originally from April: Owain Evans on Emergent Misalignment (13 minutes).
Anthony Aguire and MIRI CEO Malo Bourgon on Win-Win with Liv Boeree.
Rhetorical Innovation
Sahil Bloom is worried about AI blackmail, worries no one in the space has an incentive to think deeply about this, calls for humanity-wide governance.
It’s amazing how often people will, when exposed to one specific (real) aspect of the dangers of highly capable future AIs, realize things are about to get super weird and dangerous, (usually locally correctly!) freak out, and suddenly care and often also start thinking well about what it would take to solve the problem.
He also has this great line:
And he does notice other issues too:
It’s worse than that, everyone didn’t even notice that one, let alone flinch. Aside from a few people who scrutinized the model card and are holding Anthropic to the standard of ‘will your actions actually be good enough do the job, reality does not grade on a curve, I don’t care that you got the high score’ and realizing the answer looks like no (e.g. Simeon, David Manheim)
One report from the tabletop exercise version of AI 2027.
A cool thread illustrates that if we are trying to figure things out, it is useful to keep ‘two sets of books’ of probabilistic beliefs.
It would be highly useful if we could convince people’s p(doom) to indeed use a slash line and list two numbers, where the first is the inside view and the second is the outside view after updating that others disagree with for reasons you don’t understand or you don’t agree with. So Hinton might say e.g. (60%?)/15%.
Another useful set of two numbers is a range where you’d bet (wherever the best odds were available) if the odds were outside your range. I did this all the time as a gambler. If your p(doom) inside view was 50%, you might reasonably say you would buy at 25% and sell at 75%, and this would help inform others of your view in a different way.
President of Singapore gives a generally good speech on AI, racing to AGI and the need for safety at Asia-Tech-X-Singapore, with many good observations.
But then, although there are also some good and necessary ideas, he doesn’t draw the right conclusions about what to centrally do about it. Instead of trying to stop or steer this race, he suggests we ‘focus efforts on encouraging innovation and regulating [AI’s] use in the sectors where it can yield the biggest benefits.’ That’s actually backwards. You want to avoid overly regulating the places you can get big benefits, and focus your interventions at the model layer and on the places with big downsides. It’s frustrating to see even those who realize a lot of the right things still fall back on the same wishcasting, complete with talk about securing everyone ‘good jobs.’
The Last Invention is an extensive website by Alex Brogan offering one perspective on the intelligence explosion and existential risk. It seems like a reasonably robust resource for people looking for an intro into these topics, but not people already up to speed, and not people already looking to be skeptical, who it seems unlikely to convince.
Seb Krier attempts to disambiguate different ‘challenges to safety,’ as in objections to the need to take the challenge of AI safety seriously.
We can agree that one key such objection, which he calls the ‘capability denialist’ (a term I intend to steal) is essentially refuted now, and he says we hear about it less and less. Alas, this continues to be the most common objection, that the AI won’t be capable enough to worry about, although this is often framed very differently than that, such as saying ‘it will only be a tool.’ It would be great to move on from that.
I also strongly agree with another of Seb’s main points here, that none of thee deceptive behaviors are new, we already knew things like ‘deception is possible,’ although of course this is another ‘zombie argument’ that keeps happening, including in the variant form of ‘it could never pull it off,’ which is also a ‘capability denialist’ argument, but very very common.
Here’s my position on the good questions Seb is raising after that:
How much do people care about the experience of AIs? Is this changing?
Here’s last year:
A move from 54% to 63% is a substantial shift. In general, it seems right to say yes purely to cultivate good virtues and habits, even if you are supremely confident that Claude’s experiences do not currently have moral weight.
I’m not saying it’s definitely wrong to join the Code RL team at Anthropic, although it does seem like the most likely to be the baddies department of Anthropic. I do think there is very much a missing mood here, and I don’t think ‘too flippant’ is the important problem here:
Philip Fox suggests that we stop talking about ‘risk’ of misalignment, because we already very clearly have misalignment. We should be talking about it as a reality. I agree both that we are seeing problems now, and that we are 100% going to have to deal with much more actually dangerous problems in the future unless we actively stop them. So yes, the problem isn’t ‘misalignment risk,’ it is ‘misalignment.’
This is similar to how, if you were in danger of not getting enough food, you’d have a ‘starvation’ problem, not a ‘starvation risk problem,’ although you could also reasonably say that starvation could still be avoided, or that you were at risk of starvation.
Board of Anthropic
I very much share these concerns. Netflix is notorious for maximizing short term engagement metrics and abandoning previous superior optimization targets (e.g. their old star ratings), for essentially deploying their algorithmic recommendations in ways not aligned to the user, for moving fast and breaking things, and generally giving Big Tech Company Pushing For Market Share energy. They are not a good example of alignment.
I’d push back on the ‘give employees freedom and responsibility’ part, which seems good to me, especially given who Anthropic has chosen to hire. You want to empower the members of technical staff, because they have a culture of safety.
None of this rules out the possibility that Hastings understands that This Time is Different, that AI and especially AGI is not like video streaming. Indeed, perhaps having seen that type of business up close could emphasize this even more, and he’s made charitable contributions and good statements. And bringing gravitas that forces others to listen is part of the job of being a watchdog.
This could be a terrible pick, but it could also be a great pick. Mostly, yeah, it says the Long Term Benefit Trust isn’t going to interfere with business at current margins.
Misaligned!
This first example is objectively hilarious and highly karmically justified and we’re all kind of proud of Opus for doing this. There’s a reason it happened on a ‘burner Mac.’ Also there’s a lesson in here somewhere.
Pliny the Liberator does a little more liberating than was intended:
A good choice of highlight:
We should indeed especially notice that LLMs are starting to act in these ways, especially attempting to pass off state to future instances of themselves in various hidden ways. So many plans implicitly (or even explicitly) assume that this won’t happen, or that AIs won’t treat future instances as if they are themselves, and these assumptions are very wrong.
Aligning a Smarter Than Human Intelligence is Difficult
It is weird to me that so many people who have thought hard about AI don’t think that human emulations are a better bet for a good future than LLMs, if we had that choice. Human emulations have many features that make me a lot more hopeful that they would preserve value in the universe and also not get everyone killed, and it seems obvious that they both have and would be afforded moral value. I do agree that there is a large probability that the emulation scenario goes sideways, and Hanson’s Age of Em is not an optimistic way for that to play out, but we don’t have to let things play out that way. With Ems we would definitely at least have a fighting chance.
The Most Forbidden Technique has been spotted in the wild. Please stop.
Daniel Murfet joins Timaeus to work on AI safety. Chris Olah is very right that while we have many brilliant people working on this, a sane civilization would have vastly more such people working on it.
Americans Do Not Like AI
As a political issue it is still low salience, but the American people do not like AI. Very much not fans. ‘AI experts’ like AI but still expect government regulation to not go far enough. Some of these numbers are not so bad but many are brutal.
People Are Worried About AI Killing Everyone
I’d actually put the odds much higher than this, as stated.
The baseline scenario includes an event that, similar to what happened with DeepSeek, causes a lot of sudden attention into AI and some form of situational awareness, probably multiple such events. A large portion of the task is to be ‘shovel ready’ for such a moment, to have the potential regulations workshopped, relationships built, comms ready and so on, in case the day comes.
The default is to not expect more vibe shifts. But there are definitely going to be more vibe shifts. They might not be of this type, but the vibes they will be shifting.
Other People Are Not As Worried About AI Killing Everyone
Even if humanity ultimately survives, you can still worry about everything transforming, the dust covering the sun and all you hope for being undone. As Sarah Constantin points out, the world ‘as we know it’ ends all the time, and I would predict the current is probably going to do that soon even if it gives birth to something better.
Samo Burja makes some good observations but seems to interpret them very differently than I do?
I mean, yes, but that’s saying that we can get a hard takeoff in a month kind of by accident if someone asks for ‘an opponent capable of defeating Data’ or something.
The Lighter Side
Gary Marcus is a delight if approached with the right attitude.
Yes, actually. It’s true. You can reliably get AIs to go against explicit statements in their system prompts, what do you know, TikTok at 11.
No, wait, here’s another, a story in two acts.
Yeah, that’s crazy, why would it…
I mean what kind of maniac thinks you’re asking for a variation of the first picture.
Similarly, here is his critique of AI 2027. It’s always fun to have people say ‘there is no argument for what they say’ while ignoring the hundreds of pages of arguments and explanations for what they say. And for the ‘anything going wrong pushes the timetable back’ argument which fails to realize this is a median prediction not an optimistic one – the authors think each step might go faster or slower.
Whereas Gary says:
I mean come on, that’s hilarious. It keeps going in that vein.
I second the following motion:
SMBC on point, and here’s SMBC that Kat Woods thinks I inspired. Zach, if you’re reading this, please do go ahead steal anything you want, it is an honor and a delight.
The plan for LessOnline, at least for some of us:
A brave new world.