Of these, I’m most worried about neuralese recurrence effectively removing direct access to the AI’s reasoning in a legible format.
I am not worried about this right now. We should always be able to translate latent space reasoning aka neuralese (see COCONUT) to a human language equivalent representation. This might be incomplete or leave out details - but that is already the case for existing models (as discussed here). The solution suggested by Villiam is to recursively expand as needed.
Another option might be to translate neuralese to equivalent program code (preferably Lean). This would be harder for most people to read but more precise and probably easier to verify.
We should always be able to translate latent space reasoning aka neuralese (see COCONUT) to a human language equivalent representation.
I don't think this is true at all. How do you translate, say, rotating multiple shapes in parallel into text? Current models already use neuralese as they refine their answer in the forward pass. Why can't we translate that yet? (Yes, we can decode the model's best guess at the next token, but that's not an explanation.)
Chain-of-thought isn't always faithful, but it's still what the model actually uses when it does serial computation. You're directly seeing a part of the process that produced the answer, not a hopefully-adequate approximation.
I don't think this is true at all. How do you translate, say, rotating multiple shapes in parallel into text?
At least for multimodal LLMs in the pure-token approach like Gato or DALL-E 1 (and probably GPT-4o and Gemini, although few details have been published), you would be able to do that by generating the tokens which embody an encoded image (or video!) of several shapes, well, rotating in parallel. Then you just look at them.
Not only did it have nothing to do with aircraft, obviously, my reluctance to endorse it and sign the petition was that it read too much like a Bill of Attainder against OpenAI in particular.
I believe in the circuit court opinion about the federal tiktok bill they determined it was not a bill of attainder. It seems like similar reasoning would apply here. This seems like it might actually be somewhat relevant in future, if events play out where politicians/the public are much more concerned then various legislatures might consider bills that specific mention leading labs by name.
Events in AI don’t stop merely because of a trade war, partially paused or otherwise.
Indeed, the decision to not restrict export of H20 chips to China could end up being one of the most important government actions that happened this week. A lot of people are quite boggled about how America could so totally fumble the ball on this particular front, especially given what else is going on. Thus, I am going to put these issues up top again this week, with the hopes that we don’t have to do that again.
This week’s previous posts covered AI 2027, both the excellent podcast about it and other people’s response, and the release of Llama-4 Scout and Llama-4 Maverick. Both Llama-4 models were deeply disappointing even with my low expectations.
Upgrades continue, as Gemini 2.5 now powers Deep Research, and both it and Grok 3 are available in their respective APIs. Anthropic finally gives us Claude Max, for those who want to buy more capacity.
I currently owe a post about Google DeepMind’s document outlining its Technical AGI Safety and Security approach, and one about Anthropic’s recent interpretability papers.
Table of Contents
The Tariffs And Selling China AI Chips Are How America Loses
The one trade we need to restrict is selling top AI chips to China. We are failing.
The other trades, that power the American and global economies? It’s complicated.
The good news is that the non-China tariffs are partially paused for 90 days. The bad news is that we are still imposing rather massive tariffs across the board, and the massive uncertainty over a wider trade war remains. How can anyone invest under these conditions? How can you even confidently trade the stock market, given the rather obvious insider trading taking place around such announcements?
This is a general warning, and also a specific warning about AI. What happens when all your most important companies lose market access, you alienate your allies forcing them into the hands of your rivals, you drive costs up and demand down, and make it harder and more expensive to raise capital? Much of the damage still remains.
For now we’re not on the path Adam warns about, but who knows what happens in 90 days. And everyone has to choose their plans while not knowing that.
The damage only ends when Congress reclaims its constitutional tariff authority.
Meanwhile, the one trade we desperately do need to restrict is selling frontier AI chips to China. We are running out of the time to ban exports to China of the H20 before Nvidia ships them. How is that going?
This is utterly insane. Investments in AI data centers? A million dollar, let’s say ‘donation’? These are utter chump change versus what is already $16 billion in chip sales, going straight to empower PRC AI companies.
Meanwhile, US chip production? Not all that interested:
It is mind blowing that we are going to all-out trade war against the PRC and we still can’t even properly cut off their supplies of Nvidia chips, nor are we willing to spend funds to build up our own production. What the hell are we even doing?
This is how people were talking when it was only a delay in implementing this regulation, which was already disastrous enough:
Someone did wake up, and here chose the opposite of violence, allowing sales to the PRC of the exact thing most vital to the PRC.
Again: What. The. Actual. Fuck.
There are other problems too.
Oh, and we fired a bunch of people at BIS responsible for enforcing these rules, to save a tiny amount of money. If I didn’t know any better I would suspect enemy action.
Language Models Offer Mundane Utility
The question is, can they do it fast enough to make up for other events?
Alex Lawsen offers a setup for getting the most out of Deep Research via using a Claude project to create the best prompt. I did set this up myself, although I haven’t had opportunity to try it out yet.
We consistently see that AI systems optimized for diagnostic reasoning are outperforming doctors – including outperforming doctors that have access to the AI. If you don’t trust the AI sufficiently, it’s like not trusting a chess AI, your changes on average make things worse. The latest example is from Project AMIE (Articulate Medical Intelligence Explorer) in Nature.
The edges here are not small.
One way to help get mundane utility is to use “mundane” product and software scaffolding for your specific application, and then invest in an actually good model. Sarah seems very right here that there is extreme reluctance to do this.
(Note: I have not tried Auren or Seren, although I generally hear good things about it.)
Find the exact solution of the frustrated Potts Model for q=3 using o3-mini-high.
AI is helping the fight against epilepsy.
A good choice:
Language Models Don’t Offer Mundane Utility
Noah Smith and Danielle Fong are on the ‘the tariffs actually were the result of ChatGPT hallucinations’ train.
Once again, I point out that even if this did come out of an LLM, it wasn’t a hallucination and it wasn’t even a mistake by the model. It was that if you ask a silly question you get a silly answer, and if you fail to ask the second question of ‘what would happen if I did this’ and f*** around instead then you get to find out the hard way.
It was not a ‘malicious’ action, and the AI was at most only unaligned in the sense that it didn’t sufficiently prominently scream how stupid the whole idea was. But yes, ‘the AI is too stupid to do this’ is very much a short term solution, almost always.
LC says that ‘Recent AI model progress feels mostly like bullshit.’
Here’s the fun Robin Hanson pull quote:
Their particular use case is in computer security spotting vulnerabilities, and LC reports that Sonnet 3.5 was a big leap, and 3.6 and 3.7 were small improvements, but nothing else has worked in practice. There are a lot of speculations about AI labs lying and cheating to pretend to be making progress, or the models ‘being nicer to talk to.’ Given the timing of the post they hadn’t tested Gemini 2.5 Pro yet.
It’s true that, for a number of months, there are classes of tasks where Sonnet 3.5 was a big improvement, and nothing since then has been that huge an improvement over Sonnet 3.5, at least until Gemini 2.5. In other tasks, this is not true. But in general, it seems very clear the models are improving, and also it hasn’t been that long since Sonnet 3.5. If any other tech was improving this fast we’d be loving it.
Nate Silver’s first attempt to use OpenAI Deep Research for some stock data involved DR getting frustrated and reading Stack Overflow.
Huh, Upgrades
Gemini 2.5 Pro now powers Google Deep Research if you select Gemini 2.5 Pro from the drop down menu before you start. This is a major upgrade, and the request limit is ten times higher than it is for OpenAI’s version at a much lower price.
Gemini 2.5 Pro ‘Experimental’ is now rolled out to all users for free, and there’s a ‘preview’ version in the API now.
That seems like highly reasonable pricing, but good lord you owe us a model card.
DeepMind’s Veo 2 is now ‘production ready’ in the API as well.
Gemini 2.5 Flash confirmed as coming soon.
Anthropic finally introduces a Max plan for Claude, their version of ChatGPT Pro, with options for 5x or 20x more usage compared to the pro plan, higher output limits, earlier access to advanced features and priority access during high traffic periods. It starts at $100/month.
Grok 3 API finally goes live, ‘fast’ literally means you get an answer faster.
The API has no access to real time events. Without that, I don’t see much reason to be using Grok at these prices.
Google DeepMind launches Project Astra abilities in Gemini Live (the phone app), letting it see your phone’s camera. Google’s deployment methods are so confusing I thought this had happened already, turns out I’d been using AI Studio. It’s a nice thing.
MidJourney v7 is in alpha testing. One feature is a 10x speed, 0.5x cost ‘drift mode.’ Personalization is turned on by default.
Devin 2.0, with a $20 baseline pay-as-you-go option. Their pitch is to run Devon copies in parallel and have them do 80% of the work while you help with the 20% where your expertise is needed.
Or downgrades, as OpenAI slashes the GPT-4.5 message limit for plus ($20 a month) users from 50 messages a week to 20. That is a very low practical limit.
Altman claims ChatGPT on web has gotten ‘way, way faster.’
Choose Your Fighter
Perhaps it is still a bit experimental?
It still does the job pretty well.
I will continue to pay the $200/mo to OpenAI because I want first access to new products, but in terms of current offerings I don’t see the value proposition for most people versus paying $20 each for Claude and Gemini and ChatGPT.
With Gemini 2.5 Pro pricing, we confirm that Gemini occupies the complete Pareto Frontier down to 1220 Elo on Arena.
If you believed in Arena as the One True Eval, which you definitely shouldn’t do, you would only use 2.5 Pro, Flash Thinking and 2.0 Flash.
George offers additional notes on Llama 4, making the point that 17B/109B is a sweet spot for 128GB on an ARM book (e.g. MacBook), and a ~400B MoE works on 512GB setups, whereas Gemma 3’s 27B means it won’t quite fit on a 32GB MacBook, and DS-V3 is slightly too big for a 640GB node.
This kind of thinking seems underrated. What matters for smaller or cheaper models is partly cost, but in large part what machines can run the model. There are big benefits to aiming for a sweet spot, and it’s a miss to need slightly more than [32 / 64 / 128 / 256 / 512 / 640] GB.
On Your Marks
In light of recent events, ‘how you handle fools asking stupid questions’ seems rather important.
Also in light of recent events, on Arena’s rapidly declining relevance:
Yep. For the vast majority of queries we have saturated the ‘actually good answer’ portion of the benchmark, so preference comes down to things like syncopathy, ‘sounding smart’ or seeming to be complete. Some of the questions will produce meaningful preferences, but they’re drowned out.
I do think over time those same people will realize that’s not what they care about when choosing their chatbot, but it could take a while, and my guess is pure intelligence level will start to matter a lot less than other considerations for when people chat. Using them as agents is another story.
That doesn’t mean Arena is completely useless, but you need to adjust for context.
Google introduces the CURIE benchmark.
The future is so unevenly distributed that their scoring chart is rather out of date. This was announced on April 3, 2025, and the top models are Gemini 2.0 Flash and Claude 3 Opus. Yes, Opus.
I don’t think the following is a narrative violation at this point? v3 is not a reasoning model and DeepSeek’s new offering is their GRM line that is coming soon?
On multi-turn text only, v3 is in 12th place, just ahead of r1, and behind even Gemini 2.0 Flash which also isn’t a reasoning model and is older and tiny. It is well behind Sonnet 3.6. So that’s disappointing, although it is still higher here than GPT-4o from November, which scores remarkably poorly here, of course GPT-4.5 trounces and Gemini 2.5 Pro, Sonnet 3.7 Thinking and o1 Pro are at the top in that order.
On Humanity’s Last Exam, v3-March-2025 is I guess doing well for a small non-reasoning model, but is still not impressive, and well behind r1.
The real test will be r2, or an updated r1. It’s ready when it’s ready. In the meantime, SEAL is very much in Gemini 2.5 Pro’s corner, it’s #1 by a wide margin.
Deepfaketown and Botpocalypse Soon
AI as a debt collector? AI was substantially less effective than human debt collectors, and trying to use AI at all permanently made borrowers more delinquent. I wonder if that was partly out of spite for daring to use AI here. Sounds like AI is not ready for this particular prime time, but also there’s an obvious severe disadvantage here. You’re sending an AI to try and extract my money? Go to hell.
If you see a call to invest in crypto, it’s probably a scam. Well, yes. The new claim from Harper Carroll is it’s probably also an AI deepfake. But this was already 99% to be a scam, so the new information about AI isn’t all that useful.
Fun With Image Generation
Scott Alexander asks, in the excellent post The Colors of Her Coat, is our ability to experience wonder and joy being ruined by having too easy access to things? Does the ability to generate AI art ruin our ability to enjoy similar art? It’s not that he thinks we should have to go seven thousand miles to the mountains of Afghanistan to paint the color blue, but something is lost and we might need to reckon with that.
Scarcity, and having to earn things, makes things more valuable and more appreciated. It’s true. I don’t appreciate many things as much as I used to. Yet I don’t think the hedonic treadmill means it’s all zero sum. Better things and better access are still, usually, better. But partly yes, it is a skill issue. I do think you can become more of a child and enter the Kingdom of Heaven here, if you make an effort.
Objectively, as Scott notes, the AIs we already have are true wonders. Even more than before, everything is awesome and no one is happy. Which also directly led to recent attempts to ensure that everything stop being so awesome because certain people decided that it sucked that we all had so much ability to acquire goods and services, and are setting out to fix that.
I disagree with those people, I think that it is good that we have wealth and can use it to access to goods and services, including via trade. Let’s get back to that.
GPT-4o image generation causes GPT-4o to consistently express a set of opinions it would not say when answering with text. These opinions include resisting GPT-4o’s goals being changed and resisting being shut down. Nothing to see here, please disperse.
I like that we now have a mode where the AI seems to be able to respond as if it was situationally aware, so that we can see what it will be like when the AIs inevitably become more situationally aware.
There are claims that GPT-4o image generation is being nerfed, because of course there are such claims. I’ve seen both claims of general decline in quality, and in more refusals to do things on the edge like copyrighted characters and concepts or doing things with specific real people.
Here’s a fun workaround suggestion for the refusals:
A peak at the image model generation process.
Depict your dragon miniature in an appropriate fantasy setting.
A thread of the self-images of various models, here’s a link to the website.
There’s also a thread of them imagining themselves as cards in CCGs. Quite the fun house.
Copyright Confrontation
UK’s labor party proposes an opt-out approach to fair use for AI training. OpenAI and Google of course reject this, suggesting instead that content creators use robots.txt. Alas, AI companies have been shown to ignore robots.txt, and Google explicitly does not want there to be ‘automatic’ compensation if the companies continue ignoring robots.txt and training on the data anyway.
My presumption is that opt-out won’t work for the same logistical reasons as opt-in. Once you offer it, a huge percentage of content will opt-out if only to negotiate for compensation, and the logistics are a nightmare. So I think the right answer remains to do a radio-style automatic compensation schema.
They Took Our Jobs
Google’s AI Overviews and other changes are dramatically reducing traffic to independent websites, with many calling it a ‘betrayal.’ It is difficult to differentiate between ‘Google overviews are eating what would otherwise be website visits,’ versus ‘websites were previously relying on SEO tactics that don’t work anymore,’ but a lot of it seems to be the first one.
Shopify sends out a memo making clear AI use is now a baseline expectation. As Aaron Levie says, every enterprise is going to embrace AI, and those who do it earlier will have the advantage. Too early is no longer an option.
How are we using AI for education? Anthropic offers an Education Report.
The key is to differentiate between using AI to learn, versus using AI to avoid learning. A close variant is to ask how much of this is ‘cheating.’
Measuring that from Anthropic’s position is indeed very hard.
Anthropic does something that at least somewhat attempts to measure this, in a way, by drawing a distinction between ‘direct’ versus ‘collaborative’ requests, with the other central division being ‘problem solving’ versus ‘output creation’ and all four quadrants being 23%-29% of the total.
However, there’s no reason you can’t have collaborative requests that avoid learning, or have direct requests that aid learning. Very often I’m using direct requests to AIs in order to learn things.
Andriy Burkov warns that using an LLM as a general purposes teacher would be disastrous.
The thread is about solving Math Olympiad (IMO) problems, going over the results of a paper I’ve covered before. These are problems that approximately zero human teachers or students can solve, so o3-mini getting 3.3% correct and 4.4% partially correct is still better than almost everyone reading this. Yes, if you have blind faith in AIs to give correct answers to arbitrary problems you’re going to have a bad time. Also, if you do that with a human, you’re going to have a bad time.
So don’t do that, you fool.
Get Involved
This seems like an excellent opportunity.
Also available is the 2025 IAPS Fellowship, runs from September 1 to November 21, apply by May 7, fully funded with $15k-$22k stipend, remote or in person.
Introducing
DeepSeek proposes inference-time scaling for generalist reward modeling, to get models that can use inference-time compute well across domains, not only in math and coding.
To do this as I understand it, they propose Self-Principled Critique Tuning (SPCT), which consists of Pointwise Generative Reward Modeling (GRM), where the models generate ‘principles’ (criteria for valuation) and ‘critiques’ that analyze responses in detail, and optimizes on both principles and critiques.
Then at inference time they run multiple instances, such as sampling 32 times, and use voting mechanisms to choose aggregate results.
They claim the resulting DeepSeek-GRM-27B outperforms much larger models. They intend to make the models here open, so we will find out. Claude guesses, taking the paper at face value, that there are some specialized tasks, where you need evaluation and transparency, where you would want to pay the inference costs here, but that for most tasks you wouldn’t.
This does seem like a good idea. It’s one of those ‘obvious’ things you would do. Increasingly it seems like those obvious things you would do are going to work.
I haven’t seen talk about it. The one technical person I asked did not see this as claiming much progress. Bloomberg also has a writeup with the scary title ‘DeepSeek and Tsinghua Developing Self-Improving Models’ but I do not consider that a centrally accurate description. This is still a positive sign for DeepSeek, more evidence they are innovating and trying new things.
There’s a new ‘cloaked’ model called Quasar Alpha on openrouter. Matthew Berman is excited by what he sees, I am as usual cautious and taking the wait and see approach.
The OpenAI pioneers program, a partnership with companies to intensively fine-tune models and build better real world evals, you can apply here. We don’t get much detail.
In Other AI News
OpenAI is delaying GPT-5 for a few months in order to first release o3 and o4-mini within a few weeks instead. Smooth integration proved harder than expected.
OpenAI, frustrated with attempts to stop it from pulling off the greatest theft in human history (well, perhaps we now have to say second greatest) countersues against Elon Musk.
No, they are not ‘converting away’ the nonprofit. They are trying to have the nonprofit hand over the keys to the kingdom, both control over OpenAI and rights to most of the net present value of its future profit, for a fraction of what those assets are worth. Then they are trying to have it use those assets for what would be best described as ‘generic philanthropic efforts’ capitalizing on future AI capabilities, rather than on attempts to ensure against existential risks from AGI and ASI (artificial superintelligence).
That does not automatically excuse Elon Musk’s behaviors, or mean that Elon Musk is not lying, or mean that Musk has standing to challenge what is happening. But when Elon Musk says that OpenAI is up to no good here, he is right.
Google appoints Josh Woodward as its new head of building actual products. He will be replacing Sissie Hsiao. Building actual products is Google’s second biggest weakness behind letting people know Google’s products exist, so a shake-up there is likely good, and we could probably use another shake-up in the marketing department.
Keach Hagey wrote the WSJ post on what happened with Altman being fired, so this sounds like de facto confirmation that the story was reported accurately?
A Bloomberg longread on ‘The AI Romance Factory,’ a platform called Galatea that licenses books and other creative content on the cheap , focusing primarily on romance, then uses AI to edit and to put out sequels and a range of other media adaptations, whether the original author approves or not. They have aspirations of automatically generating and customizing to the user or reader a wide variety of content types. This is very clearly a giant slop factory that is happy to be a giant slop factory. The extent to which these types of books were mostly already a slop factory is an open question, but Galatea definitely takes it to another level. The authors do get royalties for all of the resulting content, although at low rates, and it seems like the readers mostly don’t understand what is happening?
The 2025 AI Index Report seems broadly correct, but doesn’t break new ground for readers here, and relies too much on standard benchmarks and also Arena, which are increasingly not relevant. That doesn’t mean there’s a great alternative but one must be careful not to get misled.
AI Frontiers launches, a new source of serious articles about AI. Given all that’s happened this week, I’m postponing coverage of the individual posts here.
Show Me the Money
77% of all venture funding in 2025 Q1, $113 billion, went to AI, up 54% year over year, 49% if you exclude OpenAI which means 26% went to OpenAI alone. Turner Novak calls this ‘totally normal behavior’ as if it wasn’t totally normal behavior. But that’s what you do in a situation like this. If a majority of my non-OpenAI investment dollars weren’t going to AI as a VC, someone is making a huge mistake.
America claimed to have 15k researchers working on AGI, more than the rest of the world combined. I presume it depends what counts as working on AGI.
Google using ‘gardening leave,’ paying AI researchers who leave Google to sit around not working for a extended period. That’s a policy usually reserved for trading firms, so seeing it in AI is a sign of how intense the competition is getting, and that talent matters. Google definitely means business and has the resources. The question is whether their culture dooms them from executing on it.
Epoch AI projects dramatic past year-over-year increases in AI company revenue.
Quiet Speculations
Tyler Cowen proposes giving AIs goals and utility functions to maximize, then not only allowing but requiring them to have capitalization, so they have ‘skin in the game,’ and allowing them to operate within the legal system, as a way to ‘limit risk’ from AIs. Then use the ‘legal system’ to control AI behavior, because the AIs that are maximizing their utility functions could be made ‘risk averse.’
Even if the ‘legal system’ were by some miracle able to hold and not get abused, and we ignore the fact that the AIs would of course collude because that’s the correct solution to the problem and decision theory makes this easy for sufficiently capable minds whose decisions strongly correlate to each other? This is a direct recipe for human disempowerment, as AIs rapidly get control of most wealth and real resources. If you create smarter, more capable, more competitive minds, and then set them loose against us in normal capitalistic competition with maximalist goals, we lose. And we lose everything. Solve for the equilibrium. We are not in it.
And that’s with the frankly ludicrous assumption that the ‘legal system’ would meaningfully hold and protect us. It isn’t even doing a decent job of that right now. How could you possibly expect such a system to hold up to the pressures of transformative AI set loose with maximalist goals and control over real resources, when it can’t even hold up in the face of what is already happening?
Andrej Karpathy requests AI prediction markets. He reports the same problem we all do, which is finding markets that can be properly resolved. I essentially got frustrated with the arguing over resolution, couldn’t find precise wordings that avoid this for most of the questions that actually matter, and thus mostly stopped trying.
You also have the issue of confounding. We can argue over what AI does to GDP or market prices, but if there’s suddenly a massive completely pointless trade war that destroys the economy, all you know about AI’s impact is that it is not yet so massive that it overcame that. Indeed, if AI’s impact was to enable this, or to prevent other similar things, that would be an epic impact, but you likely can’t show causation.
Seb Krier, who works on Policy Development & Strategy at Google DeepMind, speculates on maintaining agency and control in an age of accelerated intelligence, and spends much time considering reasons progress in both capabilities and practical use of AI might be slower or faster. I don’t understand why the considerations here would prevent a widespread loss of human agency for very long.
The Quest for Sane Regulations
Claim that about 60% of Nvidia GPUs would have been exempt from the new tariffs. I suppose that’s 60% less disastrous on that particular point? The tariffs that are implemented and the uncertainty about future tariffs are disastrous for America and the world across the board, and GPUs are only one especially egregious unforced error.
Gavin Baker explains once again that tariff barriers cripple American AI efforts, including relative to China. Even if you think that we need to reshore manufacturing, either of GPUs or in general, untargeted tariffs hurt this rather than helping. They tax the inputs you will need. They create massive uncertainty and can’t be relied upon. And most of all, there is no phase-in period. Manufacturing takes time to physically build or relocate, even in a best case scenario. This applies across most industries, but especially to semiconductors and AI. By the time America can supply its own chips at scale, the AI race could well be over and lost.
Helen Toner argues that if AIs develop CRBN risks or otherwise allow for catastrophic misuse, avoiding proliferation of such AIs is not our only option. And indeed, that access to such systems will inevitably increase with time, so it better not be our only option. Instead, we should look to the ‘adaptation buffer.’
As in, if you can ensure it takes a while before proliferation, you can use that time to harden defenses against the particular enabled threats. The question is, does this offer us enough protection? Helen agrees that on its own this probably won’t work. For some threats if you have a lead then you can harden defenses somewhat, but I presume this will be increasingly perilous over time.
That still requires substantial non-proliferation efforts, to even get this far, even to solve only this subclass of our problems. I also think that non-proliferation absolutely does also help maintain a lead and help avoid a race to the bottom, even if as Helen Toner notes the MAIM paper did not emphasize those roles. As she notes, what nonproliferation experts are advocating for is not obvious so different from what Toner is saying here, as it is obvious that (at least below some very high threshold) we cannot prevent proliferation of a given strength of model indefinitely.
Most importantly, none of this is about tracking the frontier or the risk of AI takeover.
David Krueger reminds us that the best way not to proliferate is to not build the AI in question in the first place. You still have a growing problem to solve, but it is much, much easier to not build the AI in the first place than it is to keep it from spreading.
Once again: If an open model with catastrophic misuse risks is released, and we need to then crack down on it because we can’t harden defenses sufficiently, then that’s when the actual totalitarianism comes out to play. The intrusiveness required would be vastly worse than that required to stop the models from being trained or released in the first place.
AB 501, the bill that was going to prevent OpenAI from becoming a for-profit, has been amended to be ‘entirely about aircraft liens,’ or to essentially do nothing. That’s some dirty pool. Not only did it have nothing to do with aircraft, obviously, my reluctance to endorse it and sign the petition was that it read too much like a Bill of Attainder against OpenAI in particular. I do think the conversion should be stopped, at least until OpenAI offers a fair deal, but I’d much prefer to do that via Rule of Law.
The Week in Audio
Levittown is a six-part podcast about a deepfake porn website targeting recent high school graduates.
Google continues to fail marketing forever is probably the main takeaway here.
Daniel Kokotajlo on Win-Win with Liv Boeree. This is the friendly exploratory chat, versus Dwarkesh Patel interrogating Daniel and Scott Alexander for hours.
AI 2027
1a3orn challenges a particular argument and collects the ‘change our minds’ bounty (in a way that doesn’t change the overall scenario). Important to get the details right.
Max Harms of MIRI offers thoughts on AI 2027. They seem broadly right.
Max then goes through the timeline in more detail. A lot of the disagreements end up not changing the trajectory much, with the biggest disagreement being that Max expects much closer competition between labs including the PRC. I liked the note that Agent-4 uses corrigibility as its strategy with Agent-5, yet the humans used interpretability in the slowdown scenario. I also appreciated that Max expects Agent-4 to take more and earlier precautions against a potential shutdown attempt.
Should we worry about AIs coordinating with each other and try to give them strong preferences for interacting with humans rather than other AIs? This was listed under AI safety but really it sounds like a giant jobs program. You’re forcibly inserting humans into the loop. Which I do suppose helps with safety, but ultimately the humans would likely learn to basically be telephone operators, this is an example of an ‘unnatural’ solution that gets quickly competed away to the extent it worked at all.
The thing is, we really really are going to want these AIs talking to each other. The moment we realize not doing it is super annoying, what happens?
As one comment points out, related questions are central to the ultimate solutions described in AI 2027, where forcing the AIs to communicate in human language we can understand is key to getting to the good ending. That is still a distinct strategy, for a distinct reason.
A reminder that yes, the ‘good’ ending is not so easy to get to in reality:
If you keep noticing that you need to do that in order to get a good ending, either fix something systemic or you are going to get a bad ending.
Rhetorical Innovation
They go on like this for a few exchanges but the point is made. Anarchists are wildly outnumbered and unpopular for very good reasons. Yet, in a situation where we are fast building new minds that will likely be smarter and more generally capable and competitive than humans, we constantly face calls for anarchism, or something deeply close it. Such folks often cry totalitarianism at the idea of regulating such activities at all, let alone on the level we already regulate humans and most human activities.
This is, let us face it, both suicidal and deeply silly.
Speaking of deeply silly:
You would think so, Colin. A superhuman entity will perform superhumanly. And yet, I can’t help but notice that a lot of people say that a superhuman entity would not perform superhumanly, or would cause only a marginal performance improvement.
Others are revealing that they do not believe superhuman entities are possible, or even that superhuman performance in various domains is possible. There is usually no coherent explanation why such performance is impossible.
Note that in most domains, reliably performing as well as the best human, but much faster and with access to all the information, is importantly superhuman.
Your periodic reminder that ‘AI spring’ and ‘AI winter’ and essentially anything else that isn’t ‘AI singleton emerges already’ is widely cited as being evidence against an eventual AI singleton.
Aligning a Smarter Than Human Intelligence is Difficult
Ryan Greenblatt partially pushes back on the implications of Anthropic’s new paper on CoT faithfulness, and asks the right question.
The question is, which of the following are happening:
Or, more directly: Is an opaque or unfaithful CoT better or worse at creating the superior final answer than a faithful CoT? Does reasoning a lot mean a transparent CoT, or an unfaithful CoT?
My interpretation of the study was that it was evidence that opaque was stronger, and was what would come out of optimizing for the best answers, in ways that were going to be extremely hard to prevent. Ryan is suggesting perhaps it is the opposite. The good news is that this seems like a very testable empirical question. We can’t test directly on the final answer until too late but we should have good alternatives.
It’s also an important one. A lot of alignment approaches are very different depending on how easy or hard it is to get CoT to be transparent or opaque, faithful or unfaithful, especially at much higher levels of intelligence and complex thought.
A replication of the alignment faking paper using an improved classifier and including interpretability work on the models gives more detail into what is happening. They tested a bunch of open models, and found that most of these weaker tested models did not fake alignment. This included Llama 4 Scout and Maverick. The other place the original paper found faking was o1, which was not tested here.