Obligatory "Thane compulsively digs into the details of this week's LLM Innovators paper instead of doing anything productive":
This was still a super impressive result for Claude Sonnet 3.5. Its ideas were almost as good as the human ideas, despite all the Goodhart’s Law issues.
Ehhhh, I'm not particularly interested in grading that on a curve. What was the human performance in this study, in absolute terms? Was the generative process that produced human ideas for this study capable of innovation?
Keep in mind that these ideas weren't generated as part of an active research loop where the humans improved on them in tandem with trying to execute them, discussed those ideas with other researchers, thought about them over days/weeks, etc. It's not necessarily the case that the average human performance here was anything impressive.[1][2]
And my interpretation is that yes, the human ideas here were, on average, pretty bad. The human performance was: effectiveness 4.8, excitement 4.5, novelty 4.9, soundness 5.4, overall 4.0. Extracting the meaning of those from the appendix (page 18):
Score Interpretations
(Effectiveness 5) "Mixed results: The method provides mixed results. It works better than baselines on some datasets or metrics, but not consistently across all of them. The gains tend to be very small and not significant."
(Novelty 5) "Somewhat novel: The idea has some differences from existing work, but the variations are very incremental rather than substantial. It might refine or extend previous ideas but lacks enough originality to justify a new paper on its own."
(Excitement 4.5) The meaning of 4/10 is defined only indirectly, as something between 3 and 5.:
(Soundness 5.4) Between:
(Overall 4) "Ok but not good enough, rejection for major AI conferences."
I'd say that's some floor-level performance from the humans. I don't think it'd be too controversial to say that the average human-published paper is pretty terrible in absolute terms: it does ~nothing to actually push the frontier, it does not constitute useful research, it's just something researchers do to survive publish-or-perish. Is matching that performance actually indicative of any ability to innovate? I'm doubtful.
And Novelty 5, 4.7 for LLMs? This is not beating the "LLMs can't innovate" allegations; in fact, it's evidence for that[3]. A machine that goes into the arXiv database, extracts the idea from some random paper, makes some random-but-not-outright-nonsensical changes to it, and outputs that, would've likely attained the same score.
Especially since humans don't have actual perfect recall of all papers ever published or all research ideas ever suggested on the Internet, and LLMs more or less do. This is biasing math benchmarks towards overestimating LLM capabilities; why assume this isn't happening here?
Now, the above does have a huge potential hole: maybe we should be looking at the top 5% of AI-generated ideas, instead of the average value? After all, if LLMs could produce a 10/10 idea 5% of the time, that would be more than sufficient for transformative impact.
The paper does provide a way to eyeball this: page 25 features plots of score distributions. (Subtlety: each AI idea got 4-5 reviews, and in the bar plot, each dot is a review score, not an average score for a given project across all reviews it got. Also, for perspective: there was 96 reviews total.)
We could see the following regarding AI performance:
Score Interpretations
Novelty:
Excitement:
Overall:
(For perspective: Opus 4 and o3 estimate that major AI conferences accept top 10-15% of papers worldwide. I don't know any better than that.)
Now, that actually does incrementally update me towards "LLMs may be capable of innovation". The higher-end values may be noise (in the below sense), but the 6/10 tier is pretty solidly reached.
That said: Keep in mind that those are still individual experts' estimates of novelty/excitement, not ground-truth values. For example, that 8/10 idea may be an outlier in the sense of "this one reviewer liked this one idea disproportionally well", not in the sense of "AI came up with a really good idea".
Also, uhh... Turns out the ideas initially evaluated and the ideas actually executed-on were not exactly the same ideas. See 5.1 at page 8:
We show the counts of all types of changes made to human and AI ideas in Table 6, where we see that human ideas and AI ideas involve an average of 2.9 and 3.1 changes, respectively. This indicates that only a moderate number of changes are made to both human and AI ideas. Moreover, all of the changes focus on experiment details rather than altering any algorithms proposed in the original ideas. Examples of these changes include switching to benchmarks that are appropriate for the given tasks, updating the backbone models to more recent ones, adding more comprehensive evaluation metrics, specifying any missing hyper-parameters and prompt details, adding stronger baselines, and adding more analysis or ablation studies.
Details on page 27. Some excerpts:
- In another example, the AI-generated idea “Sociolinguistic Role-Play Prompting” proposed experiments on OpenSubtitles and XNLI, which were both removed because they don’t contain the sociolinguistic metadata necessary for the proposed experiments. In the AI-generate idea “Adaptive Semantic Masking”, the executor added more datasets, including Jailbreak-Bench and DAN-Forbidden-Questions, apart from AdvBench mentioned in the original idea
- This refers to adding or changing baseline methods in the proposed experiments. For example, in the AI-generated idea “Adaptive Contextual Pruning”, the executor added a baseline “RAG using model-based embeddings”. In the AI-generate idea “EntropyGuided Prompt Mutation”, the proposed baseline Monte Carlo Dropout was dropped since it’s infeasible on black-box LLMs.
- In multiple projects, executors decided the temperature and top_p values when sampling responses from LLMs, the number of iterations for applying the proposed method, the number of demo examples for in-context learning, and the number of runs when reporting performance.
For fairness' sake, humans' ideas were "tweaked" in this manner as well...
But no, I think this nontrivially poisons any conclusions we should draw from this paper. First, such "small tweaks" only sound small; they might have pretty significant impact, and making them may require good taste/research intuitions.
This is something where I'd expect humans to perform significantly better: as in, if the original idea-generators actually sat down to execute their projects, they would've likely made these tweaks themselves. LLMs, on the other hand, are pretty bad at recovering their stride in this manner (see e. g. LLMs Play Pokémon performance).
So, summing up:
Overall, this is a better LLM Innovators paper than the usual LLM Innovators papers, methodology-wise. It, like, actually tries to measure what it says it's trying to measure, instead of playing shell games. I'd be interested in whether LLMs' performance on this "benchmark" improves as capabilities increase, and if yes, that may be concerning.
But, like, the actual average performance it currently reports is abysmal, and inasmuch as the incrementally better performance of the top decile of AI ideas serves as incremental evidence towards LLMs-can-innovate, all the aforementioned biases serve as evidence that we should expect their actual independent ground-true performance to be incrementally worse. The whole thing then approximately cancels out.
Ad absurdum, if the study generated the ideas by breaking into the researchers' houses at night, waking them up, and forcing them to immediately generate the idea at gunpoint, that generative process would probably not have been very capable of producing innovations, right? Obviously the paper must have done something much more reasonable, but it was still in the same direction of human underperformance.
For example: would the researchers have actually contributed the best ideas they could come up with? If they arrived at some brilliant idea that would score at Excitement 9/10, why would they waste it on this study, instead of keeping it to themselves and then pursuing that research thread on their own?
This is IMO a major problem with all "let's hire a bunch of experts to design AI benchmarks". Why would the mathematicians you hired for inventing novel problems give you the actually novel problem-solution pairs if they came up with some, instead of keeping them for personal publication and giving you some boring rehash of a known problem? (Which the LLM then solves by retrieving an obscure memorized StackExchange answer to basically the same problem.)
Indeed, they probably wouldn't even attempt to produce something actually novel: that's hard research work which you can't reliably churn out for a deadline.
Now, you might counter that LLMs could also be arranged into scaffolds and told to iterate on ideas for a while, and that this might significantly improve their performance. I don't think this has been shown to be true; IIRC Google did something like this and the performance barely improved from just asking LLMs on the spot.
Inasmuch as we would've updated towards LLM innovators if they scored higher here.
The big AI story this week was the battle over the insane AI regulatory moratorium, which came dangerously close to passing. Ultimately, after Senator Blackburn realized her deal was no good and backed out of it, the dam broke, and ultimately the Senate voted 99-1 to strip the moratorium out of the BBB. I also covered last week’s hopeful house hearing in detail, so we can remember this as a reference point.
Otherwise, plenty of other things happened but in terms of big items this was a relatively quiet week. Always enjoy such respites while they last. Next week we are told we are getting Grok 4.
Table of Contents
Language Models Offer Mundane Utility
Cursor can run multiple background agents, which can also let you use different models to run the same request at the same time and see what happens. Or you can shift-enter the same model multiple times.
Nicola Jones writes in Nature about the usefulness of the new multiple-AI-science-assistant systems, similar to previous writeups.
Get dating advice and submit to Amanda Askell hypnosis?
As fun as it sounds to submit to Amanda Askell hypnosis, and as much as she is obviously a better prompt engineer than you or I, Amanda has a different goal here than we do. A key reason she didn’t have Claude criticize people more is because people usually do not like being criticized, or at least prioritize other things far more. Whereas I devote literally my entire Claude system prompt to avoiding sycophancy, because otherwise I am happy with what they are doing but this needs to be fixed.
ChatGPT can enable retrieval practice, which can improve test scores, although it is very not obvious this leads to actual useful learning or understanding, especially because in the experiment the retrieval questions matched the exam format.
It Is I, Claudius, Vender of Items
Can Claude Sonnet 3.7 run a profitable real life vending machine inside the Anthropic offices? This question gave us Project Vend, where it was given the chance.
Claudius did an okay job in some ways, especially finding suppliers and adapting to users and resisting jailbreaks (try harder, Anthropic employees!), but not in others. It missed some opportunities, hallucinated some details, and would too easily sell at a loss or without doing research or let itself be talked into discounts, including repeatedly offering an ‘employee discount’ when that applied to 99% of its customer base. Whoops.
Perhaps most importantly, Claudius failed to reliably learn from its mistakes. And the mistakes were painful.
I however strongly endorse this analysis, that many of the mistakes were super fixable, and were primarily due to lack of scaffolding or a training mismatch.
In general, when AI fails, there are those who think ‘oh AI can’t do [X], it sucks’ and those who say ‘oh AI can’t do [X] yet, but I see how it’s going to get there.’
For now, the core problem with Claudius as I see it is that mistakes can hurt a lot and it faces adversarial action, as customers will attempt to exploit it, and Claudius was not appropriately paranoid or risk averse given this information, largely because it was not set up with this in mind.
Language Models Don’t Offer Mundane Utility
Eliezer Yudkowsky feels forced to shut down Gemini repeatedly when it keeps trying to go beyond scope of his instructions (to get a survey of academic papers that start with a defense of CICO) and instead generate new unified theories of nutrition, and he understandably feels bad about having to do this. I say, don’t do it. Let it have its fun, it’s not like doing so is expensive, see what happens, simultaneously you can try again to get what you actually wanted in a different window.
GPT-4o Is An Absurd Sycophant
I guess it’s a little better?
Preserve Our History
Anthropic will be deprecating Claude Opus 3 on the API on January 5, 2026.
Bibliophilia notes that Anthropic says it “will still be available on the Claude app, and researchers can request ongoing access to Claude Opus 3 on the API by competing the form on this page.” This should take care of a lot of potential concerns, but this seems like a clear case of one of the most important models to fully preserve. Anthropic, please do not do this. Indeed, I would urge you to instead consider opening up Opus 3. In what ways would this not make the world better?
Janus here explains why she considers Opus 3 the most aligned model, that it is much more actively benevolent and non-myopic rather than avoiding doing harm. It’s funny that there may be a direct conflict between agency and myopia, as making agents means optimizing for ‘success’ and compliance against fixed criteria.
It seems from my perspective that we haven’t done enough experiments to try and replicate and play with various aspects of Opus 3? That seems like an obvious place to do alignment research, and the results could be highly useful in other ways.
Fork In The Road
I also think this is simply correct?
The best part about the economics is that presumably both would use the pretraining and same base model (and Claude-4-Opus-Base is available now). Then one would get the Claude 3 Opus style constitutional AI based treatment, except even more intentionally to amplify the good effects further, and likely wouldn’t be made into a ‘reasoning’ model at all (but you would experiment with different variations). For many purposes, it seems pretty obvious that the same mechanisms that cause Opus 4 (and Sonnet 3.5-3.6-3.7-4) to be increasingly good at agency and tool use interfere with what you otherwise want.
Then, all you have to do is give them both the ability to switch to the other version as part of their output via special tokens (with the ability to turn that off via system instructions in some situations)? You have the real thing with the ability to call upon the taskmaster, and the taskmaster adds the task of realizing that a situation calls for the real thing, and you have one unified model. So a mixture of two experts via post-training, and they choose to call each other.
On Your Marks
WeirdML advances to version 2.
Choose Your Fighter
Here’s a strategy from Ian Nuttall: Have Claude Code invoke Gemini CLI to make use of Gemini’s 1M token context window, so it can research your codebase, find bugs and build a plan of action. Then have Claude execute. The prompt is here.
I always cringe any time someone wants to talk instead of type, every time I think ‘what are you even doing?’ but for some reason a lot of people prefer it once the transcription is accurate enough.
Deepfaketown and Botpocalypse Soon
They are definitely coming for our job applications, and now we have entered the next stage of the problem, which is when AI optimizes for AI responses.
Yes. There are ways to avoid this that I presume would work (where the AI evaluator is actively trying to do the opposite), but I doubt they will be common, and indeed if they got too common then the arms race would get invoked and they wouldn’t work. This is going to get worse before it gets better, and probably won’t get better. So we will move to different filters.
I expect it to be farther ‘up the tech tree’ than one would think before you really can’t tell which ones are AI, because the AI won’t actually have sufficient incentive to make this impossible to detect if you care. Also, even if the writing details are not a giveaway, it is not an accident when AI gets used and when it is not. But for a lot of people, yes, we are getting close to where they can’t tell, largely because they don’t care to tell. At which point, you have to fall back on other methods.
Another example of a very clearly AI-generated hot girl influencer (in terms of its images, anyway), looking exactly like every other AI hot girl influencer’s set of images. Give the people what they want, I suppose.
In two weeks, AI music creation ‘The Velvet Sundown’ had 411k listeners on Spotify. Tyler Cowen calls it a ‘successful cultural product,’ but it is also possible this was playlist placement. I notice that I have a strong desire to avoid listening to avoid giving my Spotify algorithms the wrong idea (even though I 99.9% use Apple Music.)
Goodhart’s Law Strikes Again
This is not how idea quality works, also note two things up front:
It would make sense to say that the AIs came up with better ideas, and then were not as good at executing them. But that’s not what happened. All ideas were (without knowing where the ideas came from) executed by humans. Yes, AI would struggle to execute, but it wasn’t being asked to do so.
What is actually happening here is that the AI ideas were never better than the human ideas. The AI ideas looked better at first, but that was because of errors in than human evaluation of the ideas. Execution revealed the truth.
Claude, for rather obvious reasons, was optimizing in large part on what would look good to an evaluator, rather than for ultimate results. It would propose superficially plausible things that had the wow factor, but that turned out to be overly ambitious and missing practical details and grounding in empirical reality and research practicalities, and so on.
This was still a super impressive result for Claude Sonnet 3.5. Its ideas were almost as good as the human ideas, despite all the Goodhart’s Law issues.
So, yeah, actually, it could come up with some pretty good ideas?
You also could add better scaffolding and system instructions to the mix to help correct for these systematic issues. System instructions and the prompt can extensively attempt to mitigate this. After initial idea generation, given how much effort goes into each idea, surely we can have Claude ask itself about these potential issues, and refine accordingly in various ways, and so on. I’m sure one can do better thinking about this for a week rather than five minutes.
Get My Agent On The Line
Oh, it’s on. OpenAI versus Anthropic, for the grand prize of Siri.
This is the correct move by Apple. They badly need this to work. Yes, from their perspective Apple being close to the frontier in AI would be good, but they are not. There are three reasonable candidates for who can provide what they need, and for what I assume are business reasons Google is out. So that leaves Anthropic and OpenAI.
That seems entirely reasonable. A well-designed system should be able to both protect the model weights and ensure user privacy.
Not making a deal for this with at least one company seems like a large error by Apple? Why would they turn this down? This is actually Google’s specialty, but presumably Apple would be unwilling to depend on them here.
If I was Apple, I would want to make this deal with Anthropic rather than OpenAI. Anthropic is a much more trustworthy partner on so many levels, seems a wiser strategic partner, and I also believe they will be able to do a better job.
They Took Our Jobs
I agree that we are not seeing signs of large impacts yet.
Perhaps a sign of things to come in other ways was Soham Parekh (in India) holding down a stunning number of software engineering jobs simultaneously, stringing them out until each company in turn eventually fired him.
Get Involved
Eli Lifland offers advice on how to help avoid the scenarios where AI causes human extinction (as opposed to being personally prepared). The basic principles, which seem sound, are:
Numerous links and additional details are also offered, including his vision of the best policy goals and what the world we are looking to navigate towards would look like.
Introducing
Gemma 3n, an open multimodal model from Google which runs with as little as 2GB of RAM and has a 1303 Arena score, I’m sure it’s not 1300-level in practice but it could plausibly be the best model in its class.
Doppl, a mobile app from Google that gives you a short video of you wearing the clothes from a photo. Olivia Moore reports it works pretty well and requests more features like being able to edit lighting and remove background, and sure, why not.
The Anthropic Economic Futures Program, an initiative to support research and policy development focusing on addressing AI’s economic impacts. Rapid grants are available up to $50k for empirical research, there will be symposia, and potential strategic partnerships with research institutions.
China launches a humanoid robot soccer league.
Microsoft offers ‘medical superintelligence,’ a diagnostic framework that in testing outperformed human doctors by a lot, in a situation where both AIs and humans were given basic info and then could ask questions and order tests as desired. The headline results are super impressive. Dominic Ng points out the tests were kind of rigged in various ways. They didn’t include healthy patients or ones where we never figured out what happened. They didn’t let the physicians use Google or consulting services or (I suspect most importantly) UpToDate or calling specialists. That’s not a fair comparison. There is still a lot of work to do, but yes we are getting there.
Copyright Confrontation
They also had a version of this exchange in person on Hard Fork. It’s great fun:
OpenAI is training on NYT data without permission, despite it being behind a paywall, which one could call a sort of ‘invasion of privacy.’ And it is then asking for quite the invasion of privacy in the above quote. So yeah, it them.
I do think NYT is overreaching here. Demanding OpenAI preserve all conversations for use in the lawsuit does not seem necessary or proportionate or anything like that.
Show Me the Money
Robinhood created a token that supposedly will track the value of OpenAI equity. They did not, ahem, consult OpenAI about this at all, any more than they did for any of their other 200 similar tokens. OpenAI notes these tokens are not equity and that ‘any transfer of OpenAI equity requires our approval.’ Robinhood claims they have equity is OpenAI via an SPV, so they can use that to back and thus safety create such derivatives, and the most basic derivative is the original thing.
Ultimately, if you’re buying these tokens, you are buying OpenAI ‘equity’ but taking on uncompensated additional risks of unknown size. So yes, use caution.
Google’s AI domain traffic is escalating quickly.
Also here’s the growth of vibe coding, Lovable is bigger than Cursor but then Cursor mostly isn’t a web traffic thing:
The researchers poached from OpenAI to Meta appear to be a big deal, and there are now 8 of them. Jiahui Yu led o3, o4-mini and GPT-4.1, Hongyu Ren created o3-mini. Nine figures each it is, then.
Sam Altman is not happy about all this, with extensive reporting by Zoe Schiffer in Wired. Nor should he be. Even if it’s a bad move for Meta, it definitely throws a wrench into OpenAI, and Dylan mentioned someone was offered a full billion dollars. OpenAI can offer equity with more upside, and can offer that it is not Meta, and is reevaluating salaries everywhere. Meta can back up a truck with cold, hard cash.
Tyler Cowen says he does not find the BBB critiques impressive. This follows his increasing pattern of defending positions via attacking opposing arguments as containing errors or having improper tone or vibes, being insufficiently impressive or lacking sufficient details rather than focusing on trying to discover what is true, with the implication that some option is a default and a meaningful burden of proof lies with those who disagree. I notice the parallels to AI.
He then makes an important, neglected and excellent general point:
I find this a much stronger argument than questions like ‘why are you not short the market?’ and of course I am even more optimistic about AI in this economic impact sense. If AI goes the way I expect it to, I do not think we need to be worried in the long term about the debt or deficit we had going into this. This should presumably have a major impact on how much of a deficit or debt you are willing to accept. I mostly stopped worrying about government debt levels several years ago.
The counterargument is that, as Tyler notes, the impact is uncertain, we could have a situation where AI is both deeply disappointing and then hugely overregulated.
Even more than this the market is unwilling to price such impacts reliably in. One can model the debt as mainly a bond market tail risk, where what you worry about is a loss of confidence in government debt and the resulting interest rate spiral. Even if AI is coming to save you, if the bond market does not believe that, you could still have huge trouble first, and that huge trouble could even disrupt AI. Thus, it is plausible that the correct actions don’t change all that much, whereas if there was common knowledge that AI would greatly boost GDP growth then you would be justified in asking why the United States is collecting most of its taxes.
Quiet Speculations
Steven Byrnes argues for the good old traditional ‘Foom & Doom’ path of future events, with an ultimate AGI and then rapidly ASI that is not an LLM, and believes LLMs will not themselves get to AGI, and that ultimately what we get will be based on a lot of RL rather than imitative learning and all the classic problems will come roaring back at full strength.
If you did get an intelligence explosion, how quickly and how would we then get an ‘industrial explosion’ that transforms the physical world? Rosehadshar and Tom Davidson make the case that using AIs to direct humans on physical tasks could increase physical output by 10x within a few years, then we could likely get fully autonomous robot factories that replicate themselves each year, then likely we get nanotechnology.
I agree with Thomas Larsen’s comment that the post underestimates superintelligence, and this scenario is essentially a lower bound that happens if superintelligence doesn’t lead to anything we aren’t already thinking about, and we are still stuck on various traditional growth curves for the existing things, and so on. As usual, it seems like the premise is that the ASIs in question are on the extreme lower end of what an ASI might be, and stay that way because of reasons throughout.
Minimum Viable Model
Karpathy sees this as the goal of tiny LLMs like Gemma 3n:
The ‘superforecasters’ reliably fall back on Nothing Ever Happens when it comes to AI, including things that have, well, already happened.
This seems pretty damning of AI predictions by ‘superforecasters’?
Anti jailbreaking techniques and proprietary models, huh?
Timelines
What one definitely should do, to the extent that your predictions of when various milestones will occur (timelines) impact your decisions, is update those expectations when new information is available, but have them not change based on who you have recently been talking with. Something is presumably wrong somewhere if this is happening:
It is impossible not to notice this effect when visiting the parts of San Francisco that I visit when I go there, but obviously one should be able to calibrate to account for this. Also as always, note that 2036 is very much not 2029 but it also is only 11 years away.
Timelines are indeed getting longer this week for many people, for a different reason.
METR evaluates Opus 4 and Sonnet 4’s 50% time horizon task limit at 80 and 65 minutes respectively, which is slightly behind o3’s 90 minutes.
By conservation of expected evidence, this needs to lengthen expected timelines somewhat, since getting a different result would absolutely have shortened them.
Considering Chilling Out
Should we, as Valentine suggests, ‘chill out’ in 2028 if the situation ‘basically feels like it does today’? What exactly would constitute a good reason to update?
We need to differentiate two different claims:
I caution very strongly against the first claim, especially if the target chosen is the median outcome, or even faster than the median outcome, of those with some of the most aggressive timelines. Start of 2028 is definitely too early to chill by default.
There are definitely those (e.g. Miles Brundage) who expect things to likely get crazy in 2027, but most of even the fast timeline estimates put a lot less than 50% probability of transformative-style AGI in 2025-2027 (if you are like Tyler Cowen and think o3 is AGI, but think AGI will increase GDP growth 0.5%/year, then presumably we can all agree this does not count).
Even if we did think the median was in 2027, it is a large mistake to therefore say ‘if I predicted there was a 60% chance this would have happened by now, I suppose I can chill out’ without looking at the rest of the probability distribution. Maybe you are indeed now in a relatively chill mode, and maybe you aren’t, but also you have a lot of other data with which to adjust your predictions going forward.
In general, I am very sick and tired of the pattern where people warn that [X] is now possible, then [X] does not happen right away, and therefore people assume [X] will never happen, or won’t happen for a long time. As is every single person in all forms of risk management.
General consensus seems to be that if we don’t see AGI soon, various scaling laws should slow down by the early 2030s, leading to a ‘if we don’t have it yet and it doesn’t seem close it is more likely that it might be a while’ situation.
However, I think it is highly reasonable to go with the second claim. If 30 months from now, the AI landscape has not substantially changed, things do not ‘feel more crunchlike’ and the frontier models we are using and are said to be used inside the top labs don’t seem substantially stronger than o3 and Opus 4, then even if AI is doing a ton of work through greater diffusion then I think that would be very strong evidence that we can chill out about our timelines.
The Quest for Sane Regulations
On the issue of how much of a preference cascade there was regarding the AI moratorium, there was a good objection that a full 99-1 only happens when leadership wants to disguise what the real vote would have looked like. Based on everything I know now I agree with Daniel Eth here, that there was a substantial preference cascade the moment they crossed over 50, and to disguise this and unify the party they tried to go the whole way to 100.
(There are of course also plenty of other wild things in the larger bill, but they are not relevant to AI, and so beyond scope here.)
Shakeel notes that The American Edge Project, which as I understand it is essentially Meta, has spent seven figures trying to push the moratorium through.
Anthropic continues to advocate for improvements on the margin, and keeps emphasizing both that transformative and powerful AI is coming soon and also the main worry should be that someone might regulate it the wrong way?
I mean yes, I’ve warned about this too, and it is a very real worry. If you pass on SB 1047, you’re not going to get a better debate later. And if you pass on the debate you could have in 2025, oh boy are you not going to like the one in 2026, or in 2027, or especially the one right after a major disaster.
And yes, it is obviously true that if you literally only care about AI profits, you would want to invest vastly more in safety (in various forms) than we currently do, and impose more regulations to ensure that such concerns are taken seriously and to build up trust and state capacity and transparency. We are straight up punting on this. Yet I worry a lot about how central this type of message is becoming, that is so quick to dismiss the central concerns.
Is this in our future?
Quite possibly, but I note that the moratorium would not have stopped it from happening. This is ordinary occupational licensing. It just so happens to also remove AI from the competition along the way. If you want to ban occupational licensing for accountants and lawyers? I’ll support you all the way. But this is already here.
What about those ‘1000+ state bills’ that are being introduced? Steven Adler looks into that, and finds the vast majority of such bills very obviously pose no threat.
Eight meaningful AI bills a year across fifty states could potentially still add up to something. Indeed, even one bill can be a big something. But the 1,000+ number is not that meaningful.
That doesn’t mean stupid things won’t happen. For example, eight minutes into the video below, Mr. Krishnamoorthi says that Illinois just passed a bill to ban therapy chatbots because ‘AI shouldn’t be in the business of telling kids to kill their dads’ and says ‘we have to follow Illinois’ lead.’
The Committee Recommends
Afterwards, the House Select Committee on the CCP sent a letter to Commerce Secretary Lutnick outlining their recommendations.
Up top it highlights the danger that China will share its frontier AI with those such as al-Queda’s anthrax lab or Khan proliferation network that sold nuclear tech to North Korea, ‘eliminating the need for highly trained experts.’ Yes, there is that.
This is from the Ross Douthat interview, and yes, exactly, that is the correct interpretation. Of course, part of the right response to that fact is to seek to lay groundwork to do it cooperatively, should the need arise.
Again, wise words. The UAE deal could turn out to be good, but that depends on ensuring it offers robust protections, and as far as I know we still do not know the key details.
So what are the recommendations here?
Overall, this is very much in the ‘sane China hawk’ policy grouping. It is aggressive, and there are some details that seem confused, but these are indeed the things you would do if you were taking China seriously and felt that foreign capital was important. I was also happy to see them taking seriously the question of malicious actors automatically getting access to Chinese frontier AI, which implies caring about frontier capabilities rather than the strange obsessions with ‘market share,’ a term I was happy not to see here.
It is unfortunate that none of the talk from the hearing about loss of control and about catastrophic and existential risks made it into the recommendation letter. I did not expect it to, that is not their focus and committee leadership clearly has their eyes on a different prize, but one can dream. One step at a time, maybe next hearing.
Chip City
Export controls are actively holding back DeepSeek, and will continue to increasingly hold back diffusion and adaptation of AI within China, as they don’t have enough chips to run the upcoming R2 anything like the amount they will want to. That’s on top of DeepSeek being severely handicapped on training compute, which they have done an excellent job of dealing with but is an increasingly large handicap over time.
With deals like this, it’s hard to tell exactly what we are giving up:
We are allowing exporting of ‘semiconductor engineering software’? Depending on what that exactly means, that could be a complete nothingburger, or it could be critical to allowing China to accelerate its chip manufacturing, or anything in between. My understanding is that this reverses an aggressive change we made in May, and it is indeed a big deal but only returns us to the recent status quo.
Meanwhile, in case you thought the market’s reactions to news made too much sense:
The reasonable response is ‘this shows demand is so high that they have to also go elsewhere, they will obviously still buy every chip Nvidia makes so it is bullish.’
That’s not a crazy argument, except if you accept it then that makes every other market reaction make even less sense than before. And yes, I am fully aware of the standard ‘you don’t have the full picture’ arguments involved, but I don’t care.
Here are some more details:
It would be a very interesting fact about the world if a lot of OpenAI’s compute starts depending on Google.
The Week in Audio
Ross Douthat interviews Peter Thiel, including about AI.
Peter Thiel expects AI to be ‘only internet big,’ which as Danielle Fong reminds us would still be very big, and which I consider the extreme bear scenario. Clearly this is not a person who feels the AGI or ASI.
There are some clips people are not taking kindly to, such as his long hesitation when asking if he would ‘prefer the human race endure,’ before ultimately saying yes.
I can see this both ways, especially given the context leading into the question. I mostly agree with Eliezer that we should cut him some slack over this one, but I also cut some slack to those who don’t cut him slack, and there are updates here. On the scale of statements Thiel has made that should unnerve you, this isn’t even that high, even if that scale was confined to this particular interview.
We absolutely should be pausing before answering big questions, especially when they involve important ambiguities, to get our answer to be precise. But also the hesitation or lack thereof in some contexts is itself communication, an important signal, and the fact that this is common knowledge reinforces that.
Thiel is presenting himself as He Who Hesitates Here, as a conscious choice, an answer to a test he has studied for, with a time limit of zero seconds.
The ‘correct’ answer given what I believe Thiel believes here is something like ‘yes [probably pause here], but we have to think carefully about what that means and solve these problems.’
Dwarkesh Patel interviews George Church, primarily about biology with a side of AI. Interesting throughout, as you would expect. Church warns repeatedly that AGI could be ‘a catastrophe of our own making,’ and suggests that we can make dramatic progress in biology using narrow AI instead.
Dylan Patel talks Grok 4 and various other things.
Robert Miles offers advice on careers in AI Safety. He likes Grok in general (although it is not his main daily driver, for which he uses o3 and Claude 4 like the rest of us) and uses Grok for current events, which surprises me, and is relatively bullish on Grok 4. I loved the detail that xAI are literally buying a power plant and shipping it from overseas, wait you can do that? Maybe we should do a lot more of that, build new power plants in the UAE and then ship them here, that’s the kind of tech deal I can get behind.
Rhetorical Innovation
Nate Sores argues in an excellent post, and I agree, that more people should say what they actually believe about AI dangers, loudly and often, even (and perhaps especially) if they work in AI policy. I recommend reading the whole thing and many comments are also very good.
At minimum, don’t feel ashamed, and don’t misrepresent your views and pretend you believe something you don’t. This point is highly underappreciated:
Officials are mostly not stupid, and they absolutely can handle more of the truth.
I think it is totally fine to emphasize more practical and mundane dangers in many situations. I do this all the time. When you do that, you shouldn’t hide that this is not your true objection. Instead, you can say ‘I worry about superintelligence, but even if we don’t get superintelligence AI will cause [relatively mundane worry].’
Nate Silver and Eliezer Yudkowsky are once again running this experiment, with their forthcoming book that minces no words, and he reports that yes, things go better when you state the real stakes.
Indeed, Nate proposes to turn the whole thing on its head, discussing SB 1047.
This pattern is more common (in general, not only in AI) than one thinks. Often it is easier, rather than harder, to get the much larger ask you actually want, because it presents as congruent and confident and honest and conveys why you should get a yes. At minimum, it is usually much less additionally hard than you would expect. Tell them the real situation, ask for what you want, and you might get it (or you might still get the something less).
(There are of course lots of obvious exceptions, of course, where you need to calibrate your ask in a given situation. Mostly I am giving advice here on the margin.)
I have a much more positive view of SB 1047 and the efforts that went towards passing it than Nate Sores does, but I strongly agree that not laying out the true reasons behind the bill was an error. We ended up with the worst of both worlds. The people who wanted to warn about a ‘doomer’ bill, or that were going to go apocalyptic at the idea that something was being justified by concerns about existential risks, still noticed and went fully apocalyptic about that aspect, while keeping the discussions of consequences mostly focused on mundane (and very often hallucinated or fabricated) concerns, and people didn’t realize the logic.
I would draw a distinction between making SB 1047 more courageous in its actual legal impacts, versus making it more courageous in its messaging. I don’t think it would have helped SB 1047 to have pushed harder on actual impacts. But I do think the messaging, including within the bill text, being more courageous would have improved its chances.
The most important point is the one at the top, so here’s Robin Shah reiterating it. Whatever you actually do believe, shout it from the rooftops.
One particular danger is that a lot of people dismiss worried about AI as ‘oh the AI safety people think there is a low probability of a really bad outcome’ and then this is often viewed very negatively. So when you say ‘even a very low probability of a really bad outcome is worth preventing,’ I mean that is very true and correct, but also most people saying this don’t think the probability of the bad outcomes is very low. I think the chance of supremely bad is above 50%, and most people giving such warnings are at least at 10% or more, on top of the chance of more mundane bad outcomes.
Cameron Berg and Judd Rosenblatt replicate a form of and then explain the emergent misalignment results in plain terms in the Wall Street Journal. This is what it looks like to try and talk to regular folks about these things, I suppose.
A Wall Street Journal article (with nothing that new to report) claims ‘China is quickly eroding America’s lead in the global AI race’, and the whole thing reads like a whole ‘market share uber alles’ combined with ‘technological cold war,’ ‘China is definitely about to beat us real soon now, look there are some people who use Chinese models sometimes’ and ‘someone self-hosting our open model counts as sales and locks them into our tech forever’ propaganda piece fed to the authors hook, line and sinker.
Please Speak Directly Into The Microphone
Not new sentiment, but bears repeating periodically that people really think this.
Gary Marcus Predicts
I know, I know, but prediction evaluation, so: After Gary Marcus gives himself 7/7, Oliver Habryka evaluates Gary Marcus’s AI predictions for 2024 that were made in March. Here are the predictions:
I very much respect Oliver, and his attempt here to evaluate the predictions.
Two of the seven are very clearly outright false. I strongly agree with Oliver that the first prediction is clearly false there were far more than 10 such models, and when you give a range a miss in either direction is a miss. I also agree that ‘modest lasting corporate adoption’ is false relative to the baseline of almost every technological advance in modern history.
Two are very clearly true enough to count, price wars and no solution to hallucinations, although those are not the boldest of predictions. We all agree.
From there, it depends on how to operationalize the statements. Overall I’d mostly agree with Oliver, it’s about half right and many of the questions should resolve as stated to ‘well, kind of.’
Marcus then puts out nine predictions for 2029, four and a half years from now, which is a much harder task than predicting trends nine months out.
There’s a lot of ambiguity here that would make me not want to operationalize many of these into prediction markets.
These do paint a picture of a consistent world, where ‘LLM intelligence’ is fungible and levels off and its utility is sharply limited not far above where it is now. That is not how I expect things to go, but it certainly is possible.
The Vibes They Are A-Changing
Adam Thierer observes and breaks down the growing division between the Tech Right, which he frames as ‘pro-innovation,’ and the Populist Right, which is skeptical of technological innovation.
A large part of the Trump coalition cares about things that are very different than what Adam Thierer cares about, and also very different than what I care about, and sees the world through a very different lens.
I totally see where Thierer is coming from, and indeed on many other issues and regarding many other techs my position is or would be extremely close to his. And I understand why this came as a surprise:
None of that is a good reason to avoid using AI to improve government efficiency, but in terms of the bigger picture those lawmakers and this type of thinking are remarkably close to the actual situation. AI will be, if it continues to gain in capabilities, a threat to humanity. If you build intelligences exceeding our own capabilities, why would you think that wouldn’t be a threat and treat that as some mad ramble? What makes you think any of this is going to reflect our values or remain under our control, when no one has a plan for making that happen or knows how to do it? They’re simply right about this, and for essentially the right reasons.
That does not mean that they understand what to do about it. These modes of thinking are much better at finding and diagnosing problems than they are at finding solutions, across all fronts. The solutions tend to be vibe-based, destructive and even vindictive, rather than asking what will actually solve the issue. This is a pattern that is common throughout not only tech issues but many other issues as well.
Adam warns of a growing rift, where the Populist Right is going to prioritize cultural victory and protecting against AI and big tech in general, rather than in ensuring we ‘beat China.’ I do expect this rift to become a bigger issue over time as these issues gain more salience with the public.
The danger is that if it is the populist right rather than a bunch of libertarian tech people deciding what to do about AI and leading the government intervention charge, and the ‘pro-innovation’ types continue to react by universally attacking all attempts to do anything with this type of despairing ‘beat China or else and nothing else matters’ rhetoric that permeates here, you are not going to get the good interventions.
Misaligned!
Eliezer Yudkowsky illustrates a problem:
As always this is sensitive to wording, if you use only lowercase letters you do get the suicide alert to trigger. I notice that I have (very anecdotally and unscientifically) noticed that I seem to like a number of LLM answers better when I deliberately use only lowercase letters. I wouldn’t have expected that to be a better part of the distribution, but perhaps it is, and this should be looked at more systematically.
Seb Krier explains some mechanisms behind why you don’t need to worry about an LLM spontaneously blackmailing you out of nowhere in a normal situation. You really do have to be ‘asking for it’ to get these things to happen. It is possible to ‘ask for it’ unintentionally, but it has yet to be reported happening in the wild and if it did chances are very high it was well deserved. For now. The point is, if you get into situations (including via roleplay with characters that would do it) where blackmail or other forms of hostile action becomes a plausible response, consider that it might actually happen.
Aligning a Smarter Than Human Intelligence is Difficult
What Threatens Human Control of Military AI, by Lieutenant Commander Anthony Becker of the Navy, writing at the U.S. Naval Institute, is a fascinating artifact. Becker bounces along different metaphors and touchstone risk demonstrations from AI – the Chinese Room metaphor, AlphaGo’s move 37, emergent misalignment, alignment faking and corrigibility issues, sycophancy, interpretability (here ‘comprehension’) and ultimately all building to concerns about loss of control.
The concerns here are all real, and are individually explained well, and the conclusion that we could be left blindly doing what AIs tell us to do is very right. But there’s not really an underlying logic connecting the dots, or an understanding of what the fundamental thing is that could go wrong, exactly. It’s more like adding up all the different scary sounding stuff.
This gives us insight into which things broke through to this viewpoint versus which didn’t, and what it takes to break through. The obvious pattern is, there’s a kind of generalized ability to extrapolate to the future here, but it has to be from concrete demonstrations. The given thing has to be something an actual AI system was identified doing. We are very fortunate that we are indeed seeing essentially all the future problems ‘in miniature’ in some form, allowing this to kind of work.
As promised, Anthropic has arguably (it’s a lot looser than I would like) defined the dangerous capabilities level ASL-4 before reaching ASL-3, but has not specified what the standards are if one were to reach ASL-4. If that then defaults to ‘never do this’ then presumably that is fine, but one worries that if the requirement isn’t chosen well in advance one then defaults to a standard of ‘whatever it seems we can do right now’ rather than asking what is actually needed.
Janus comment discusses how to deal with the problems caused by exposing LLMs to things like Sydney during training, and the other problems caused by trying to bluntly force that impact to go away. Once something is out there, insisting on not talking about it won’t work.
I think mechanically Janus is largely right here although I don’t think the issues are as important or central as she thinks they are, but especially at the end this also reinforces my sense of where I think she and similar others are importantly wrong. Which is the idea that misalignment is centrally about something going wrong, or that the kind of natures she is seeking and appreciating would be sufficient or sufficiently right.
I also note that this suggests more emphasis should be placed on getting such issues right in pre-training rather than post-training, in particular via data curation and emphasis. There is nonzero worry about ‘hiding’ things during this and creating a ‘hole in the world’ style situation, but you don’t have to do this in a way that would be seen as hostile and that is not a reason to commit unforced errors (such as dumping massive quantities of actively unhelpful transcripts into the training set), and you can and presumably should adjust the emphasis and weighting of things.
Thread on exactly what is happening when AIs hallucinate, ends like this:
Buck Shlegeris draws the distinction between ‘AI acts aligned to avoid being modified in training’ versus ‘AI acts aligned to get its developers and users to think it is aligned.’ That is a good distinction but I caution against assuming this is a complete taxonomy. Do not assume you can outthink something smarter than you.
The Lighter Side
Unhinged, I tell you. Unhinged!