Anthropic then gave us the big one, Claude Opus 4.5, which is for now the clear best model available, and remains my daily driver, both for chat and also in Claude Code.
Claude Opus 4.5 felt like a large practical leap, some like Dean Ball going so far as to call it AGI. I don’t agree but I understand where they are coming from.
Alas, Claude Opus 4.5 is likely on track to force METR to add new tasks to the benchmark because comments like this and this indicate that the benchmark itself is no longer as reliable as it once was...
P.S. If you inserted a video in the original post, then why did it become lost in cross-posting?
It’s that time. It’s been a hell of a year.
At the start we barely had reasoning models. Now we have Claude Code and Opus 4.5.
I don’t code. Yet now I cause code to exist whenever something about a website annoys me, or when I get that programmer’s realization that there’s something I am planning on doing at least three times. Because why not?
The progress has simultaneously been mind bogglingly impressive and fast. But a lot of people don’t see it that way, because progress has been incremental, and because we were reasonably expecting to often get even more than this.
The public conversation and debate, even more than before, was full of false narratives and active attempts to make the situation worse. The same goes for attempts to shape Federal policy towards AI, and OpenAI’s conversion into a for-profit.
It’s been, as they say, one battle after another, with many wins, many setbacks and a lot of things in between.
This includes the key developments in AI, and also other blog posts from the year that I consider memorable looking back.
This is only our corner of the world’s Year in Review, not one in general, thus things like Liberation Day are happening in the background and go undiscussed.
January
The confusions started in January, as we prepared for Trump to take office.
OpenAI had just given us o1-preview, the first reasoning model.
At the tail end of 2024, DeepSeek released v3, or The Six Million Dollar Model. This was a big advancement in open source and Chinese model capabilities, and showed that they were not as far behind as we thought they were, and also that damn good models could be trained on the cheap. Not as cheap as the headline number, since the six million was only direct costs of the final run, but still pretty cheap.
Then a few weeks later, DeepSeek gave us r1, a reasoning model based on v3. They wrapped this up into a nice clean free app experience, which included the first time most people could see a reasoning model’s chain of thought – Gemini Flash Thinking offered this too but almost no one knew about that or cared. This showed that the ‘secret sauce’ of building a reasoning model was not so difficult to copy, and the marginal costs of doing so were low.
DeepSeek shot to the top of the App store, and the world completely lost its mind. The stock market mini-crashed. People talked about how China had ‘caught up’ to America, or this meant inference would be so cheap no one would need Nvidia chips (as consumers rushed out to buy Nvidia chips to run DeepSeek r1), or how it would destroy margins and drive American AI out of business. I had to warn people, many times, with the classic advice: Don’t Panic, and I went on Odd Lots to discuss it all.
Collectively this was called The DeepSeek Moment.
White House rhetoric talked about how this meant we were in a ‘race’ with China, so of course any other considerations than ‘winning’ must be thrown out the window.
With time, those paying attention realized all of that was overblown. DeepSeek was impressive as a lab, and v3 and r1 were excellent models, but still on the order of eight months behind OpenAI, Anthropic and Google. We had been comparing the relatively best features of r1 on their own, and then using that to project into the future, which flat out did not happen. This happened at a crucial inflection point, right when reasoning models had started, which was when a tiny amount of compute could go a maximally long way.
Later on, r1-0528 did not have a moment, nor did DeepSeek 3.1 or DeepSeek 3.2.
February
Google started out the month introducing us to Deep Research, a new product form that would be copied by OpenAI, allowing the AI to take time to prepare a report. At the time, this was super impressive. It definitely has its uses, even if the timing is awkward and you have to push past the tendency to pad reports with a lot of slop.
A new paper on The Risk of Gradual Disempowerment From AI improved the debate by highlighting a central way that humans end up not being in charge. There doesn’t need to be some ‘AI coup’ or battle, the AIs will by default end up with more and more resources and power unless something stops this from happening. One day we wake up and realize we are not in control. Another day after that we don’t wake up.
OpenAI declared that its primary alignment strategy would be Deliberative Alignment, so I analyzed that approach. I think it is helpful, but not a central solution.
The Administration made its AI feelings clear at The Paris AI Anti-Safety Summit. Previous summits had been efforts to lay foundation for international cooperation, with serious discussions of existential risks, in particular with The Bletchley Declaration. That was clearly over, transformed into a disdain for the idea that sufficiently advanced AI could be existentially dangerous, and Vance giving a speech demanding suicidal accelerationism and warning against attempts to not die.
The year would play out in similar fashion. We had some modest success in California and New York, but the White House would, under the influence of David Sacks, become an active force for interference with efforts to not die, and later even to beat China. They would do some pro-America things along the way, but also things that actively interfered with our competitiveness.
I introduced a key new concept handle which I call Levels of Friction. Different actions are variously harder or easier, from both practical and legal perspectives, to do. They range from Level 0 (defaults or requirements), to Level 1 (legal and ubiquitous and easy), Level 2 (safe but annoying), Level 3 (actively tricky or risky), Level 4 (actually seriously illegal) up to Level 5 (we really care about stopping you). Instead of thinking of a boolean of legal-illegal or possible-impossible, it is often more enlightening to consider moving between levels.
AI is going to move a lot of things to lower levels of friction. That is by default bad, but frictions can be load bearing, such as with job applications or limiting antisocial behaviors. It protects the commons. We will have to adjust quite a lot of things once key frictions are removed from the system.
February was the peak of ‘could Grok be a thing?’ It turned out not to be a thing. In other model news we got Claude 3.7.
We also got our first introduction to Emergent Misalignment, the idea that training the AI to do bad things associated with evil could lead it to generalize into thinking of itself as trope-style evil and doing a wide range of trope-style evil things.
March
A non-AI highlight was my piece on elementary education, School Is Hell.
GPT-4.5 was OpenAI’s attempt to give us a large and slow model. It did some cool things, and there are people that really liked it, but mostly it wasn’t worthwhile.
A big part of AI coverage is getting confident in dismissing hype. A great example of this was my coverage of The Manus Marketing Madness. Now that they’ve unceremoniously sold out to Meta, it’s easy to forget that a lot of people were hyping Manus as The Next Big Thing, as well as the next reason we would Lose To China.
I warned against using The Most Forbidden Technique, which is where you use interpretability to train on intermediate outputs, to teach it to think the thoughts you want it to think, thus teaching the AI to, like humans before it, hide its thinking.
Image generation had its first big moment, when the 4o image generator came online and everyone went Studio Ghibli crazy, taking advantage of both the advancement in quality and willingness to mimic styles.
Gemini 2.5 Pro came out, which I called the new state of the art. I think this was correct at the time, but later versions of Gemini 2.5 Pro were actively worse, and soon OpenAI would be back out ahead.
April
AI 2027 provided an illustrative scenario that presented a best guess as to what was likely to happen, with an alternative scenario option where things turn out well because a bold decision is made to slow down at a key moment. Scott Alexander and Daniel Kokotajlo explained the details on the Dwarkesh podcast, and I covered various responses.
Llama 4 was released, and turned out to be a total dud. Meta has been silent since in terms of topline AI products, while spending hundreds of millions on individual pay packages to try and gather the talent to get back in the game. It is a good thing Meta is struggling, given its bizarrely dystopian AI vision it is willing to give in public.
o3 put OpenAI firmly back out in front in reasoning, with excellent tool use, but was rapidly exposed as a Lying Liar that lies a lot.
OpenAI had other problems with GPT-4o. It was always an absurd sycophant that could get some of its users into trouble, but updates made around this time made it even more of an absurd sycophant, forcing a reversion to a previous build. I would later offer a postmortem.
May
OpenAI claimed that their conversion to a for-profit, which as announced then would clearly have been one the biggest thefts in human history, would leave the non-profit in control.
The White House had from the beginning made a huge deal out of how Just Awful the Biden diffusion rules were, just like it talks about everything Biden did, but it initially acted generally wisely on chip diffusion and export controls, including on the H20.
Alas, over time David Sacks got more control over their narrative and increasingly started spouting Obvious Nonsense About AI Diffusion, literally claiming that ‘beating China’ means maximizing Nvidia’s share of chip sales, and warning that China would step in with non-existent and otherwise greatly inferior AI chips to build its own ‘AI tech stack’ if we didn’t sell massive compute to partners with questionable loyalties. Initially this rhetoric and action was confined to sales to parties like UAE and KSA, where a case can be made if the deals and safeguards are good, and details matter. Later this would extend to trying to sell chips to China directly.
OpenAI released Codex to compete with Claude Code. Claude Code was such a stealth release, initially a side project of one employee, that it took a while to notice something was happening, and even longer for me to finally give it a try. Nowadays Claude Code might be most of my AI token usage.
Claude 4 put Anthropic back in the game.
I offered thoughts on those who use AI to cheat, especially in education.
Veo 3 gave Google the lead in video generation.
I wrote my first ‘Letting Kids Be Kids,’ I would later write another in December.
June
Dating Roundup #6 proved popular, and #7 did solidly too. I just put out #8 and #9.
I did an analysis of New York’s proposed RAISE Act, by Alex Bores who is now running for Congress. I concluded it was an excellent bill. It would later pass, although in somewhat weakened form because of Governor Hochul’s changes.
OpenAI and in particular Sam Altman continued to try and sell us on the concept of a Gentle Singularity, that AIs would become superintelligent and your life wouldn’t much change. This is of course Obvious Nonsense. Your life might become great, or it might end, or it might get into High Weirdness, but it won’t stay the same.
o3 Pro came out, and was very strong and not the lying liar that normal o3 was.
I came out with my (hopefully annual from here on in) blog recommendations.
July
The first attempt to pass a federal moratorium on AI regulation, as in tell the states they aren’t allowed to regulate AI because that should be federal while also not regulating AI at the federal level, came dangerously close to passing as part of the BBB. It was ultimately stripped out 99-1 once the tide had turned.
Congress had one of its finer hearings, where they asked good questions about AI.
Grok ran into trouble. No, Grok, No. Do not call yourself MechaHitler. Or worse.
Kimi K2 was an unusually impressive new open Chinese model. We would later get Kimi K2 Thinking in November.
Google and OpenAI got IMO Gold.
AI companions were getting a lot of attention, which has since died down. This will be a big thing at some point, and for some it is a very real thing, but for now it isn’t good enough to hold most people’s interest. I followed up again in August.
August
The big hyped release of the year was of course GPT-5. This would be their big moment to unify all their crazy model variations and names, and create one model to rule them all, with a router to think longer if and only if that was worthwhile. There were approaching death stars and we saw a variety of assertive valueposting. It was the big version number jump, and people expected a lot.
GPT-5 was a good model, I found it to be a clear upgrade, but it very much did not live up to the hype. Many even strongly wanted to keep GPT-4o for its far friendlier and more empathic attitude, or some would say its sycophancy – the very features that make GPT-4o not a great thing for many users are alas the reasons users often like it so much. I covered the basic facts and model card, then outside reactions and finally created a synthesis.
Unfortunately, the model OpenAI chose to call GPT-5 being a disappointing release gave so many people, up to and including David Sacks and Sriram Krishnan at the White House, the wrong idea. There is a constant demand for data points that say AI won’t advance much, that scaling is dead, that it will all be a normal technology and you don’t have to worry about AGI. Washington seems to have come away from the GPT-5 release with this message, and it plausibly did great harm in numerous ways, including to our export controls.
I tried to push directly back against this, pointing out that AI was continuing to make rapid progress, both around GPT-5 and various other misleading data points, especially the no-good, very-bad ‘MIT study.’ I followed up by pointing out that Yes, AI Continues To Make Rapid Progress, Including Towards AGI.
I noticed I was deeply confused about AI consciousness, along with everyone else. I still am, except now I’m more confused at a better, more advanced level. These questions are coming up more and more now, and I expect that to continue.
It’s so funny to have half of people debating AI consciousness, while the other half thinks AI is not making any progress.
I offered my advice around flying.
Are the AIs starting to take our jobs? Not in general, but for entry level jobs? Kinda.
September
I reviewed If Anyone Builds It, Everyone Dies. There were a few weeks where this inspired a lot of discussion, much of it remarkably good.
The month ended with Anthropic reclaiming its role as my daily driver thanks to Claude Sonnet 4.5.
There was more on AI craziness, then later in November we would see additional lawsuits against OpenAI related to suicides.
October
OpenAI meanwhile decided to release Sora and The Big Bright Screen Slop Machine, attempting to turn its good short video generator into a dystopian social network. I said the comparables were Google+ and Clubhouse. Call looks good.
I got to go to The Curve, which was an excellent conference.
One of the consequences of the GPT-5 release was more people talked about AI as potentially being in a bubble. I do not agree, other than in the nominal ‘number might go down’ sense. Number might go down, if not number needs to go up.
OpenAI completed its trio of overhyped releases with the Atlas browser. This jaded people sufficiently that when GPT-5.1 and GPT-5.2 later came out, people gave them remarkably little focus.
Andrej Karpathy went on Dwarkesh Patel and cautioned us not to get overexcited.
The biggest advantage America has over China is its access to vastly more compute. This is thanks in large part to our export controls. Alas, David Sacks it the AI Czar, acts like a de facto Nvidia lobbyist, and is trying to make us give that edge away.
Emboldened by prior success in getting authorization for H20 sales, Nvidia and David Sacks made their move, and came (based on what I know) remarkably close to getting America to commit quite a lot of civilizational suicide and sell B30A chips to China, essentially giving them close to chip parity. This would have been a completely insane move, and we should be thankful a combination of key people stepped up and prevented this from happening.
Unfortunately, although far less unfortunately than if we’d sold B30As, they then regrouped and in December would successfully push, despite it being obviously unwise and unpopular, for us to sell H200s to China. The Chinese are making a show of not wanting them so much, but it’s a show, and our edge has been substantially eroded. The logic behind this seems to have been nominally based in part on a prediction that Huawei can scale chip production far faster than credible predictions say, as in being off by an order of magnitude or more.
OpenAI finished its conversion to a for-profit, completing what I believe is arguably the second largest theft in human history behind the Russian oligarchs of the 1990s. The final terms came as the result of negotiations with the District Attorneys of Delaware and California, and they did extract a lot of highly meaningful concessions, both in terms of compensation and also in helping retain meaningful control and oversight over OpenAI. This could have gone so much worse. But as I said, that’s like a mugger demanded your money, and they got talked down to only taking half your money, then they claim they ‘recapitalized you.’ You’re still out half of your money.
November
We got what may be the final key revelations of what I call OpenAI’s Battle of the Board where the board attempted to fire Sam Altman, as we got Ilya Sustkever’s testimony about what happened. We now know that this was driven by Ilya Sutskever and Mira Murati, and was motivated by ordinary business concerns, centrally Sam Altman’s lying and mistreatment of employees.
I offered my 2025 edition of The Big Nonprofits Post, for those looking to donate, and would later share an update from my nonprofit, Balsa Research.
The year would finish with a flurry of new model releases.
OpenAI started us off with GPT-5.1, a modest upgrade that follows custom instructions well and often glazes the user, and then followed it up with GPT-5.1-Codex-Max, which was a substantial boost in coding power in particular.
Google gave us Gemini 3 Pro, a vast intelligence with no spine and also severe alignment issues and mental problems. It’s a great model, and was clearly now the best at a variety of uses, especially raw intelligence, or a teacher or whom you had questions with known answers that you would ask an autist.
Anthropic then gave us the big one, Claude Opus 4.5, which is for now the clear best model available, and remains my daily driver, both for chat and also in Claude Code.
Claude Opus 4.5 felt like a large practical leap, some like Dean Ball going so far as to call it AGI. I don’t agree but I understand where they are coming from.
December
I went to San Francisco for the Solstice, and wrote Little Echo.
I did the annual movie review.
We learned even more reasons to beware reward mismatches in RL.
OpenAI upgraded again to GPT-5.2, which I evaluated as Frontier Only For The Frontier. Its impressive benchmarks do not reflect its capabilities, and people reacted with fatigue after too many disappointing OpenAI model releases. It’s not an especially ‘fun’ model to interact with, nor is it especially fast, and it currently occupies a sweet spot only for tasks where you need a lot of raw thinking capability and are looking for ‘just the facts’ and cold analysis, and potentially for coding where everyone serious should try various models to see what works best for their tasks.
I offered a sequence of posts on why median wages are up, economists keep saying times are solid, yet young people keep saying things suck. Those complaining often say false things and use statistics wrong, but if so many people think things suck, then you know there’s a problem. I looked into cost changes over time, and when were various things the best. Finally, I presented my thesis, which was that this was due to the Revolution of Rising Expectations and the Revolution of Rising Requirements. Our expectations and comparison points are supremely high, as are the things we legally require of those looking to raise families.
Questions For Next Season
AI is going gangbusters. The news about it is accelerating, not slowing down. It’s going to increasingly impact our lives and be the topic of conversation. The model releases will come fast and furious. The agents will make big leaps in 2026, and not only for coding. It will likely be a major topic in the midterm elections. I don’t expect full High Weirdness in 2026, but you can’t fully rule it out.
Blog growth, in terms of views, stagnated this year. That’s disappointing, as previously I had experienced strong growth, and I likely need to explore additional ways to get the word out. But ‘number go up’ was never the ultimate goal and I am confident that I am directly reaching quite a lot of the people I care about reaching. I do intend to send out a user survey some time in the near future.
One big personal goal for 2026 is to do more coding and evergreen posting, going deeper into questions that matter or that I get curious about, and being better about organizing my thoughts, and to focus less on ephemeral items and news, and to finally get a handle on organizing what I do have to better create longer term resources. I am fully aware that almost all views happen within a few days of posting, but that doesn’t need to dictate anything, and there are some basic things where I could build permanent resources much better than I’ve been doing.
The other big goal is to focus on what matters, including the fights and debates that matter, making sure to do that in a way that adds to permanent resources and not let important things end up buried. I have to do better triage, especially in letting relatively unimportant matters drop. I intend to publish fewer words on the blog in 2026, and with that to become more willing to skip days. I know the amount of content can be overwhelming.
One thing that got lost in the shuffle this year, and illustrates the problem, was my planned review of Open Socrates. It’s a book warning you not to live your life 15 minutes at a time, and I didn’t finish my response because life kept throwing too much stuff at me. Well, that’s kind of the worst possible excuse not to finish that, isn’t it? Even if because of the delay I ultimately have to reread a lot of the book.
I also have a bunch of projects I’d love to try. We’ll see how that goes. But also movies to watch, and games to play, and people to see, and fun to be had. Life beckons.
And you know what? Life is pretty awesome. Other people sing Ald Lang Syne. I go to the Secular Solstice. My personal tradition, at year’s end, is something else entirely.
Happy New Year, everyone.