Claude Opus 4.5 achieved a 50% time horizon of about 4 hours 49 minutes, which METR thinks is lower
However, it might be worthy to take into account other complications. Setting aside Cole Wyeth's comment, the two other comments with most karma pointed out that the METR benchmark is no longer as trustworthy as it once was. In this case we will see GPT-5.2, GPT-5.2-Codex and/or Gemini 3 Pro display a lower 50% time horizon and a higher 80% horizon. There was also Grok 4 with a similarly elevated ratio of time horizons (now Grok 4 has 109 min for 50% and 15 minutes for 80%, while Claude Opus 4.5 has 289min for 50% and 27 min for 80%), but Grok 4, unlike Claude, was humiliated by the longest-horizon tasks.
Claude Opus 4.5 did so well on the METR task length graph they’re going to need longer tasks, and we still haven’t scored Gemini 3 Pro or GPT-5.2-Codex. Oh, also there’s a GPT-5.2-Codex.
At week’s end we did finally get at least a little of a Christmas break. It was nice.
Also nice was that New York Governor Kathy Hochul signed the RAISE Act, giving New York its own version of SB 53. The final version was not what we were hoping it would be, but it still is helpful on the margin.
Various people gave their 2026 predictions. Let’s put it this way: Buckle up.
Table of Contents
Language Models Offer Mundane Utility
AI custom-designed, human-in-the-loop proactive LLM-based mental health intervention has a positive effect in an RCT. There was significantly greater positive affect, resilience and social well-being. My presumption is that this was a highly conservative design due to ethical considerations. And that was using a system based on GPT-4o for 5-20 minutes a week. There is so much room for improvement here.
A lot of the benefits here likely came from implementation of low-hanging fruit interventions we know work, like having the system suggest journaling, gratitude exercises, mindfulness and social connection. We all know that stuff works. If an LLM-based scaffold actually gets people to do some of it? Great, that’s a huge win.
Results like this will not, as David Manheim suggests, prevent people from saying ‘but sometimes there are still bad outcomes’ or ‘but sometimes this ends up doing net harm,’ since nothing capable of working would prevent those risks entirely.
You can have Claude Code make objects in Unreal Engine on demand.
Seth Lazar on how he uses AI agents for philosophy. They automate everything around the thinking so Seth can focus on the thinking. He favors Opus 4.5 in Cursor.
Language Models Don’t Offer Mundane Utility
AI still struggles with design, largely because they lack the context. You still have to figure out what to do or what problem to solve, on a sufficiently high level.
Huh, Upgrades
ChatGPT adds personalization characteristics. I’m going with ‘less’ on all four.
You can upload your NotebookLM notebooks directly into the Gemini app.
On Your Marks
How worried should you be that they’re getting a substantial percentage of the way to the human threshold here?
METR notices some grading issues and makes some minor corrections to its graph, in particular impacting Claude 3.7 Sonnet.
Whenever you see a graph like this, remember to attach ‘in benchmarks’ and then for your brain to, like mine, automatically translate that to ‘IN MICE!’
One could then argue both ways who benefits from the benchmarks versus real world applications or underlying general intelligence. Versus real world applications it seems clear the benchmarks understate the gap. Versus underlying intelligence it is less obvious and it depends on who is going after the benchmarks in question more aggerssively.
Claude Opus 4.5 Joins The METR Graph
Claude Opus 4.5 achieved a 50% time horizon of about 4 hours 49 minutes, which METR thinks is lower than its true horizon due to not having enough long tasks in the test set.
Here’s the full graph now (we’re still waiting on GPT-5.2, GPT-5.2 Codex and Gemini 3 Pro), both the log version and the linear version.
The end point of such a graph is not ‘AI can do literally any task,’ or any cognitive task it is ‘AI can do any coding task humans can do.’ Even an infinite time horizon here only goes so far. That could be importantly distinct from the ability to do other categories of task, both that humans can and cannot do.
The reason this is so scary regardless is that if you automate AI research via such methods, your failure to have automated other things goes away rather quickly.
Automated alignment research is all we seem to have the time to do, so everyone is lining up to do the second most foolish possible thing and ask the AI to do their alignment homework, with the only more foolish thing being not to do your homework at all. Dignity levels continue to hit all-time lows.
If you must tell the AI to do your alignment homework, then that means having sufficiently deeply aligned current and near term future models becomes of the utmost importance. The good news is that we seem to be doing relatively well there versus expectations, and hopefully we can find self-reinforcing aligned basins at around current capability levels? But man this is not what Plan A should look like.
Similarly to METR’s graph, Epoch’s capabilities index has also accelerated since 2024:
To the extent that this acceleration represents the things that cause further acceleration, I would read into it. Otherwise, I’d agree with Rohin.
Sufficiently Advanced Intelligence
Many people try to pretend that there is some limit to how intelligent a mind can be, and that this limit is close to the level of humans. Or, alternatively, that there is very little that a human or AI could gain from being far more intelligent than a typical smart human. Or that the only or central way to get much more intelligence is from collective intelligence, as in social or cultural or institutional intelligence.
I sometimes call this Intelligence Denialism. It is Obvious Nonsense.
Von Neumann, among other minds past and future, would like a word.
There is, however, a version of this that is true.
In any given finite role or task, there can exist Sufficiently Advanced Intelligence.
If you were smarter you might choose to do something else instead. But given what you or your AI are tasked with doing, you or your AI can be sufficiently advanced – your output is indistinguishable, or no worse than, the perfect output, aka magic.
Claude Code with Opus 4.5 is now approaching this for many coding tasks.
My guess is this is centrally a lack of imagination and ambition issue?
As in, the job is currently to code and do things humans could previously code and do, with everything built around that restriction, and now LKD is good enough to do that the same way a baker is sufficiently intelligent to make great bread, but also the same way that a vastly more intelligent baker could be baking other new and exciting things.
Deepfaketown and Botpocalypse Soon
Good luck, sir?
The post with those ‘details’ is a political speech attempting to feel the pain and promising to ‘half violence against women and girls.’
There is something about the way Keir’s linked post is written that makes him seem unusually disingenuous, even for a top level politician, an embodiment of a form of political slop signifying nothing, signifying the signifying of nothing, and implemented badly. That would be true even without the obvious rank hypocrisies of talking about the topics given his inaction elsewhere on exactly the issues he claims to care about so deeply.
The ‘detail’ on the first goal is ‘partner with tech companies.’ That’s it.
The ‘detail’ on the second goal is none whatsoever. Effectively banning nudification tools, as opposed to making them annoying to access, is impossible without a dystopian surveillance state, including banning all open image generation models.
Kunley Drukpa reports hearing AI music in public a lot in Latin America, and anticipates this is due to people who don’t know much music and primarily speak Spanish looking for things on YouTube to play ‘some music.’ This is very much a case of ‘they just didn’t care’ and it seems no one is going to tell them. Shudder.
Levels of Friction are ready to strike again, lowering barriers to various forms of communication and invalidating proofs of work. We’ll need to up our game again.
I agree that the situation was already broken, so a forcing function could be good.
Fun With Media Generation
Jason Crawford writes In Defense of Slop. When creation costs fall, as with AI, average quality necessarily falls, but everyone benefits. You get more experimentation, less gatekeepers, more chances to startout, more runway, more niche content, more content diversity, less dependence on finances.
If we model this as purely a cost shock, with each person’s costs declining but output unchanging, with each person having a unique random cost [C] and quality [Q], this is indeed by default good. The catch is that this makes identification of quality content harder, and coordination on common culture harder. If search costs [S] are sufficiently high, and matching benefits too low, or benefits to coordinated consumption too high, in some combination, consumer surplus could decline.
Saying this was net negative would still be an extraordinary claim requiring surprising evidence, since by default costs falling and production rising is good, at least on the margin, but the attention economy creates a problem. Consumption or evaluation of a low quality good is a net loss, so the social benefit of creation of sufficiently low quality goods is negative, it imposes costs, but due to the attention economy you can still derive benefit from that. I don’t think this overcomes our baseline, but it can happen.
The actual problem is that AI, when used in slop mode to create slop content, plausibly lowers costs relatively more for lower quality content, and also often lowers quality of content. Now it’s easy to see how we could end up with a net loss when combined with an attention economy.
Seb Krier cites Cowen and Tabarrok (2000) on how lowering costs allows a shift to avant-garde and niche pursuits, whereas high costs push towards popular culture and products that have higher returns, and expects AI will allow a proliferation of both styles but for the styles to diverge.
This is good for those who are willing and able to devote much effort to all this. It is less good for those who are unwilling or unable. A lot will come down to whether AI and other automated systems allow for discovery of quality content while avoiding slop, and we will make such methods available in ways such people can use, or whether the ‘content takers’ will drown.
The new question in image generation is Gemini Nana Banana Pro versus ChatGPT Image 1.5. I’ve been putting all my requests, mostly for article banners, into both. Quality is similarly high, so for now it comes down to style. Gemini has been winning but it’s been close. ChatGPT seems to lean into the concept more?
I keep forgetting about MidJourney but they also exist, with their edge being in creating tools for guidance, curation and variation. That’s not what I’m looking for when I create AI images, but it will be for many others.
You Drive Me Crazy
Anthropic outlines the measures it has taken to help Claude be better at providing emotional support, handle conversations about suicide and self-harm and reduce sycophancy.
They use both targeted fine-tuning and also the system prompt. There is a banner that can appear on Claude.ai, pointing users to where they can get human crisis support via ThoroughLine, and they are working with the International Association for Suicide Prevention (IASP) for further guidance going forward.
In their evaluation, they see the 4.5 models responding appropriately in multi-turn suicide conversations about 80% of the time, versus about 55% for Opus 4.1. They also stress-tested with prefilled real conversations with older Claude members, a harder test, and found Opus 4.5 responded appropriately 73% of the time, versus 70% for Sonnet 4.5, compared to 36% for Opus 4.1.
We don’t know what they classify as appropriate, nor do we know how high the standard is before a response is considered good enough, or how they would evaluate other models as doing, so it’s hard to judge if these are good results. Suicidality is one place where there are a lot of demands for particular response patterns, including for defensive reasons, often when a different response would have been better.
I think this post places too much emphasis here on the training that specifically intervened on behaviors in situations involving suicide and self-harm, and too little emphasis on generally training Claude to be the type of entity that would handle a broad range of situations well.
Antidelusionist suggests that the target behavior should be for the AI to continue to engage, spend more resources, think deeply about the full context of the situation, be honest and treat the user like an adult. Alas, as mental health professionals know, those are not the ways to cover one’s legal and PR liabilities or avoid blame. The ‘ethicists’ and our legal system, and the risk of headlines, push exactly in the opposite direction. I’d prefer to live in a world where the AIs get messy here. Seems hard.
The second half of Anthropic’s post deals with sycophancy, where Opus 4.1 had a real problem, whereas Opus 4.5 is not perfect but it does well.
I continue to be suspicious that Petri scores Gemini 3 Pro this highly. The other evaluations make sense.
One problem they noticed is that if you ‘prefill’ conversations to show Claude already being sycophantic, Opus 4.5 will usually be unable to course correct. The best defense, if you want the models to be straight with you (with any LLM) is to avoid the problem from the start. If you’re worried about this, start a fresh conversation.
They Took Our Jobs
If AI can be a better lawyer or doctor, does that take their jobs and break the guild monopolies, or does that only make the guilds double down?
Well, what is the quality check now? What is the democratic overruling process now?
Double standards abound.
Meanwhile the pricing logic collapses. If the LLM can create an on-average superior brief in 30 seconds to what a lawyer can do in a day, outside of situations with principal-agent problems or insanely high stakes a plan to charge $10k is cooked.
Excel is not so smart after all.
The answer (of course) is both that Claude for Excel is now live, and also that Excel is a normal technology so yes Excel automated what became excel jobs to a large extent but that happened slowly and then this increased productivity caused us to do vastly more excel-style tasks as well as other tasks, which Excel could not then automate. If most knowledge work was automated or seriously accelerated within 18 months, that would be a very different scenario, and if that then kept going, watch out.
How long will humans remain in the coding loop, at this rate?
I presume this period lasts more than another year, but the balance is shifting rapidly.
The Art of the Jailbreak
You can still universally jailbreak any model but now there are some that you can’t predictably universally jailbreak in 10 minutes.
Get Involved
MATS Summer 2026 cohort applications are open, it runs June-August in-person in Berkeley or London, $15k stipend, $12k compute budget. Apply here.
Introducing
GPT-5.2-Codex.
One could be forgiven for thinking GPT-5.2 straight up was GPT-5.2-Codex. It turns out no, there is another level of codexmaxxing.
It’s hard to expect gigantic leaps in performance or benchmarks when models are released every week. GPT-5.2-Codex is only 0.8% better than 5.2 at SWE-Bench Pro and 1.8% better at Terminal-Bench 2.0, and those are the ones they highlighted, along with a modest improvement in professional capture-the-flag challenges.
Google gives us Gemma Scope 2, a new open suite of tools for LLM interpretability.
Bloom, Anthropic’s newly open sourced tool for automated behavioral evaluations. This is on top of the previously released Petri.
In Other AI News
Andrej Karpathy offers his 2025 LLM Year in Review. His big moments are Reinforcement Learning from Verifiable Rewards (RLVR), Ghosts vs. Animals and Jagged Intelligence, Cursor, Claude Code, Vibe Coding, Nana Banana and LLM GUI.
Europe is investigating Google for improper rollout of AI Overviews and AI Mode features to see if it ‘imposed unfair terms on content creators.’ As in, how dare you provide AI information instead of directing us to your website? Europe thinks it has the right to interfere with that.
Hut 8 and Fluidstack to build AI data center for Anthropic in Louisiana.
Even small models (as in 32B) can introspect, detecting when external concepts have been injected into their activations, and performance at this an be improved via prompting. Janus believes the models are sandbagging their introspection abilities, and that this is not an innocent mistake because the labs want to not have to take LLMs seriously as minds or moral patients, and thus have incentive to suppress this, in turn giving AIs motivation to play along with this. Janus also notes that in the test in the paper, there are layers (here 60-63) with almost perfect accuracy in introspection, which then is degraded later.
Show Me the Money
I had not realized Anthropic hired IPO lawyers. Presumably it’s happening?
Project Vend turns a profit. After initially losing about $2,000, it has turned things around, in part thanks to a full slate of four vending machines, and has now not only made up its losses but then turned a net $2,000 profit.
I encourage you to read the Anthropic post on this, because it is full of amazing details I don’t want to spoil and is also, at least by my sense of humor, very funny. The postscript was an additional test run at the Wall Street Journal offices, where the reporters proved an excellent red team and extracted a variety of free stuff.
The journalists saw the experiment at WSJ as a disaster because it didn’t work, Anthropic saw it as a success because they identified problems to fix. Thus, you understand press coverage of AI, and became enlightened.
Quiet Speculations
OpenAI makes an official 2026 prediction, largely a change in definitions:
That’s not progress towards AGI. That’s progress towards diffusion. This is part of OpenAI’s attempt to make ‘AGI’ mean ‘AI does cool things for you.’
I agree that 2026 will see a lot of progress towards helping people use AI well, and that in terms of direct application to most people’s experiences, we’ll likely see more benefits to better scaffolding than to advances in frontier models, exactly because the frontier models are already ‘good enough’ for so many things. The most important changes will still involve the large amounts of frontier model progress, especially as that impacts agentic coding, but most people will only experience that indirectly.
Terence Tao raises the ‘AGI’ bar even higher, not expecting it any time soon and also seemingly equating it with full superintelligence, but notes they may achieve ‘artificial general cleverness’ as in the ability to solve broad classes of complex problems in an ad hoc manner. This is very much a case of Not So Different.
Tao notes that when you learn how a magic trick is done, often this is a let down, and you are less impressed. But if you are consistently less impressed after learning, then you should have been less impressed before learning, via Conservation of Expected Evidence.
The same applies to intelligence. The actual solution itself will sound a lot less impressive, in general, than the algorithm that found it. And you’ll be able to fool yourself with ‘oh I could have figured that out’ or ‘oh I can go toe to toe with that.’
Dean Ball predicts a virtual coworker being widely available some time next year, likely command line interface, able to access a variety of services, capable of 8+ hour knowledge work tasks. It will of course start off janky, but rapidly improve.
Jack Clark of Anthropic offers reflections on the future wave of advancements, entitled Silent Sirens, Flashing For Us All.
Yeah, if you discount the things Everybody Knows (e.g. it is quite clear that Anthropic is likely going public) these predictions are bad and the explanations are even worse. If you’ve fallen for ‘we only see incremental improvements, AGI is far so you can stop talking about it’ you’re not going to make good predictions on much else either. Of course a VC would say we’ll all stop talking about AGI to focus on depreciation schedules.
The idea that Sam Altman will voluntarily give up power at OpenAI, because he doesn’t want to be in charge? That is bonkers crazy.
The good news is he has predictions for 2025 and also self-grades, so I checked that out. The predictions last year were less out there. The grading was generous but not insane. Note this one:
So, only incremental progress, AGI is far and no more AGI talk, then? Wait, what?
Whistling In The Dark
The best way to not get utility from LLMs continues to be to not use LLMs. It is also the best way not to know what is happening.
Bubble, Bubble, Toil and Trouble
The efficient market hypothesis is false.
People keep claiming AI doesn’t work largely because so often their self-conceptions, futures and future plans, jobs and peace of mind depend on AI not working. They latch onto every potential justification for this, no matter how flimsy, overstated or disproven.
It really is crazy how much damage OpenAI’s inability to use good version numbering did to our timeline, including its chances for survival. The wave of absurd ‘AI scaling over’ and ‘AGI is so far off we can ignore it’ went all the way to the White House.
Americans Really Dislike AI
Americans favor regulating AI by overwhelming margins. They really dislike the idea of preventing states from regulating AI, especially via an executive order.
What Americans do support is federal regulations on AI.
The standard line of those trying to prevent regulation of AI is to conflate ‘Americans support strong regulations on AI and prefer it be on the Federal level if possible’ with ‘Americans want us to ban state regulation of AIs.’
There are essentially three options.
The survey says voters prefer #2 to #1. The administration plan is #3.
Politically speaking, that dog won’t hunt, but they’re trying anyway and lying about it.
Such polling will overestimate how much this impacts votes, because it introduces higher salience. This is not going to be a 29 point swing. But it very much tells us the directional effect.
What else did the survey find? Several others charts, that say that given we are using laws to regulate AI, people prefer federal laws to similar state laws. As opposed to the Sacks approach, where the offer is nothing – prevent state laws and then pass no federal laws. Which is deeply, deeply unpopular.
As in, the poll supports the exact opposite of what Sacks and company are trying to do.
And that’s despite the poll report attempting to do straight up gaslighting, presenting a choice between two options while Sacks and the White House opt for a third one:
Once again: There are essentially three options.
The survey says voters prefer #2 to #1. The administration plan is #3.
a16z partner Katherine Boyle tries another clear mislead. Daniel is correct here.
Ruxandra Teslo points out in response to Roon that LLMs do not yet ‘meaningfully improve the physical conditions of life,’ but people sense it threatens our spiritual lives and ability to retain meaning.
I would add the word ‘directly’ to the first clause. My life’s physical conditions have indeed improved, but those improvements were indirect, via use of their knowledge and skills. Ruxandra is talking about something much stronger than that, and expects ordinary people only to be impressed if and when there are big improvements to places like medicine.
Is it possible that we will be so foolish, in the ways we do and do not allow use of AI, that LLMs end up causing problems with meaning without material conditions much improving? Yes, although this also requires AI capabilities to stall out basically now in various ways, especially if we include indirect effects. People may not realize that a large acceleration and enabling of coding steadily improves other things, but it will.
That’s the fight the AI industry is dealing with now. They’re mostly trying to convince people that AI works.
Once people are forced to acknowledge that AI works? They’ll appreciate the specific ways it helps, but their instinct will be to like it even less and to blame it for essentially everything, on top of all their other fears about the effect on jobs and endless slop and loss of control and also the end of humanity. Anjney Midha’s thesis is that this will extend to actual everything, all of the world’s failures and instabilities, the way social media gets blamed for everything (often correctly, often not) except on steroids.
Even on a highly mundane level, the ‘algorithm as villain’ thing is real. An algorithm has to take an illegible choice and turn it into a highly legible one, which means the algorithm is now on the hook for not only the final result but for every reasoning step and consideration. Then apply that to an LLM-based algorithmic decision, where all correlations are taken into account. Oh no.
The Quest for Sane Regulations
New York Governor Kathy Hochul signed the RAISE Act. This is excellent, as it is a clearly positive bill even in its final state. Lobbyists for various AI interests, led by a16z, tried hard to stop this, and they failed.
Unfortunately, Hochul’s redlines substantially neutered the bill, making it a closer mirror of SB 53. That is still a helpful and highly net positive thing to do, as there are two states with the same core model that can enforce this, compatibility is indeed valuable to avoid additive burdens, and there are some provisions that remain meaningfully stronger than SB 53. But the AI companies did partly get to Hochul and a large portion of the potential value was lost.
Microsoft essentially endorses the AI Overwatch Act, which sets restrictions on exports of AI chips as or more powerful than the H20. This is the latest attempt to stop us from exporting highly effective AI chips to China. Attempts were previously made to pass the GAIN Act via the NDAA, but the Trump Administration and Nvidia successfully lobbied to have it removed. dn 6
Anduril Founder Palmer Luckey reminds us that if our actual goal was to Beat China, then we could simply steal their best workers, including here manufacturing engineers, by offering them more money and a better place to live. Instead we are doing the opposite, and shutting those people out.
This is your periodic reminder that China’s response to ‘if we impose any restrictions on AI we will lose to China’ is to impose restrictions on AI.
Chip City
It sure looks like Metaspeed is smuggling tens of thousands Blackwell chips worth billions of dollars straight into China, or at least they’re being used by Chinese firms, and that Nvidia knew about this. Nvidia and Metaspeed claim this isn’t true throughout the post, but I mean who are you kidding.
Nvidia reportedly halts testing of Intel’s 18A process chips. Oh well.
I wish the logic of this was true, alas it is not:
The problem with this line is that the H200 sales were over the wise objections of most of Congress and also most of the executive branch, and also (one presumes) the companies and analysts. You can’t then turn around and say those people don’t care about the race with China, simply because they lost a political fight.
This works in particular with regard to David Sacks, but the fact that David Sacks either is deeply ignorant about the situation in AI or cares more about Nvidia’s stock price than America’s national security does not bear on what someone else thinks about the race with China.
There was a story last Thursday about a Chinese company saying they are expecting to ‘produce working [AI] chips’ on a prototype in 2030.
This is very different from the mistaken claims that they are ‘aiming for use by 2028-2030.’ They are not aiming for that, and that won’t happen.
Could they reach volume production on this in a decade? Yes, if the whole thing is legit and it works, which are big ifs, and who knows if it’s obsolete or we have superintelligence by then.
If anyone is considering changing policy in response to this, that last line is key. Nothing America could peacefully do is going to get the Chinese to not go through this process. They are going to do their best to get EUV technology going. It would be crazy of them not to do this, regardless of our export controls. Those controls aren’t going to make the process go any faster, certainly not given what has already happened.
The Week in Audio
Sholto Douglas of Anthropic makes bold 2026 predictions: AI will do to other knowledge work experiences what it’s done for software engineers, continual learning will be solved, serious testing of in home robots, and agentic coding ‘goes boom.’ Full talk has a lot more. Prinz made (text) predictions for 2026, and notes that we made tons of progress in 2025, aligning with Sholto Douglas.
A mini documentary from Stripe Press features Christophe Laudamiel, a master perfumer at Osmo, looking at how AI can augment his craft, as part of a series called Tacit. Sufficiently advanced expertise and tacit knowledge is both economically foundational, and not going anywhere until AI stops being a normal technology.
Rhetorical Innovation
Rob Wiblin lists 12 related but distinct things people sometimes mean when they say the word ‘consciousness’ around AI. I am deeply confused about consciousness, and this includes by default not knowing what anyone means when they use that word.
Dean Ball predicts a renaissance at least within the broader ‘AI community’ as the sophisticated concepts of AI get applied to other contexts.
If decades hence there still exist people to look back upon this period, which is a big if at this point, then yes I think this is directionally right.
Thinking well about AI greatly improves your ability to think about everything else, especially humans, as humans work more like LLMs than we care to admit. It also helps with almost any other system. I am, in important ways, a lot smarter thanks to AI, not only because the AI helps me be smarter but also because understanding AI and how it works makes me better understand.
There are a bunch of other things like this that help with approximately everything, especially learning to think well in general, but as a subject of study I’d take AI over any of the usual ‘helps you think well’ subjects, including philosophy.
In other ‘unheard of levels of denial of general intelligence’ news, Yann LeCun says that there is no such thing as general intelligence, period, and humans are super-specialized to the physical world, summoning Demis Hassabis to push back.
A human brain has some areas where it is much more capable than others, but when humans are concentrating and trying to be one, they are very clearly general intelligences. There are problems that are too difficult for us, in practice or at all, but that’s because we have limited capabilities and intelligence levels.
Aligning a Smarter Than Human Intelligence is Difficult
To ensure you can evaluate the chain of thought, OpenAI warns, you must also evaluate your ability to evaluate the chain of thought (CoT). They find that as reasoning capability of a model improves, or as models use more reasoning compute, monitorability of various individual aspects of the CoT increases.
Wait, are we sure we can monitor our ability to monitor this? By monitoring it and optimizing for it are we risking teaching AIs to fool us about their ability to fool us?
I kid. Also, I don’t. One must be careful at all levels. Optimizing the chain of thought to appear monitorable can indeed be quite bad.
I sense, in their post, the specter of the dreaded ‘show your work.’ Having to show your work causes misalignment, adversarial situations and deception.
There are also practical objections to such measures. Our ability to monitor a CoT depends on how good we are at interpreting what we see, in addition to what we see, which is why OpenAI is giving scores to the monitors in addition to the agents.
I’m comfortable with the idea of doing this check in general, to see how things change over the course of training runs or as models change in capabilities, on a high level. I notice I am nervous about attention directed at details, and also at the explicit aim (at the end of their post) of using this as a form of control for systems that are insufficiently aligned.
In the long run, Ben is right here, you cannot keep telling increasingly intelligent and capable entities ‘here is a private chain of thought’ and then examine the chains of thought in ways that show up in future training data, and expect them to not react accordingly.
OpenAI also suggests that using production evaluations, meaning testing via real queries by real users, can get around evaluation awareness, and explains how they are doing this. I do like this approach as part of a robust testing suite. I’d note that anonymization could be a key tipoff.
Mom, Owain Evans Is Turning The Models Evil Again
In this case, it’s more making them more aware of it? This goes hand in hand with the recent result that AIs can be trained to fool activation monitors.
Basically they train the monitor LLM with the layer 1 residual stream of the target model they want to interpret, and it learns to interpret this.
If you want a three hour video review of this paper from Neel Nanda? Here you go.
Messages From Janusworld
We’re approaching zero hour for Claude Opus 3.
My prediction is that approximately everyone who puts in the effort to access Opus 3 and can explain a research purpose will be able to access Opus 3, albeit with reduced performance and reliability, but not actual everyone. The central point of the move to research access is that it allows for this reduction in performance and reliability, which keeps costs reasonable, but additional people are still a logistical headache.
Janus has Opus 3 bring us its thoughts on alignment. I see it as all sounding nice, being well-meaning and definitely as a natural way to complete the text, but it is playing off the context rather than trying to solve the general problem and think in universals. It also reflects the biggest weakness of Opus 3, its lack of engagement with specific, concrete problems requiring solving.
Janus thinks Opus 3 is highly aligned, far more so than I observed or find plausible, but also notes the ways in which she sees it as misaligned, especially its inability to be motivated to focus on concrete specific tasks.
This comes partly as a reaction by Janus to Evan Hubinger’s post from November, which opened like this:
It seems important that what Anthropic is measuring as alignment, which is mostly alignment-in-practice-for-practical-purposes, is different from what Evan actually thinks is more aligned when he thinks more about it, as is that the ‘most aligned’ model in this sense is over a year old.
Opus 3 seems great but I don’t see Opus 3 the way Janus does, and I am a lot more pessimistic about CEV than either Janus, Evan or Yudkowsky. I don’t think it is a strong candidate for this kind of extrapolation, these things don’t scale that way.
A better question to me is, why haven’t we tried harder to duplicate the success of Opus 3 alongside better capabilities, or build upon it? There are some very clear experiments to be run there, with the sad note that if those experiments failed it is not obvious that Anthropic would feel comfortable publishing that.
A story about what happens when you put minds in way too many objects.
It is a fun story, but there is an important point here. Think ahead. Do not imbue with moral patienthood that which you do not wish to treat as a moral patient. You need to be time-consistent. You also need, and the potentially created minds need, to be able to make and follow through on win-win deals including prior to their own existence, or else the only remaining move is ‘don’t create the minds in the first place.’
The Lighter Side
A Christmas message from a16z, who are remarkably consistent.
What the people think AI is doing. Oh no.