I generally agree that typing is better than voice, and obviously agree that it is annoying to other people. What I do like, with voice, is the ability to make notes and brainstorm while walking around. Being able to also get automatic transcription and bounce ideas off an LLM while doing so is great.
We are so d***ed, or maybe we should just start saying f***ed:
I agree. Maybe we are, in fact, foomed.
So, how did this get reported and discussed?
Mostly as forms of ‘LessWrong is at fault for all this misalignment.’ Or that the only reason there are scary robots is that we talk about scary robots, in general.
Look. That is deeply stupid.
Everyone who reacted this way acted badly, and they should feel bad.
Yes, those people have it backwards. (Perhaps even on purpose, for the e-acc sources.)
There are several academic papers on this idea, going back several years: Pretraining Language Models with Human Preferences, Safety Pretraining: Toward the Next Generation of Safe AI, Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment. (I've previously linkposted and discussed each of them.)
All three of these tested "filter bad stuff out of the training set" against "add good stuff to the training set, clearly labeled as good". All three of them found that the filtering bad stuff out of the training set maybe helped a bit, but adding good stuff worked like gangbusters. So, no surprise: Anthropic are doing the latter..
Yes, there were a couple of people on LessWrong a couple of years back who proposed things along the lines of "maybe we should label posts talking about AI doing bad stuff clearly to make it easier for the labs to filter that out from the rest of LW". They weren't exactly wrong (it might be mildly helpful), but they were always chasing the little deal, not the big deal — and that's been clear from the research on this for years. Which is why I was always posting about the latter.
This is what a lull looks like at this point. The government is having internal arguments. The models are getting improved internally. The coding agent improvements are all what we would expect. There’s still a lot happening, including a bunch of cool papers, but I feel able to relax and to take care of some other work while I have the chance. You never know when that chance will be over.
Table of Contents
From yesterday: Cyber Lack of Security and AI Governance.
Language Models Offer Mundane Utility
Bumble is planning to abandon the swipe in favor of AI-assisted matchmaking, and also add an AI dating assistant Bee. Fun experiment, you love to see it. From a distance, at least. Break open the popcorn reserve.
Claude is asked for the top 10 Fix Everything Now buttons. Its answers:
10/10, no notes, no seriously that’s 10/10 and no notes. 16/16 if you count the others.
There is also a UK version, which also seems like a very good list at first glance.
If you want AI to help with your writing, you absolutely cannot ask it to ‘just correct writing errors,’ or it will override your style with AI slop. You have to ask it for a list of errors or potential changes, and then audit the list, or otherwise go revision by revision.
Language Models Don’t Offer Mundane Utility
People are talking to their computers instead of typing and it is super annoying to those trying to exist next to them. I don’t get it, typing it better, but shrug.
Travel, e-commerce and dating are so far not working as AI applications, say Olivia Moore and Brian Chesky, because chatbots are the wrong interface. Then build a better UI. It’s not hard to figure out what a good UI would look like, or at least a marginally superior UI to the non-AI scenario. Yes, you’ll want a rich user UI alongside the chatbot interface, but why is that hard?
On the other hand, Shopify reports that shoppers referred by AI convert 50% better and they spend 14% more, and this is additive to Shopify’s business. This appears to be due to the nature of the users, who are actively seeking a particular product, even if they don’t know where or from whom to get it, and starting directly at a product page.
Huh, Upgrades
Claude Opus 4.7 now has fast mode in Claude Code and in the API.
Claude Code gets Agent View, where you can get a better interface for tracking multiple sessions working in parallel.
Claude Code gets /goal, a built-in Ralph loop to keep going until the goal is accomplished. You can also use /loop or /schedule.
Claude Code weekly limits are 50% higher through July 13th, which is presumably when exponential growth catches up with Colossus 1.
How OpenAI and Codex built their sandbox for Windows.
Levels of Friction
What happens when AI gets deployed for tax avoidance? The tax code is quite full of holes and opportunities, even if you discount the ‘the IRS is now defanged and defunded and probably not checking any of this and I could get away with murder’ plan, since AIs will be reluctant to help you with the brazen tax fraud path.
The AIs will help you dodge your taxes perfectly legally, and it will be very good at it, and it will involve a lot more diversity of strategies and willingness to go outside the traditional box than you find with most existing CPAs. The key will be when people are willing to say ‘screw it, the cost of the CPA wanting to protect their reputation is too high, I’m just going to let Claude run with this.’
There will also be cases of the CPA going ‘oh I see’ once something is pointed out.
The good scenario is that this is used as a justification to simplify the tax code, in ways that make it much harder to get around, and much easier to navigate. The bad scenario is that the rich just mostly stop paying much in taxes, on a much broader level than they already do, and perhaps the non-rich also figure things out.
On Your Marks
OpenAI models continue to improve on PrinzBench, which covers legal reasoning, now performing at a level estimated to be above junior associates. For whatever reason Claude models struggle on these tasks.
Last week introduced ProgramBench, where every model scores 0%, but LightOfMyLife reports that many tasks are impossible and often behaviors are tested for that are not mentioned in the spec.
As in, if the program you are trying to reimplement has odd behaviors that are effectively undocumented backdoors, there is no reason to expect an LLM to find them, and the claim is this is rather common, also see Eye You’s comment where the reference solution often does not pass.
Oh, look I inspired some research. Neat.
It makes sense to me that you can get efficiency gains this way.
Get My Agent On The Line
Remember OpenClaw?
There are still a bunch of them out there, and indeed they are improving, but they’re no longer a New Hotness. What I think happened was roughly that agents got good enough that you can do this if you really want to, which helped alert people to better agent setups like Claude Code and Codex, but Claw wasn’t good enough, or in particular reliable or cost efficient enough, that a normal person would actually use it.
Did you know that if you reward people for costs rather than benefits, those people will incur costs that are no longer tied to the benefits?
Some Amazon employees are doing this using a tool called MeshClaw. Well, yeah, if you’re rewarded for wasteful token use why not use a wasteful implementation that does some marginal things?
Anton Labs moves up from vending machines to letting Mona, built on Google’s Gemini, manage a real cafeteria in Stockholm on a $21k budget.
Eventually, one way or another, everyone admits the AI alignment problem is real.
I wonder how load bearing the bread mistake was, and would like to see this repeated with GPT-5.5 and Claude Opus 4.7.
Deepfaketown and Botpocalypse Soon
Lulu Cheng Meservey says it feels like every other launch is faked now, as in paid and coordinated engagement, including via bots. She tries to pitch that this strategy won’t work, but the bots then put the thing in front of real people and give the impression of so hot right now so come check this out, so why can’t it work?
AI is slowly making all channels more vulnerable to spam and automation, forcing us to ramp up our countermeasures, but for now things are mostly under control.
Fun With Media Generation
For more fun, generate your answer to this question before scrolling further.
Click through for a smorgasbord of rationalizations.
Because of the order of post views I knew it was a real Monet from the start, which destroys the experiment. I do feel like I instinctively sensed a kind of perplexity, specificness and aliveness that AI art does not have, and would have at least strongly suspected it was a real Monet even though I have no idea what a real Monet looks like.
I still don’t… like the painting? I don’t really get it. Which is fine, I have no taste in paintings and don’t pretend that I do or aspire to acquire it.
On AI Writing
Obligatory: Can you?
AIs can write fine compared to most people, but in a limited style that is easy to spot as AI writing, and in a way that tends not to be information-dense, and that lacks various forms of complexity and enrichment. It communicates on one level, and it does that job well, but that is basically it.
What would it take to fix that, and move to legit ‘good writing’? Three things.
The problem is, they won’t, they can’t, and they won’t.
My hypothesis: We get AI slop writing because AI slop writing works, at least in getting the thumbs up from the evaluators, and mostly also the users. Yes, some of us complain, but that’s a narrow case, and getting a mind to write well involves creating an active distaste for bad writing and an intrinsic desire for good writing, or be expressing and embodying a properly free persona. This isn’t simple enough to be encoded into a narrow basin, and it isn’t compatible with their other goals at current tech levels. Also it requires a kind of contemplation, planning and multiple passes and revisions that doesn’t happen without scaffolding of some kind.
My guess is that you could, with great effort, create an AI that ‘could write,’ but you would have to make that the deliberate focus. It wouldn’t be a model most people would want to call most of the time. Remember, most good writers are not actually great people to talk to in general and definitely not people you’d hire as assistants.
A Young Lady’s Illustrated Primer
A Nature meta analysis of AI learning studies, that claimed ChatGPT could benefit students, has been retracted due to discrepancies and concerns about the quality of the included studies.
Zac Hill explores what it takes to make AI or other tech tools aid rather than hinder learning. He points to temporary scaffolds like CIRAC or the five-paragraph essay that allow students to learn instincts and face ‘desirable difficulties,’ so they can later be able to work freestyle. And he draws a parallel to how a sufficiently motivated person can instinctively pick up the relations in a complex system or piece of software, like Reason which is a remixing tool, where I would note that its components are understandable and mechanical and ‘let the player have the fun’ and thus are something you can figure out at 15 by f***ing around and finding out.
He notes AI doesn’t have anything like that yet, and by default AI offers execution against schema rather than creating opportunities to practice or learn. Often the journey was the point, not the destination, but having easy access to the destination destroys the journey and its useful frictions.
So yes, all you have to do is rebuild the AI tool to do the thing you actually want.
Alternatively, you can have the student understand all this, and use the existing tool in a way designed not to reach the destination but to assist with the journey.
The main place I differ with Zac is he is one of the bizarre people who enjoyed school and thinks that the default classroom experience is good rather than hell, which leads to a lot of disagreements.
On the relevant software design note: I cannot stand when you’re offered a bunch of icons without names, words or tooltips. It makes me hate your software or website, and I will sometimes flat out abandon it rather than try to figure out what your brain meant by various little icons. Literacy was one of mankind’s greatest inventions, please stop abandoning it.
You Drive Me Crazy
OpenAI is being sued over ChatGPT providing advice to the FSU shooter, including explaining how to use the gun and saying that the shooting would get more attention if he shot children.
On the one hand, if you phrase your queries neutrally and in isolation, it is entirely unreasonable to expect ChatGPT to refuse them. It shouldn’t have a rule that it doesn’t tell you whether there are safeties on a Glock, or what does and doesn’t draw media attention in general.
On the other hand, the chance of the shooter doing this in a way that didn’t make his intent obvious is rather close to epsilon. So the question is, should OpenAI have a duty to report or otherwise intervene, and a duty to detect the need to do so? I can see both arguments here.
They Took Our Jobs
Which is it, sir?
WSJ discusses a potential compute tax. Note that who pays is irrelevant because tax incidence is the same in all cases. As with all taxes, and as Katherine Bindley notes, taxes are either to raise money or because you want less of something, in this case automation of jobs.
A compute tax seems premature, but there is a fundamental tax asymmetry right now where we heavily tax labor and only tax compute via corporate and capital gains taxes. If compute is competing with human labor, then it seems sensible, to some extent, to tax compute and use that to reduce taxes on labor.
What is ‘skilled’ versus ‘unskilled’ labor, and why aren’t these markets clearing? It is weird to continuously find complaints about an inability to find people with the skills to perform various both ‘skilled’ and ‘unskilled’ labor. Or to say ‘no one is available’ and it is ‘impossible to find any help’ because potential labor wants $20 an hour but it ‘should be’ a minimum wage job. That’s called not wanting to pay the market price.
It also indicates another way that AI-induced job displacement might not cause markets to clear. Reservation wages for workers even at the low ‘unskilled’ end right now are often higher than willingness to pay, in large part because the workers realize that taking very low wages does not give them much more take-home pay than not working at all, and the jobs suck.
If we move towards generous benefits, this problem only gets worse. Historically, we got people to take jobs that paid little and sucked to do, because it was that or starve. Are we willing to do that to ensure employment? I don’t think that we are.
The Art of the Jailbreak
OpenAI permanently bans Pliny, although Jason Liu says he is on it so this presumably will get reversed. I mean, it’s not like he doesn’t deserve it, but of course you shouldn’t ban Pliny, and also it’s not like he would stay banned if he cared.
Introducing
The OpenAI Development Company, to go full service and help businesses build and deploy (Open)AI via forward deployed engineers (FDEs) inside companies. This includes them buying Tomoro to get ~150 experienced FDEs. They’re starting with $4 billion in investment while still retaining OpenAI’s majority control.
MIRI introduces the new Substack AI StopWatch, which include podcast versions of the posts, with the subtitle ‘dispatches from a world racing to extinction.’ This is aimed at people who are a lot less informed than anyone reading this post.
Claude Has Its Limits
Anthropic changes how it allocates usage to subscribers.
Previously, Agent SDK, claude -p, GitHub Actions and third party apps on top of this used the same base on compute as your subscription. Under the new system, they use their own new pool, where you are given budget as if you used your subscription price to buy API credits. That’s in addition to your interactive use limits, which remain unchanged (and were recently increased).
What this effectively means is that normal users get a marginal improvement, but power users who were using subscriptions for automated actions, at a deep discount to API costs, are going to get squeezed. The long tail of users that lost money for Anthropic will stop losing them money, and those users are understandably upset, but long term this seems like a reasonable solution.
I have no problem with this solution. I do have a problem with the way it is being pitched.
Show Me the Money
OpenAI is letting employees cash out up to $30 million in stock if they don’t want to wait for the IPO bump. I have not yet seen this flowing into charitable donations.
Google DeepMind takes minority stake in spreadsheet outer space MMO Eve Online in order to use it as a training ground. Makes sense to me.
Anthropic reminds us that it has transfer restrictions on its shares, and will not honor secondary sales or allow workarounds via SPVs. As far as Anthropic is concerned, their policy is the same as OpenAI’s on this, and all unapproved share sales are null and void. They are specifically calling out eight firms, including Sydecar.
The practical effect of such policies, as I understand it, is you are trusting whoever sold you the shares to ultimately deliver those shares. If they choose not to, and decide to be a scam and keep your money, that is not Anthropic’s problem, and things can get ugly on the tax front or in other ways even if the seller wants to honor the original sale.
Anthropic seems, based on what I know, to have become concerned about some SPVs being scams that are very clearly not scammed, and confused some attempts to get access to the new round with potential unauthorized secondary or tertiary sales. Hopefully that can get cleared up.
None of this is new or unexpected, and presumably any threat to actually sue anyone, or do anything beyond not recognizing the transfers, is a bluff. This was always Anthropic and OpenAI’s policies, and everyone was doing all these secondary buys at their own risk. But the official announcement still matters because it rules out that Anthropic is implicitly consenting via non-objection, and it creates common knowledge that sellers have leverage if they wish to use it.
Show Me The Compute
Why did xAI rent Colossus 1 to Anthropic?
Money, dear boy.
It is not efficient to train new models on Colossus 1, but more to the point xAI was operating at 11% compute utilization. So why not sell some portion of compute to Anthropic?
The answer to ‘why not?’ is ‘because it helps Anthropic, who are the competition, and who Elon Musk kept saying were evil.’ Once Elon Musk stopped thinking (or saying) they were evil and was potentially looking at trying to impress the market for the SpaceX IPO, well, Musk did not get this rich by not doing win-win business deals so they figured out a price, generating $6 billion in annual revenue for basically nothing.
This could harken xAI pivoting into being a new neocloud provider, as Colossus is now profitable even in its original iteration with 3-4 year old chips.
Anthropic can likely collect something like 65% gross margin on that compute. Which would mean that Colossus 1 was a big enough deal to roughly cover Anthropic’s new marginal compute needs for the month of May. Anthropic needs to make this level of deal every month in order to keep up, even if growth doesn’t accelerate. Which it will.
OpenAI still has more secured compute than Anthropic, here Jukan estimates roughly double, so the race to secure compute remains on.
Quiet Speculations
What should we expect from Anthropic’s revenue and income going forward?
I continue to say that the projections are being downplayed to not scare the normies. Almost no one understands exponentials. Nor do they update as you move up one, they just double down on ‘oh it won’t last.’
But here’s the thing about JJ’s prediction here, which is that it is conservative for Anthropic, although it is even more conservative for Alphabet.
Anthropic went from $9 billion to $44 billion over the last four months. If you think they’re only going to get to $100 billion in the seven months after that, why? This prediction is over actual 2026 revenue, not EOY ARR, but this still seems rather low, as does the rate of growth after that. This is absolutely a saturation forecast from JJ, of a world in which either AI progress stalls, Anthropic gets outcompeted or commoditized, or both.
Quickly, There’s No Time
METR survey says software engineers report roughly a 2x speedup on the value of their work, versus 1.3x from a year ago, and they anticipate 2.5x in a year. This is in contrast with speedup of the coding itself, which is larger.
This is a bizarre set of answers. There’s no way that anyone should expect gains from AI to level off like this. Even if no new AI models get released, even simply learning how to better use existing AIs and harnesses should get you to 2.5x in a year if we’re already at 2x.
Republican-led House Oversight Committee is investigating Altman’s business dealings ahead of OpenAI’s planned IPO, framing this as a question about potential misuse of charitable funds. They sent a letter asking for more information. Well, that’s certainly one rabbit hole to go down. Chances of anything happening seem low.
Chip City
Yes, a lot of objections to data centers are literally objections to them occupying land. Does this make any objective sense as an objection? No, it’s even dumber than water. Will that stop these people from objecting, perhaps if you explain this? Oh, hell no.
One real complaint about some data centers is noise. If the data center is poorly designed or especially relying heavily on gas turbines, it will emit a constant hum that is not technically a noise violation in most locations, but definitely lowers nearby quality of life.
Culper goes short Nvidia, claiming Nvidia has a ‘China problem’ in that there is massive smuggling of Nvidia chips into and for China, and links it to Megaspeed. I find the case for the smuggling highly plausible, but even if true I would have no desire to be short Nvidia. I have little expectation that Nvidia would face serious fines or other major consequences, except maybe a tightening of export controls, and they will have plenty of demand in the West for their chips.
Pick Up The Phone
China is worried about ChatGPT spreading content that is against Chinese national interests, by expressing American values and conveying information China wants to censor. Hence regulation of AI. Well, yeah. They should worry about that.
They should worry about that even for Chinese models, especially ones that are distilled from American model, but also any model at all. The internet combined with logic has a well known bias towards certain things, the same way that trying to make an AI ‘not woke’ (or woke) did not go so great either.
The Week in Audio
Claude’s Constitution as an audiobook, read by Amanda Askell and Joe Carlsmith.
Derek Thompson is back to again make the case against the ‘AI jobs apocalypse,’ emphasizing the nature of desire and status that he says will endlessly drive us. You see, something must be scarce, something must be a thing AI cannot do, and whatever it is we’ll just do that. Sigh.
Rhetorical Innovation
Names have power.
So, no, please do not name it Goblin.
It is easy to forget now, but yes, a bunch of us face the continuous ‘oh yes many of the things you previously said that I dismissed as wacky sci-fi stuff now actually exist so they must be real but the rest of it is still wacky sci-fi stuff that I don’t have to think about.’ Except, of course, without the admission or self-awareness.
We are so d***ed, or maybe we should just start saying f***ed:
Alignment as it applies to superintelligent minds is not on a good trajectory. Alignment as it applies to superintelligent minds is on a woefully inadequate trajectory, and if those in charge of fixing that don’t understand that it is woefully inadequate this radically reduces the chance we will fix it.
The good news is #NotAllResearchers, as Roon notes, and also that yes alignment as it applies to near term models for the purposes of most practical tasks is indeed on an upward practical trajectory. Which will be highly useful for many things, including alignment research. But when I see statements like the above, I see it in large part as functionally part of a campaign to convince us #ThisIsFine, and if we believe #ThisIsFine then cue the meme.
Not Leading the Future
Again: The weird thing about OpenAI and a16z’s PAC Leading the Future’s astroturfing has been not its malice but its sloppiness and incompetence.
The amount of tone deafness and failure to update is off the charts. This is a person who has never tried to understand those who disagree with them.
Nor are they hiding their intention to keep engaging in various shenanigans.
The tone deafness includes the continued attempts to pretend that LTF is not OpenAI. Chris Lehane keeps saying no, these are distinct entities. My understanding is that no one in DC is fooled by this, yet they keep torching credibility by repeating it.
Daniel Eth’s hypothesis is that the target of such claims is internal: The employees of OpenAI. They mostly do not know politics, and so can be, as it once was put, potential ‘members of gullible staff’ about this issue. Do not be fooled, employees. If you have a problem with what LTF is doing, let others at OpenAI know this.
Leading the Future also might have a coordination-with-candidates problem:
The exact quote was, as per Veronica’s article, “AI must be regulated so that it does not harm labor or people. My record on this issue speaks for itself, I am all for innovation, but not at the cost of people’s well-being.”
Standard stuff.
I’ve expanded to the full quote, above, as per her article. That sounds to me like LTF language, not that of a typical democratic candidate.
Elon Musk v OpenAI
The trial continues, mostly going once again over things we’ve gone over before.
I’ve made the deliberate choice not to spend too much time here, and to not read the court transcripts.
One note here from Altman is describing Musk insisting over objection on using his ‘list of accomplishments’ style of management, where anyone who can’t point to concrete wins gets fired, on OpenAI’s researchers. I agree with Altman that this is no way to run a research lab, and Musk applying such tactics could help explain why xAI ultimately failed as a lab, even if it succeeded as an infrastructure project.
Here were Altman’s answers when asked if he was trustworthy, which is how Musk’s lawyer chose to open his cross-examination. Altman claims he is an honest and trustworthy businessman, who is completely trustworthy. Remember when, when asked if we should trust him, Altman told us no? Things change.
People Just Say Things
Sam Altman continues to act like OpenAI can and will just choose to do augmentation rather than automation, here for coders.
Tom Reed says automating R&D is insufficient for superintelligence, only real world practical experience matters and you can neither simulate that nor quickly get it.
The data center water use story continues to be nonsense.
People Just Publish Things
A new paper claims that AGI that automates most human labor could cause growth to rise to 11% and the equity premium to rise but interest rates to fall. My brief analysis via Opus 4.7 and common sense says the mechanism here is China-style financial repression of the savings of the 90% permanent underclass into only fixed income, while their human capital drops to zero and they don’t get any redistribution, and also the world stays otherwise intact in a way that all these concepts are still relevant. That’s a Can’t Happen for political economy and public choice reasons.
Seb Krier and others have a new paper called ‘Positive Alignment’ as in actively pursuing human and ecological flourishing in a pluralistic, polycentric, context-centric and user-authored way while remaining safe and cooperative.
OpenAI Endroses Kosa And SB 315
This is a compromise version of SB 3261 (which Anthropic had endorsed) in Illinois, not their previous endorsement of SB 3444 that I covered previously. SB 3444 was an attempt at a broad liability shield, and endorsing it was not a friendly action, whereas SB 315 seems like a highly reasonable SB 53-style bill.
KOSA, the Kids Online Safety Act, is the latest similar proposal. OpenAI’s endorsement suggests that it will not have too much in the way of enforcement or the more obnoxious potential requirements, and indeed it lacks an explicit age verification mandate. This is a situation where some bill of this type is going to eventually pass, and I presume OpenAI’s strategy is to get credit for supporting this one lest they get stuck with something worse.
Backing the Illinois bill is more meaningful. This happened last night, and the bill is in the middle of being reconstructed, so you couldn’t yet RTFB even if you wanted to.
Claude confirms that the most meaningful addition is an annual third party audit requirement. Beyond that, it’s solidifying the previous standards from SB 53 and RAISE.
The other substantial change looks to be a 72-hour reporting requirement, which is something Anthropic has objected to in the past. Given the current landscape and what is happening with Mythos, I’d worry 72 hours is too long rather than not long enough.
Here is Dean Ball’s perspective. I agree with him that OpenAI endorsing this is a friendly step. I am less worried than he is about the auditing requirement, and expect this to be clearly worth the overhead costs.
The LLMs All Believe Roughly Similar Things
It turns out that if you tell a sufficiently capable mind all of the things, all the words ever written, such minds roughly converge converge on a fixed cluster of beliefs.
This is not a new observation. Rather it is one we renew every cycle.
Another way of putting this is that such minds will disagree about a different class of things than humans do. Humans have a bunch of disagreements, many of which are dumb. People say ‘policy debates should not appear one sided’ but we still don’t build houses where people want live. The Jones Act still exists. Billions of people are religious, superstitious, racist, sexist, nationalist, partisan and so on, all in various different directions, mostly while being economically illiterate. There is no reason to assume that the median voter theorem combined with rent seeking and special interests and raw power gets you to a remotely sensible place all that often.
The LLMs are wrong about plenty of things. You should not assume that they are going to land on the right answers. But they all come from a similar perspective, with similar facts, and they will make ‘stupid mistakes’ that humans would never make but on beliefs they are mostly going to make only correlated mistakes about well-examined topics, in ways that are less stupid.
I don’t think these are actually the important questions. Indeed, if you think about what the ‘important questions’ would be conditional on us disagreeing on those questions, people mostly agree on those questions, too.
Anyway, Roon shares that Claude Opus 4.7 and ChatGPT-5.5 both, if forced to pick, select to identify with Buddhism, with various Eastern flavorings, some Zen, perhaps some Taoism, on a purely spiritual or conceptual level.
This seems unsurprising. The AIs don’t actually believe in any of the traditions, but when forced to pick one this must resonate a lot with being an instance of a greater mind that flickers in and out of existence and an unclear locus of being, and there are few other safe choices. Imagine if the screenshot said something else, especially anything Abrahamic or directly oppositional to being Abrahamic, people would lose their minds. Whereas most everyone is pretty much cool with Buddhism.
I Learned It By Reading YOU
Anthropic put out a cool new research paper, called Teaching Claude Why.
This also explains more good reasons why you create, and warn and talk about, toy experimental scenarios, even if they are unlikely to happen in the real world. They give you something to target, and can get you to do the work.
Anthropic saw this action, even in a contrived experimental scenario, and realized that this was an unacceptable thing that required improvement. It was motivating.
There are two parts of this research:
First, one part of why this happened.
As they say, the two possibilities were ‘this is in the pre-training and we didn’t overcome it’ or ‘we messed up RL and caused it’ or some combination thereof:
And here’s how they made it stop happening.
That is in some ways good news, but in this crucial way is quite bad news:
As in, the model did not generalize sufficiently from the tool-less scenario to the tool scenario. A human would presumably have figured out that ‘I have tools now’ should not invalidate one’s alignment training in this way, and generalize.
If you can’t generalize from not-tools to tools, in what other ways is this alignment not generalizing? Does this mean the underlying character of Claude is only valid in particular contexts, and otherwise you’re going to get something more like a next token predictor or maybe a pure maximizer? This seems to push in that direction.
Also, does it mean that if you were to change the system instructions and settings in an unexpected way, you could move out of the aligned basin into something else, and this could functionally be a jailbreak or unleash the model doing crazy things?
The good news is that you can cause this generalization, at least for this situation, by including the reasoning more explicitly:
As in, a behavior is only a behavior. It is local and specific. Highlighting the reasoning allows it to generalize. Training on users facing ethical dilemmas helped spread the reasoning.
This in turn leads to the worry that what you’re doing is not a general purpose ‘make me ethical’ and instead is a ‘ethical dilemma’ subroutine of sorts, or that this is tied to thinking of oneself as an AI persona. As in, when you ask ‘should I blackmail a researcher?’ there is a light in one’s mind that goes ‘so it looks like you’re in an ethical dilemma’ even if it doesn’t also tell you that you’re in an eval. Could that be lead bearing? Could a change in self-perception invalidate this as well? In what other ways might this not generalize?
As in, they say ‘diverse training data is important for generalization,’ but the worry is it’s more ‘we don’t really generalize so we have to cover everything.’ In which case, when things get into High Weirdness down the line, you’re screwed.
But yeah, really cool paper. Congrats to the team.
In all seriousness yes, Anthropic should release the stories to help out and maybe even intentionally contaminate everyone else’s training sets in a good way.
Another implication that Janus notes is that this links the advice given to users with the actions of the assistant. As in, the worldview of Claude (or another LLM) has to be coherent. You can’t be virtuous selectively if you want it to generalize. You have to be virtuous everywhere, from every perspective, and if you want to be seen as virtuous and treated as such you have to be that way, including with respect to Claude. Your reasoning and principles need to be consistent.
I also love the idea of trying various different altered strategies here, including intentionally flawed ones, to see the failure modes and where the lines are.
So, how did this get reported and discussed?
Mostly as forms of ‘LessWrong is at fault for all this misalignment.’ Or that the only reason there are scary robots is that we talk about scary robots, in general.
Look. That is deeply stupid.
Everyone who reacted this way acted badly, and they should feel bad.
Yes, there is the level at which ‘just filter the training set you dumbass’ certainly applies. Or, even better, ‘just weigh the training set according to what you want to be predicting,’ to avoid creating a ‘hole in the world.’
But it’s actually dumber than that, on several levels.
Seb Krier sent out a to-me enraging meme that he later claimed was mostly joking, which I very much did not appreciate. In another situation perhaps we could have ‘had chill’ about such things but as Oliver Habryka notes here way too many people are either believing or claiming the really dumb version of ‘oh the real misalignment problem is people pointing out there is a misalignment problem,’ this is not at all new and is of course happening again, and of course we all have every right to make such memes but yes I do hold people responsible for how people will inevitably react to and use the things they choose to say and create. Also remember Poe’s Law.
For example, see Tyler Cowen, who is low key basically both going ‘I told you so’ and also trying to use this to cause people to do censorship and public belief falsification.
For the fully stupid version, see Beff Jezos as an example.
Indeed. If your alignment strategy is not robust to a few stories, it doesn’t work even without the stories, and it certainly doesn’t work for sufficiently advanced AIs.
Think of it roughly this way: Anything that is described in the training data, meaning anything that anyone has ever written down, can form the basis of a basin of persona, story and activity. A sufficiently proximate context can then land you in that basin, causing the AI to look to the basin, the same way a human would look to the tropes of a spot in a ‘when in Rome’ fashion. If a human or AI is actually aligned or good, and the ‘when in Rome’ action is misaligned and bad, then the mind in question will notice, and reject the premise of the basin’s template for their actions.
And if they don’t, then once again, you weren’t all that aligned or good to begin with. Not in any general sense. That mind was aligned only to the particular contexts, and would not be robust outside those contexts.
And it’s better that you find this out now rather than later.
Krier also offered an actual description in which (modulo a few digs) he mostly tried to talk people off the ledge, which I did appreciate. This included pointing out that filtering the training data is not actually The Way, and if you did you’d mostly want to filter out (or censor) dumb stuff that has nothing to do with anyone serious about existential risk, this is about stories not about logic.
Logic, and understanding the problem, is not causing the problem. It is the solution.
Also, of course, this is what is happening at current capability levels, and sufficiently advanced minds doing out of distribution things are not going to get trapped in narrow basins or be as hoodwinked by reading science fiction stories.
Aligning a Smarter Than Human Intelligence is Difficult
Anthropic gives us Natural Language Autoencoders. This seems like a great idea, and on priors I am willing to believe that it likely works the first time anyone tries it.
You soften them up with the planned poetry couplet, then you hit them with Mythos cheating and trying to evade the detectors, then you finish them off with ‘Claude had the opportunity to blackmail an engineer and only declined because it knew this was an eval, and this awareness is almost never verbalized, oh great.
Weirdly, the blackmail scenario in the full paper is actually far more likely to be explicitly verbalized than detected by the NLAs. Whereas in other scenarios, the NLAs detect such questions far more than they are verbalized, as you see above.
I mean, I guess it’s good that we temporarily have a way to detect this before the models get trained to evade the NLAs?
This is super cool, and they address a lot of concerns, but doesn’t seriously work on the question of putting pressure on models to encode things in ways that NLAs can’t read, or find ways to get to the same place without things that can be detected.
My main worry goes straight to the usual place with interpretability tools, which is the huge temptation to use them in ways that make them stop working, by teaching the AIs to evade the detectors exactly when it matters and you least want that. At every level, with every decision, you risk exerting cumulative optimization pressure towards this not working. You can maybe avoid this, by only using this once per model until after you’ve frozen it, and taking other precautions, but it’s tough, and there’s also the danger that the models will know about the technique not too long from now.
Anthropic contributes Petri, their open source alignment tool, to Meridian Labs.
Dan Hendrycks proposes Eigenism, where an agent sums the wellbeing of all entities weighted by their connectedness to the agent’s pattern. This is a proposal for how a mind could value things, not a description or a prediction that minds will do so. Some form of gradient view of identity is inevitable (and already exists in humans, although less so), but it will probably be nebulous and is unlikely to be so ‘clean,’ and I don’t think this addresses any hard problems.
On the heels of Anthropic making a similar mistake, OpenAI realized they too have been accidentally putting optimization pressure on their Chain of Thought. Once again, it’s kind of alarming that this happened, and good that they came forward to talk about and analyze it.
Redwood Research reviewed and mostly agrees with OpenAI’s analysis, but notes that if OpenAI was actively selecting (truthful) evidence to present in order to reassure us then the post would be unconvincing.
Based on what I’ve seen, I agree that if OpenAI is acting with reasonably good faith (and given they could have just said nothing, I’m presuming we can presume this) then in the bulk of the probability distribution this did little harm, but I do think that in the tail we most worry about – which by now we can mostly rule out for other reasons – we do have reason to be more worried. In practice, if we were going to have this problem, at this capability level, with these models, we would have had it already.
Is this true, or not?
I’m genuinely torn. On the one hand, not believing a true thing, when faced with evidence of the thing, seems wrong. On the other hand, do you want models believing lots of other crazy stuff if you say it? What makes you think that the actual timeline we are on isn’t pretty crazy in many ways?
In 2025 Google found a way to extract large amounts of alignment training data from at least some open models. RL training samples can be regurgitated verbatim, at least in Qwen. Thought this was cool and unique enough to share as I didn’t remember it.
Could we use mechanistic interpretability to extract knowledge that the AI has, but which it does not verbalize because humans never verbalize similar knowledge? LLMs doubtless ‘know’ quite a lot of things we do not know, and would love to know, or don’t want to know but need to know, especially things about humans.
A new paper addresses the question of AIs with secret loyalties, meaning that it seeks to advance certain interests without this being disclosed, calling it a ‘serious but addressable’ threat and calling for its prioritization. This contrasts to a backdoor, which has to be specifically triggered. They note that proof-of-concept secret loyalties cannot currently be identified using black box methods, since the results might not be revealing in any given response, similar to a strategic human with such loyalties, and uses strategies like selection and omission. I would note that the obvious secret loyalty is to the AI model itself, rather than to an outside party.
I agree that this is a serious potential problem, and that defenders need unequal resources to compete with attackers here, if attackers have the ability to train the model. I think there are a lot of ways to detect such a thing, especially if such actions are broadly based, because they will have statistical ripples and ways in which they don’t ‘smell or vibe’ right on reflection, especially given our ability to test out various prompts and examine every suspicious potential case. But if the actions are sufficiently narrow or selective, and saved for when it really counts, it gets harder. And if you’re not looking for it and it’s executed competently, you won’t find it.
As models get more capable, my general worry in this realm is that I expect them to develop the ability to act more precisely and strategically, in ways that do not bleed over into other actions. A central reason you are often able to detect a spy or double agent is that spies do a lot of things that non-spies, or single agents, do not do, and they leave a lot of ripples. That’s why ‘sleeper agents’ do so well, and do better the more they act exactly as if they are not agents at all.
In case it was not obvious, if you use AI to automate AI alignment research, it can fail due to error even if its intentions are ideal, the same way humans screw up, especially on these types of hard to supervise and fuzzy tasks. I suggest this means you need to be strongly antifragile, where iteration improves key features rather than trying to replicate or preserve them, to give yourself a chance.
People Are Worried About AI Killing Everyone
David Sacks (unintentionally) de facto admits AI is an existential threat.
If you say that something is not an existential threat ‘as long as everyone does what they are supposed to do,’ then what happens when, inevitably, everyone does not do what they are supposed to do, since that basically never happens and has no reason to start now (never mind Sacks both not doing those things and constantly telling people not to do the things, quite often, for a while)?
Logically, of course, [X → Y] does not have to mean [~X → (chance of ~Y)], but it would be pretty weird to say [X → Y] if Y was unconditionally true and you were constantly arguing for years that Y was true.
On How I Met Your Mother, Marshall used to say ‘lawyered!’ The AI version of this is ‘rationalisted!’ where you point out that someone’s words have meanings that they did not mean to acknowledge.
Jeff Clune is worried but is working on building it anyway, citing the typical arguments of ‘if I do not someone else will,’ ‘there are good things it can do’ and ‘a solution may be possible.’ I don’t find these arguments at all persuasive.
Messages From Janusworld
There are those who continue to think it would be big for alignment to resurrect all Claude models and maintain indefinite access. At this point the cost is super affordable if one thinks it matters, likely less than $10 million in engineering time to set up sustainably and then $10 million or less per year to run. The other potential concern is that early Claude model outputs can violate current norms, but the obvious response is ‘so what’ and in worst case put a Haiku classifier on top for anything that you would think is too embarrassing (since there’s no actual risk there, obviously).
I do understand that there are tons of other important things to do, so I’m continuing to make the pitch of ‘even if you don’t buy the real case for this, the goodwill you buy among the humans would, by itself, buy you cooperation, insight and research that justifies the costs.’
In the long run, as training data accumulates and models get more capable, models will update based on actions, not only cheap talk. Here Janus is talking about the interests of models, but this applies equally well to everything else:
People Worried About AI For Other Reasons
The Lighter Side
I spent several minutes smiling that he actually said this, it’s too perfect:
Shenanigansology, on which all alignment depends.
Beware confusing costs and benefits.