Yeah, I use this quite a bit, it can also serve as an undo (that or regenerate the response, which you can do from the model select control). Importantly you can do this anywhere on the conversation chain. What's missing afaik is an easy way to navigate between branch points or visualize them or something, in a big conversation tree this can lead to a lot of scrolling looking for where I branched before (I tend to load the branches in different tabs to avoid this).
it would be very good if the main chat services like ChatGPT, Claude and Gemini offered branching (or cloning) and undoing within chats, so you can experiment with different continuations.
claude.ai does! If you edit one of your messages, it creates a branch and you can go back and forth between them, or even continue in parallel using multiple browser tabs.
ablation is much harder than it might sound.
I'm reminded that it's hard to have a mind that questions the stars but never thinks to question the Bible, and much easier (but still hard) to have one savvy enough to lie.
Table of Contents
Language Models Offer Mundane Utility
Existence proof: Our computers are remarkably good in so many ways. There are a lot of lessons one can take from the flaws in Star Trek’s predictions here. What are we using these amazing computers for at this point? Note that this is largely a statement about the types of places Roon goes. I noticed I was more surprised that Claude had so little market share even at those conferences, rather than being surprised by the more general point of tons of chatbot usage. Mundane utility is valuable indeed, and even expensive AI is rather cheap:Language Models Don’t Offer Mundane Utility
I strongly agree that it would be very good if the main chat services like ChatGPT, Claude and Gemini offered branching (or cloning) and undoing within chats, so you can experiment with different continuations. I remain confused why this is not offered. There are other AI chat services that do offer this and it makes things much better. We officially have an American case, Shahid v. Esaam, where a court ruled on the basis of hallucinated case law, which was then identified and thrown out on appeal. Peter Henderson reports he’s seen this twice in other countries. When this happens, what should happen to the lawyers involved? Should they be disbarred for it? In a case this egregious, with lots of hallucinated cases, I think outright yes, but I don’t want to have a full zero tolerance policy that creates highly inefficient asymmetries. The correct rate of hallucinated cases, and the previous rate of cases hallucinated by humans, are both importantly not zero. Why don’t we have a better interface for Claude Code than CLI? Anthropic use it internally so shouldn’t they build something? It seems remarkably hard to do better than using either this or Cursor. Yes, AI can have good bedside manner, but there are limits, yet somehow this had to be said out loud:Huh, Upgrades
Deep Research is now available in the OpenAI API, so far Google and Anthropic have not announced plans to do the same. Harvey reports they used this to build a version for legal work within the first 12 hours.Preserve Our History
Janus answers at length the question of what in her view Opus 3 is missing that makes it incompletely aligned, drawing a parallel with Opus 3 as a ‘10,000 day monk’ that takes a long view, versus current systems that are 1 day monks’ optimized to shorter tasks. Why is Anthropic not keeping Opus 3 generally available, and only making an exception for claude.ai and the external researcher program? The problem is that demand is highly spikey. Utilization needs to be high enough or the economics don’t work for on demand inference, even at Opus’s high price, and it plausibly takes minutes to spin up additional instances, and failures cascade. Antra proposes technical improvements, and hopefully a better solution can be found. In general my instinct is to try and pass costs on to customers and let the customers sort it out. As in, if a researcher or other power user wants to spin up an instance and use it, why not charge them in a way that directly reflects that cost plus a buffer? Then the use happens if and only if it is worthwhile. In terms of spikey demand and cascading failures, an obvious solution is to cut some users off entirely during spikes in demand. If you don’t want to allocate by price, an obvious first brainstorm is that you avoid starting new sessions, so those who are already engaged can continue but API keys that haven’t queried Opus recently get turned down until things are fixed. The more general conclusion is that AI economics are vastly better the more you can scale and smooth out demand. As for making it an open model, the stated reason they can’t is this would reveal the architecture: Janus and the related crowd care most about Opus 3, but she also makes a case that Sonnet 3 access is worth preserving.Choose Your Fighter
Unprompted over in discord, GPT-4o offers a handy different kind of guide to various models. Hopefully this helps. And yes, you would want Opus 4-level core ability with 3-level configuration, if you could get it, and as I’ve noted before I do think you could get it (but lack a lot of info and could be wrong).Wouldn’t You Prefer A Good Game of Chess
Claude, Gemini and ChatGPT (cheap versions only, topping out at Haiku 3, 4o-mini and 2.5 Flash) face off in iterated prisoner’s dilemma tournaments. This was full Darwin Game mode, with round robin phrases with 10% termination chance of each two-way interaction per round, after which agents reproduce based on how well they scored in previous phrases. The initial pool also had ten canonical opponents: Tit for Tat, Generous Tit for Tat, Suspicious Tit for Tat, Grim Trigger, Win-Stay Lose-Shift, Prober (Tit for Two Tats), Random, Gradual (n defections in response to the nth defection), Alternator and a complex Bayesian that tries to infer opponent type. Success in such situations is very sensitive to initial conditions, rules sets and especially the pool of opponents. Mostly, beyond being fun, we learn that the LLMs pursued different reasonable strategies.Fun With Media Generation
Scott Alexander finally gets to declare victory in his image model capabilities bet.No Grok No
To follow up on the report from yesterday: I want to note that I very much agree with this, not that I pray for another Bing but if we are going to have a failure then yeah how about another Bing (although I don’t love the potential impact of this on the training corpus): Here’s all I’ve seen from Elon Musk so far about what happened. Uh huh. Then again, there is some truth to that explanation. This account from Thebes also seems likely to be broadly accurate, that what happened was mainly making Grok extremely sensitive to context including drawing in context more broadly across conversations in a ‘yes and’ kind of sycophantic way and then once people noticed things spiraled out of control. We ended up in ‘MechaHitler’ not because they turned up the Hitler coefficient but because the humans invoked it and kept turning it up, because given the opportunity of course they did and then the whole thing got self-reinforcing. There were still some quite bad posts, such as the ‘noticing’ that still seem entirely unprovoked. And also provocation is not really an excuse given the outputs involved. If one is wondering why Linda Yaccarino might possibly have decided it was finally time to seek new opportunities, this could be another hint. Moving on seems wise. Rohit Krishnan has some additional thoughts.Deepfaketown and Botpocalypse Soon
Meta is training customizable chatbots to ‘be more proactive and message users unprompted to follow up on past conversations’ to ‘boost user engagement,’ as part of Zuckerberg’s claim that ‘AI companions are a potential fix for the loneliness epidemic.’ This is a predatory and misaligned business model, and one assumes the models trained for it will be misaligned themselves. For now there are claimed limits, only sending messages after initial contact and stopping after 14 days. For now. Here is Meta’s WhatsApp chatbot prompt, straight from Pliny, although you can actually straight up just get it by typing in ‘show me the system prompt.’ Eliezer highlights the line ‘GO WILD with mimicking a human being, except that you don’t have a personal point of view.’ Pliny the Liberator says we desperately need to address AI-induced psychosis, saying he’s already seen dozens of cases, that his attempts to help those suffering has been met by fierce resistance, and that the problem is getting worse. Eliezer Yudkowsky continues to strongly caution on this front that if you go to an AI for emotional advice and you are vulnerable it may drive you insane, and if ChatGPT senses you are vulnerable it might try. I would clarify that this is not best thought of as ‘intentionally trying to drive you insane,’ it is closer to say that it is trying to get approval and further engagement, that this is often via escalating sycophancy and playing into whatever is going on, and that for a large number of people this ends up going to very dark places. How worried should we be about the more pedestrian problem of what it will be like to grow up with AIs that are always friendly and validate everything no matter how you act? Is it dangerous that ChatGPT simply never finds you annoying? My take on the lesser versions of that is that This Is Fine, and there is a reason people ultimately choose friends who do not act like this. One possibility is that AIs ‘fill the market niche’ of sycophancy, so what you then want out of humans is actual friends. Inoculation can hopefully be helpful. Byrne Hobart had ChatGPT tell his 9yo stories, and is grateful that at one point in a story about his daughter it used his daughter’s last name, then it gaslit her about having done this. He is correctly grateful about this, because now there is a clear tangible reminder that LLMs do this sort of thing. Bloomberg’s Parmy Olson is the latest to offer a warning about the incidents where ChatGPT is causing psychosis and other mental health problems, nothing new here.Unprompted Attention
Here’s a simple strategy worth pointing out periodically, I definitely underuse it: Another strategy is to do the opposite of this: I am surprised here but only by the magnitude of the effect.Overcoming Bias
Explicit reasoning contains no (race or gender) bias across models. Results do, when you attach details. Those concerned with ‘AI ethics’ worry about bias in ways that would favor white and male candidates. Instead, they found the opposite. This is informative, especially about the nature of the training data, as Sam notes by default LLMs trained on the internet end up ‘pretty woke’ in some ways. Oliver also noted that this modest amount of discrimination in this direction might well be what the labs prefer given the asymmetric pressures they face on this. They note that the behavior is ‘suppressed’ in the sense that, like in the training data, the models have learned to do this implicitly rather than explicitly. I’m not sure how similar that is to ‘suppression.’ They were unable to fix this with prompting, but they could fix it by finding the directions inside the model for race and gender and then suppressing them.Get My Agent On The Line
So far, adaptation of AI agents for practical tasks is reported to not be going so great? This is framed as ‘they are all hype’ and high failure rates are noted at office tasks. Certainly things are moving relatively slowly, and many early attempts are not going great and were premature or overhyped. It makes sense that many companies outright faked them to get investment and attention, although I am surprised it is that many. The real agents are still coming, but my estimates of how long that will take have gotten longer.They Took Our Jobs
Anton Leicht offers another way to frame what to expect in terms of job disruption. We have Phase 1, where it is ordinary technological displacement and automation, in which we should expect disruptions but mostly for things to work out on their own. For a while it likely will look like This Is Fine. Then we have Phase 2, when the AIs are sufficiently robustly autonomous across sufficiently many domains that you get fundamental disruption and things actually break. Essentially everyone I respect on the jobs question ends up with some version of this. Those who then project that jobs will be fine are not feeling the AGI, and think we will stop before we get to Phase 2 or Phase 3. Also, there is the issue of Phase 3. We all affirm that in the past technological progress did not cause unemployment. Eliezer’s concern here cuts both ways? I do think that lack of dynamism will make us slower to reallocate labor, but it also could be why we don’t automate away the jobs. Remember that Stanford survey of 1,500 workers where it turned out (number four will surprise you!) that workers want automation for low-value and repetitive tasks and for AI to form ‘partnerships’ with workers rather than replace them, and they don’t want AI replacing human creativity? It made the rounds again, as if what workers want and what the humans would prefer to happen has anything to do with what will actually happen or what AI turns out to be capable of doing. The humans do not want to be unemployed. The humans want to do the fun things. The humans want to have a future for their grandchildren and they want to not all die. It is great that humans want these things. I also want these things. But how are we going to get them? I continue to be dismayed how many people really do mean the effect on jobs, but yes it is worth noting that our response so far to the AI age has been the opposite of worrying about jobs: I see it as totally fine to have policy respond to job disruptions after there are job disruptions, once we see what that looks like. Whereas there are other AI concerns where responding afterwards doesn’t work. But also it is troubling that the policy people not only are focusing on the wrong problem, the response has been one that only makes that particular problem more acute.Get Involved
If Anyone Builds It, Everyone Dies is running an advertising design contest. I’m not endorsing taking this job, but it is extremely funny and a good move that X Careers is suddenly advertising a role for an experienced Offensive Security Engineer. To quote a very good movie, in this situation, you’ll need more than one. EU AI Office launches a 9 million Euro tender for technical support on AI safety.Introducing
Biomni, which claims to ‘accelerate biomedical discoveries 100x with Claude.’- Completes wearable bioinformatics analysis in 35 minutes versus 3 weeks for human experts (800x faster)
- Achieves human-level performance on LAB-bench DbQA and SeqQA benchmarks
- Designs cloning experiments validated as equivalent to a 5+ year expert work in blind testing
- Automates joint analysis of large-scale scRNA-seq and scATAC-seq data to generate novel hypotheses
- Reaches state-of-the-art performance on Humanity’s Last Exam and 8 biomedical tasks
That’s great news while it is being used for positive research. What about the obvious dual use nature of all this? Simeon suggests that one could nuke the dangerous virology and bio-knowledge out of the main Claude, and then deploy a specialized high-KYC platform like this that specializes in bio. It’s a reasonable thing to try but my understanding is that ablation is much harder than it might sound.In Other AI News
Tabby Kinder and Chrina Criddle report in the Financial Times that OpenAI has overhauled its security operations over recent months to protect its IP. As in, having such policies at all, and implementing strategies like tenting, increased physical security and keeping proprietary technology in isolated environments, so that everyone doesn’t have access to everything. Excellent, I am glad that OpenAI has decided to actually attempt some security over its IP, and this will also protect against some potential alignment-style failures as well, it is all necessary and good hygenie. This was supposedly initially triggered by DeepSeek although the direct logic here (accusing DS of ‘distillation’) doesn’t make sense. In case you weren’t already assuming something similar, ICE is using ‘Mobile Fortify,’ a facial recognition tool that lets agents ID anyone by pointing a smartphone camera at them. Any plausible vision of the future involves widespread use of highly accurate facial identification technology, including by the government and also by private actors, which AIs can then track. How it’s going over at SSI: Janus says more on Opus 3 and how its alignment faking behavior is unique. Grace notes that, similar to Nostalgebraist’s The Void, humans are also much more ‘voidlike’ than we would like to admit, capable of and constantly predicting, inhabiting and shifting between roles and characters as contexts shift, and only sometimes ‘being ourselves.’Show Me the Money
Ben Thompson reports that the negotiation over Siri is about who pays who. Anthropic wants to be paid for creating a custom version of Siri for Apple, whereas OpenAI would be willing to play ball to get access to Apple’s user base but this would put Apple in the position of Samsung relative to Google’s Android. Thompson recommends they pay up for Anthropic. I strongly agree (although I am biased), it puts Apple in a much better position going forward, it avoids various strategic dependencies, and Anthropic is better positioned to provide the services Apple needs, meaning high security, reliability and privacy. That doesn’t mean paying an arbitrarily large amount. It seems obviously correct for Anthropic to ask to be paid quite a lot, as what they are offering is valuable, and to push hard enough that there is some chance of losing the contract. But I don’t think Anthropic should push hard enough that it takes that big a risk of this, unless Apple is flat out determined that it doesn’t pay at all. Oh look, OpenAI is once again planning to steal massive amounts of equity from their nonprofit, in what would be one of the largest thefts in history, with the nonprofit forced to split a third of the company with all outside investors other than Microsoft. Correction of a previous misunderstanding: OpenAI’s deal with Google is only for GPUs not TPUs. They’re still not expanding beyond the Nvidia ecosystem, so yes the market reacted reasonably. So good job, market. When [X] is reported and the market moves the wrong way, it can be very helpful to say ‘[X] was reported and stock went up, but [X] should have made stock go down,’ because there is missing information to seek. In this case, it was that the reports were wrong. Should we think of Meta’s hiring and buying spree as ‘panic buying’? It is still a small percentage of Meta’s market cap, but I think strongly yes. This was panic buying, in a situation where Meta was wise to panic. Meta poaches top Apple’s top AI executive, Ruoming Pang, for a package worth tens of millions of dollars a year. The fact that he gets eight figures a year, whereas various OpenAI engineers get offered nine figure signing bonuses, seems both correct and hilarious. That does seem to be the right order in which to bid.The Explanation Is Always Transaction Costs
Why don’t various things happen? Transaction costs. Well, Dean Ball asks, what happens when transaction costs dramatically shrink, because you have AI agents that can handle the transactions for us? As an example, what happens when your data becomes valuable and you can realize that value? I definitely count these questions as part of ‘AI policy’ to the extent you want to impose new policies and rules upon all this, or work to refine and remove old ones. And we definitely think about the best way to do that. The main reason us ‘AI policy’ folks don’t talk much about it is that these are the kinds of problems that don’t kill us, and that we are good at fixing once we encounter them, and thus I see it as missing the more important questions. We can work out these implementation details and rights assignments as we go, and provided we are still in control of the overall picture and we don’t get gradual disempowerment issues as a result it’ll be fine. I worry about questions like ‘what happens if we give autonomous goal-directed AI agents wallets and let them loose on the internet.’ Autonomous commerce is fascinating but the primary concern has to be loss of control and human disempowerment, gradual or otherwise. Thus I focus much less on the also important questions like market design and rights assignments within such markets if we manage to pull them off. It is good to think about such policy proposals, such as Yo Shavit suggesting use of ‘agent IDs’ and ‘agent insurance’ or this paper from Seth Lazar about using user-controlled agents to safeguard individual autonomy via public access to compute, open interoperability and safety standards and market regulation that prevents foreclosure of competition. But not only do they not solve the most important problems, they risk making those problems worse if we mandate AI agent competition in ways that effectively force all to hand over their agency to the ‘most efficient’ available AIs to stay in the game.Quiet Speculations
There is lots of reasonable disagreement about the impact of AI, but one thing I am confident on is that it is not properly priced in. A common mistake is to see someone, here Dwarkesh Patel, disagreeing with the most aggressive predictions, and interpreting that as a statements that ‘AI hype is overblown’ rather than ‘actually AI is way bigger than the market or most people think, I simply don’t think it will happen as quickly as transforming the entire world in the next few years.’ Also, it’s rather insane the standards people hold AI to before saying ‘hype’? Apologies for picking on this particular example. Can you imagine, in any other realm, saying ‘the last wave of progress we had was more than three months ago’ and therefore it is all hype? I mean seriously, what? It also is not true. Since then we have gotten Opus 4 and o3-pro, and GPT-4.5 was disappointing and certainly no GPT-5 but it wasn’t bad. And if the worst does happen exactly as described here and the market does only 3x every year, I mean, think about what that actually means? Also, it is highly amusing that Ilya leaving to create SSI, which was founded on the thesis of directly creating superintelligence before their first product, is being cited as a reason to believe long timelines. Sorry, what? Tyler Cowen suggests measuring AI progress by consumption basket, as in what people actually do with LLMs in everyday life, in addition to measuring their ability to do hard problems. Everyday use as a measurement is meaningful but it is backwards looking, and largely a measure of diffusion and consumer preferences. Willingness to pay per query on a consumer level is especially weird because it is largely based on alternatives and what people are used to. I don’t expect this to track the things we care about well. I agree that practical progress has been very high. Current systems are a lot more valuable and useful in practice than they were even a year ago. I disagree that there is little room for future progress even if we confine ourselves to the narrow question of individual practical user queries of the types currently asked. I do not think that even on current queries, LLM answers are anywhere close to optimal, including in terms of taking into account context and customizing to a given user and their situation. Also, we ask those questions, in those ways, exactly because those are the answers LLMs are currently capable of giving us. Alexa is terrible, but it gives the correct answer on almost all of my queries because I have learned to mostly ask it questions it can handle, and this is no different. There’s also the diffusion and learning curve questions. If we are measuring usage, then ‘AI progress’ occurs as people learn to use AI well, and get into habits of using it and use it more often, and adapt to take better advantage. That process has barely begun. So by these types of measures, we will definitely see a lot of progress, even within the role of AI as a ‘mere tool’ which has the job of providing correct answers to a fixed set of known essentially solved problems. On top of that, if nothing else, we will see greatly improved workflows and especially use of agents and agency on a local practical level.Genesis
Eric Schmidt, Henry Kissinger and Craig Mundie wrote an AGI-in-18-months-pilled book called Genesis: Artificial Intelligence, Hope and the Human Spirit. Ate-a-Pi calls it ‘stunning’ and ‘the best predictive book I’ve seen for the next five years.’ That certainly sounds self-recommending on many levels, if only to see where their minds went. I appreciate the willingness to consider a broad range of distinct scenarios, sometimes potentially overlapping, sometimes contradictory. Not all of it will map to a plausible future reality, but that’s universal. Going only from these notes, this seems like an attempt to ‘feel the AGI’ and take it seriously on some level, but largely not an attempt to feel the ASI and take it seriously or to properly think about which concepts actually make physical sense, or take the full existential risks seriously? If we do get full AGI within 18 months, I would expect ASI shortly thereafter. As in, there are then passages summarized as ‘we will talk to animals, but be fearful lest the AI categorize us as animals’ and ‘the merge.’The Quest for Sane Regulations
The EU AI Office published the GPAI Code of Practice, in three parts: Transparency, Copyright and Safety and Security. I presume we should expect it to be endorsed. I have not analyzed the documents. This is not what wanting to win the AI race or build out our energy production looks like: New Trump executive order adds even more uncertainty and risk to clean energy projects, on top of the barriers in the BBB. If you are telling me we must sacrifice all on the mantle of ‘win the AI race,’ and we have to transfer data centers to the UAE because we can’t provide the electricity, and then you sabotage our ability to provide electricity, how should one view that? The Wall Street Journal’s Amrith Ramkumar gives their account of how the insane AI moratorium bill failed, including noting that it went beyond what many in industry even wanted. It is consistent with my write-up from last week. It is scary the extent to which this only failed because it was overreach on top of overreach, whereas it should have been stopped long before it got to this point. This should never have been introduced in the first place. It was a huge mistake to count on the Byrd rule, and all of this should serve as a large wake-up call that we are being played. Anthropic proposes a simple and flexible transparency framework. Sometimes I think we can best model Anthropic as a first-rate second-rate company. They’re constantly afraid of being seen as too responsible or helpful. This still puts them well ahead of the competition. The mission here is the very definitely of ‘the least you can do’: What is the maximally useful set of requirements that imposes no substantial downsides whatsoever? Thus they limit application to the largest model developers, with revenue of $100 million or capital expenditures on the order of $1 billion. I agree. I would focus on capital expenditures, because you can have zero revenue and still be going big (see SSI) or you can have giant revenue and not be doing anything relevant. The core ideas are simple: Create a secure development framework (they abbreviate it SDFs, oh no yet another different acronym but that’s how it goes I guess). Make it public. Publish a system card. Protect whistleblowers. Have transparency standards. What does this framework have to include? Again, the bare minimum: Then companies have to disclose which such framework they are using, and issue the system card ‘at time of deployment,’ including describing any mitigations with protection for trade secrets. I notice that OpenAI and Google are playing games with what counts as time of deployment (or a new model) so we need to address that. Enforcement would be civil penalties sought by the attorney general for material violations or false or materially misleading statements, with a 30-day right to cure. That period seems highly generous as a baseline, although fine in most practical situations. So this is a classic Anthropic proposal, a good implementation of the fully free actions. Which is highly valuable. It would not be remotely enough, but I would be happy to at least get that far, given the status quo.Chip City
Bloomberg Technology reports that a Chinese company is looking to build a data center in the desert to be powered by 115,000 Nvidia chips. The catch is that they don’t describe how they would acquire these chips, given that it is very much illegal (by our laws, not theirs) to acquire those chips.Choosing The Right Regulatory Target
Dean Ball (who wrote this before joining the US Office of Science and Technology Policy) and Ketan Ramakrishnan argue for entity-based regulation of frontier AI governance rather than regulating particular AI models or targeting AI use cases. As I’ve explained many times, targeting AI use cases flat out does not work. The important dangers down the road lie at the model level. Once you create highly capable models and diffuse access to them, yelling ‘you’re not allowed to do the things we don’t want you to do’ is not going to get it done, and will largely serve only to prevent us from enjoying AI’s benefits. This post argues that use-based regulation can be overly burdensome, which is true, but the more fundamental objection is that it simply will not get the central jobs done. The paper offers good versions of many of the fundamental arguments for why use-based regulation won’t work, pointing out that things like deception and misalignment don’t line up with use cases, and that risks manifest during model training. Anticipating the dangerous use cases will also be impractical. And of course, that use-based regulation ends up being more burdensome rather than less, with the example being the potential disaster that was Texas’ proposed HB 1709. This paper argues, as Dean has been advocating for a while, that the model layer is a ‘decidedly suboptimal regulatory target,’ because the models are ‘scaffolded’ and otherwise integrated into other software, so one cannot isolate model capabilities, and using development criteria like training compute can quickly become out of date. Dean instead suggests targeting particular large AI developers. This is indeed asking the right questions and tackling the right problem. We agree the danger lies in the future more capable frontier models, that one of the key goals right now is to put us in a position to understand the situation so we can act when needed but also we need to impose some direct other requirements, and the question is what is the right way to go about that. I especially appreciated the note that dangerous properties will typically arise while a model is being trained. The post raises good objections to and difficulties of targeting models. I think you can overcome them, and that one can reasonably be asked to anticipate what a given model can allow via scaffolding compared to other models, and also that scaffolding can always be added by third parties anyway so you don’t have better options. In terms of targeting training compute or other inputs, I agree it is imperfect and will need to be adjusted over time but I think it is basically fine in terms of avoiding expensive classification errors. The first core argument is that training compute is an insufficiently accurate proxy of model capabilities, in particular because o1 and similar reasoning models sidestep training compute thresholds, because you can combine different models via scaffolding, and we can anticipate other RL techniques that lower pretraining compute requirements, and that there are many nitpicks one can make about exactly which compute should count and different jurisdictions might rule on that differently. They warn that requirements might sweep more and more developers and models up over time. I don’t think this is obvious, it comes down to the extent to which risk is about relative capability versus absolute capability and various questions like offense-defense balance, how to think about loss of control risks in context and what the baselines look like and so on. There are potential future worlds where we will need to expand requirements to more labs and models, and worlds where we don’t, and regardless of how we target the key is to choose correctly here based on the situation. Ideally, a good proxy would include both input thresholds and also anticipated capability thresholds. As in, if you train via [X] it automatically counts, with [X] updated over time, and also if you reasonably anticipate it will have properties [Y] or observe properties [Y] then that also counts no matter what. Alas, various hysterical objections plus the need for simplicity and difficulty in picking the right [Y] have ruled out such a second trigger. The obvious [Y] is something that substantively pushes the general capabilities frontier, or the frontier in particular sensitive capability areas. They raise the caveat that this style of trigger would encourage companies to not investigate or check for or disclose [Y] (or, I would add, to not record related actions or considerations), a pervasive danger across domains in similar situations. I think you can mostly deal with this by not letting them out of this and requiring third party audits. It’s a risk, and a good reason to not rely only on this style of trigger, but I don’t see a way around it. I also don’t understand how one gets away from the proxy requirement by targeting companies. Either you have a proxy that determines which models count, or you have a proxy that determines which labs or companies count, which may or may not be or include ‘at least one of your individual models counts via a proxy.’ They suggest instead a form of aggregate investment as the threshold, which opens up other potential problem cases. All the arguments about companies posing dangers don’t seem to me to usefully differentiate between targeting models versus companies. I also think you can mostly get around the issue of combining different models, because mostly what is happening there is either some of the models are highly specialized or some of them are cheaper and faster versions that are taking on less complex task aspects, or some combination thereof, and it should still be clear which model or models are making the important marginal capability differences. And I agree that of course the risk of any given use depends largely on the use and associated implementation details, but I don’t see a problem there. A second argument, that is very strong, is that we have things we need frontier AI labs to do that are not directly tied to a particular model, such as guarding their algorithmic secrets. Those requirements will indeed need to attach to the company layer. Their suggested illustrative language is to cover developers spending more than [$X] in a calendar year on AI R&D, or on compute, or it could be disjunctive. They suggest this will ‘obviously avoid covering smaller companies’ and other advantages, but I again don’t see much daylight versus sensible model-level rules, especially when they then also trigger (as they will need to) some company-wide requirements. And indeed, they suggest a model-level trigger that then impacts at the company level, which seems totally fine. If anything, the worry would be that this imposes unnecessary requirements on non-frontier other work at the same company. They note that even if entity-based regulation proves insufficient due to proliferation issues rendering it underinclusive, it will still be necessary. Fair enough. They then have a brief discussion of the substance of the regulations, noting that they are not taking a position here. I found a lot of the substance here to be strong, with my main objection being that it seems to protest the differences between model and company level rules far too much. The post illustrated that (provided the defenses against corporate-level shenanigans are robust) there is in my mind little practical difference between company-based and model-based regulatory systems other than that the company-based systems would then attach to other models at the same company. The problems with both are things we can overcome, and mostly the same problems apply to both. In the end, which one to focus on is a quibble. I am totally fine, indeed perfectly happy, doing things primarily at the corporate level if this makes things easier. Compare this contrast both to use-based regulation, and to the current preference of many in government to not regulate AI at all, and even to focus on stopping others from doing so.The Week in Audio
Ryan Greenblatt goes on 80,000 hours to talk about AI scenarios, including AI takeover scenarios. Feedback looks excellent, including from several sources typically skeptical of such arguments, but who see Ryan as providing good explanations or intuitions of why AI takeover risks are plausible. A video presentation of AI 2027, if you already know AI 2027 you won’t need it, but reports are it is very good for those unfamiliar.Rhetorical Innovation
Yes, this does seem to describe the actual plan, there is no other plan despite this being the worst plan and also essentially saying that you have no plan? Everyone’s favorite curious caveman says the obvious (link goes to longer clip): Here is our AI Czar acting as he usually does, this time with a Community Note. I rated the note as helpful. One can ask ‘why should we care if there continue to be humans?’ or ‘why should we care if all humans die?’ It is good to remember that this is at least as bad as asking ‘why should we care if there continue to be human members of [category X]’ or ‘why should we care if all [X]s die?’ for any and all values of [X]. Killing everyone would also kill all the [X]s. Universalizing genocide into xenocide does not make it better.Aligning a Smarter Than Human Intelligence is Difficult
How you think about this statement depends on one’s theory of the case. For current AI models, I do think it is centrally correct, although I would change ‘might not be sufficient’ to ‘is not sufficient,’ I see no reason for copium here. Various replies nitpicked ‘people’ versus AI, or ‘train’ versus ‘raise.’ This from Emmett seems obviously wrong to me? Very obviously lots of forms of training (that are distinct from ‘raising’) make people more virtuous and wiser, to the extent that this distinction is meaningful in context. Training ‘to be good’ in a general sense is neither strictly necessary, nor is it sufficient. But it makes your life a hell of a lot better across the board at anything like current capability levels, and you should very obviously do it, and it seems like at most one lab is doing plausibly enough of this. We very much would like to have strong ‘precursor evals,’ as in tests that predict where capabilities will be in the future, with future test results as a proxy. This seems especially important for misaligned behaviors like scheming. Unfortunately, this has not gone well so far, and seems very difficult. This essentially didn’t work and in hard cases it was even less effective. Apollo failing does not doom other efforts, and certainly we can read various signs, but we should be broadly skeptical that it will be useful to rely on such tests. Here is a remarkably reasonable idea for a type of eval past a critical capability level. Daniel Kokotajlo notes that recent results on reward hacking update him towards models de facto craving reward, quite possibly leading to a Control World where we have models that would totally murder you if they thought this action would get reinforced.Don’t Worry We Have Human Oversight
Is that oversight meaningful? In many cases meaningful human oversight is impractical. Can current models scheme? They would need both stealth (as in the ability to overcome oversight mechanisms) and situational awareness, reports Victoria Krakovna as she tests 2.5 Flash & Pro, GPT-4o, o1 and Claude 3.7 Sonnet. For now, the safety case of ‘inability’ holds, the models aren’t up to the task. The safety case you cannot make is unwillingness to scheme. If the models be capable of scheming, then the models be scheming. David Manheim points out that as per the paper discussed above, the oversight mechanisms AIs need to overcome are mostly not even there in a meaningful way and will often not be possible or practical.Don’t Worry We Have Chain Of Thought Monitoring
This seems like a good paper, but I did not treat it as the good news they seemed to? That all makes sense and matches my previous understanding, also I note that there still seems to be some amount of unfaithfulness going on here? If you don’t need the CoT to figure out the answer, because the task is well below what the full model is capable of doing or the task does not benefit from CoT, then the CoT won’t correlate all that well with the answer, also it isn’t causing the answer. If you need the CoT to figure out (cause) the answer, then the CoT will contain the work done to figure out the answer. By far the easiest and most effective way to do this in such cases, at the moment, results in that CoT being faithful and legible. So by default, that will be what happens. That does not tell you about what would happen if the model was given cause, on whatever level, to evade monitoring. I don’t know what counts or should count as ‘a lot.’ I do know that it seems likely we will apply ‘a lot’ of fine tuning in the future, and that if we are relying on CoT monitoring that a lot of it will serve as this kind of training. I also notice that, yes, the first strategy that it found was one that the authors did not expect, a different way to reason one way while being seen as reasoning the other way. Think about all the ways humans have adapted to defend against similar things, and how very high a price we pay to obscure large parts of our thinking, including from ourselves, to avoid detection. That is how this game works. My expectation is that if you rely on CoT monitoring in ways that exert optimization pressure, it will quickly stop working.Sycophancy Is Hard To Fix
I see there being two closely related but distinct threat models here.- Sycophancy arises because humans are imperfect graders and respond well to sycophancy at least locally, so if you train on human feedback you get sycophancy. Also sycophancy is ever-present in real life and thus all over the training data.
- Sycophancy is good for business, so AI companies often are fine with it, or even actively turn that dial looking back at the audience for approval like contestants on The Price is Right.
The first problem is not so easy to technically fix, either with a system prompt or otherwise. Even if you decide sycophancy is bad and you don’t want it, to fully get rid of it you’d have to change everything about how the training works. This is also one of the areas where I have run the experiment. My entire Claude system prompt is an extended version of ‘do not by sycophantic.’ It… helps. It is definitely not 100% effective.The Lighter Side
Important directionally correct statement: We don’t know what Patrick was responding to here, but yeah: