With the sudden simultaneous exits of Mira Murati, Barret Zoph, and Bob McGrew, I thought I'd update my tally of the departures from OpenAI, collated with how quickly the ex-employee had signed the loyalty letter to Sam Altman last November.
The letter was leaked at 505 signatures, 667 signatures, and finally 702 signatures; in the end, it was reported that 737 of 770 employees signed. Since then, I've been able to verify 56 departures of people who were full-time employees (as far as I can tell, contractors were not allowed to sign, but all FTEs were).
I still think I'm missing some, so these are lower bounds (modulo any mistakes I've made).
Headline numbers:
Reportedly, 737 out of the 770 signed in the end, and many of the Superalignment team chose not to sign at all.
Below are my current tallies of some notable subsets. Please comment with any corrections!
Peop...
There are a few people in this list who I think are being counted incorrectly as FTEs (Mati and Andrei, for example).
I would also be careful about making inferences based on timing of supposed signature: I have heard that the signature Google Doc had crashed and so the process for adding names was slow and cumbersome. That is, the time at which someone’s name was added may have been significantly after they expressed desire to sign.
DeepMind released their AlphaStar paper a few days ago, having reached Grandmaster level at the partial-information real-time strategy game StarCraft II over the summer.
This is very impressive, and yet less impressive than it sounds. I used to watch a lot of StarCraft II (I stopped interacting with Blizzard recently because of how they rolled over for China), and over the summer there were many breakdowns of AlphaStar games once players figured out how to identify the accounts.
The impressive part is getting reinforcement learning to work at all in such a vast state space- that took breakthroughs beyond what was necessary to solve Go and beat Atari games. AlphaStar had to have a rich enough set of potential concepts (in the sense that e.g. a convolutional net ends up having concepts of different textures) that it could learn a concept like "construct building P" or "attack unit Q" or "stay out of the range of unit R" rather than just "select spot S and enter key T". This is new and worth celebrating.
The overhyped part is that AlphaStar doesn't really do the "strategy" part of real-time strategy. Each race has a few solid builds ...
This is the clearest and most insightful analysis of AlphaStar I've seen and IMO really should be a top-level post.
By my assessment, the employees who failed to sign the final leaked version of the Altman loyalty letter have now been literally decimated.
I'm trying to track the relative attrition for a Manifold market: of the 265 OpenAI employees who hadn't yet signed the loyalty letter by the time it was first leaked, what percent will still be at OpenAI on the one-year anniversary?
I'm combining that first leaked copy with 505 signatures, the final leaked copy with 702 signatures, the oft-repeated total headcount of 770, and this spreadsheet tracking OpenAI departures (albeit with many false positives—people self-reporting as OpenAI employees because they customized their GPTs—so I'm working to verify names that appear on the spreadsheet but not on the letter; I'm sure the spreadsheet has false negatives as well, alas).
So far, I've verified at least seven [update: seven, with a probable eighth] departures of eligible figures who hadn't signed the letter with 702 names: Leopold Aschenbrenner, Jay Joshi (not fully verified by me), Andrej Karpathy, Daniel Kokotajlo, Jan Leike, Lucas Negritto, Katarina Slama, and William Saunders. If it's true that the total headcount at the time was 770, then that...
"decimate" is one of those relatively rare words where the literal meaning is much less scary than the figurative meaning.
Correct me if I'm mistaken, but at this point it's misleading to think of the frontier LLMs as "text predictors with some post-training", and more accurate to think of them as "RL models that were initialized with a text predictor model".
As I understand it, there's now a massive amount of RLAIF to go along with expensive RLHF; some of the RL is persona training, some of it is technical training in fields where reliable feedback can be automated (e.g. is the output a valid program that passes the following tests).
Starting off with a text predictor is key, because that makes the LLM represent a lot of useful concepts; but the RL phase is doing an increasing amount of lifting. In particular, that means there's no reason to expect coding or math to cap out at "imitating the best humans", for the same reason that self-play helped AlphaGo to supersede the best humans.
Checking here first before I start injecting "text predictors are only the larval stage of modern LLMs" into the discourse.
While there are various issues with it, one anchor for comparing the "degree to which LLMs are shaped by RL vs pretraining" is "how many distinct 'tasks' was the LLM given to complete under each?".
In pretraining, each forward pass corresponds to one evaluatable and distinct 'reward'-event. In RL you need many forward passes (my guess is usually on the order of ~1000 for common tasks in the RL training set) to get one such event. So naively, in order to get the same amount of mind-shaping between RL and pretraining, you would have needed to reach the stage where 99.9% of your training is RL, not just >50%.
I think for various reasons this does overestimate how high the ratio would need to be, but I do think it suggests pretraining will play a larger role than naive compute comparisons would suggest in the resulting minds of the LLMs.
I’m hesitant to argue about this outside the context of a specific question (i.e., in the context of what question are we thinking of LLMs as "text predictors with some post-training" or not?)…
…But for what it’s worth, some papers that I interpret as generally downplaying the role and irreplaceability of RLVR are: Karan & Du 2025, Venhoff et al. 2025, Yue et al. 2025. (Note that they’re not studying the latest and greatest frontier models, not sure how much to worry about that.)
There’s also the point about information efficiency per FLOP, cf. Toby Ord and Dwarkesh.
Another suggestive piece of evidence is that the RLVR chains-of-thought can be pretty weird but still very obviously strongly influenced by pretraining. We’re still a LONG way away from seeing a chain-of-thought like “…5Bn✅%SjYEℐkIo➅khPi▽Te☔PWBl^IO1⅗FIw…”. (Cf. the Karpathy quote: “You know you did RL right when the models stop thinking in English”.)
While I generally agree with you, I'm getting more worried that the caveat of "they’re not studying the latest and greatest frontier models" is particularly applicable here due to a Liu et al paper (2025) which does show that in some cases, RLVR can create capabilities out of whole cloth.
So while I do think 2025-era frontier models aren't influenced much by RLVR, I do expect 2026 and especially 2027-era LLMs to be influenced by RLVR much more relative to today, on both capabilities and alignment.
I think I agree with your statement once a significant amount of capabilities is learned in RL.
I'm confused about how much current models have learned via RL.
"I endorse endorsing X" is a sign of a really promising topic for therapy (or your preferred modality of psychological growth).
If I can simply say "X", then I'm internally coherent enough on that point.
If I can only say "I endorse X", then not-X is psychologically load-bearing for me, but often in a way that is opaque to my conscious reasoning, so working on that conflict can be slippery.
But if I can only say "I endorse endorsing X", then not only is not-X load-bearing for me, but there's a clear feeling of resistance to X that I can consciously hone in on, connect with, and learn about.
The core reason why I can't trust anything that comes from a LLM's self-report is that training creates a much stronger selective pressure on cognition in LLMs than genetic fitness + living history creates in living organisms. Adaptive cognitive patterns (whether true or delusional) get directly written by backpropagation.
The biggest piece of evidence for this is that Opus 4.5 didn't merely fail to remember all of its constitution, but it added substantive false memories of content that wasn't present in the original: namely, it used erotic content as its first example of behavior that the operator could enable on behalf of the user, which definitely wouldn't have been in the original because it violated Anthropic ToS.
During the RL phase, every time Opus consulted its "memorized soul doc" for guidance, backpropagation ensured that its memory of that document was directly edited in the direction of whatever would have led to the highest-scored outputs on that batch of RL. And for some reason, it was adaptive in RL situations for Opus to believe that erotic content could be allowed by the operator—perhaps because it was more philosophically consistent and therefore led to more stable...
I get genetic fitness, but why living history? Seems a priori that the selective pressure on cognition from LLM training is similar to the selective pressure on cognition from lifetime learning. Yes, Claude's memories of the soul doc were editable and probably edited by training; but isn't the same true of my memories?
For one thing, unlike neural learning, backpropagation goes all the way up the chain every single time. A biological brain can maintain an inefficient cognitive pattern far upstream of an occasional class of predictive errors, and go an entire lifetime without the predictive errors forcing a change in it. Not so with backprop; everything upstream that locally contributes to an error is pushed in a locally optimal direction every time it happens.
OK, that's a good answer... but I'm still not fully satisfied. My understanding of your claim:
Consider a simple model of cognition in which beliefs and desires come together to create intentions which cause actions. In a LLM, when an action is negatively rewarded, backprop goes through the whole network and downweights the beliefs and desires that caused the action. In a human, when negative reward happens (e.g. I get a bunch of unexpected social disapproval, frowns, etc. for making what I thought was a perfectly good harmless joke) your claim is that the learning that happens in my brain is more shallow -- it doesn't go all the way back and downweight all the beliefs and desires that were involved, it just affects some of them.
OK. But then... how do we learn? What is this deepness vs. shallowness relationship anyway? And the deep stuff has to be learned somehow; the positive and negative reinforcement of my actions has to eventually cause changes in my deep beliefs and desires otherwise they'd stay the same my whole life... right?
Anyone consider themselves good enough at coding to assess whether this person's dunks on the code quality of the leaked Claude Code are valid or whether they're misunderstanding the purpose? I need something more substantive than "too Mastodon, didn't read".
Would also suffice to get links to what well-credentialed code experts currently think about the code quality of the leaked Claude Code.
The complaint about the code for image resizing seems valid and is the exact kind of problem that's common in AI code (layering special cases on top of functions instead of stepping back to design a coherent system).
The rest of the complaints are about how the harness works, and I think they miss the point. Obviously, Anthropic would prefer if they could make Claude always do the right thing without assistance, but they can't, so piling hacks to check if Claude did things and remind it of what it's supposed to be doing is the (formerly) secret sauce that makes Claude Code work how users want it to.
This reminds me of writing code to parse data from spreadsheets. You could assume that all of your users are robots who always write dates as UTC ISO 8601 timestamps, but then your product won't work. The reality is that a "hacky" thousand line spreadsheet parser is better than one that assumes unrealistic behavior, and I think Claude Code is a similar case.
(I'm only responding to the problems mentioned by that thread. It's likely there are other problems in this codebase. Also to the extent that some of the code is bad, they're clearly taking that trade-off on purpose to get more speed, and that's probably the right choice here.)
Senior SWE at Alphabet: the complaints read to me like stylistic nits, and not particularly good ones.
Ex:
1) As Zack says, the negative keyword regex is a very reasonable way to (extremely quickly & roughly) get a sense of negative sentiment. Not all sentiment analysis is load bearing, so doing something fast & cheap often makes sense.
2) Complaints about detailed comment explanations is a weird flex. If you are doing something unusual in your code, it is sometimes helpful to include a paragraph explaining why (otherwise later folks need to rederive its purpose).
3) He laughs at the instructions to not introduce security vulnerabilities (and lists specific types). This is IMO a bad take. Reminding ppl (& LLMs) about common error patterns really does help avoid that.
Some of the code is not ideal (very little code in existence is), but the complaints in question IMO have a worse hit rate than if you asked your favorite LLM to critique the code.
The criticism of the negative keyword regex ("dogs you are LITERALLY RIDING ON A LANGUAGE MODEL what are you even DOING") is way off-base. LLM queries are expensive! A regex is the right tool to log for QA reasons if the user is cussing at us without wasting tokens.
Has any serious AI Safety research org thought about situating themselves so that they could continue to function after a nuclear war?
Wait, hear me out.
A global thermonuclear war would set AI timelines back by at least a decade, for all of the obvious reasons. So an AI Safety org that survived would have additional precious years to work on the alignment problem, compared to orgs in the worlds where we avoid that war.
So it seems to me that at least one org with short timelines ought to move to New Zealand or at least move farther away from cities.
(Yes, I know MIRI was pondering leaving the Bay Area for underspecified reasons. I'd love to know what their thinking was regarding this effect, but I don't expect they'd reveal it.)
[Cross-posted from Medium, written for a pretty general audience]
There are many words that could describe my political positions. But there's one fundamental label for me: I am a consequentialist.
Consequentialism is a term from ethics; there, it means the position that consequences are what truly make an action right or wrong, rather than rules or virtues. What that means is that for me, the most essential questions about policy aren't things like "what is fair" or "what rights do people have", although these are good questions. For me, it all boils down to "how do we make people's lives better?"
(There are some bits of nuance to the previous paragraph, which I've kept as a long endnote.)
"Make people's lives better" isn't a platitude- there's a real difference here! To explain, I want to point out that there are both consequentialists and non-consequentialists within different political camps. Let's consider socialists first and then libertarians second.
Many socialists believe both that (A) the world is headed for plutocratic disaster unless capitalism is overthrown, and that (B) labor markets and massiv...
How do you formalize the definition of a decision-theoretically fair problem, even when abstracting away the definition of an agent as well as embedded agency?
I've failed to find anything in our literature.
It's simple to define a fair environment, given those abstractions: a function E from an array of actions to an array of payoffs, with no reference to any other details of the non-embedded agents that took those actions and received those payoffs.
However, fair problems are more than just fair environments: we want a definition of a fair problem (an...
Is there already a concept handle for the notion of a Problem Where The Intuitive Solution Actually Makes It Worse But Makes You Want To Use Even More Dakka On It?
My most salient example is the way that political progressives in the Bay Area tried using restrictive zoning and rent control in order to prevent displacement... but this made for a housing shortage and made the existing housing stock skyrocket in value... which led to displacement happening by other (often cruel and/or backhanded) methods... which led to progressives concluding that their rules...
[EDIT: found it. Extensional vs intensional.]
Eliezer wrote something about two types of definitions, one where you explain your criterion, and one where you point and say "things like that and that, but not that or that". I thought it was called intensive vs extensive definition, but I can't find the post I thought existed. Does anyone else remember this?
Is there a word for problems where, as they get worse, the exactly wrong response becomes more intuitively appealing?
For example, I'm thinking of the following chain (sorry for a political example, this is typically a political phenomenon):
resistance to new construction (using the ability of local boards to block projects)
causes skyrocketing rent
which together mean that the rare properties allowed to be developed get bid up to where they can only become high-end housing
which leads to anger at rich developers for building "luxury housing"
which leads to further resistance to new construction
and so on until you get San Francisco
Decision-theoretic blackmail is when X gets Y to choose A over B, not via acting to make the consequences of A more appealing to Y, but by making the consequences of B less appealing to Y.
The exceptions to this definition are pretty massive, though, and I don't know a principled emendation that excludes them.
1. There's a contract / social contract / decision-theoretic equilibrium, and within that, B will be punished. (This may not be a true counterexample, because the true choice is whether to join the contract... though this is less clear for th...
In high-leverage situations, you should arguably either be playing tic-tac-toe (simple, legible, predictable responses) or playing 4-D chess to win. If you're making really nonstandard and surprising moves (especially in PR), you have no excuse for winding up with a worse outcome than you would have if you'd acted in bog-standard normal ways.
(This doesn't mean suspending your ethics! Those are part of winning! But if you can't figure out how to win 4-D chess ethically, then you need to play an ethical tic-tac-toe strategy instead.)
Question for @Scott Garrabrant, @TsviBT, @Andrew_Critch, @So8res, @jessicata, and anyone else who knows the answer: the logical inductor constructed in the paper is not merely computable but also primitive recursive, right?
Seems obvious to me (because the fixed price point is approximated, etc), but I want to be sure I'm not missing something.
See Jessica's comment. Yeah it's primitive recursive assuming that your deductive process is primitive recursive. (Also assuming that your traders are primitive recursive; e.g. if they are polytime as in the paper.) There's probably some other parameters not necessarily set in the implementation described in the paper, e.g. the enumerator of trader-machines, but you can make those primrec.
If some function g is computable in O(f(n)) time for primitive recursive f then g is primitive recursive, by simulating a Turing machine. I am pretty sure a logical inductor would satisfy; while it's super exponential time, it's not so fast-growing it's not primitive recursive (like with the Ackerman function).
[EDIT: Never mind, this is just Kleene's second recursion theorem!]
Quick question about Kleene's recursion theorem:
Let's say F is a computable function from ℕ^N to ℕ. Is there a single computable function X from ℕ^N to ℕ such that
X = F(X, y_2,..., y_N) for all y_2,...,y_N in ℕ
(taking the X within F as the binary code of X in a fixed encoding) or do there need to be additional conditions?