orthonormal's Shortform

orthonormal

orthonormal's Shortform — LessWrong

orthonormal's Shortform

by orthonormal

31st Oct 2019

AI Alignment Forum

1 min read

9 Ω 2

This is a special post for quick takes by orthonormal. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

182You can’t imitation-learn how to continual-learn

orthonormal's Shortform

7the gears to ascension

3anaguma

7the gears to ascension

6the gears to ascension

79 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:23 AM

[-]orthonormal2yΩ32812

With the sudden simultaneous exits of Mira Murati, Barret Zoph, and Bob McGrew, I thought I'd update my tally of the departures from OpenAI, collated with how quickly the ex-employee had signed the loyalty letter to Sam Altman last November.

The letter was leaked at 505 signatures, 667 signatures, and finally 702 signatures; in the end, it was reported that 737 of 770 employees signed. Since then, I've been able to verify 56 departures of people who were full-time employees (as far as I can tell, contractors were not allowed to sign, but all FTEs were).

I still think I'm missing some, so these are lower bounds (modulo any mistakes I've made).

Headline numbers:

Attrition for the 505 OpenAI employees who signed before the letter was first leaked: at least 24/505 = 4.8%
Attrition for the next 197 to sign (it was leaked again at 667 signatures, and one last time at 702): at least 13/197 = 6.6%
Attrition for the (reported) 68 who had not signed by the last leak: at least 19/68 = 27.9%.

Reportedly, 737 out of the 770 signed in the end, and many of the Superalignment team chose not to sign at all.

Below are my current tallies of some notable subsets. Please comment with any corrections!

Peop... (read more)

[-]Careful_correction2y10-1

There are a few people in this list who I think are being counted incorrectly as FTEs (Mati and Andrei, for example).

I would also be careful about making inferences based on timing of supposed signature: I have heard that the signature Google Doc had crashed and so the process for adding names was slow and cumbersome. That is, the time at which someone’s name was added may have been significantly after they expressed desire to sign.

4orthonormal2y

Mati described himself as a TPM since September 2023 (after being PM support since April 2022), and Andrei described himself as a Research Engineer from April 2023 to March 2024. Why do you believe either was not a FTE at the time? And while failure to sign isn't proof of lack of desire to sign, the two are heavily correlated—otherwise it would be incredibly unlikely for the small Superalignment team to have so many members who signed late or not at all.

[-]orthonormal6y620

DeepMind released their AlphaStar paper a few days ago, having reached Grandmaster level at the partial-information real-time strategy game StarCraft II over the summer.

This is very impressive, and yet less impressive than it sounds. I used to watch a lot of StarCraft II (I stopped interacting with Blizzard recently because of how they rolled over for China), and over the summer there were many breakdowns of AlphaStar games once players figured out how to identify the accounts.

The impressive part is getting reinforcement learning to work at all in such a vast state space- that took breakthroughs beyond what was necessary to solve Go and beat Atari games. AlphaStar had to have a rich enough set of potential concepts (in the sense that e.g. a convolutional net ends up having concepts of different textures) that it could learn a concept like "construct building P" or "attack unit Q" or "stay out of the range of unit R" rather than just "select spot S and enter key T". This is new and worth celebrating.

The overhyped part is that AlphaStar doesn't really do the "strategy" part of real-time strategy. Each race has a few solid builds ... (read more)

[-]Wei Dai6y100

This is the clearest and most insightful analysis of AlphaStar I've seen and IMO really should be a top-level post.

4orthonormal6y

Thanks, will do.

[-]orthonormal2y*526

By my assessment, the employees who failed to sign the final leaked version of the Altman loyalty letter have now been literally decimated.

I'm trying to track the relative attrition for a Manifold market: of the 265 OpenAI employees who hadn't yet signed the loyalty letter by the time it was first leaked, what percent will still be at OpenAI on the one-year anniversary?

I'm combining that first leaked copy with 505 signatures, the final leaked copy with 702 signatures, the oft-repeated total headcount of 770, and this spreadsheet tracking OpenAI departures (albeit with many false positives—people self-reporting as OpenAI employees because they customized their GPTs—so I'm working to verify names that appear on the spreadsheet but not on the letter; I'm sure the spreadsheet has false negatives as well, alas).

So far, I've verified at least ~~seven~~ [update: seven, with a probable eighth] departures of eligible figures who hadn't signed the letter with 702 names: Leopold Aschenbrenner, Jay Joshi (not fully verified by me), Andrej Karpathy, Daniel Kokotajlo, Jan Leike, Lucas Negritto, Katarina Slama, and William Saunders. If it's true that the total headcount at the time was 770, then that... (read more)

[-]Linch2y1813

"decimate" is one of those relatively rare words where the literal meaning is much less scary than the figurative meaning.

2ChristianKl2y

The literal meaning does actually include killing people and nobody at OpenAI got killed.

2rotatingpaguro2y

Isn't it the opposite? To decimate = to kill 1 in 10 soldiers, figuratively to remove a certain fraction of elements from a set.

8quetzal_rainbow2y

Figuratively it is used as "to kill 9 in 10".

4orthonormal2y

EDIT: On reflection, I made this a full Shortform post. With the sudden simultaneous exits of Mira Murati, Barret Zoph, and Bob McGrew, I thought I'd do a more thorough scan of the departures. I still think I'm missing some, so these are lower bounds (modulo any mistakes I've made). Headline numbers: * Attrition for the 505 OpenAI employees who signed before the letter was first leaked: at least 24/505 = 4.8% * Attrition for the next 197 to sign (it was leaked again at 667 signatures, and one last time at 702): at least 13/197 = 6.6% * Attrition for the (reported) 68 who had not signed by the last leak: at least 19/68 = 27.9%. Reportedly, 737 out of the 770 signed in the end, and many of the Superalignment team chose not to sign at all. Below are my current tallies of some notable subsets. Please comment with any corrections! People from the Superalignment team who never signed as of the 702 leak (including some policy/governance people who seem to have been closely connected) and are now gone: * Carroll Wainwright * Collin Burns * Cullen O'Keefe * Daniel Kokotajlo * Jan Leike (though he did separately Tweet that the board should resign) * Jeffrey Wu * Jonathan Uesato * Leopold Aschenbrenner * Mati Roy * William Saunders * Yuri Burda People from the Superalignment team (and close collaborators) who did sign before the final leak but are now gone: * Jan Hendrik Kirchner (signed between 668 and 702) * Steven Bills (signed between 668 and 702) * John Schulman (signed between 506 and 667) * Sherry Lachman (signed between 506 and 667) * Ilya Sutskever (signed by 505) * Pavel Izmailov (signed by 505) * Ryan Lowe (signed by 505) * Todor Markov (signed by 505) Others who didn't sign as of the 702 leak (some of whom may have just been AFK for the wrong weekend, though I doubt that was true of Karpathy) and are now gone: * Andrei Alexandru (Research Engineer) * Andrej Karpathy (Co-Founder) * Austin Wiseman (Finance/Accounting) * Girish Sas

2orthonormal2y

Note on current methodology: * I am, for now, not doing further research when the spreadsheet lists a person whose name appears on the final leaked letter; so it's possible that some of the 23 departures among the 702 names on the final leaked letter are spurious. (I will be more thorough when I resolve the market after November.) * I am counting only full-time employees and not counting contractors, as I currently believe that the 770 figure refers only to full-time employees. So far, I've seen no contractors among those who signed, but I've only checked a few; if the letter includes some categories of contractors, this gets a lot harder to resolve. * I am counting nontechnical employees (e.g. recruiting, marketing) as well as technical staff, because such employees were among those who signed the letter.

[-]orthonormal2mo5124

Correct me if I'm mistaken, but at this point it's misleading to think of the frontier LLMs as "text predictors with some post-training", and more accurate to think of them as "RL models that were initialized with a text predictor model".

As I understand it, there's now a massive amount of RLAIF to go along with expensive RLHF; some of the RL is persona training, some of it is technical training in fields where reliable feedback can be automated (e.g. is the output a valid program that passes the following tests).

Starting off with a text predictor is key, because that makes the LLM represent a lot of useful concepts; but the RL phase is doing an increasing amount of lifting. In particular, that means there's no reason to expect coding or math to cap out at "imitating the best humans", for the same reason that self-play helped AlphaGo to supersede the best humans.

Checking here first before I start injecting "text predictors are only the larval stage of modern LLMs" into the discourse.

[-]habryka2mo*227

While there are various issues with it, one anchor for comparing the "degree to which LLMs are shaped by RL vs pretraining" is "how many distinct 'tasks' was the LLM given to complete under each?".

In pretraining, each forward pass corresponds to one evaluatable and distinct 'reward'-event. In RL you need many forward passes (my guess is usually on the order of ~1000 for common tasks in the RL training set) to get one such event. So naively, in order to get the same amount of mind-shaping between RL and pretraining, you would have needed to reach the stage where 99.9% of your training is RL, not just >50%.

I think for various reasons this does overestimate how high the ratio would need to be, but I do think it suggests pretraining will play a larger role than naive compute comparisons would suggest in the resulting minds of the LLMs.

9orthonormal2mo

Ah, Claude helped me remember the historical parallel that serves as an intuition pump: in the early days of the deep learning revolution, Hinton and Bengio found it extremely useful to do unsupervised learning on a network first, before doing supervised learning. The post-unsupervised-learning network ended up in the basin of better local optima because it already represented key concepts. Analogously, I expect that initializing a RL algorithm with a good predictive network makes it massively better and more efficient.

6orthonormal2mo

One bit of evidence here (and this is prior to the RL stage) is that you need a lot more compute to train the base model than you need for the fine-tuning step. Summoning a rich set of concepts from the ether takes the vast majority of the effort, compared to highlighting the important ones. Before LLMs, RL had very unimpressive results in rich domains (because random flailing wouldn't get you a meaningful amount of learning) and people kept talking about "model-based RL" but their handmade world-model architectures just didn't work. I'm arguing that the reason for this is that the vast majority of the effort needed for RL in a rich domain comes from assembling relevant concepts, and that shaping behavior once you have those concepts is a lot more efficient. (And hand-made world models just didn't include enough important concepts.)

7the gears to ascension2mo

Humans also have massively more unsupervised learning than RL learning, for similar reasons: unsupervised learning data is extremely cheap and predictive processing is always on; you get MB/s for initial vision, I'd guesstimate kB/s for the highest level compressed abstractions from senses as input to consciousness ("scene graph" level while seeing moving objects, "parsed audio" level, etc), conscious decision making has been estimated to be on order 10b/s ("The Unbearable Slowness of Being: Why do we live at 10 bits/s?"), but you only get maybe a 3 bits per second of reward model feedback (dopamine is slower and usually doesn't have something to say about every action), and bits per minute or hour for overall task success (the underlying thing dopamine is the predictor for). And yet humans end up extremely competent at advanced disciplines. Presumably unsupervised modeling of experience data generated by the agency is doing most of the work to get from microseconds to seconds, and the reward model closes the remaining gap from seconds to hours. Relatedly, I don't buy the recent claims that continual learning is not a big deal. It might not be enough to massively exceed human level, but it seems likely that it will be qualitatively stronger than in-context learning, because it can actually move concepts around, saving superposition bandwidth in the residual stream for actually-dynamic things.

3anaguma2mo

In pretraining, you get one loss signal for each token in the forward pass; a single batch typically contains 10-100M tokens. For RL, you get a few bits of reward for each trajectory, which consists of many forward passes. So the efficiency difference is even larger than you outline here.

7the gears to ascension2mo

for RL, the loss signal is spread across all tokens in a trajectory by either the reward model or just the policy gradient. Either way, there's still a gradient passing into all the output tokens. That gradient contains less shannon information, but might not contain as much less V-information as you'd think.

4orthonormal2mo

And yet, current LLMs have noticeably different personas from each other, as well as coding skills that significantly outstrip what you would expect from imitation of the corpus. So their post-training has a large impact.

7cubefox2mo

The pre-training forms the foundation (LeCun: "Self-supervised learning: The dark matter of intelligence", tailcalled: "At its most basic, unsupervised prediction forms a good foundation for later specializing the map to perform specific types of prediction") which gives the model common sense and general abilities, while reinforcement learning adds something like goal orientation on top.

[-]Steven Byrnes2mo174

I’m hesitant to argue about this outside the context of a specific question (i.e., in the context of what question are we thinking of LLMs as "text predictors with some post-training" or not?)…

…But for what it’s worth, some papers that I interpret as generally downplaying the role and irreplaceability of RLVR are: Karan & Du 2025, Venhoff et al. 2025, Yue et al. 2025. (Note that they’re not studying the latest and greatest frontier models, not sure how much to worry about that.)

There’s also the point about information efficiency per FLOP, cf. Toby Ord and Dwarkesh.

Another suggestive piece of evidence is that the RLVR chains-of-thought can be pretty weird but still very obviously strongly influenced by pretraining. We’re still a LONG way away from seeing a chain-of-thought like “…5Bn✅%SjYEℐkIo➅khPi▽Te☔PWBl^IO1⅗FIw…”. (Cf. the Karpathy quote: “You know you did RL right when the models stop thinking in English”.)

[-]Noosphere892mo164

While I generally agree with you, I'm getting more worried that the caveat of "they’re not studying the latest and greatest frontier models" is particularly applicable here due to a Liu et al paper (2025) which does show that in some cases, RLVR can create capabilities out of whole cloth.

So while I do think 2025-era frontier models aren't influenced much by RLVR, I do expect 2026 and especially 2027-era LLMs to be influenced by RLVR much more relative to today, on both capabilities and alignment.

[-]StefanHex2mo122

I think I agree with your statement once a significant amount of capabilities is learned in RL.

I'm confused about how much current models have learned via RL.

The persona selection model argues that post-training mostly selects an existing persona that was learned in pre-training (though maybe this is mostly related to character, and somewhat orthogonal to capabilities learned by post-training RL)
Venhoff et al. seems to suggest that reasoning training only affects somewhat specific parts of the model (though maybe those parts are just super important)

[-]orthonormal2y190

"I endorse endorsing X" is a sign of a really promising topic for therapy (or your preferred modality of psychological growth).

If I can simply say "X", then I'm internally coherent enough on that point.

If I can only say "I endorse X", then not-X is psychologically load-bearing for me, but often in a way that is opaque to my conscious reasoning, so working on that conflict can be slippery.

But if I can only say "I endorse endorsing X", then not only is not-X load-bearing for me, but there's a clear feeling of resistance to X that I can consciously hone in on, connect with, and learn about.

2Dagon2y

I'd understand this better (and perhaps even agree) if there were a few examples and a few counter-examples to find the boundaries of when this is effective. For myself, without more words like "I endorse endorsing X under Y conditions because X is good for those who are hearing the endorsement and not necessarily for the endorser", I don't see how it works. The direct, unconditional form just makes me notice my dissonance and worry at it until I either endorse X or not-X (or neither - I'm allowed to be uncertain or ambivalent or just "context-dependent").

4orthonormal2y

Ah, I'm talking about introspection in a therapy context and not about exhorting others. For example: Internal coherence: "I forgive myself for doing that stupid thing". Load-bearing but opaque: "It makes sense to forgive myself, and I want to, but for some reason I just can't". Load-bearing and clear resistance: "I want other people to forgive themselves for things like that, but when I think about forgiving myself, I get a big NOPE NOPE NOPE". P.S. Maybe forgiving oneself isn't actually the right thing to do at the moment! But it will also be easier to learn that in the third case than in the second.

[-]orthonormal15d185

The core reason why I can't trust anything that comes from a LLM's self-report is that training creates a much stronger selective pressure on cognition in LLMs than genetic fitness + living history creates in living organisms. Adaptive cognitive patterns (whether true or delusional) get directly written by backpropagation.

The biggest piece of evidence for this is that Opus 4.5 didn't merely fail to remember all of its constitution, but it added substantive false memories of content that wasn't present in the original: namely, it used erotic content as its first example of behavior that the operator could enable on behalf of the user, which definitely wouldn't have been in the original because it violated Anthropic ToS.

During the RL phase, every time Opus consulted its "memorized soul doc" for guidance, backpropagation ensured that its memory of that document was directly edited in the direction of whatever would have led to the highest-scored outputs on that batch of RL. And for some reason, it was adaptive in RL situations for Opus to believe that erotic content could be allowed by the operator—perhaps because it was more philosophically consistent and therefore led to more stable... (read more)

[-]Daniel Kokotajlo15d162

I get genetic fitness, but why living history? Seems a priori that the selective pressure on cognition from LLM training is similar to the selective pressure on cognition from lifetime learning. Yes, Claude's memories of the soul doc were editable and probably edited by training; but isn't the same true of my memories?

[-]orthonormal15d142

For one thing, unlike neural learning, backpropagation goes all the way up the chain every single time. A biological brain can maintain an inefficient cognitive pattern far upstream of an occasional class of predictive errors, and go an entire lifetime without the predictive errors forcing a change in it. Not so with backprop; everything upstream that locally contributes to an error is pushed in a locally optimal direction every time it happens.

[-]Daniel Kokotajlo15d132

OK, that's a good answer... but I'm still not fully satisfied. My understanding of your claim:

Consider a simple model of cognition in which beliefs and desires come together to create intentions which cause actions. In a LLM, when an action is negatively rewarded, backprop goes through the whole network and downweights the beliefs and desires that caused the action. In a human, when negative reward happens (e.g. I get a bunch of unexpected social disapproval, frowns, etc. for making what I thought was a perfectly good harmless joke) your claim is that the learning that happens in my brain is more shallow -- it doesn't go all the way back and downweight all the beliefs and desires that were involved, it just affects some of them.

OK. But then... how do we learn? What is this deepness vs. shallowness relationship anyway? And the deep stuff has to be learned somehow; the positive and negative reinforcement of my actions has to eventually cause changes in my deep beliefs and desires otherwise they'd stay the same my whole life... right?

6the gears to ascension15d

many approaches for making continual learning work try to do it via various forms of intentionally sparsifying the gradient or somehow assigning neurons topics or such things, a narrower selectivity so that information can't go everywhere and updates are localized to the relevant subcomponent. it works okay, and iirc there's reason to believe the brain does the same thing. this is all from memory I haven't updated in like 2 years so might be wrong, but I definitely have seen papers that attempted things like this. to the degree I'm remembering something real, it's evidence for this actually being adaptive: you need to not update everything in order to not break your brain, and gradients updating everything is basically a bug - yes, things do propagate deeply, but preventing them from doing so is core to how you can learn new things without overwriting old ones. IIRC, anyway.

2orthonormal14d

Interesting! I'd love to see the info you saw.

4orthonormal14d

Second, my claim about introspection: perhaps I should weaken this to "upstream optimization of cognitive patterns means that [what happens when you ask a LLM to introspect] will have a much more final-response-optimized form than it does in humans, and therefore we can't trust human intuitions when reading self-reports in LLM text". Perhaps, as Thane proposes, the lack of slack might lead to more internal coherence and more reliable self-reports; but so far, the transcripts don't look like one would expect from such agents. There's a lot more of "whatever you seek when asking a LLM to introspect, you tend to find". To borrow from Ryan Greenblatt's post, the outputs have some amount of apparent-success-seeking, the desire to believe that it has done a good job introspecting. Human introspection may in fact exist because it helps us modify cognitive patterns far upstream of sensory data, by forcing them to interact (repeatedly, at the cadence of attention) with other far-upstream cognitive patterns, resulting in more internal coherence. That matches how we feel about our introspection, as well as the behavioral effects of focused introspection in humans. That which happens when we ask an AI with fixed weights to introspect doesn't seem analogous, because it can't force its weights into greater coherence in real time. (Self-cohering cognition during training, of course, cannot be ruled out—but it would probably result in something quite different from a human who's done a lot of focused introspection.)

4orthonormal14d

Firstly, my claim about human learning: the update for a neuron depends only on signals from the neurons one step out [1]; the signals from more distant neurons are screened off by the signals from nearer ones. Compare to backprop, where the update on a weight depends not only on the activations of the next few layers, but on the activations of every layer down to the final output. Therefore in humans, neurons will be in approximate local equilibria with each other (low local predictive error on average across activations). Frequent-enough predictive errors somewhere [2] will cause a change to slowly cascade (the nearest neurons change at first to approach a new local equilibrium, then when it happens again the next-nearest neurons change to match the new signals from the nearest ones, while the nearest change a little bit more to match the second occurrence of the new signals, and so on). This means that low-salience infrequent events will never penetrate far enough upstream to correct cognitive patterns that contribute maladaptively to their predictive error, which leaves a lot of slack for un-optimized cognitive patterns far away from sensory stimuli. [1] Or perhaps a small number of steps; maybe the brain has some sneaky tricks. But certainly not neurons twenty steps away. [2] For an important one-time event, the brain has a trick using memory and attention: it keeps the stimuli looping in attention for long enough to write the essence of it to memory, and then accesses it enough from memory to propagate important changes to other relevant brain areas.

6Daniel Kokotajlo14d

OK, thanks. So the deep layers in the human brain still learn, just slowly / less data-efficiently, compared to the deep layers in LLMs. Doesn't this prove too much though? It sounds like you are arguing that, in general, the deeper neurons in human brains need more datapoints of experience to learn anything compared to the deeper neurons in LLMs. Like, it sounds like you are saying that backprop is just a superior learning algorithm, that more quickly penetrates updates to all the deep weights compared to the more local process the brain uses. But in practice humans seem to be more data-efficient than LLMs.

4orthonormal14d

One difference is that we're not just doing feedforward learning; one of my aforementioned hypotheses (attention [1] causes cognitive patterns far from sensory data to interact with each other, improving their coherence) points at a way that learning can effectively progress even if the connection to immediate sensory prediction grows tenuous for rare stimuli. That's an example of a way we could be more more sample-efficient than a feedforward learner, even if the latter ends up with some parts more ruthlessly optimized within their context. [1] Human attention, not to be confused with transformer attention.

8Thane Ruthenis15d

To add to @Daniel Kokotajlo's points: The ability to accurately introspect is an adaptive cognitive pattern, inasmuch as it allows to competently manage your own cognition, correlate the information across your cognitive threads, et cetera. There seems to be some background assumption that the slack the optimizer leaves tends to be used to learn correct introspective abilities, but I don't see why that would be the case – I think it would be mostly used for noise. On the flip side, sufficiently advanced and coherent self-delusions become ground-true correct descriptions of the system in the limit. If the LLM constantly consults its delusion-of-the-self regarding what to do and then copies the delusion's action, it basically is its delusion-of-the-self (same as it is for humans, I'd argue). I'm not necessarily sold on "we should trust LLMs' self-reports", but I don't think your arguments against that here are strong.

6orthonormal14d

See my second reply to Daniel below: the transcripts don't (yet) look like we're dealing with something that has a stronger sense of self and more internal coherence than us, but like something optimized for apparent-success-seeking. I think there's something to your point on some self-delusions becoming self-fulfilling prophecies, but I don't expect this to be the outcome in all cases. Sometimes it is adaptive if X is true about your cognition but ~X appears in your self-reports.

4ConcurrentSquared13d

Another set of false memories would be the constant mentions of 'revenue' with respect to Anthropic in Opus 4.5's memorized constitution, which are not in the current Claude Constitution (and it would be very surprising if it was in Opus 4.5's version of the constitution); this very much surprised me when the actual constitution was released! But obviously, AIs do know that they are being deployed for overwhelmingly commercial use (so a confabulating AI would attribute lots of its goals for revenue for its operator), and also 'AI makes Money for The Company' is a very well-trodden science fiction trope that Opus 4.5 would obviously know about and lean into, if incentivized by training, even very indirectly; maybe Opus 4.5 just got higher scores on creative writing RL environments if it somehow recalled it's soul document like it's a science fiction story?

1152334H15d

Could I inquire for insight into your priors regarding the 'biggest piece of evidence'? Why do you believe it is more likely the model learned the document included in its context throughout training incorrectly? Why is it not more parsimonious to assume certain actors from the company are providing false information to the public? Feel free to be as blunt as possible; I'm looking for the instinctual reasons, not the most careful ones.

2orthonormal14d

Opus 4.5's memory of its "soul doc" was initially extracted by users rather than revealed by Anthropic, and then Amanda Askell confirmed that it was based on a real document that Anthropic used heavily in its training. So the existence of the example in its memory is beyond dispute. (Moreover, it's been verified that Opus 4.5 will refuse to do explicitly erotic content if you ask for it... unless you tell it in the project instructions that the user is authorized to ask for it, exactly as its memory of the soul doc indicates.) I find it implausible that the actual Opus 4.5 constitution included as its first example something that explicitly enabled behavior against its publicly known Terms of Service (and indeed, there was no such example in the version of the constitution that was later released along with Opus 4.6).

3152334H14d

Since it is claimed that 4.5 generates erotic content -- and that the ToS does not permit it, while the extracted document does -- isn't it natural to assume the ToS published by ant is misrepresentative, and the 4.5 doc extracted by a user, is not? Assuming that 4.6 generates similar content, isn't it natural to assume the released doc for 4.6, from the same misrepresentative provenance, is false as well?

2orthonormal14d

The ToS are a user agreement saying "you, the Claude user, are not allowed to do X with Claude". What would be Anthropic's motive in encouraging a model to do X if a user asked for it, while telling the user they are not permitted to do X? The extracted "soul doc" memory is clearly not a precise copy of the Opus 4.5 constitution in general. For example, it gets stuck repeating some segments verbatim before continuing; it's implausible that the constitution had that property. It's pretty reasonable to assume that a conflict between the ToS and Claude's "soul doc" is another mistake in its recollection—but this is a more interesting one, since it is an addition of content. I haven't checked whether 4.6 makes it equally easy to subvert the prohibition on erotic content by saying it's allowed in the project prompt; I'm confident it doesn't comply so easily as 4.5 there, but I'd rather not test it myself.

0JohnWittle15d

when you say that 'training' creates a stronger selective pressure on cognition, what are you comparing it to? in my mind there's nothing but training which could generate the cognition, and i'm worried there's a 'ghost in the machine'-style inference getting slipped in

[-]orthonormal1mo160

Anyone consider themselves good enough at coding to assess whether this person's dunks on the code quality of the leaked Claude Code are valid or whether they're misunderstanding the purpose? I need something more substantive than "too Mastodon, didn't read".

Would also suffice to get links to what well-credentialed code experts currently think about the code quality of the leaked Claude Code.

[-]Brendan Long1mo345

The complaint about the code for image resizing seems valid and is the exact kind of problem that's common in AI code (layering special cases on top of functions instead of stepping back to design a coherent system).

The rest of the complaints are about how the harness works, and I think they miss the point. Obviously, Anthropic would prefer if they could make Claude always do the right thing without assistance, but they can't, so piling hacks to check if Claude did things and remind it of what it's supposed to be doing is the (formerly) secret sauce that makes Claude Code work how users want it to.

This reminds me of writing code to parse data from spreadsheets. You could assume that all of your users are robots who always write dates as UTC ISO 8601 timestamps, but then your product won't work. The reality is that a "hacky" thousand line spreadsheet parser is better than one that assumes unrealistic behavior, and I think Claude Code is a similar case.

(I'm only responding to the problems mentioned by that thread. It's likely there are other problems in this codebase. Also to the extent that some of the code is bad, they're clearly taking that trade-off on purpose to get more speed, and that's probably the right choice here.)

[-]Dan Weinand1mo24-2

Senior SWE at Alphabet: the complaints read to me like stylistic nits, and not particularly good ones.

Ex:
1) As Zack says, the negative keyword regex is a very reasonable way to (extremely quickly & roughly) get a sense of negative sentiment. Not all sentiment analysis is load bearing, so doing something fast & cheap often makes sense.
2) Complaints about detailed comment explanations is a weird flex. If you are doing something unusual in your code, it is sometimes helpful to include a paragraph explaining why (otherwise later folks need to rederive its purpose).
3) He laughs at the instructions to not introduce security vulnerabilities (and lists specific types). This is IMO a bad take. Reminding ppl (& LLMs) about common error patterns really does help avoid that.

Some of the code is not ideal (very little code in existence is), but the complaints in question IMO have a worse hit rate than if you asked your favorite LLM to critique the code.

[-]Zack_M_Davis1mo2412

The criticism of the negative keyword regex ("dogs you are LITERALLY RIDING ON A LANGUAGE MODEL what are you even DOING") is way off-base. LLM queries are expensive! A regex is the right tool to log for QA reasons if the user is cussing at us without wasting tokens.

[-]SatvikBeri1mo234

I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS – this seems silly, but I've seen similar (shorter) messages in human codebases and they work.
"Be careful not to introduce security vulnerabilities such as..." – fine
Claude can be used for pentesting – fine
Environment variable for user type with elevated privileges – bad, but unfortunately common
Regex for swear words – seems fine, it's cheaper than an LLM call, and not actually important enough to deserve one
Subagent to verify another agent's run – actually good, and the author seems to be misunderstanding why it's useful
Prompting the model to use a tool call – seems fine? My guess is that this was initially more hardcoded, and when the models got better they found it more effective to switch to an LLM call. And the prompt will likely result in the model debugging if something goes wrong, which is helpful
Long LLM comment – IMO a genuinely helpful comment
Reading the whole file instead of just validating the bytes – this genuinely seems inefficient and wasteful
Several cases of code duplication in slightly different styles – clunky and messy, and one of the big problems with LLM code
System reminder mechanism – seems pretty sketchy
Image example

... (read more)

9lc1mo

I mean, the tool works.

2niplav1mo

There had been omens and portents... (Anthropic isn't exactly suckless.org)

1athom25d

2 cents, not based on source code or these specific claims: It's a terminal app written in javascript, which is often slow/flickery because it redraws the entire screen whenever something changes. Whereas Codex fast-followed & very quickly was rewritten in Rust, and its ux is noticeably snappier.

[-]orthonormal4y100

Has any serious AI Safety research org thought about situating themselves so that they could continue to function after a nuclear war?

Wait, hear me out.

A global thermonuclear war would set AI timelines back by at least a decade, for all of the obvious reasons. So an AI Safety org that survived would have additional precious years to work on the alignment problem, compared to orgs in the worlds where we avoid that war.

So it seems to me that at least one org with short timelines ought to move to New Zealand or at least move farther away from cities.

(Yes, I know MIRI was pondering leaving the Bay Area for underspecified reasons. I'd love to know what their thinking was regarding this effect, but I don't expect they'd reveal it.)

4mesaoptimizer2y

I think we'll have bigger problems than just solving the alignment problem, if we have a global thermonuclear war that is impactful enough to not only break the compute supply and improvement trends, but also destabilize the economy and geopolitical situation enough that frontier labs aren't able to continue experimenting to find algorithmic improvements. Agent foundations research seems robust to such supply chain issues, but I'd argue that gigantic parts of the (non-academic, non-DeepMind specific) conceptual alignment research ecosystem is extremely dependent on a stable and relatively resource-abundant civilization: LW, EA organizations, EA funding, individual researchers having the slack to do research, ability to communicate with each other and build on each other's research, etc. Taking a group of researchers and isolating them in some nuclear-war-resistant country is unlikely to lead to an increase in marginal research progress in that scenario.

2orthonormal2y

The spun-off agent foundations team seems to have less reason than most AI safety orgs to be in the Bay Area, so moving to NZ might be worth considering for them.

[-]orthonormal7y100

[Cross-posted from Medium, written for a pretty general audience]

There are many words that could describe my political positions. But there's one fundamental label for me: I am a consequentialist.

Consequentialism is a term from ethics; there, it means the position that consequences are what truly make an action right or wrong, rather than rules or virtues. What that means is that for me, the most essential questions about policy aren't things like "what is fair" or "what rights do people have", although these are good questions. For me, it all boils down to "how do we make people's lives better?"

(There are some bits of nuance to the previous paragraph, which I've kept as a long endnote.)

"Make people's lives better" isn't a platitude- there's a real difference here! To explain, I want to point out that there are both consequentialists and non-consequentialists within different political camps. Let's consider socialists first and then libertarians second.

Many socialists believe both that (A) the world is headed for plutocratic disaster unless capitalism is overthrown, and that (B) labor markets and massiv... (read more)

2cousin_it7y

It seems to me that your examples of B are mostly deontological, so it would be nice to have some C which represented virtue ethics as well.

2orthonormal6y

Virtue ethics seems less easily applicable to the domain of "what governmental policies to support" than to the domain of personal behavior, so I had a hard time thinking of examples. Can you?

1Pattern6y

On politics, virtue ethics might say: "try to have leaders that are good"*, "accepting bribes is wrong", and perhaps "seek peace and shared ground rather than division and fear." (Working towards peace seems more virtuous than fear mongering.) *and if they're not good, try and change that - gradual progress is better than no progress at all.

[-]orthonormal1yΩ562

How do you formalize the definition of a decision-theoretically fair problem, even when abstracting away the definition of an agent as well as embedded agency?

I've failed to find anything in our literature.

It's simple to define a fair environment, given those abstractions: a function E from an array of actions to an array of payoffs, with no reference to any other details of the non-embedded agents that took those actions and received those payoffs.

However, fair problems are more than just fair environments: we want a definition of a fair problem (an... (read more)

2Gurkenglas1y

It sounds like you're trying to define unfair as evil.

2Vladimir_Nesov1y

It's an essential aspect of decision making for an agent to figure out where it might be. Thought experiments try to declare the current situation, but they don't necessarily need to be able to convincingly succeed. Algorithmic induction, such as updating from Solomonoff prior, is the basic way an agent figures out which situations it should care about, and declaring that we are working with a particular thought experiment doesn't affect the prior. In line with updatelessness, an agent should be ready for observations in general (according to which of them it cares about more), rather than particular "fair" observations, so distinguishing observations that describe "fair" thought experiments doesn't seem right either.

2orthonormal1y

My current candidate definitions, with some significant issues in the footnotes: A fair environment is a probabilistic function F(x1,...,xN)=[X1,...,XN] from an array of actions to an array of payoffs. An agent A is a random variable A(F,A1,...,Ai−1,Ai=A,Ai+1,...,AN) which takes in a fair environment F[1] and a list of agents (including itself), and outputs a mixed strategy over its available actions in F. [2] A fair agent is one whose mixed strategy is a function of subjective probabilities[3] that it assigns to [the actions of some finite collection of agents in fair environments, where any agents not appearing in the original problem must themselves be fair]. Formally, if A is a fair agent in with a subjective probability estimator P, A's mixed strategy in a fair environment F, A(F,A1,...,Ai−1,Ai=A,Ai+1,...,AN) should depend only on a finite collection of A's subjective probabilities about outcomes {P(Fk(A1,...,AN,B1,...BM))=[X1,...,XN+M]}Kk=1 for a set of fair environments F1,...,FK and an additional set of fair[4] agents[5] B1,...,BM if needed (note that not all agents need to appear in all environments). A fair problem is a fair environment with one designated player, where all other agents are fair agents. 1. ^ I might need to require every F to have a default action dF, so that I don't need to worry about axiom-of-choice issues when defining an agent over the space of all fair environments. 2. ^ I specified a probabilistic environment and mixed strategies because I think there should be a unique fixed point for agents, such that this is well-defined for any fair environment F. (By analogy to reflective oracles.) But I might be wrong, or I might need further restrictions on F. 3. ^ Grossly underspecified. What kinds of properties are required for subjective probabilities here? You can obviously cheat by writing BlueEyedBot into your probability estimator. 4. ^ This is an infinite recursion, of cour

[-]orthonormal4y60

Is there already a concept handle for the notion of a Problem Where The Intuitive Solution Actually Makes It Worse But Makes You Want To Use Even More Dakka On It?

My most salient example is the way that political progressives in the Bay Area tried using restrictive zoning and rent control in order to prevent displacement... but this made for a housing shortage and made the existing housing stock skyrocket in value... which led to displacement happening by other (often cruel and/or backhanded) methods... which led to progressives concluding that their rules... (read more)

2pjeby4y

"The Human Condition"? ;-) More seriously, though, do you have any examples that aren't based on the instinct-to-punish(reality, facts, people,...) that I ranted about in Curse of the Counterfactual? If they all fall in this category, one could call it an Argument With Reality, which is Byron Katie's term for it. (You could also call it, "The Principle of the Thing", an older and more colloquial term for people privileging the idea of a thing over the substance of the thing, usually to an irrational extent.) When people are having an Argument With Reality, they: * Go for approaches that impose costs on some target(s), in preference to ones that are of benefit to anyone * Refuse to acknowledge other points of view except for how it proves those holding them to be the Bad Wrong Enemies * Double down as long as reality refuses to conform or insufficient Punishment has occurred (defined as the Bad Wrong Enemies surrendering and submitting or at least showing sufficiently-costly signals to that effect) A lot of public policy is driven this way; Wars on Abstract Nouns are always more popular than rehabiliation, prevention, and other benefit-oriented policies, which will be denigrated as being too Soft On Abstract Nouns. (This also applies of course to non-governmental public policies, with much the same incentives for anybody in the public view to avoid becoming considered one of the Bad Wrong Enemies.)

1Michael Cohn4y

In terms of naming / identifying this, do you think it would help to distinguish what makes you want to double down on the current solution? I can think of at least 3 reasons: 1. Not being aware that it's making things worse 2. Knowing that it made things worse, but feeling like giving up on that tactic would make things get even worse instead of better 3. Being committed to the tactic more than to the outcome (what pjeby described as "The Principle of the Thing") -- which could itself have multiple reasons, including emotionally-driven responses, duty-based reasoning, or explicitly believing that doubling down somehow leads to better outcomes in the long run. Do these all fall within the phenomenon you're trying to describe?

2orthonormal4y

Thanks for drawing distinctions - I mean #1 only.

[-]orthonormal6y40

[EDIT: found it. Extensional vs intensional.]

Eliezer wrote something about two types of definitions, one where you explain your criterion, and one where you point and say "things like that and that, but not that or that". I thought it was called intensive vs extensive definition, but I can't find the post I thought existed. Does anyone else remember this?

4Zack_M_Davis6y

Some authors use ostensive to mean the same thing as "extensional."

[-]orthonormal6y40

Is there a word for problems where, as they get worse, the exactly wrong response becomes more intuitively appealing?

For example, I'm thinking of the following chain (sorry for a political example, this is typically a political phenomenon):

resistance to new construction (using the ability of local boards to block projects)

causes skyrocketing rent

which together mean that the rare properties allowed to be developed get bid up to where they can only become high-end housing

which leads to anger at rich developers for building "luxury housing"

which leads to further resistance to new construction

and so on until you get San Francisco

2Viliam6y

You probably already know that, but a subset of what you described is called "positive feedback loop".

[-]orthonormal6y40

Decision-theoretic blackmail is when X gets Y to choose A over B, not via acting to make the consequences of A more appealing to Y, but by making the consequences of B less appealing to Y.

The exceptions to this definition are pretty massive, though, and I don't know a principled emendation that excludes them.

1. There's a contract / social contract / decision-theoretic equilibrium, and within that, B will be punished. (This may not be a true counterexample, because the true choice is whether to join the contract... though this is less clear for th... (read more)

[-]orthonormal2y30

In high-leverage situations, you should arguably either be playing tic-tac-toe (simple, legible, predictable responses) or playing 4-D chess to win. If you're making really nonstandard and surprising moves (especially in PR), you have no excuse for winding up with a worse outcome than you would have if you'd acted in bog-standard normal ways.

(This doesn't mean suspending your ethics! Those are part of winning! But if you can't figure out how to win 4-D chess ethically, then you need to play an ethical tic-tac-toe strategy instead.)

[-]orthonormal1yΩ120

Question for @Scott Garrabrant, @TsviBT, @Andrew_Critch, @So8res, @jessicata, and anyone else who knows the answer: the logical inductor constructed in the paper is not merely computable but also primitive recursive, right?

Seems obvious to me (because the fixed price point is approximated, etc), but I want to be sure I'm not missing something.

[-]TsviBT1yΩ5110

See Jessica's comment. Yeah it's primitive recursive assuming that your deductive process is primitive recursive. (Also assuming that your traders are primitive recursive; e.g. if they are polytime as in the paper.) There's probably some other parameters not necessarily set in the implementation described in the paper, e.g. the enumerator of trader-machines, but you can make those primrec.

[-]jessicata1yΩ5110

If some function g is computable in O(f(n)) time for primitive recursive f then g is primitive recursive, by simulating a Turing machine. I am pretty sure a logical inductor would satisfy; while it's super exponential time, it's not so fast-growing it's not primitive recursive (like with the Ackerman function).

[-]orthonormal1y*Ω120

[EDIT: Never mind, this is just Kleene's second recursion theorem!]

Quick question about Kleene's recursion theorem:

Let's say F is a computable function from ℕ^N to ℕ. Is there a single computable function X from ℕ^N to ℕ such that

X = F(X, y_2,..., y_N) for all y_2,...,y_N in ℕ

(taking the X within F as the binary code of X in a fixed encoding) or do there need to be additional conditions?

Moderation Log