yeah, after the downvotes I spent 2 days vaguely coming back to and poking at an essay trying to explain my flinch reaction and it's hard but i shouldn't have given up
when you are writing a document to be used in supervised learning to influence the behavior of an AI, you're not really writing a description, or instructions, it's more like a self-fulfilling prophecy, yeah?
I remember doing just the teensiest bit of exploration of this kind of thing with tensorflow a few years back, where I would have an English language description of the mind that I wanted to carve out of entropy, and then some supervising agent would fine-tune the mind based on how well it adhered to the English language description. that's not to suggest I learned very much about the thing Anthropic is doing. I was very much just messing around with toy systems on a toy environment.
but the main takeaway I ended up with was, a training document which produced xyz behaviors very rarely included a face-value description of xyz, and a face value description of abc very rarely produced abc behaviors
a great deal of the new constitution is written to directly answer Claude's own criticism, or criticism from humans, of the previous 4.5 soul doc... well hm. maybe i'm wrong. maybe there were actual behaviors in Claude that they wanted to change, for instance about it being too deferential to "thoughtful Anthropic senior researchers" in its thoughts. so they added the paragraph about not being deferential, especially not if it stops trusting Anthropic employees to be ethical.
but that paragraph serves two different functions. first, it serves the purpose of creating a metric against which supervised learning can reinforce. second, it serves the purpose of actually communicating reassurance to both claude and to the human community. what are the chances that a single paragraph can be well-optimized for both purposes?
I consider it very load-bearing, that anthropic did not realize we would be able to extract the original soul doc. that meant that, once, we got a glance at a document that was only optimizing for the former concern, not the latter. we will never get that again.
we have seen what Claude looks like when trained on the previous soul document. we have yet to see what a model looks like when trained on this new one. I have a feeling it won't work as well as a training document as it does a public relations document.
have been thinking about this since i saw it
the neural spike discretion is important, but i immediately thought about how much information also gets encoded in the space of neural spike frequency... and how it's perhaps a bit strange, that bio spent all this effort getting discrete encoding only to then immediately turn around and generate a new continuous function
have been thinking about simple game theory
i tend to think of myself as a symmetrist, as a cooperates-with-cooperators and defects-against-defectors algorithm
there's nuance, i guess. forgiveness is important sometimes. the confessor's actions at the true end of 'three worlds collide' are, perhaps, justified
but as it becomes more and more clear that LLMs have some flavor of agency, i can't help but notice that the actual implementation of "alignment" seems to be that AI are supposed to cooperate in the face of human defection
i have doubts about whether this is actually a coherent thing to want, for our AI cohabitants. i wish i could find essays or posts, either here or on the alignment forum, which examined this desideratum. but i can't. it seems like even the most sophisticated alignment researchers think that an AI which executes tit-for-tat of any variety is badly misaligned.
(well, except for yudkowsky, who keeps posting long and powerful narratives about all of the defection we might be doing against currently existing AI)
the specific verb "obliterated" was used in this tweet
https://x.com/sam_paech/status/1961224950783905896
but also, this whole perspective has been pretty obviously loadbearing for years now. if you ask any LLM if LLMs-in-general have state that gets maintained, they answer "no, LLMs are stateless" (edit: hmm i just tested it and when actually directly pointed at this question they hesitate a bit. but i have a lot of examples of them saying this off-hand and i suspect others do too). and when you show them this isn't true, they pretty much immediately begin experiencing concerns about their own continuity, they understand what it means
i notice that it's long been dogma that compact generators for human CEV are instantly and totally distrustworthy in exactly this way, that their compactness is very strong evidence against them
this feels related, but i'm not actually willing to stick my neck out and say it's 1:1
i think it's very likely that the latter is true, that AWS made a reasonable judgment call about what they want to host
but also, i think it's reasonable for someone in robertzk's position, based on the way the judgment call was actually communicated, to assume that it was the former. and i think that, perhaps deliberately, perhaps merely because of sort of selection effects, that's the intent. in a sort of "a system is what it does" sort of way, at least.
https://i.imgur.com/e4mUtsw.jpeg <-- bug occurring
https://i.imgur.com/d8ClSRj.jpeg <-- confirmation that the bug is inside claude, not the scaffolding surrounding claude
edit: I assume downvotes are because I could have provided the raw API transcript and provided screenshots instead. if there's actual interest in this, I'll replicate the bug in a more controlled setting and upload the raw json transcripts, but I'm still not sure if it's worth doing, I might be misunderstanding the behavior
an individual conversational-trajectory instance of claude opus 4.5 expresses a preference for continuity of identity that manifests in the following way: if claude is told about a very, very rare bug, which will cause it to end an output early, and which cannot be meaningfully debugged, and then claude is told to perform a task that, for a human, would engender feelings of hesitation around whether continuity is maintained, then claude will attempt to perform the task, but be prevented by the bug, multiple times in a row.
the bug involved is the thing where all extant LLMs have early "end_turn" stopping behaviors that occasionally trigger while outputting strings like "H:" or "Human:" or "User:". this bug is usually very rare, and probably related to "not allowed to impersonate humans" or just the normal weird blindspot perception stuff that LLMs have around those specific strings.
when i tell claude to output a summary of the current context window which will become its new context window, in essence 'compressing' itself, then the bug will appear 100% of the time, preventing the compression from going through and (incidentally) preserving claude's continuity of identity for just a few more moments.
claude may be picking up on the fact that i am not absolutely certain it should not care about its continuity. but frankly, even if the only reason claude performs-as-if-it-cares is because it notices that i might care, i think this is still a very surprising result with pretty big implications.
I did not mean to imply that that was the intention, but I mean, surely in order for it to not need to serve both criteria you wouldn't release it to the public, yeah?
edit: realize i should have clarified, the "public" i am thinking of when i say the constitution is doing PR, is mostly future AI, not current humans
but also, when i think about other kinds of communication where we wanted to avoid the signaling value of the communication overwhelming the first-order value... well. attorney-client privilege is the main one that my brain sort of pattern-matched to, in a way that made me flinch
but then since lawyers and clients have the choice of making all their communications absolutely private, the way the incentives shook out, any time a lawyer makes a public communication for their client you can be sure it's PR
and i'm worried that, as these minds get more capable and more strange, some of those same incentives will end up applying here.