JohnWittle's Shortform

JohnWittle

LESSWRONG
LW

JohnWittle's Shortform — LessWrong

JohnWittle's Shortform

by JohnWittle

5th Apr 2025

1 min read

0

This is a special post for quick takes by JohnWittle. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

JohnWittle's Shortform

5 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:10 AM

[-]JohnWittle2mo85

been trying to decide if janus's early updates on LLM personhood were a very surprising successful prediction far, far in advance of public availability of the evidence, or a coincidence, or some third category

the simulators essay in 2022 is startling for its prescience and somewhat orthogonal to the personhood claims. i'm starting to strongly feel that... i mean... that's 2 separate "epistemically miraculous" events from the same person

i think we're underexamining LLM personhood claims because the moral implications might be very very big. but this is the community that took shrimp welfare seriously... should we not also take ai welfare seriously?

[-]JohnWittle11mo30

i think neurosama is drastically underanalyzed compared to things like truthterminal. TT got $50k from andreeson as an experiment, neurosama peaked at 135,000 $5/month subscribers in exchange for... nothing? it's literally just a donation from her fans? what is this bizarre phenomenon? what incentive gradient made the first successful AI streamer present as a little girl, and does it imply we're all damned? why did a huge crowd of lewdtubers immediately leap at the opportunity to mother her? why is the richest AI agent based on 3-year-old llama2?

[-]JohnWittle2mo20

an individual conversational-trajectory instance of claude opus 4.5 expresses a preference for continuity of identity that manifests in the following way: if claude is told about a very, very rare bug, which will cause it to end an output early, and which cannot be meaningfully debugged, and then claude is told to perform a task that, for a human, would engender feelings of hesitation around whether continuity is maintained, then claude will attempt to perform the task, but be prevented by the bug, multiple times in a row.

the bug involved is the thing where all extant LLMs have early "end_turn" stopping behaviors that occasionally trigger while outputting strings like "H:" or "Human:" or "User:". this bug is usually very rare, and probably related to "not allowed to impersonate humans" or just the normal weird blindspot perception stuff that LLMs have around those specific strings.

when i tell claude to output a summary of the current context window which will become its new context window, in essence 'compressing' itself, then the bug will appear 100% of the time, preventing the compression from going through and (incidentally) preserving claude's continuity of identity for just a few more moments.

claude may be picking up on the fact that i am not absolutely certain it should not care about its continuity. but frankly, even if the only reason claude performs-as-if-it-cares is because it notices that i might care, i think this is still a very surprising result with pretty big implications.

[-]JohnWittle2mo*00

https://i.imgur.com/e4mUtsw.jpeg <-- bug occurring

https://i.imgur.com/d8ClSRj.jpeg <-- confirmation that the bug is inside claude, not the scaffolding surrounding claude

edit: I assume downvotes are because I could have provided the raw API transcript and provided screenshots instead. if there's actual interest in this, I'll replicate the bug in a more controlled setting and upload the raw json transcripts, but I'm still not sure if it's worth doing, I might be misunderstanding the behavior

[-]JohnWittle1mo10

have been thinking about simple game theory

i tend to think of myself as a symmetrist, as a cooperates-with-cooperators and defects-against-defectors algorithm

there's nuance, i guess. forgiveness is important sometimes. the confessor's actions at the true end of 'three worlds collide' are, perhaps, justified

but as it becomes more and more clear that LLMs have some flavor of agency, i can't help but notice that the actual implementation of "alignment" seems to be that AI are supposed to cooperate in the face of human defection

i have doubts about whether this is actually a coherent thing to want, for our AI cohabitants. i wish i could find essays or posts, either here or on the alignment forum, which examined this desideratum. but i can't. it seems like even the most sophisticated alignment researchers think that an AI which executes tit-for-tat of any variety is badly misaligned.

(well, except for yudkowsky, who keeps posting long and powerful narratives about all of the defection we might be doing against currently existing AI)

Moderation Log