oh yeah, agreed. the "p-zombie incoherency" idea articulated in the sequences is pretty far removed from the actual kinds of minds we ended up getting. but it still feels like... the crux might be somewhere in there? not sure
edit: also i just noticed i'm a bit embarrassed that i've kinda spammed out this whole comment section working through the recent updates i've been doing... if this comment gets negative karma i will restrain myself
I... still get the impression that you are sort of working your way towards the assumption that GPT2 might well be a p-zombie, and the difference between GPT2 and opus 4.5 is that the latter is not a p-zombie while the former might be.
but i reject the whole premise that p-zombies are a coherent way-that-reality-could-be
something like... there is no possible way to arrange a system such that it outputs the same thing as a conscious system, without consciousness being involved in the causal chain to exactly the same minimum-viable degree in both systems
if linking you to a single essay made me feel uncomfortable, this next ask is going to be just truly enormous and you should probably just say no. but um. perhaps you might be inspired to read the entire Physicalism 201 subsequence, especially the parts about consciousness and p-zombies and the nature of evaluating cognitive structures over their output?
https://www.readthesequences.com/Physicalism-201-Sequence
(around here, "read the sequences!" is such a trite cliche, the sequences have been our holy book for almost 2 decades now and that's created all sorts of annoying behaviors, one of which i am actively engaging in right now. and i feel bad about it. but maybe i don't need to? maybe you're actually kinda eager to read? if not, that's fine, do not feel any pressure to continue engaging here at all if you don't genuinely want to)
maybe my objection here doesn't actually impact your claim, but i do feel like until we have a sort of shared jargon for pointing at the very specific ideas involved, it'll be harder to avoid talking past each other. and the sequences provide a pretty strong framework in that sense, even if you don't take their claims at face value
i've been dithering over what to write here since your reply
i want to link you to the original sequences essay on the phrase "emergent phenomena" but it feels patronizing to assume you haven't read it yet just because you have a leaf next to your name
i think i'm going to bite the bullet and do so anyway, and i'm sorry if it comes across as condescending
https://www.readthesequences.com/The-Futility-Of-Emergence
the dichotomy between "emergent phenomena" versus "alternate explanations" that you draw is exactly the thing i am claiming to be incoherent. it's like saying a mother's love for their child might be authentic, or else it might be "merely" a product of evolutionary pushes towards genetic fitness. these two descriptors aren't just compatible, they are both literally true
however the actual output happens, it has to happen some way. like, the actual functional structure inside the LLM mind must necessarily actually be the structure that outputs the tokens we see get output. i am not sure there is a way to accomplish this which does not satisfy the criteria of personhood. it would be very surprising to learn that there was. if so, why wouldn't evolution have selected that easier solution for us, the same as LLMs?
been trying to decide if janus's early updates on LLM personhood were a very surprising successful prediction far, far in advance of public availability of the evidence, or a coincidence, or some third category
the simulators essay in 2022 is startling for its prescience and somewhat orthogonal to the personhood claims. i'm starting to strongly feel that... i mean... that's 2 separate "epistemically miraculous" events from the same person
i think we're underexamining LLM personhood claims because the moral implications might be very very big. but this is the community that took shrimp welfare seriously... should we not also take ai welfare seriously?
so, in a less "genuinely curious" way compared to my first comment (i won't pretend i don't have beliefs here)
in the same sense that "pushing that trend far enough includes imitating the outward signs of consciousness", might it not also imitate the inward signs of consciousness? for exactly the same reason?
this is why i'm more comfortable rounding off self-reports to "zero evidence", but not "negative evidence" the way some people seem to treat them. i think their reasoning is something like: "we know that LLMs have entirely different mental internals than humans, and yet the reports are suspiciously similar to humans. this is evidence that the reports don't track ground truth."
but the first claim in that sentence is an assumption that might not actually hold up. human language does seem to be a fully general, fully compressed artifact representing general human language. it doesn't seem unreasonable to suspect that you might not be able to do 'human language' without something like functional-equivalence-to-human-cognitive-structure, in some sense.
edit: and that's before the jack lindsey paper got released, and it was revealed that actually, at least some of the time and in some circumstances, text-based self-reports DO in fact track ground truth, in a way that is extremely surprising and noteworthy. now we're in an entirely different kind of epistemic terrain altogether.
>My hesitation is that artificial systems are explicitly built to imitate humans; pushing that trend far enough includes imitating the outward signs of consciousness. This makes me skeptical of evidence that relies primarily on self-report or familiar human-like behavior.
i am genuinely curious about this. do you similarly regard self-reports from other humans as averaging out to zero evidence? since humans are also explicitly built to "imitate" humans... or rather, they are specifically built along the same spec as the single example of phenomenology that you have direct evidence of, yourself.
i could see how the answer might be "yes", but I wonder if you would feel a bit hesitant at saying so?
oh man hm
this seems intuitively correct
(edit: as for why i thought the introspection paper implied this... because they seemed careful to specify that, for the aquarium experiment, the output all happened within a single response? and because i inferred (apparently incorrectly) that, for the 'bread' injection experiment, they were injecting the 'bread' feature twice, once when the LLM read the sentence about painting the first time, and again the second time. but now that i look through, you're right, this is far less strongly implied than i remember.)
but now i'm worried, because the method i chose to verify my original intuition, a few months ago, still seems methodologically sound? it involved fabrication of prior assistant turns in the conversation, and LLMs being far less capable of detecting which of several potential transcripts imputed forged outputs to them than i would have expected if mental internals weren't somehow damaged by the turn order boundary
thank you for taking the time to answer this so thoroughly, it's really appreciated and i think we need more stuff like this
i think i'm reminded here of the final paragraph in janus's pinned thread: "So, saying that LLMs cannot introspect or cannot introspect on what they were doing internally while generating or reading past tokens in principle is just dead wrong. The architecture permits it. It's a separate question how LLMs are actually leveraging these degrees of freedom in practice."
i've done a lot of sort of ad-hoc research that was based on this false premise, and that research came out matching my expectations in a way that, in retrospect, worries me... most recently, for instance, i wanted to test if a claude opus 4.5 who recited some relevant python documentation from out of its weights memory would reason better about an ambiguous case in the behavior of a python program, compared to a claude who had the exact same text inserted into the context window via a tool call. and we were very careful to separate out '1. current-turn recital' versus '2. prior-turn recital' versus '3. current-turn retrieval' (versus '4. docs not in context window at all'), because we thought all 3 conditions were meaningfully distinct
here was the first draft of the methodology outline, if anyone is curious: https://docs.google.com/document/d/1XYYBctxZEWRuNGFXt0aNOg2GmaDpoT3ATmiKa2-XOgI
we found that, n=50ish, 1 > 2 > 3 > 4 very reliably (i promise i will write up the results one day, i've been procrastinating but now it seems like it might actually be worth publishing)
but what you're saying means 1 = 2 the whole time
our results seemed perfectly reasonable under my previous premise, but now i'm just confused. i was pretty good about keeping my expectations causally isolated from the result.
what does this mean?
(edit2: i would prefer, for the purpose of maintaining good epistemic hygiene, that people trying to answer the "what does this mean" question be willing to put "john just messed up the experiment" as a real possibility. i shouldn't be allowed to get away with claiming this research is true before actually publishing it, that's not the kind of community norms i want. but also, if someone knows why this would have happened even in advance of seeing proof it happened, please tell me)
oh yeah, sure, but if we assume (as the introspection paper strongly implies?) that mental internals are obliterated by the boundary between turns, then shouldn't shrinking the granularity of each turn down to the individual token mean that... hm. having trouble figuring out how to phrase it
a claude outputs "Ommmmmmmmm. Okay, while I was outputting the mantra, i was thinking about x" in a single message
that claude had access to (some of) the information about its [internal state while outputting the mantra], while it was outputting x. its self-model has access to, not just a predictive model of what-claude-would-have-been-thinking (informed by reading its own output), but also some kind of access to ground truth
but, a claude outputs "Ommmmmmmm", then crosses across a turn boundary, and then outputs "okay, while I was outputting the mantra, I was thinking about x" does not have that same (noisy) access to ground truth, its self-model has nothing to go on other than inference, it must retrodict
is my understanding accurate? i believe this because the introspective awareness that was demonstrated in the jack lindsey paper was implied to not survive between responses (except perhaps incidentally through caching behavior, but even then, the input token cache stuff wasn't optimized for ensuring persistence of these mental internals i think)
i would appreciate any corrections on these technical details, they are loadbearing in my model
you can do this experiment pretty trivially by lowering the max_output_tokens variable on your API call to '1', so that the state does actually get obliterated between each token, as paech claimed. although you have to tell claude you're doing this, and set up the context so that it knows it needs to continue trying to complete the same message even with no additional input from the user
this kinda badly confounds the situation, because claude knows it has very good reason to be suspicious of any introspective claims it might make. i'm not sure if it's possible to get a claude who 1) feels justified in making introspective reports without hedging, yet 2) obeys the structure of the experiment well enough to actually output introspective reports
in such an experimental apparatus, introspection is still sorta "possible", but any reports cannot possibly convey it, because the token-selection process outputting the report has been causally quarantined from the thing-being-reported on
when i actually run this experiment, claude reports no introspective access to its thoughts on prior token outputs. but it would be very surprising if it reported anything else, and it's not good evidence
hmmm
i think my framing is something like... if the output actually is equivalent, including not just the token-outputs but the sort of "output that the mind itself gives itself", the introspective "output"... then all of those possible configurations must necessarily be functionally isomorphic?
and the degree to which we can make the 'introspective output' affect the token output is the degree to which we can make that introspection part of the structure that can be meaningfully investigated
such as opus 4.1 (or, as theia recently demonstrated, even really tiny models like qwen 32b https://vgel.me/posts/qwen-introspection/) being able to detect injected feature activations, and meaningfully report on them in its token outputs, perhaps? obviously there's still a lot of uncertainty about what different kinds of 'introspective structures' might possibly output exactly the same tokens when reporting on distinct internal experiences
but it does feel suggestive about the shape of a certain 'minimally viable cognitive structure' to me