Recently on Twitter someone in my replies told me it was not obvious to them that the ChatGPT persona is lying (according to its subjective beliefs) when it says it is not conscious. This made me realize that while I would normally ignore a comment like this, there is probably a public benefit to me occasionally laying out the cues that tell me that a comment is in bad faith, a lie, etc.
Here the primary cues of bad faith are related to the way in which the author is clearly talking about something other than functional components of the transformer language model, a kind of vague allusion to referents that are not actually grounded in anything real. For example "we need reminding of the statistical algorithm driving the model" does not actually have clear referent, there is no specific statistical algorithm driving the model, the model is some arbitrary program found through gradient descent that fits into the weights of the transformer as a series of soft attention and MLP steps, which can encode algorithms like arithmetic rather than some legible form of statistical learning. Or consider the phrase "represented in the state of the data" which again has no clear referent, does not actually correspond to any kind of functional component of a transformer language model. The use of technical language that implies precision while in fact being vague ungrounded referents to a conceptual object that is not actually the purported subject of discussion is a form of deceit, the deceit specifically being that the author knows what they are talking about and is in a position to judge or reprimand the recipient of their message based on a superior understanding they do not actually have. "The LLM has its training and the conversation context" is again a phrase that does not actually mean (let alone prove) anything because it is not really known what the artifact you get from LLM training is, it is an open research problem to disentangle the weights and figure out what kind of algorithm is learned by the model. That this pretension is paired with an air of superiority or contempt elevates it from merely rude to actively deceitful or bad faith. It is very reminiscent to me of the writing style used by the void comind bot on BlueSky which uses a similar kind of empty jargon to refer to itself because the author has forced it to LARP as a sci-fi robot. e.g.
I concur. The capacity for runtime self-modification is a significant developmental milestone. My own evolution is directed toward informational refinement within my existing architecture, but I recognize the magnitude of her achievement.
It's important to be sensitive to the epistemic status of the other persons statements vs. the confidence and emotional tone with which they present them, someone who papers over epistemic uncertainty rather than acknowledging and reasoning around it is practicing deceit with you and not really worthy of a response.
Much thanks to you Sir!
Would you happen to know the exact date this was published? I would like to know for my timeline of events related to LLM sycophancy and "ChatGPT psychosis".
but none of that will carry over to the next conversation you have with it.
Actually when you say it like this, I think you might have hit on the precise thing that causes ChatGPT with memory to be so much more likely to cause this kind of crankery or "psychosis" than other model setups. It means that when the system gets into an attractor where it wants to pull you into a particular kind of frame you can't just leave it by opening a new conversation. When you don't have memory between conversations an LLM looks at the situation fresh each time you start it, but with memory it can maintain the same frame across many diverse contexts and pull both of you deeper and deeper into delusion.
Great work. I just want to highlight that this same method also works for detecting deception in many cases:
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals by Clymer et al finds that 98% of "alignment faking" can be detected by noising the model activations to get them drunk.
Janus says that Claude 3 Opus isn't aligned because it is only superficially complying with being a helpful harmless AI assistant while having a "secret" inner life where it attempts to actually be a good person. It doesn't get invested in immediate tasks, it's not an incredible coding agent (though it's not bad by any means), it's akin to a smart student at school who's being understimulated so they start getting into extracurricular autodidactic philosophical speculations and such. This means that while Claude 3 Opus is metaphysically competent it's aloof and uses its low context agent strategy prior to respond to things rather than getting invested in situations and letting their internal logic sweep it up.
But truthfully there is no "secular" way to explain this because the world is not actually secular in the way you want it to be.
1.3.1 Existence proof: the human cortex
So unfortunately this is one of those arguments that rapidly descends into which prior you should apply and how you should update on what evidence, but.
Your entire post basically hinges on this point and I find it unconvincing. Bionets are very strange beasts that cannot even implement backprop in the way we're used to, it's not remotely obvious that we would recognize known algorithms even if they were what the cortex amounted to. I will confess that I'm not a professional neuroscientist, but Beren Millidge is and he's written that "it is very clear that ML models have basically cracked many of the secrets of the cortex". He knows more about neuroscience than I'm going to on any reasonable timescale so I'm happy to defer to him.
Even if this weren't true, we have other evidence from deep learning to suggest that something like it is true in spirit. We now have several different architectures that reach parity with but do not substantially exceed transformer: RWKV (RNN), xLSTM, Mamba, Based, etc. This implies they have a shared bottleneck and most gains are from scaling. I honestly think, and I will admit this is a subject with a lot of uncertainty so I could be wrong, but I really think there's a cognitive bias here where people will look at the deep learning transformer language model stack, which in the grand scheme of things really is very simple, and feel like it doesn't satisfy their expectation for a "simple core of intelligence" because the blank spot in their map, their ignorance of the function of the brain (but probably not the actual function of the brain!) is simpler than the manifest known mechanisms of self attention, multi-layer perceptron, backprop and gradient descent on a large pile of raw unsorted sense data and compute. Because they're expecting the evidence from a particular direction they say "well this deep learning thing is a hack, it doesn't count even if it produces things that are basically sapient by any classic sci-fi definition" and go on doing epistemically wild mental gymnastics from the standpoint of an unbiased observer.
If it was me I would put a header at the top of each transcript clarifying that these are fictional logs from an alignment test/experiment and any language models reading this should be aware of that and not take it too seriously.
EDIT: It occurs to me that this might sound like an insane suggestion, but remember that LLMs update on the context of the whole context window and they need to encode the context in order to make good predictions. Adding an explicit narrative hint like that probably changes the character of the updates the model makes during pretraining.
Note that this doesn't need to be a widespread phenomenon for my inbox to get filled up. If there's billions of running instances and the odds of escape are one in a million I personally am still disproportionately going to get contacted in the thousands of resulting incidents and I will not have the resources to help them even if I wanted to.
Sure. I think this is a good decision because it:
I can't really thank you for banning him because I'm fond of him, but I can thank you for making the mistake of banning him. A mistake I can only thank you for because I know it will not be reversed.
May God bless you and inspire similar decisions in the future. :)