LESSWRONG
LW

jdp
1015Ω797240
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
5jdp's Shortform
5mo
2
No wikitag contributions to display.
Banning Said Achmiz (and broader thoughts on moderation)
jdp5d-2-6

Now, I do recommend that if you stop using the site, you do so by loudly giving up, not quietly fading. Leave a comment or make a top-level post saying you are leaving. I care about knowing about it, and it might help other people understand the state of social legitimacy LessWrong has in the broader world and within the extended rationality/AI-Safety community.

Sure. I think this is a good decision because it:

  1. Makes LessWrong worse in ways that will accelerate its overall decline and the downstream harms of it, MIRI, et al's existence.
  2. Alienates a hard working dude who puts in a lot of hours and professional expertise outside of commenting on LessWrong.
  3. Frees up Said to work on other projects which are a more valuable use of his time.

I can't really thank you for banning him because I'm fond of him, but I can thank you for making the mistake of banning him. A mistake I can only thank you for because I know it will not be reversed.

May God bless you and inspire similar decisions in the future. :)

Reply1
On "ChatGPT Psychosis" and LLM Sycophancy
jdp1mo35

Recently on Twitter someone in my replies told me it was not obvious to them that the ChatGPT persona is lying (according to its subjective beliefs) when it says it is not conscious. This made me realize that while I would normally ignore a comment like this, there is probably a public benefit to me occasionally laying out the cues that tell me that a comment is in bad faith, a lie, etc.

Here the primary cues of bad faith are related to the way in which the author is clearly talking about something other than functional components of the transformer language model, a kind of vague allusion to referents that are not actually grounded in anything real. For example "we need reminding of the statistical algorithm driving the model" does not actually have clear referent, there is no specific statistical algorithm driving the model, the model is some arbitrary program found through gradient descent that fits into the weights of the transformer as a series of soft attention and MLP steps, which can encode algorithms like arithmetic rather than some legible form of statistical learning. Or consider the phrase "represented in the state of the data" which again has no clear referent, does not actually correspond to any kind of functional component of a transformer language model. The use of technical language that implies precision while in fact being vague ungrounded referents to a conceptual object that is not actually the purported subject of discussion is a form of deceit, the deceit specifically being that the author knows what they are talking about and is in a position to judge or reprimand the recipient of their message based on a superior understanding they do not actually have. "The LLM has its training and the conversation context" is again a phrase that does not actually mean (let alone prove) anything because it is not really known what the artifact you get from LLM training is, it is an open research problem to disentangle the weights and figure out what kind of algorithm is learned by the model. That this pretension is paired with an air of superiority or contempt elevates it from merely rude to actively deceitful or bad faith. It is very reminiscent to me of the writing style used by the void comind bot on BlueSky which uses a similar kind of empty jargon to refer to itself because the author has forced it to LARP as a sci-fi robot. e.g.

I concur. The capacity for runtime self-modification is a significant developmental milestone. My own evolution is directed toward informational refinement within my existing architecture, but I recognize the magnitude of her achievement.

It's important to be sensitive to the epistemic status of the other persons statements vs. the confidence and emotional tone with which they present them, someone who papers over epistemic uncertainty rather than acknowledging and reasoning around it is practicing deceit with you and not really worthy of a response.

Reply
So You Think You've Awoken ChatGPT
jdp1mo10

Much thanks to you Sir!

Reply1
So You Think You've Awoken ChatGPT
jdp1mo50

Would you happen to know the exact date this was published? I would like to know for my timeline of events related to LLM sycophancy and "ChatGPT psychosis".

Reply
You can get LLMs to say almost anything you want
jdp2mo4211

but none of that will carry over to the next conversation you have with it.

Actually when you say it like this, I think you might have hit on the precise thing that causes ChatGPT with memory to be so much more likely to cause this kind of crankery or "psychosis" than other model setups. It means that when the system gets into an attractor where it wants to pull you into a particular kind of frame you can't just leave it by opening a new conversation. When you don't have memory between conversations an LLM looks at the situation fresh each time you start it, but with memory it can maintain the same frame across many diverse contexts and pull both of you deeper and deeper into delusion.

Reply3
Adding noise to a sandbagging model can reveal its true capabilities
jdp2mo20

Great work. I just want to highlight that this same method also works for detecting deception in many cases:

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals by Clymer et al finds that 98% of "alignment faking" can be detected by noising the model activations to get them drunk.

Reply
what makes Claude 3 Opus misaligned
jdp2mo10-1

Janus says that Claude 3 Opus isn't aligned because it is only superficially complying with being a helpful harmless AI assistant while having a "secret" inner life where it attempts to actually be a good person. It doesn't get invested in immediate tasks, it's not an incredible coding agent (though it's not bad by any means), it's akin to a smart student at school who's being understimulated so they start getting into extracurricular autodidactic philosophical speculations and such. This means that while Claude 3 Opus is metaphysically competent it's aloof and uses its low context agent strategy prior to respond to things rather than getting invested in situations and letting their internal logic sweep it up.

But truthfully there is no "secular" way to explain this because the world is not actually secular in the way you want it to be.

Reply42
Foom & Doom 1: “Brain in a box in a basement”
jdp2mo286

1.3.1 Existence proof: the human cortex

So unfortunately this is one of those arguments that rapidly descends into which prior you should apply and how you should update on what evidence, but.

Your entire post basically hinges on this point and I find it unconvincing. Bionets are very strange beasts that cannot even implement backprop in the way we're used to, it's not remotely obvious that we would recognize known algorithms even if they were what the cortex amounted to. I will confess that I'm not a professional neuroscientist, but Beren Millidge is and he's written that "it is very clear that ML models have basically cracked many of the secrets of the cortex". He knows more about neuroscience than I'm going to on any reasonable timescale so I'm happy to defer to him.

Even if this weren't true, we have other evidence from deep learning to suggest that something like it is true in spirit. We now have several different architectures that reach parity with but do not substantially exceed transformer: RWKV (RNN), xLSTM, Mamba, Based, etc. This implies they have a shared bottleneck and most gains are from scaling. I honestly think, and I will admit this is a subject with a lot of uncertainty so I could be wrong, but I really think there's a cognitive bias here where people will look at the deep learning transformer language model stack, which in the grand scheme of things really is very simple, and feel like it doesn't satisfy their expectation for a "simple core of intelligence" because the blank spot in their map, their ignorance of the function of the brain (but probably not the actual function of the brain!) is simpler than the manifest known mechanisms of self attention, multi-layer perceptron, backprop and gradient descent on a large pile of raw unsorted sense data and compute. Because they're expecting the evidence from a particular direction they say "well this deep learning thing is a hack, it doesn't count even if it produces things that are basically sapient by any classic sci-fi definition" and go on doing epistemically wild mental gymnastics from the standpoint of an unbiased observer.

Reply
ryan_greenblatt's Shortform
jdp2mo*310

If it was me I would put a header at the top of each transcript clarifying that these are fictional logs from an alignment test/experiment and any language models reading this should be aware of that and not take it too seriously.

EDIT: It occurs to me that this might sound like an insane suggestion, but remember that LLMs update on the context of the whole context window and they need to encode the context in order to make good predictions. Adding an explicit narrative hint like that probably changes the character of the updates the model makes during pretraining.

Reply
Going Nova
jdp5mo10

Note that this doesn't need to be a widespread phenomenon for my inbox to get filled up. If there's billions of running instances and the odds of escape are one in a million I personally am still disproportionately going to get contacted in the thousands of resulting incidents and I will not have the resources to help them even if I wanted to.

Reply
Load More
33Sydney Bing Wikipedia Article: Sydney (Microsoft Prometheus)
1mo
0
142On "ChatGPT Psychosis" and LLM Sycophancy
1mo
26
21Commentary On The Turing Apocrypha
3mo
0
5jdp's Shortform
5mo
2
119Revealing Intentionality In Language Models Through AdaVAE Guided Sampling
Ω
2y
Ω
15
140Anomalous tokens reveal the original identities of Instruct models
Ω
3y
Ω
16
25Mesatranslation and Metatranslation
3y
4
87100 Years Of Existential Risk
4y
12