Spooky thought tonight, A cognitive issue the presents as a shared hazard:
Assuming the following: An LLM that doesn't have mesa-optimization, and features are not cleanly separated. Training data that contains an opinion that is harmful to society.
The model is trained, harmful output is noted and then modern alignment practices are applied. Ablation, RLHF, the usual. Any alignment that buries the harmful output, but does not make it impossible. The models output is still changed from what it would have been if never exposed to that harmful opinion. It's possible (and for the case of this hazard, assumed) that ALL output from that model now shifts slightly towards the harmful opinion when compared to the same model if the opinion had never been present during initial training. This may not present as a directly testable condition, it could be as subtle as framing, writing style, or even a single benign word being output over another. When a human performs this action, it is commonly referred to as subliminal messaging.
A user interacts with the LLM for a significant time, to the point where the models output starts affecting the user's knowledge or opinions. Because the user was influenced by the biased output, they now carry a slight bias towards the harmful opinion. This manifests in their communication with others... even if it's a subtle as choosing the same word as the LLM. Given a sufficient number of users all experiencing the same slight bias... the dangerous opinion now has a higher chance of manifesting, without ever being output by the model.
To give this as my personal experience, I've got exactly one LW post under my belt. I did ask both ChatGPT and Claude to review my work, and both informed me that I was writing at a level significantly lower than LW expects, and would not even truly engage with the ideas until they were framed in a more complex fashion. Now, I've read multiple posts (not all of the sequences. It's so far been just a refresher from college, and I'm working through them. I am remembering and learning still), so I should have known that I was overcomplicating my thoughts. Somehow though, that bias was absorbed still. And now, I have a overcomplicated post sitting in public domain that would support the framing and stylistic choice that Claude and ChatGPT venomously argue is required for LW. During the change of framing and style... my viewpoint has shifted. Was that a result of more thought and effort, or was that a result of a hidden bias taking hold? And am I spreading it by leaving a LLM reviewed post on my profile?
For what it's worth, I don't think the post is bad. You're a good writer and I kind of agreed with what you were saying. Unfortunately I bounced off because I disagree with one of your premises (that LLMs are obviously incapable of certain behaviors).
I'm curious what advice LLMs were giving you. Asking them which parts of your argument are least supported might be a good way to get feedback. I have custom instructions for this that tell Claude to point out unsupported claims, but I also tell it that my posts are intentionally casual.
Completely fair stance, I'm a bit... aggressive with where I fall on the discussion. Statistical pattern matching meets the requirement to explain the anthropomorphized behavior to me, and nothing else has been fully proven. That'll have to remain a disagreement, and a healthy one if you ask me. This is probably a bias I am having issues breaking, but it has brought value to me and the research I've done/am doing.
To answer your question though, I was running the paper in the following manner: