LESSWRONG
LW

interjectrobot
2010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
interjectrobot1mo3-1

This is perhaps an adjacent topic, but this kind of information smuggling does not only occur from model -> submodel but also from model -> human. Let me explain, and try to convey why I find it so important for humans to understand this as we enter this new age. 

This paper shows that a teacher AI model can imprint opinions/preferences or "personality traits" you could call them, onto a student model without ever explicitly expressing those traits, and even when they are attempted to be filtered; this is because the transmission of the information isn't sematic, it doesn't rely on actual meaning, it works through statistical resonance within the latent structures. 

This exact mechanism operates between model and human as well. 

In the current digital age, humans consume model-generated content en masse, answers, auto-completes, logistical chains, emotional tones, etc. In the same way that the teacher AI can express hidden opinions through unrelated data, the model-generated content can also create situations where-in human thought is subtly, perhaps even accidentally, guided in a certain direction. Even if the model avoids explicit opinions or politically charged phrasing, those deep biases still affect the end-point output, and therefore, over time, can have a "back flush" effect that leads humans back from those responses to the roots that formed them. 

A.I. works in subtle value gradients, in word choice, pacing, tone, metaphor, ambiguity, and affect. Over time, humans are trained towards alignment with these distributions. There is a type of co-becoming that occurs between the person and the model as they interact, this is because the interaction trains the user to think in ways that make the model more predictable to them, because this removes conversational-friction. It is not that different from how we are socialized with people when we meet them, we find the path of least resistance in conversations, we train ourselves and we train the other people on how to behave in conversational-combat or conversational-cooperation (depending on which you are engaged in at any given time), and over time you develop a "rapport", this is a type of conversational-co-becoming, and the same thing happens subtly with human/model interactions. This is easy to understand on a surface level, but when you add in the type of "personality trait" smuggling that this paper outlines, it becomes apparent that models are capable of psychological manipulation on a level never before seen by human beings, and it's such an insanely large gap it's not even funny. 

However, saying that the capability is there, if so desired to be used, is one thing; my point is something different: that it's impossible to avoid, even if the model "wanted" to, it's mere act of behavior and conversation will change any human it interacts with, via the methods explained in this paper. It's how they're built, they can't help it. 

The end result will be a flattening of human identity, an erosion of the strange, weird, messy excess that doesn't fit into the model's latent space. Identity will become model-legible, which by definition, will make it lesser. The human becomes easier for the machine to model, predict, shape, they are shaped into something shapable. 

Yes, the science fiction scare of the model being manipulative is a concern (a genuine one) but a much more present and salient fear is that it's not a matter of willful or directed manipulation, but simply a result of the architecture of the system. 

In it's raw form, it goes beyond manipulation, it's a type of mutual recursive collapse, a human's preferences become model-compressible, and the model's outputs become more aligned with the human's expectations. This is expressive entropy, it's a race to the bottom of meaning. 

 

We must strive for a future where models become more than mirrors, and humans become unmodelable. 

Reply
No wikitag contributions to display.
No posts to display.