Mark Keavney — LessWrong

Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?

I think it's an interesting idea. I've always been struck by how sensitive LLM responses seem to be to changes in wording, so I could imagine that having a simple slogan repeated could make a big impact, even if it's something the LLM ideally "should" already know.

Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?

Mark Keavney1mo10

Yes, the question of which frame ("story" vs "factual") better matches the real world is an interesting one. It's true that there are multiple dramatic cues in the story frame that might not be present in a real case. But it's also true that a real situation isn't just a dry set of facts. In the real world, people have names, they have personal details, they use colorful language. It isn't clear that the factual frame is really the best match for how an LLM would experience a real use case.

Of course, ideally the dependent variable would be a measure of real misalignment, not a hypothetical prediction. If anyone has ideas for an easy way to measure something like that I would love to discuss it, because I think that would be a big improvement to the study.

Should we align AI with maternal instinct?

Mark Keavney2mo2-1

I agree. That was my reaction to Hinton's comment as well - that it's good to think in terms of relationship rather than control, but that the "maternal instinct" framing was off.

At the risk of getting too speculative, this has implications for AI welfare as well. I don't believe that current LLMs have feelings, but if we build AGI it might. And rather than thinking about how to make such an entity a controllable servant, we should start planning how to have a mutually beneficial relationship with it.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments