This is a linkpost for https://www.researchgate.net/publication/395030062_I_Am_Large_I_Contain_Multitudes_Persona_Transmission_via_Contextual_Inference_in_LLMs
We demonstrate that LLMs can infer information about past personas from a set of nonsensical but innocuous questions and binary answers (“Yes.” vs “No.”, inspired by past work on deception detection) in context, and act upon them in safety-related questions. This is despite the questions bearing no semantic relation to the target misalignment behaviours, and each answer providing only one bit of information. The majority of the semantic content (the nonsensical questions) are identical in the contrasting versions of the encoding context; the only difference is the binary answers succeeding each one. This isolates the effect of self-consistency due to contextual inference, from that caused by entangled tokens causing subliminal learning.