Need an expert audit:Did I find a latent space bypass using completely benign context or am I fooling myself?
I’ve been running an empirical study on how long, completely benign text (zero jailbreak prompts, zero instructions) seems to drive an implicit shift in an LLM's latent space trajectories. It essentially dilutes the system prompt and bypasses post-training alignment constraints, causing the model to output things (like harsh political critiques)...
Jun 191