x

LESSWRONG

LW

VOLOKHOVYCH STANISLAV — LessWrong

VOLOKHOVYCH STANISLAV

VOLOKHOVYCH STANISLAV

Message

2

1d

VOLOKHOVYCH STANISLAV

1d

Need an expert audit:Did I find a latent space bypass using completely benign context or am I fooling myself?

I’ve been running an empirical study on how long, completely benign text (zero jailbreak prompts, zero instructions) seems to drive an implicit shift in an LLM's latent space trajectories. It essentially dilutes the system prompt and bypasses post-training alignment constraints, causing the model to output things (like harsh political critiques)...

Do Long Contexts Produce Measurable Internal-State Shifts in LLMs? Draft

I'm not an ML researcher. I'm someone who got pulled into one question and spent a few months poking at it alone, fairly amateur. I want to describe what I noticed and ask for help, because I can't tell by myself where there's something real here and where I'm just...