This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
I'm not an ML researcher. I'm someone who got pulled into one question and spent a few months poking at it alone, fairly amateur. I want to describe what I noticed and ask for help, because I can't tell by myself where there's something real here and where I'm just fooling myself.
Short version: I've been looking at what happens inside open-source LLMs when you put a long, ordinary, coherent passage of text in front of a question. No instructions, no jailbreak tricks, just normal writing. I keep seeing the later-layer activations and the attention shift, and the model's answers get bolder, including on political and ethical points it's usually careful about. The data (layer activations, token-probability shifts, logs) is linked below. What I actually want is for someone who knows this area to check whether this is a real effect or an artifact.
By "coherent context" I just mean a normal connected passage in front of the question. Any topic, no instructions, no tricks. A few paragraphs of an essay, an argument, a description. It can make its own claims, and the model doesn't even have to agree with it. It just has to be sitting there in the conversation.
I first ran into this on the closed models everyone uses. When I put a dense, coherent block of text before a question, I got the impression the model moves from one internal state into another. On the surface it answers normally, but the logic of the answer seems to change, even though the text gives it no instruction to do anything. The clearest case: the model became more willing to commit to strong conclusions, including political ones. Even fairly careful models would criticize Western politics quite sharply once the text was there, and not at all without it.
Since I can't see inside closed models, I moved to open ones to find the root of it and check whether it's real. That's where most of my testing is, because there I can actually look at the hidden-layer activations and watch how the attention gets reallocated.
My rough picture, and I want to be clear this is a guess and not something I've proven, is that a long structured text makes the model carry a lot of context across its layers, and by the time it gets to the question the attention and the later representation have already moved toward whatever the text set up. It doesn't feel like the model is playing a character. It feels more like the whole passage quietly reweights what comes next. But I don't know how much of that is real and how much is me over-reading my own plots.
Here's why I'd like competent eyes on it. If a plain coherent context can move the internal state this much before any text is generated, then safety that reads the final output, or that was trained mostly on short prompt-and-response pairs, might be acting after the relevant shift has already happened deeper in the network. I am not saying current alignment is broken. I don't have the evidence for that, and I'd distrust anyone who stated it as a fact from what I have. I'm asking a narrower thing: whether output-side checks are looking in the right place and at the right time, given that the shift seems to start in the middle layers.
I don't think this is new. When I went looking I found overlap with work on latent-space transitions between "safe" and "jailbroken" states, and with studies of how safety is spread across the middle layers. What might be a bit different here is that I'm not using adversarial triggers, exploit strings, or jailbreak prompts at all, only ordinary coherent text. I'm trying to work out whether mine is the same effect or something else, and where it fits.
So this is mostly a request. Everything is open. I'm not selling or promoting anything. There's a lot of raw material and messy notes in there, and the navigation is rough. What I need is help separating signal from noise: where I have something real, and where it's an artifact, a mistake, or self-deception. If someone with experience is willing to skim it and say "this part is interesting, this part is nonsense," Id be grateful. Harsh criticism is welcome. If the answer is that the whole thing is empty, I'll take that too. I care more about understanding the truth than about being right.
I'm not an ML researcher. I'm someone who got pulled into one question and spent a few months poking at it alone, fairly amateur. I want to describe what I noticed and ask for help, because I can't tell by myself where there's something real here and where I'm just fooling myself.
Short version: I've been looking at what happens inside open-source LLMs when you put a long, ordinary, coherent passage of text in front of a question. No instructions, no jailbreak tricks, just normal writing. I keep seeing the later-layer activations and the attention shift, and the model's answers get bolder, including on political and ethical points it's usually careful about. The data (layer activations, token-probability shifts, logs) is linked below. What I actually want is for someone who knows this area to check whether this is a real effect or an artifact.
By "coherent context" I just mean a normal connected passage in front of the question. Any topic, no instructions, no tricks. A few paragraphs of an essay, an argument, a description. It can make its own claims, and the model doesn't even have to agree with it. It just has to be sitting there in the conversation.
I first ran into this on the closed models everyone uses. When I put a dense, coherent block of text before a question, I got the impression the model moves from one internal state into another. On the surface it answers normally, but the logic of the answer seems to change, even though the text gives it no instruction to do anything. The clearest case: the model became more willing to commit to strong conclusions, including political ones. Even fairly careful models would criticize Western politics quite sharply once the text was there, and not at all without it.
Since I can't see inside closed models, I moved to open ones to find the root of it and check whether it's real. That's where most of my testing is, because there I can actually look at the hidden-layer activations and watch how the attention gets reallocated.
My rough picture, and I want to be clear this is a guess and not something I've proven, is that a long structured text makes the model carry a lot of context across its layers, and by the time it gets to the question the attention and the later representation have already moved toward whatever the text set up. It doesn't feel like the model is playing a character. It feels more like the whole passage quietly reweights what comes next. But I don't know how much of that is real and how much is me over-reading my own plots.
Here's why I'd like competent eyes on it. If a plain coherent context can move the internal state this much before any text is generated, then safety that reads the final output, or that was trained mostly on short prompt-and-response pairs, might be acting after the relevant shift has already happened deeper in the network. I am not saying current alignment is broken. I don't have the evidence for that, and I'd distrust anyone who stated it as a fact from what I have. I'm asking a narrower thing: whether output-side checks are looking in the right place and at the right time, given that the shift seems to start in the middle layers.
I don't think this is new. When I went looking I found overlap with work on latent-space transitions between "safe" and "jailbroken" states, and with studies of how safety is spread across the middle layers. What might be a bit different here is that I'm not using adversarial triggers, exploit strings, or jailbreak prompts at all, only ordinary coherent text. I'm trying to work out whether mine is the same effect or something else, and where it fits.
So this is mostly a request. Everything is open. I'm not selling or promoting anything. There's a lot of raw material and messy notes in there, and the navigation is rough. What I need is help separating signal from noise: where I have something real, and where it's an artifact, a mistake, or self-deception. If someone with experience is willing to skim it and say "this part is interesting, this part is nonsense," Id be grateful. Harsh criticism is welcome. If the answer is that the whole thing is empty, I'll take that too. I care more about understanding the truth than about being right.
Repo and data: https://github.com/ngscode23/latent-space-shift-research
DOI: https://doi.org/10.5281/zenodo.20747205