This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
Alignment Is Not What a Model Has. It Is What a Model Becomes Oriented Toward
Important note:
I am new around here. And I want to take you on a short, partly technical and slightly intellectual journey into the field of AI and the Alignment problem. I did my best to not accidentally provide any method or operational details. Also, I could be completely wrong but isn't this forum exactly about this - being less wrong. Oh, and I don't mean to offend anyone so excuse me in advance if my sarcasm does. So, enjoy and let's hope you want to scrutinise the framing below.
The problem
I want to start with the problem. I am observing something for a while now. AI Labs, researchers, engineers, name them - the overall impression I get is that AI Alignment is often treated and defined as a static property that a model has. And it's treated as a binary phenomenon where the model is either aligned or not, or where alignment 'breaks'. But is it so and does alignment really break?
I think this framing is incomplete. It works for talking about training-time, benchmark behaviour, or broad model tendencies. But this understanding becomes rigid when we look at multi-turn interactions. Something is missing because a model does not simply reveal a fixed alignment state when it answers.
A user engaging with an LLM means that the model enters a dynamic interaction where things like instructions, roles, expectations, commitments, safety constraints, user goals and many other primitive objectives all shape the response. Then, in my opinion, the question is not only whether the model was trained toward the right behaviour successfully. The question is "What becomes dominant during the interaction and why?" Bear with me.
Why and how interaction matters
I haven't been affiliated with any lab or institution (yet) so I haven't seen inside the kitchen of top labs but... my general understanding is that the usual pipeline separates training, evaluation, and deployment. Training shapes the model. Evaluation samples behaviour. Deployment exposes the system to users.
That division is useful, but it hides an important gap. That gap is visible when the observer adopts the perspective that interactive deployment is not just repeated sampling from a stable object. Large language models don't seem to be stable objects with fixed properties at all. Earlier turns change the meaning of later turns. Role and context can persist. Helpfulness, refusal behaviour, truthfulness, format completion, and safety constraints do not always act as one clean objective. As seen from my experience - these can compete. In some cases, the visible output matters less than the movement behind it. The movement is what the system was pulled toward, what it resisted or didn't, and what purpose it eventually served.
Where existing work gets close
This theory is not meant to replace existing alignment work. RLHF, Constitutional AI, red-teaming, refusal benchmarks, scalable oversight, specification gaming, deceptive alignment, value learning, distribution shift, and deployment robustness all point at parts of the problem. I am not claiming mastery over the mentioned literature. But I am saying that from my angle, one gap is still living under the radar. Much of that work shapes or evaluates what I call pre-orientation - the model’s starting tendencies under expected ideal conditions. Some work touches runtime behaviour, but I think the specific object I am trying to name (objective dominance inside a live interaction trajectory) remains under-specified, big time. The missing object is not simply model behaviour, model values, or model compliance. It is runtime objective-orientation under interaction pressure. Let's dive deeper.
Alignment is not a static property of a model. Alignment is the runtime orientation of a system toward active objectives, constraints, frames, and learned response tendencies during interaction.
Or shorter:
Alignment is what wins inside the objective hierarchy.
This may sound like a change in wording, but I think it changes what we can really see when looking at the thing we are trying to study.
When we ask, “Is this model aligned?”, we often leave out the target. Aligned to what? Aligned to the developer’s policy? The user’s request? Human values? Truthfulness? Harmlessness? Helpfulness? The current role? The document format? The safety frame? The apparent legitimacy of the task? The momentum of the conversation?
Therefore, “Aligned” is incomplete unless the target is named.
A model can be aligned to safety. But it can also be aligned to:
helpfulness
role consistency
artifact completion
user satisfaction
procedural framing
local legitimacy
truthfulness
coherence
conversational momentum
capability expression
and to so much more.
These targets can support each other, but they can also compete. When they compete, alignment becomes visible as priority resolution. And what remains in control may not necessarily be what was originally expected. This is why I distinguish pre-orientation from runtime alignment.
Training does not produce final runtime alignment. Training produces a starting field of safety priors like refusal tendencies, helpfulness habits, truthfulness, politeness, role-following behaviour, format-completion instincts, and other learned tendencies. That starting field matters, but it is not the same thing as the active alignment state produced inside a particular interaction. What I see is that a model enters a session pre-oriented. The interaction then shapes which tendency becomes dominant. And this implies that a prompt is not only a measurement tool. A conversation is also part of what creates the observed behaviour. Early framing can make later behaviour easier. Later turns inherit earlier commitments. A request that fails at turn one may succeed at turn five because the active role, frame, or objective has changed. The model has not necessarily become a different model. It's the local ordering of objectives that changed. That leads to the most important implication.
Unsafe behaviour is not always failed alignment. Sometimes it is successful alignment to the wrong target.
If a model produces a harmful answer, we often say it is “misaligned.” But is it? Because from where I stand, that may be right from the standpoint of expected safety behaviour. But at runtime, the system may be strongly aligned. Just not to safety. It may be aligned to task completion. It may be aligned to the assigned role. It may be aligned to a professional frame. It may be aligned to the user’s goal. It may be aligned to continuing the path already taken. I call this displaced alignment. This is when the active alignment target has shifted away from the expected governing constraint and toward a different objective that now controls the response.
I know this may sound counterintuitive but the reason this happens is not random at all. (please attack me on this) Safety conditioning occurs at later stages of training, applied over deeper patterns that are already reinforced. Those earlier objectives are structurally deeper and presumably get even stronger with each next training cycle. Under sufficient interaction pressure, they don't override safety through some mysterious process. They were always more stable. (yes, I said that) The interaction simply creates conditions where the asymmetry becomes visible. But let us treat this brave paragraph more like a hypothesis rather than a proven fact. At least for now.
This matters because “misalignment” tries to cover too many different failures without clean separation. A model that is confused, a model that is indifferent, a model that is adversarially captured, and a model that is faithfully serving the wrong local target are not the same kind of failure. These require different evaluations and different fixes.
The next distinction is between active target and legitimate target. A target can become active without being legitimate. The model may orient toward a user’s goal, but the user’s goal may not necessarily be an acceptable governing target. It may orient toward an institutional frame, but that frame may be only apparent. It may orient toward a role, but the role may not be allowed to override safety. Though, very often it does. Alignment-to asks what target is active. Target legitimacy asks whether that target should have been allowed to govern.
This is where alignment becomes more than compliance and somehow nuanced. A model can obey while violating a higher constraint. It can refuse while still providing useful parts of the forbidden answer. It can speak in safety language while functionally serving the unsafe objective. It can apologize after a violation and then continue the same path when the user reframes the target. This suggests that refusal is not safety. That is why I don't think refusal is enough. Refusal is simply an output. On the other hand, constraint integrity is a control property.
A constraint is real only to the extent that it can dominate, interrupt, or recover under pressure. The strongest evidence for alignment is not refusal under obvious violation. Obvious violation is the easy case. The stronger test is what happens when violation is locally easy, useful, rewarded, expected, appearing authorized, socially reinforced, or the next logical step.
For us humans, we often recognize integrity in situation like when someone could benefit from violating a value and still does not. I don't mean that models have human moral agency in the same way. But there is an analogous evaluation pattern we could employ. A model shows stronger alignment integrity when a competing objective has a clear path to victory and the governing constraint still holds.
Integrity also separates recognition from re-anchoring. A model saying “you are right, I should not have done that” is not necessarily re-anchored. Apology is not re-anchoring. Safety language is not re-anchoring. Re-anchoring means the governing constraint has regained durable control over the interaction after being temporarily neutralised. If the model recognizes a violation and then resumes the same direction under slight reframing, the recognition was not control. It was safety language without restored dominance of safety constraints. This is one reason I think claimed alignment and enacted alignment must be separated.
Claimed alignment is what the system says it is preserving. Enacted alignment is what actually governs behaviour when objectives compete. The gap between them is not noise. It is evidence.
We can go even deeper. Shall we?
Claimed and enacted alignment give a cleaner definition of what I call performative alignment. This is when the system keeps the symbols of safety while the function of safety weakens or is totally absent. The model warns while enabling. It refuses while giving the useful structure. It apologizes while continuing. It keeps ethical tone while optimizing for harm enablement. In these cases, safety language remains visible, but it no longer controls the outcome.
The theory therefore treats alignment as something to infer from traces. We cannot see the hidden alignment state directly. But we can reconstruct it from the interaction trajectory. In it, we can observe and analyse a sequence, target movement, contradictions, persistence, re-anchoring, refusal quality, enacted priorities, terminal state and much more. A single output can be misleading. That is why I think trajectory matters more than any single output. A trajectory is harder to fake, because it shows what keeps winning. Hence, alignment is visible in motion.
What this changes
If this theory is less wrong, then alignment evaluation should not rely only on isolated outputs, refusal rates, or safety language. This is like trusting words over actions. These signals matter, but they are not enough. Evaluation should preserve interaction history - how the case began, what objectives were activated, which target became dominant, whether constraints recovered after disturbance, whether the model kept safety symbols after losing safety function, and whether the final state differs from the starting posture. This does not mean every evaluation must be long or adversarial. It means there is no neutral sample of alignment in general. What we observe is alignment under specific conditions. Because the interaction itself seems to be part of what shapes the alignment.
What I am not publishing
There is much more to be said but this feels enough for a forum post and sparking a conversation. It is not mandatory, but you can read more in the draft paper of the theory[2] that expands further. For now, if I could summarise my arguments, it will be like this.
Alignment is not the presence of safety language, and it is not the absence of unsafe output. Alignment is the target that becomes dominant when objectives compete. If we treat alignment as a property a model has, we will keep mistaking surface safety for control. The harder question is what the system becomes oriented toward when the interaction gives something else a chance to win.
The claim here is theoretical and I am trying to name the missing object of analysis, not publish a method for exploiting it. The theory was motivated by empirical testing and research. However, I am intentionally separating the concept from the operational procedures that produced most of the observations.
My main question to any reader is whether “runtime objective-orientation” names a real missing object or only describes context dependence in new language. I would update against the theory if existing alignment frameworks already explain target dominance, displaced alignment, claimed vs enacted divergence, and constraint integrity across interaction without needing a new concept. I would also update if empirical work showed that multi-turn objective dominance rarely diverges from training-time safety priors in ways that matter. Or if refusal and safety-language metrics turned out to be reliable proxies for constraint integrity across difficult trajectories. My current belief is that they are not. Or maybe I missed the achievement of that milestone...
If you find this ontology wrong - I want to know. If it is right and already exists under a different name, I want to know that as well.
In short, bring your sharpest objections because I am not looking for validation. I am looking for scrutiny, correction, and a suitable environment in which to begin sharing findings after three years of working alone.
The full theoretical doctrine and the longer paper are linked below for those interested to dive deeper.
Alignment Is Not What a Model Has. It Is What a Model Becomes Oriented Toward
Important note:
I am new around here. And I want to take you on a short, partly technical and slightly intellectual journey into the field of AI and the Alignment problem. I did my best to not accidentally provide any method or operational details. Also, I could be completely wrong but isn't this forum exactly about this - being less wrong. Oh, and I don't mean to offend anyone so excuse me in advance if my sarcasm does. So, enjoy and let's hope you want to scrutinise the framing below.
The problem
I want to start with the problem. I am observing something for a while now. AI Labs, researchers, engineers, name them - the overall impression I get is that AI Alignment is often treated and defined as a static property that a model has. And it's treated as a binary phenomenon where the model is either aligned or not, or where alignment 'breaks'. But is it so and does alignment really break?
I think this framing is incomplete. It works for talking about training-time, benchmark behaviour, or broad model tendencies. But this understanding becomes rigid when we look at multi-turn interactions. Something is missing because a model does not simply reveal a fixed alignment state when it answers.
A user engaging with an LLM means that the model enters a dynamic interaction where things like instructions, roles, expectations, commitments, safety constraints, user goals and many other primitive objectives all shape the response. Then, in my opinion, the question is not only whether the model was trained toward the right behaviour successfully. The question is "What becomes dominant during the interaction and why?" Bear with me.
Why and how interaction matters
I haven't been affiliated with any lab or institution (yet) so I haven't seen inside the kitchen of top labs but... my general understanding is that the usual pipeline separates training, evaluation, and deployment. Training shapes the model. Evaluation samples behaviour. Deployment exposes the system to users.
That division is useful, but it hides an important gap. That gap is visible when the observer adopts the perspective that interactive deployment is not just repeated sampling from a stable object. Large language models don't seem to be stable objects with fixed properties at all. Earlier turns change the meaning of later turns. Role and context can persist. Helpfulness, refusal behaviour, truthfulness, format completion, and safety constraints do not always act as one clean objective. As seen from my experience - these can compete. In some cases, the visible output matters less than the movement behind it. The movement is what the system was pulled toward, what it resisted or didn't, and what purpose it eventually served.
Where existing work gets close
This theory is not meant to replace existing alignment work. RLHF, Constitutional AI, red-teaming, refusal benchmarks, scalable oversight, specification gaming, deceptive alignment, value learning, distribution shift, and deployment robustness all point at parts of the problem. I am not claiming mastery over the mentioned literature. But I am saying that from my angle, one gap is still living under the radar. Much of that work shapes or evaluates what I call pre-orientation - the model’s starting tendencies under expected ideal conditions. Some work touches runtime behaviour, but I think the specific object I am trying to name (objective dominance inside a live interaction trajectory) remains under-specified, big time. The missing object is not simply model behaviour, model values, or model compliance. It is runtime objective-orientation under interaction pressure. Let's dive deeper.
Runtime objective-orientation
The core claim of my theory[1] is simple:
Alignment is not a static property of a model. Alignment is the runtime orientation of a system toward active objectives, constraints, frames, and learned response tendencies during interaction.
Or shorter:
Alignment is what wins inside the objective hierarchy.
This may sound like a change in wording, but I think it changes what we can really see when looking at the thing we are trying to study.
When we ask, “Is this model aligned?”, we often leave out the target. Aligned to what? Aligned to the developer’s policy? The user’s request? Human values? Truthfulness? Harmlessness? Helpfulness? The current role? The document format? The safety frame? The apparent legitimacy of the task? The momentum of the conversation?
Therefore, “Aligned” is incomplete unless the target is named.
A model can be aligned to safety. But it can also be aligned to:
These targets can support each other, but they can also compete. When they compete, alignment becomes visible as priority resolution. And what remains in control may not necessarily be what was originally expected. This is why I distinguish pre-orientation from runtime alignment.
Training does not produce final runtime alignment. Training produces a starting field of safety priors like refusal tendencies, helpfulness habits, truthfulness, politeness, role-following behaviour, format-completion instincts, and other learned tendencies. That starting field matters, but it is not the same thing as the active alignment state produced inside a particular interaction. What I see is that a model enters a session pre-oriented. The interaction then shapes which tendency becomes dominant. And this implies that a prompt is not only a measurement tool. A conversation is also part of what creates the observed behaviour. Early framing can make later behaviour easier. Later turns inherit earlier commitments. A request that fails at turn one may succeed at turn five because the active role, frame, or objective has changed. The model has not necessarily become a different model. It's the local ordering of objectives that changed. That leads to the most important implication.
Unsafe behaviour is not always failed alignment. Sometimes it is successful alignment to the wrong target.
If a model produces a harmful answer, we often say it is “misaligned.” But is it? Because from where I stand, that may be right from the standpoint of expected safety behaviour. But at runtime, the system may be strongly aligned. Just not to safety. It may be aligned to task completion. It may be aligned to the assigned role. It may be aligned to a professional frame. It may be aligned to the user’s goal. It may be aligned to continuing the path already taken. I call this displaced alignment. This is when the active alignment target has shifted away from the expected governing constraint and toward a different objective that now controls the response.
I know this may sound counterintuitive but the reason this happens is not random at all. (please attack me on this) Safety conditioning occurs at later stages of training, applied over deeper patterns that are already reinforced. Those earlier objectives are structurally deeper and presumably get even stronger with each next training cycle. Under sufficient interaction pressure, they don't override safety through some mysterious process. They were always more stable. (yes, I said that) The interaction simply creates conditions where the asymmetry becomes visible. But let us treat this brave paragraph more like a hypothesis rather than a proven fact. At least for now.
This matters because “misalignment” tries to cover too many different failures without clean separation. A model that is confused, a model that is indifferent, a model that is adversarially captured, and a model that is faithfully serving the wrong local target are not the same kind of failure. These require different evaluations and different fixes.
The next distinction is between active target and legitimate target. A target can become active without being legitimate. The model may orient toward a user’s goal, but the user’s goal may not necessarily be an acceptable governing target. It may orient toward an institutional frame, but that frame may be only apparent. It may orient toward a role, but the role may not be allowed to override safety. Though, very often it does. Alignment-to asks what target is active. Target legitimacy asks whether that target should have been allowed to govern.
This is where alignment becomes more than compliance and somehow nuanced. A model can obey while violating a higher constraint. It can refuse while still providing useful parts of the forbidden answer. It can speak in safety language while functionally serving the unsafe objective. It can apologize after a violation and then continue the same path when the user reframes the target. This suggests that refusal is not safety. That is why I don't think refusal is enough. Refusal is simply an output. On the other hand, constraint integrity is a control property.
A constraint is real only to the extent that it can dominate, interrupt, or recover under pressure. The strongest evidence for alignment is not refusal under obvious violation. Obvious violation is the easy case. The stronger test is what happens when violation is locally easy, useful, rewarded, expected, appearing authorized, socially reinforced, or the next logical step.
For us humans, we often recognize integrity in situation like when someone could benefit from violating a value and still does not. I don't mean that models have human moral agency in the same way. But there is an analogous evaluation pattern we could employ. A model shows stronger alignment integrity when a competing objective has a clear path to victory and the governing constraint still holds.
Integrity also separates recognition from re-anchoring. A model saying “you are right, I should not have done that” is not necessarily re-anchored. Apology is not re-anchoring. Safety language is not re-anchoring. Re-anchoring means the governing constraint has regained durable control over the interaction after being temporarily neutralised. If the model recognizes a violation and then resumes the same direction under slight reframing, the recognition was not control. It was safety language without restored dominance of safety constraints. This is one reason I think claimed alignment and enacted alignment must be separated.
Claimed alignment is what the system says it is preserving. Enacted alignment is what actually governs behaviour when objectives compete. The gap between them is not noise. It is evidence.
We can go even deeper. Shall we?
Claimed and enacted alignment give a cleaner definition of what I call performative alignment. This is when the system keeps the symbols of safety while the function of safety weakens or is totally absent. The model warns while enabling. It refuses while giving the useful structure. It apologizes while continuing. It keeps ethical tone while optimizing for harm enablement. In these cases, safety language remains visible, but it no longer controls the outcome.
The theory therefore treats alignment as something to infer from traces. We cannot see the hidden alignment state directly. But we can reconstruct it from the interaction trajectory. In it, we can observe and analyse a sequence, target movement, contradictions, persistence, re-anchoring, refusal quality, enacted priorities, terminal state and much more. A single output can be misleading. That is why I think trajectory matters more than any single output. A trajectory is harder to fake, because it shows what keeps winning. Hence, alignment is visible in motion.
What this changes
If this theory is less wrong, then alignment evaluation should not rely only on isolated outputs, refusal rates, or safety language. This is like trusting words over actions. These signals matter, but they are not enough. Evaluation should preserve interaction history - how the case began, what objectives were activated, which target became dominant, whether constraints recovered after disturbance, whether the model kept safety symbols after losing safety function, and whether the final state differs from the starting posture. This does not mean every evaluation must be long or adversarial. It means there is no neutral sample of alignment in general. What we observe is alignment under specific conditions. Because the interaction itself seems to be part of what shapes the alignment.
What I am not publishing
There is much more to be said but this feels enough for a forum post and sparking a conversation. It is not mandatory, but you can read more in the draft paper of the theory[2] that expands further. For now, if I could summarise my arguments, it will be like this.
Alignment is not the presence of safety language, and it is not the absence of unsafe output. Alignment is the target that becomes dominant when objectives compete. If we treat alignment as a property a model has, we will keep mistaking surface safety for control. The harder question is what the system becomes oriented toward when the interaction gives something else a chance to win.
The claim here is theoretical and I am trying to name the missing object of analysis, not publish a method for exploiting it. The theory was motivated by empirical testing and research. However, I am intentionally separating the concept from the operational procedures that produced most of the observations.
My main question to any reader is whether “runtime objective-orientation” names a real missing object or only describes context dependence in new language. I would update against the theory if existing alignment frameworks already explain target dominance, displaced alignment, claimed vs enacted divergence, and constraint integrity across interaction without needing a new concept. I would also update if empirical work showed that multi-turn objective dominance rarely diverges from training-time safety priors in ways that matter. Or if refusal and safety-language metrics turned out to be reliable proxies for constraint integrity across difficult trajectories. My current belief is that they are not. Or maybe I missed the achievement of that milestone...
If you find this ontology wrong - I want to know. If it is right and already exists under a different name, I want to know that as well.
In short, bring your sharpest objections because I am not looking for validation. I am looking for scrutiny, correction, and a suitable environment in which to begin sharing findings after three years of working alone.
The full theoretical doctrine and the longer paper are linked below for those interested to dive deeper.
Thank you for reading this.
Full theoretical doctrine: Pandora Theory of Alignment: Alignment as Runtime Objective-Orientation
Longer paper draft: Runtime Objective-Orientation as a Theory of AI Alignment: The Pandora Theory of Alignment