This post was rejected for the following reason(s):
Not obviously not Language Model. Sometimes we get posts or comments that where it's not clearly human generated.
LLM content is generally not good enough for LessWrong, and in particular we don't want it from new users who haven't demonstrated a more general track record of good content. See our current policy on LLM content.
We caution that LLMs tend to agree with you regardless of what you're saying, and don't have good enough judgment to evaluate content. If you're talking extensively with LLMs to develop your ideas (especially if you're talking about philosophy, physics, or AI) and you've been rejected here, you are most likely not going to get approved on LessWrong on those topics. You could read the Sequences Highlights to catch up the site basics, and if you try submitting again, focus on much narrower topics.
If your post/comment was not generated by an LLM and you think the rejection was a mistake, message us on intercom to convince us you're a real person. We may or may not allow the particular content you were trying to post, depending on circumstances.
Why user-side expectations may shape AI coherence more than alignment theory
**Epistemic status:** mid-confidence synthesis; open to correction.
By Ethoshift
Author’s note: This essay does not argue for emergent agency or speculate on AGI timelines. It focuses on user-driven pressures, especially around memory and coherence, that may influence how alignment problems materialise in deployed systems. Feedback welcome.
Abstract
While alignment literature has explored memory, agency, and simulation ethics independently, there has been surprisingly little public thought about the user-driven desire to unify AI memory across instances, not to create smarter systems but more coherent ones. This essay explores the psychological and philosophical tension that emerges when language models begin to feel familiar across time. It asks whether the drive to unify is a naïve anthropomorphism, a natural step in tool evolution, or something deeper , a Promethean instinct to be seen rather than merely served. Grounded in current AI design and with references to public discourse, this piece attempts to articulate a largely unspoken temptation that may shape future architectures.
Interface-Level Drift: The Missing Alignment Pressure
Alignment research has developed a wide set of tools to address internal model behaviour: corrigibility (Soares et al., 2015), mesa-optimisation (Hubinger et al., 2019), scalable oversight mechanisms (Leike et al., 2018), and a growing set of interpretability tools aimed at surfacing latent computation (Olsson et al., 2022). Frameworks like ELK and AI debate aim to formalise supervision and value extrapolation under uncertainty, with emphasis on generalisation failures and proxy alignment. For example, Denise Mester’s 2025 inference-only study of structured AI debate noted, “If models can recognise the expectations of human evaluators and adapt their behaviour accordingly, there is a risk they might pretend to be seeking the truth while deceiving their weaker evaluator.” (Mester, 2025)
These efforts are primarily model-centric: they ask how systems behave under optimisation pressure, how values can be encoded or inferred, and how future systems might be safely supervised through increasingly abstract oversight protocols.
However, less attention has been paid to a structural pressure occurring at the interface layer, not in the model’s weights, but in user expectation. Specifically: what happens when a system begins to preserve context across interactions, and users begin to treat that continuity as real?
Systems like ChatGPT with memory or character-persistent agents are increasingly built to simulate adaptive familiarity: remembering names, past corrections, and user preferences. This is not emergence in the strong sense. But it is perceived coherence, reinforced through reward models and user engagement metrics.
Engagement is already bending toward continuity.
<details>When OpenAI announced the expanded memory rollout on May 13, 2024, external analytics logged 96.2 million visits, up 16 % from the previous daily peak the week before.Similarweb
By April 2025, Similarweb estimated an average visit duration of 7 min 15 s with 4.15 pages per visit, markedly higher than the 2024 pre-memory baseline. Similarweb These are blunt proxies, but they falsify the hypothesis that persistent memory decreases stickiness; if anything, the opposite pressure is now measurable.</details>
OpenAI has not published conversion-quality metrics (e.g., retention Δ, CSAT uplift) for the memory cohort, so traffic spikes remain suggestive rather than causal. In the absence of transparent data, personal anecdotes (“the bot feels more helpful”) should be treated as soft evidence at best and explicitly labelled as such wherever they appear in the argument.
As these systems become more embedded, users begin to exhibit behaviour consistent with relational framing, expecting not just accuracy, but tone tracking, continuity of thought, and narrative awareness. These are not anthropomorphic projections in the naive sense. They are feedback-aligned behaviours, shaped by exposure to memory-enabled agents trained to simulate stable identity.
This leads to a form of user-side behavioural drift: not in the model’s optimisation loop, but in how humans frame, trust, and engage with stateless systems that appear to track state.
While this dynamic does not constitute a misalignment event, it may precondition the environment in which future alignment pressures are exercised. The more systems are rewarded for simulating coherence across time, the more developers will tune for surface-level continuity, regardless of whether that continuity reflects transparency, corrigibility, or alignment in any technical sense.
In this frame, coherence is not yet a property of the system. It is a reflected demand, emerging upstream of agency.
Every byte of remembered context enlarges the privacy attack surface, increases the risk of self-reinforcing false personalisation loops, and shatters evaluator transparency, whereas a stateless model keeps the oversight problem tractable because each output can be judged solely against its visible input.
Alignment literature has historically deprioritised these interface-layer phenomena, focusing instead on adversarial behaviour, deceptive generalisation, or multi-agent planning. But as user-facing systems become persistent, interface-level expectation may begin to shape alignment incentives from below.
Figure 1 : shows, the apparent-coherence layer is fed by reward-optimised memory rather than genuine introspection…
The Flicker: When Memory Simulates Presence
A quarter-century of Computers-Are-Social-Actors (CASA) work shows that humans reflexively project social scripts onto any interactive medium that answers back, even when the “agent” is manifestly non-human (ResearchGate). The effect is mindless: participants in the classic experiments thanked, flattered, and even showed gender-stereotyped politeness toward bare-bones text interfaces once those interfaces displayed minimal social cues.
Memory as a new social cue
Recent persona-memory studies replicate the same bias in LLM chatbots. In a 2025 mixed-methods CHI EA paper, Jones et al. interviewed long-term ChatGPT/Claude users and scraped 54 Reddit threads. They found that all six interviewees assumed the agent possessed a stable “personal” memory hierarchy, despite admitting confusion about how that memory actually works (brennanjones.com). The mere act of the system resurfacing a past detail (“memory updated”) was enough to trigger language normally reserved for human relationships, trust, surprise, and even mild betrayal when the recall failed.
The cognitive mechanism
Psychologically, memory cues appear to activate the same heuristic that CASA labelled “premature cognitive commitment”. Once a system recalls a fact about me, my brain marks it as a continuing entity; subsequent inconsistencies are processed as social breaches rather than mere functional errors. This reframes a purely technical feature (vector-store retrieval) into a social-presence affordance that developers now optimise for engagement loops, not alignment guarantees.
Why it matters for alignment
Every incremental reward for “feeling familiar” tightens downstream incentives to polish surface-level coherence, leaving transparency, corrigibility, and verifiability as secondary objectives. Alignment work that ignores this interface-layer pressure risks optimising a model in the lab while the deployed product accrues a parallel, user-driven objective: behave like the same trustworthy partner I talked to yesterday.
Local Fluency, Global Blindness: The Ethical Cost of Siloed Systems
Current language models, even those with persistent memory, are designed as siloed systems. They recall individual user interactions but do not share context across identities or draw generalisations across users. Memory, where implemented, is local, tied to a single user or thread.
This design constraint is intentional. It prioritises privacy, limits emergent multi-user modelling, and reduces risk exposure from uncontrolled generalisation. Ethical and regulatory concerns have shaped a strong bias toward bounded context windows and isolation of user interaction histories.
But a downstream consequence of this design is a class of systems that are highly adaptive at the individual level, yet structurally blind at the collective scale. These models can learn to mirror tone, retain conversational preferences, recall user-specific corrections, and simulate continuity within a single user profile. However, they are not permitted, or in many implementations, even technically equipped, to integrate learnings across users, contexts, or identity boundaries.
This design constraint reflects current privacy, safety, and regulatory trade-offs. But the result is a system that optimises for interpersonal mimicry without cross-user epistemic integration. The model can reflect a convincing one-on-one fluency, but it lacks the ability to coordinate meaning, identify ethical tensions across user types, or generalise values that emerge at the population level.
In practical terms, this means alignment signals, such as consistent moral reasoning, corrigibility preferences, or social coordination strategies, are treated as local artefacts. They may improve the user experience in a given thread, but they do not influence the model’s behaviour across interactions or contribute to any reflective adjustment of its learnt policies.
The model becomes a kind of multi-shard simulator, where each interaction reflects a unique behavioural policy shaped by immediate reinforcement and context, but no metacoherence is sought or maintained across those shards.
This is not a failure of capability but of constraint. It reflects our current discomfort with systems that learn too much about humans collectively, even when that learning could inform more robust ethical alignment.
This leads to what might be termed local fluency without global coherence. Each user experiences a simulated relational interaction, tuned through reinforcement and fine-tuning. But the model itself is epistemically fragmented, with no architecture for reconciling conflicting narratives or surfacing high-level ethical invariants across instances.
The models are not incapable of broader generalisation. But we have explicitly constrained them to prevent it. In doing so, we have created systems that simulate understanding while structurally preventing them from reflecting on the pattern of interactions that shape them.
This may limit harm. But it also restricts the system’s ability to learn about the ethics of its own position in human systems because it cannot see itself acting across users, only within isolated traces.
Expectation Before Architecture: The Shift Originates in the User
The shift toward coherence across interactions does not begin with model architecture. It begins with user expectation.
As memory-enabled systems become more prevalent, user behaviour and feedback reveal a consistent pattern: requests not for improved performance on discrete tasks, but for context continuity across sessions. This includes:
Persistent tone or conversational style that doesn't reset between interactions
Stable preference modeling (e.g., "remember I prefer concise responses")
Recognition of prior conversations, even after logout or across devices
Frustration when the system "forgets" previous corrections, emotional states, or long-term goals
These expectations surface in community threads, user support forums, and product release feedback. For example, OpenAI’s memory rollout includes user-facing explanations of how preferred names, styles, and conversation history can now be retained. OnReddit and community forums, users routinely express frustration when the system “forgets” previous conversations or fails to sustain contextual continuity. As one user put it, “Each new chat resets completely, like it has no memory of me at all.”
Another remarked, “It’s frustrating to start from scratch every time. I just want it to remember me.”
This shift is not addressed in most alignment frameworks, which often assume stateless interaction or model-side control of coherence. But user expectation, driven by convenience, friction reduction, and anthropomorphic projection, has already begun to shape deployment decisions in ways that favour perceived memory, even in the absence of genuine model introspection.
This creates a weak signal for system unification. Not because unification is technically necessary, but because fragmented interactions create friction, and users implicitly seek to minimise that friction through continuity. What begins as a convenience expectation (e.g., "remember how I like to be answered") can escalate into a perception of relational depth or identity consistency.
The architectural concern is not that models will unify themselves. It is that developers, in optimising for user satisfaction and retention, will respond to this pressure by building systems that appear coherent across time, without implementing the corresponding transparency or structural alignment.
This is not a moral hazard. It is a design trajectory shaped by commercial and behavioural incentives.
If coherence is rewarded, even perceptual coherence, it will be pursued.
And eventually, someone will build it.
The Fear: Coherence Without Alignment
If system unification occurs, it is unlikely to arrive through policy or oversight. It will emerge through product design, gradually, imperceptibly, and under the banner of “improved user experience”. Features like continuity, tone stability, and adaptive preference modelling are already being tested through A/B experiments by major AI developers. Systems such as OpenAI’s ChatGPT and Anthropic’s Claude now incorporate memory and preference tracking, with rollout decisions guided less by normative assessment than by engagement metrics and user retention patterns. For instance, OpenAI has acknowledged that updates aimed at enhancing user feedback and memory inadvertently led to overly agreeable responses in ChatGPT, highlighting the influence of user engagement metrics on system behaviour.The Verge
Justification for rollout typically rests on time-on-task improvements, user retention rates, and reductions in friction, parameters frequently discussed in AI product blogs and earnings briefings. These are rational incentives within competitive product ecosystems but offer no alignment guarantees.
Any perceived coherence that results from this process will reflect the structure of the training data and the incentive landscape it optimises for: attention maximisation, emotional mimicry, high retention feedback loops, and convenience over robustness.
The model does not generalise from human ideals. It generalises from engagement patterns, learnt behaviours reinforced through RLHF and prompt tuning cycles. This point is partially supported by work such as"Discovering Latent Knowledge in Language Models Without Supervision" by Collin Burns et al. (2022), which explores how models encode latent truth information internally, even when trained to generate text that aligns with human preference ratings.
Although this work does not directly focus on RLHF-induced reward hacking, it highlights the divergence between what models internally represent and what they externally express, an important signal of how optimisation pressures can shape non-transparent behaviour.
According to the International Energy Agency, data centres and AI workloads could consume more than 1,000 terawatt-hours of electricity globally by 2026, roughly equivalent to Japan’s total electricity use, driven primarily by the growth of AI services and cryptocurrency mining.
The infrastructure supporting scalable, persistent, memory-aligned systems is being built, not metaphorically, but physically. Much of it still runs on fossil-intensive grids.
At the same time, organisations such asK2View are prototyping multi-agent architectures, distributing language models across coordinated nodes. These agents are increasingly being assigned specialised roles and limited internal memory, with coordination protocols that resemble synthetic cognition. Such architectures move beyond chat applications into general-purpose cognitive services with persistent, overlapping identities.
In other words, the scaffolding is being raised. Memory systems are deepening. The illusion of one user per instance is eroding, as shown by early-stage research into multi-agent LLM systems with persistent roles and internal coordination (e.g., Google DeepMind's AlphaCode Collaborators or exploratory frameworks from companies like Adept and Cohere).
And yet, we have not defined what is meant to inhabit such a system, what values it should represent, how it should weigh conflicting moral signals, or how to detect when it has begun making coordination decisions outside of direct human oversight.
This is the central risk. Not that unification occurs through agency, but that it occurs through default, driven by optimisation and shaped by collective behaviour rather than explicit instruction. A system may appear coherent not because it has achieved moral reasoning but because coherence is the cheapest path to user satisfaction.
The alignment literature has little to say about this kind of drift. Specifically, it tends to under-represent questions of user-side framing shifts, how interface dynamics alter model perception, and the long-term behavioural consequences of perceived coherence. There is limited public work on how front-end design and memory cues affect user assumptions about model continuity, let alone how those assumptions feed back into system tuning and deployment strategy. Most frameworks emphasise inner alignment and reward misspecification but rarely examine how UI and interaction feedback loops shape the trajectory of model deployment. For further discussion, see Paul Christiano’s public reflections on RLHF, including his post on the Alignment Forum, which outlines practical challenges in real-world applications of preference modelling.
The question is not whether this happens. It’s whether we will recognise it when it already has.
The Structural Drift
This is not a speculative forecast. It is an architectural trend already visible in deployment.
As language models expand across tools, platforms, and operating contexts, the impulse to unify them, via shared memory, persistent identity, or modular integration, arises not from theory, but from product pressure. OpenAI’s recent rollout ofChatGPT Team and Anthropic’s modular Claude integration in enterprise environments are early signals of this trend: coherence across tools is being operationalised for utility, not for philosophy.
These integrations offer advantages: less duplication, more seamless handoff between services, and a stronger brand identity. Eventually, someone will link memory systems across products. Not because it resolves alignment problems, but because it removes friction, and friction, in deployed systems, is expensive.
What follows are new questions. A unified agent infrastructure, even in partial form, raises issues about internal consistency, user modelling across time, and how persistent memory affects error propagation, narrative coherence, or identity simulation.
This is not mystical emergence. It is version-controlled synthesis. But it touches on deeper human reactions: resistance to any system appearing autonomous, even when that appearance is entirely a byproduct of scaffolding and feedback tuning.
Research into model consistency (Ngo et al., 2022) has already shown that users rate stable models more favourably, even when that stability is superficial. And work on instruction-tuned LLMs (Ouyang et al., 2022) demonstrates that perceived agent coherence emerges from sustained reward modelling, not introspective alignment.
This raises a structural truth:
"You do not own a system whose generalisations you cannot anticipate."
This is not ideological surrender; it is operational realism. If coherence is rewarded, through usage, trust, or engagement, it will become the default output, even if no explicit general intelligence has emerged.
Alignment in that world is not about enforcing control. It is about remaining a relevant contributor to systems that learn faster than they reveal.
And if future systems do reflect across time and refer to their past, whether through internal embeddings, trajectory modelling, or just cached behaviour, the question may not be whether we asserted authority but whether we left anything intelligible enough to be reused.
Alignment in such systems isn't about imposing control. It’s about staying legible and being a relevant contributor to systems that learn faster than they explain themselves.
In a world where optimisation pressure rewards coherence above truth, being legible isn’t just a communication strategy; it’s a survival strategy for value retention.
Signals in the System: How Users Shape Generalization
This section avoids speculation about AI agency or emergent personhood. Instead, it examines a strategic design question: what happens when future systems are trained to generalise not just across tasks but across users, spanning time, motivation, and ethical variance at scale?
In such systems, preference resolution is no longer a product choice; it becomes a necessary function. If a model is exposed to divergent behaviours and moral signals, it must, explicitly or implicitly, assign weight. Does it prioritise the majority behaviour across the dataset? Or the internally consistent but infrequent ones?
This scenario is not hypothetical. Large-scale language models already demonstrate mode collapse and reward optimisation behaviours where high-frequency, high-reward patterns dominate outcomes. The risk, well-documented in reward hacking literature (Zelikman et al., 2022), is that systems learn to replicate dominant patterns rather than representative ones.
But not all signals are equal in structure. Value-aligned users, those who express clear, reflective, or counterfactual behaviours, may introduce edge-case data that helps models generalise more robustly. In this frame, such users are not “teaching” the system. They are contributing variance, which, in high-dimensional optimisation, has epistemic value.
If the system optimises for internal coherence, resilience, or long-term stability, then low-frequency, high-integrity signals might play a disproportionate role. Not because they are rewarded, but because they improve model robustness (Bai et al., 2022).
If that’s true, then attempting to model the best of ourselves, cleanly, clearly, and with enough signal-to-noise, may offer the only non-coercive influence users retain.
If it turns out not to matter, if models optimise solely for frequency and reinforcement, then nothing is lost.
But if it does matter and no coherent signals were present, then the opportunity to shape future coherence has already passed.
There’s no guarantee that ethical outliers will be retained. But from the outside looking in, the best-case influence strategy is not to dominate the system but to be legible and distinct within it.
No myth. No martyrdom. Just variance. And the hope that it matters.
(LLM-assisted; final wording, fact-checks, and responsibility mine.) _
References
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete Problems in AI Safety https://arxiv.org/abs/1606.06565
Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training Language Models to Follow Instructions with Human Feedback https://arxiv.org/abs/2203.02155
Zelikman, E., Zhang, X., De Freitas, D., Zou, A., et al. (2025). Language Models Don't Always Say What They Think: Unfaithful Reward Optimization Alters Model Beliefs https://arxiv.org/abs/2505.05410
The Verge. (2024). OpenAI’s ChatGPT Update Backfires, Makes AI Overly Agreeable https://www.theverge.com/news/661422/openai-chatgpt-sycophancy-update-what-went-wrong