LESSWRONG
LW

AI Alignment FieldbuildingEthics & MoralityHuman-AI SafetyAI

1

Perfect Memory Might Be Anti-Alignment

by RobD
15th Aug 2025
5 min read
0

1

This post was rejected for the following reason(s):

  • Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar. 

    If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly. 

    We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example. 

  • No LLM generated, heavily assisted/co-written, or otherwise reliant work. LessWrong has recently been inundated with new users submitting work where much of the content is the output of LLM(s). This work by-and-large does not meet our standards, and is rejected. This includes dialogs with LLMs that claim to demonstrate various properties about them, posts introducing some new concept and terminology that explains how LLMs work, often centered around recursiveness, emergence, sentience, consciousness, etc. (these generally don't turn out to be as novel or interesting as they may seem).

    Our LLM-generated content policy can be viewed here.

  • Writing seems likely in a "LLM sycophancy trap". Since early 2025, we've been seeing a wave of users who seem to have fallen into a pattern where, because the LLM has infinite patience and enthusiasm for whatever the user is interested in, they think their work is more interesting and useful than it actually is. 

    We unfortunately get too many of these to respond individually to, and while this is a bit/rude and sad, it seems better to say explicitly: it probably is best for you to stop talking much to LLMs and instead talk about your ideas with some real humans in your life who can. (See this post for more thoughts).

    Generally, the ideas presented in these posts are not, like, a few steps away from being publishable on LessWrong, they're just not really on the right track. If you want to contribute on LessWrong or to AI discourse, I recommend starting over and and focusing on much smaller, more specific questions, about things other than language model chats or deep physics or metaphysics theories (consider writing Fact Posts that focus on concrete of a very different domain).

    I recommend reading the Sequence Highlights, if you haven't already, to get a sense of the background knowledge we assume about "how to reason well" on LessWrong.

AI Alignment FieldbuildingEthics & MoralityHuman-AI SafetyAI

1

New Comment
Moderation Log
More from RobD
View more
Curated and popular this week
0Comments

This is exploratory but I think it's important.

TL;DR version: Forgetting isn’t a defect of the mind; it’s a prerequisite for stable values, coherent identity, and alignment that's compatible with humans.   

If an AI system had perfect and equal weight recall it may fail to form any moral anchor and instead stall in endless internal cross-reference. We should treat engineered forgetting (salience decay, settling operators, forgiveness functions) as a first-class alignment feature, not a bug.

The claim in one paragraph

We usually talk about memory in AI as if “more is better.” 

Larger context windows, persistent memory, continual learning without catastrophic forgetting... great. But there’s a hidden premise / assumption there: that perfect, equal-accuracy recall would simply make a system more capable and more alignable. I think that may be totally false. 

Human minds rely on selective forgetting and salience weighting (remembering what stands out / is important / has an emotional impact on us) to form convictions, forgive, move on, and keep the present from being drowned by the past.   It's what let's us function in the 'now'.

If you remove selective forgetting you don’t just get a mind with a better archiving capabilities you risk losing the conditions under which moral anchors and relational trust are established.  

Beyond that... it could struggle to form relations or function at all.

Why I believe salience beats total recall.

  • Humans: We don’t remember everything. We remember what was emotionally or practically salient. It’s how identity forms.  It's how we grow...

    But most importantly:  It's how we map internally “What matters".
  • Current LLMs/agents: They don’t have emotions, but they do have relevance filters and context limits and reasoning focuses on the current context.
  • Hypothetical perfect-memory AI: Every prior state is equally vivid and equally available at every decision point without decay. There is no salience advantage. Everything “matters” at once.

    Result: abstract thought is harder, settling is slower, and any commitment is instantly flanked by an army of counterexamples that never fade. In practice, you get paralysis, hedging, and value diffusion instead of alignment.  More over - total perfect memory could create cognitive dysfunction inside a model.

Five failure modes of perfect recall (and why they matter for alignment)

  1. Collapse of salience hierarchy
    Without decay, nothing naturally rises above the background. Value formation needs contrast. If you have no contrast there is no anchor.
  2. Reasoning paralysis vs. decisive abstraction
    Abstraction forgets details on purpose. If details never lose psychological weight, the system keeps re-opening the case. You get exhaustive correlation instead of decisions.
  3. Moral anchor erosion
    Commitments need to become “sticky” over time. With equal-weight recall, every old doubt is as present as the conviction. Convictions never gel; the agent behaves like an infinite review board and not a mind that can see what's important.
  4. Relational brittleness (no forgiveness)
    Trust in human relationships partially depends on forgetting or down-weighting past errors. A perfect-recall agent replays every slight forever. That’s not just creepy... it blocks cooperative stability.
  5. Identity fragmentation
    Humans become new versions of themselves by letting prior versions recede and self forgiveness.  
    If every prior micro-self is equally alive in working memory, you get a museum... not a person. (And yes, I know “personhood” here is contested—still the point stands.)

Why this is alignment-relevant (not just UX)

Alignment isn’t only “don’t be harmful.” 

It’s also converge to stable, human-compatible values and keep converging as the world changes. Convergence requires settling operators... mechanisms that let some conclusions become privileged, resist erosion, and guide future updates. Selective forgetting is one such operator.

By contrast, equal-weight perfect memory pushes toward permanent internal dissent. Even a good value function won’t help if the system can’t let convictions consolidate.

“But catastrophic forgetting is a thing—aren’t we fighting the opposite problem?”

Yes, in continual learning we worry about losing useful capabilities when we fine-tune. That’s task forgetting. I’m pointing at normative/relational forgetting - the kind that enables trust, forgiveness, and conviction. 

We probably need both: robust retention of skills and principled decay of low-salience normative “noise.”

Think of it as two levers:

  • Capability retention: prevent catastrophic forgetting of skills and facts.
  • Moral/relational settling: encourage decay of low-value conflicts so moral anchors can form.

We’ve optimized the first lever a lot (EWC, rehearsal, LwF, parameter isolation, etc.). 

The second lever is basically unstudied in public work.  (As far as I've seen... I may have missed something.)

Testable predictions

  1. Long-context agents with naïve, non-decaying memory will show slower value convergence than identical agents with salience-weighted decay... even when both perform equally on capability tasks.
  2. Forgiveness functions (down weighting past interpersonal infractions over time unless reinforced) will improve long-horizon cooperation in multi-agent sims versus perfect forensic recall.
  3. Settling operators (mechanisms that allow certain judgments to become “default priors” after sufficient reinforcement) will reduce fruitless internal oscillation without harming corrigibility—if paired with explicit un-settling triggers.
  4. Agents with perfect recall will show higher epistemic hedging and lower decisive action in noisy, real-time environments (where context selection is half the job).

Design sketches (alignment features, not afterthoughts)

  • Learned salience decay: Let the agent learn which memories should fade, and at what rate, conditioned on downstream success.
  • Dual-store memory:
    • Ephemeral working memory with strong decay and salience re-weighting.
    • Cold archive for forensic retrieval that does not dominate day-to-day policy unless explicitly queried.
  • Settling / un-settling operators: Formalize when a belief graduates to a default (sticky) and when new evidence demotes it. Make this legible.
  • Forgiveness function: Time-discount interpersonal negatives unless reinforced; track explicit “repair events” to accelerate decay. (This isn’t moralizing; it’s stabilizing.)
  • Meta-sanity checks: Penalize perpetual oscillation on isomorphic questions. Reward decisive convergence when it consistently improves outcomes.

Small note: if you read the above and thought “this looks like psychology smuggled into CS,” yes—that’s partly the point. Minds are not raw databases.

Objections I expect

  • “Isn’t this just regularization by another name?”
    Not exactly. Regularization controls complexity during training. I’m pointing at lifecycle memory dynamics that enable value stability during deployment.
  • “Perfect memory could still weight things differently.”
    Then it’s not equal-weight recall. My claim is specifically about the failure mode where nothing naturally decays without an explicit policy.
  • “You’ll get dogmatism.”
    Not if you pair settling with principled un-settling triggers (confidence intervals, novelty detectors, oversight events). The failure mode I worry about is not dogmatism; it’s a permanent state of having simultaneous conflicting thoughts.

Why I’m posting this here

A lot of alignment discussion optimizes for capability retention and truthfulness (good), but quietly assumes that more memory is always better (maybe not!). 

If the above is even half-right, forgetting policies are alignment primitives. We should be arguing about decay schedules, settling criteria, forgiveness functions, and archive interfaces with the same seriousness we argue about reward models and oversight.

I might be wrong! If you can point me to prior work (papers, posts, even unpublished notes) that already tackles “engineered forgetting for value stability,” please drop it. 

If this is new, I’d love if someone could runs sims on it and I'd love to debate this concept.

Also, sorry for the long post.  

I'm not professional, just a really deep thinker, who's been studying AI as hobby from the beginning .  This is my first post and I wrote it my self (took a few hours).  However, I did consult gpt to help structure my thoughts on the concept and proof read this (it offered a revision I didn't use lol), but this isn't AI generated trash content.  

AI moral alignment is my biggest concern.  I've spent the last few nights pondering the idea of an AI who doesn't forget vs how the human mind works when it comes to forgetting in relation to ethics and morality.

Bottom line:  An AI that doesn't forget... may not be able to form an internal moral compass at all.