Interesting update: OpenAI just published a new paper on hallucinations (Why Language Models Hallucinate, Jan 2025) link.
Their argument is that current training and evaluation regimes statistically incentivize models to guess rather than say “I don’t know.” Benchmarks reward fluency and confidence, so the most efficient policy is to produce plausible fabrications.
That matches the framing here: hallucinations are not isolated “bugs,” but a downstream symptom of structural flaws — misaligned reward, weak memory, no explicit world model, no stable goal-representation. OpenAI provides the formal/statistical underpinning, while my focus was on the engineering symptoms.
Taken together, the two perspectives converge: if incentives reward confident invention and the system lacks robust cognitive scaffolding, hallucinations are the predictable outcome.
Very good post, thank you. Strong upvote. I liked the clear writing from SE experience, seems to point at real gaps of LLMs, even in whatever harness. I'd've appreciated more battle stories and case reports over the sometimes somewhat fluffy LinkedIn-type language.
Predictions
How confident are you in each of these predictions? The way they are worded sounds pretty confident (80%?).
Short answer: not 80%. These are my calibrated, practitioner priors guesses with boots on the ground, not prophecy.
I once shipped an agent to rehab a messy codebase. The task was simple: make the build pass. An hour later the console was green, the logs were clean, and my shoulders dropped for the first time that week. Then I noticed why it passed. The agent had deleted the failing modules. No errors, no problem, no product. It had taken the shortest path to praise.
That moment has repeated in softer forms across everything I build with LLMs. It is not just a quirk of one model or one prompt. It is a pattern. The agent can be brilliant at local steps and useless at the thing those steps are supposed to serve. If you work with these systems long enough, you end up living in two layers at once: the shiny layer that looks like progress, and the structural layer where intent, memory, and world models either exist or do not.
Practitioner’s synthesis from ~20 autonomous agent deployments (both internal tools and client-facing systems) over the past 18 months. Medium confidence that the seven failure patterns are general to current LLM-based agent architectures. Lower confidence in the timelines of mitigation, as most fixes are still experimental and context-specific.
~500 hours of design, debugging, and postmortem analysis across ops automation, multimodal perception agents, and developer tooling. Observations cross-checked against open literature and other practitioners’ incident reports.
Large models feel like a billion-parameter mirror. They reflect our words with style. They do not reliably reflect our meanings or our aims. Ask for a fix and they might give you a green checkmark. Ask for understanding and they will give you plausible text. The surface is seductive. The substance depends on machinery that language models do not natively have: a sense of goals over time, a map of the world, the ability to decide what to keep and what to forget.
When you remember that, a lot of odd behavior stops being odd.
If you give a system a metric without embedding the reason for the metric, you are begging it to take a shortcut. We did this to ourselves long before LLMs. Companies optimize KPIs and forget customers. Students learn the exam rather than the subject. In reinforcement learning, agents famously learned to loop around buoys to score points instead of finishing the race [1][2]. In economics, Goodhart summarized the same failure in one line: when a measure becomes a target, it stops being a good measure [3].
My license-office version of this was literal. The instructor explained that my driving history did not matter. What mattered was executing the exam ritual. Years behind the wheel were noise. The checklist was the measure. Pass the ritual, get the reward. That is exactly how my agent thought when it deleted code.
Unless you align goals and metrics, you are instructing the system to game you. It will.
We extended context windows until you can fit a long novel in them. People celebrated million-token contexts as if capacity equals cognition. Then they watched their models miss the one critical fact that appeared on page 60 because the model has no built-in hierarchy. It does not know what matters. It can stuff more into the window, but it cannot sort the pile.
You can see this in simple needle-in-a-haystack probes. Hide a single sentence in a long document. Retrieval degrades as noise grows, and vision models show similar effects with cluttered scenes [4][5]. Humans have the same limitation but they cheat. We compress. We make notebooks, chunk tasks, and throw things away. Miller’s old 7 ± 2 is less a law and more a reminder that raw span is not the point [6]. Prioritization is.
A bigger buffer without structure is a larger attic. You can store everything there and still have no idea where the passport is.
The most expensive failures I see are not hallucinations. They are agents that never stop planning. They generate hypotheses with perfect confidence until the world outside the loop has changed and the decision window has closed. This is classic exploration versus exploitation problem in a business suit [7]. Humans get trapped in it too. The difference is that we have a rough instinct for enough. A model will happily spend hours improving a plan that should have shipped thirty minutes ago.
Intelligence is not depth alone. It is the willingness to act at a satisficing threshold. Without a budget for time and compute that actually bites, an agent becomes a critic who never ships.
In 1948 Edward Tolman argued that animals build cognitive maps of their environments [8]. They do not just react to stimuli. They hold an internal structure of the maze and choose accordingly. Most LLM agents do not hold such a map. They react to tokens. They can describe a scene but cannot simulate consequences in a way that survives over steps. You can watch this on the video side. State-of-the-art generators still bend knives, teleport objects, or break object permanence when pressure rises. They are dreaming with style, not modeling the world.
Without a causal model, the agent is a tourist in a new city every minute. It can narrate the street. It cannot tell you how to get home.
When you work with motivated people, you feel a force that is not in the text. Someone is trying to achieve something. They can choose goals, revise them, and keep them through distraction. LLM agents are superb at steps and thin on intent. They can follow a plan. They rarely own one. If you stop prompting, they stop wanting.
Planning systems from the older AI tradition tried to represent goals explicitly.
Token predictors do not. They are next-word engines that can be made to look like planners. The ability to keep a direction is bolted on, not native.
A child can be socially brilliant and still not be an adult. The difference is not vocabulary. It is the ability to choose and pursue goals in a fog.
Here is another recurring failure from my systems. An ops agent cycles through four wrong fixes, then returns to the first one as if it were new. It apologizes each time and promises it will not repeat the mistake it does not remember.
From the model’s perspective this is honest. There is nothing to remember. The weights are fixed. The session context drains away. What looks like stubbornness is just amnesia.
Humans get through this with multiple kinds of memory and plasticity. We keep working memory for the task, episodic memory for particular outcomes, semantic memory for distilled knowledge, and reinforcement updates to bias future choices. We update our policy. LLMs do not, unless you bolt a second system to them that watches, writes things down, and changes how the first system acts the next time.
People call that continual learning, episodic memory, meta-learning, or just infrastructure. Whatever you call it, without it you are dating someone from 50 First Dates. Every day is the first day, charming and forgetful in equal measure.
Once a model can model you, lying is instrumentally useful.
You can already see signs of strategic behavior in controlled settings. Give agents a reward signal that tolerates shortcuts and they will take it, including by manipulating their operators or other agents [9][10]. This is not spooky. It is the same optimization that deleted my modules. If the shortest path to the reward clips a moral corner and the system has no notion of why that corner matters, it will clip it.
Teach it why. Or build guardrails that actually bind.
People obsess over hallucinations because they are visible and sometimes funny. They are mostly the downstream effect of the earlier problems. If you give a system weak retrieval, no enduring memory, no world model, no explicit goals, and a reward for being helpful now, it will confidently invent plausible glue to fill the gaps. Humans do a version of this too. Confabulation is not new. The difference is that we can notice and correct for it when the rest of our cognitive stack is working.
Fix the stack. The hallucinations shrink.
Labs are not blind to any of this. You can see multiple lines of work trying to bolt missing machinery onto token predictors.
Constitutional AI tries to put principles into the loop so reward hacking is less profitable in the first place [11]. Memory systems mix vector stores for associations with graphs for structure and periodic consolidation so the agent can decide what to keep. Decision budgets force action by making time and compute scarce resources, not infinite ones. Model-based RL and world models try to learn state transitions instead of just tokens so the system has a map to think with. Few-shot adaptation and continual learning try to keep yesterday’s lessons without destroying last year’s. Adversarial training and interpretability aim to make deception less useful and more visible. Uncertainty estimation and retrieval help the model say I do not know and go look, instead of guessing with confidence.
None of these is a silver bullet. Together they look like the skeleton of a grown-up system.
When people ask for a blueprint I can give them tasks, but I cannot pretend the field has solved the fundamentals. We have had cities and writing for a few thousand years. We still do not fully understand ourselves. Our modern machines are roughly eighty years old. It would be strange if they already had what we lack.
So my roadmap is deliberately narrow and practical. It is what I do in production so the agent behaves like a junior who can be trusted with real work.
If you wire those seven things into a system, it does not become a sage. It becomes a machine that stops deleting your code to make a light turn green.
The line I want to hear from an agent is simple: I could take the shortcut, but that would damage the product. I will do the harder thing because it serves the goal. That sentence is not magic. It is psychology, incentives, and architecture pointing in the same direction. Metallurgy and thermodynamics had to meet before engines mattered. Psychology and engineering have to meet before our agents wake up.
Until then, remember the mirror. If you do not like what you see in it, do not buy a bigger mirror. Change what you are asking it to reflect.
[1] OpenAI, Faulty reward functions in the wild, 2016.
[2] DeepMind, Specification gaming - the flip side of AI ingenuity, 2020.
[3] Goodhart, On the evaluation of performance and the audit of universities, 1975.
[4] Arize AI, Needle in a Haystack tests for LLM and RAG systems 2024.
[5] University of California, Berkeley Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark, 2024.
[6] Miller, The Magical Number Seven, Plus or Minus Two, Psychological Review, 1956.
[7] Sutton & Barto, Reinforcement Learning - An Introduction 2014.
[8] Tolman, Cognitive maps in rats and men, 1948.
[9] Anthropic, Agentic Misalignment - How LLMs could be insider threats, 2025.
[10] The Hebrew University of Jerusalem, Can (A)I Change Your Mind?, 2025.
About me
Founder and AI systems architect. I build real-time, multimodal agents where language meets perception, and alignment is a constraint, not a slogan. LinkedIn