I've formerly done research for MIRI and what's now the Center on Long-Term Risk; I'm now making a living as an emotion coach and Substack writer.
Most of my content becomes free eventually, but if you'd like to get a paid subscription to my Substack, you'll get it a week early and make it possible for me to write more.
I think part of the issue is that epistemology is largely a question of mindware, and practice does not fix missing or bad mindware any more than it can teach a person calculus if they've never studied it.
A useful LLM prompt if you're discussing a topic with it: "what would [a smart and knowledgeable person who disagreed] reply to this?"
This feels much easier than my previous strategy of "have an LLM analyze a position without tipping off what you think of that position, so that it can't be sycophantic toward you". Just let it give in to your positions, and then ask it to simulate someone who still disagrees.
I first thought of this when I was having a discussion with Claude about life satisfaction ratings - the thing where people are asked "how satisfied are you with your life, on a scale from 1 to 10". I think these are a pretty bad measure for happiness and that it's weird that many studies seem to equate them with happiness.
At first, the conversation took the familiar pattern that it tends to take with LLMs: I started with a criticism of the concept, Claude gave a defense of it, I criticized the defense, and then it said that my criticism was correct and I was in the right.
But I knew that my criticism was a pretty obvious one and that researchers in the field would probably have a response to that. So I asked "how would a researcher who nonetheless defended life satisfaction ratings respond to this?", and it gave me an answer that did change my mind on some points!
Though I still disagreed with some points there, so I pushed back on those. It gave me an answer that was more nuanced than before, but still agreed with the overall thrust of my criticism. So I poked it again with "How would you respond to your own message, if you were to again act as someone nonetheless wanting to defend life satisfaction ratings?".
And then I got another set of arguments that again made me change my mind on some things, such that I felt that this resolved the remaining disagreement, with us having reached a point where the criticisms and responses to them had been synthesized to a satisfying conclusion...
...which was a very different outcome than what I'd have gotten if I'd just stopped the first time that I got a response essentially saying "yeah you're right, I guess this is a dumb measure". Instead of stopping at my antithesis, we actually got to a synthesis.
Yeah, I generally don't even try one-shotting stories. :D
Thanks for the link!
This is the clearest explanation I've read of neural annealing so far! Thanks for writing it, I feel like I have a better intuition of it now.
The skill is to stay with it, without freaking out and without latching to the first new story that might explain what’s going on.
Yes. Applies to non-psychedelic-aided inner work as well.
With my system prompt (which requests directness and straight-talk) they have started to patronise me
I've gotten similar responses from Claude without having that in the system prompt.
I read this as being premised on "going crazy about the world ending" meaning that you end up acting obviously stupid and crazy, with the response basically being "find a way to not do that".
My model about going crazy at the end of the world isn't so much doing something that's obviously crazy in your own view, but that the world ending is so out-of-distribution for everything you've been doing so far that you have no idea of what even is a sane or rational response anymore. For instance, if your basic sense of meaning has been anchored to a sense of the world persisting after you and you making some kind of mark on the world, you won't know what to do with your life if there won't be anything to make a mark on.
So staying sane requires also knowing what to do, not just knowing what not to do. Is there anything you would say about that?
This is why, in a much more real and also famous case, President Truman was validly angered and told "that son of a bitch", Oppenheimer, to fuck off, after Oppenheimer decided to be a drama queen at Truman.
For anyone else who didn't remember the details of what this was referencing:
Claude Opus 4.5's explanation of the reference
This refers to a meeting between J. Robert Oppenheimer and President Harry Truman in October 1945, about two months after the atomic bombings of Hiroshima and Nagasaki.
The meeting itself
Oppenheimer was invited to the Oval Office, ostensibly to discuss the future of atomic energy and weapons policy. At some point during the conversation, Oppenheimer reportedly said to Truman: "Mr. President, I feel I have blood on my hands."
Truman's reaction was sharp and dismissive. According to various accounts (primarily from Truman himself and his aides), Truman offered Oppenheimer his handkerchief and said something to the effect of "Would you like to wipe your hands?" After Oppenheimer left, Truman told Dean Acheson (then Undersecretary of State) that he never wanted to see "that son of a bitch" in his office again. Truman reportedly also said, "The blood is on my hands. Let me worry about that."
Why Truman reacted this way
Truman's anger seems to have stemmed from a few sources:
1. The decision was Truman's, not Oppenheimer's. Oppenheimer built the bomb, but Truman gave the order to use it. From Truman's perspective, Oppenheimer was claiming moral weight that properly belonged to the person who actually made the decision—and who would have to live with its consequences as a matter of presidential responsibility, not personal drama.
2. Truman viewed it as weakness or self-indulgence. Truman was famously blunt and decisive. He kept a sign on his desk reading "The Buck Stops Here." A scientist coming to him wringing his hands about guilt may have struck Truman as someone trying to have the significance of the decision without the responsibility for it.
3. The political context. Truman was dealing with the practical aftermath—the emerging Cold War, questions about international control of atomic weapons, the Soviet threat. Someone showing up to perform remorse rather than help solve problems may have seemed unhelpful at best.
The essay's interpretation
The author seems to be making the point that Oppenheimer's gesture made the atomic bomb about Oppenheimer—his feelings, his moral status, his inner drama—rather than about the actual event and its consequences. There's something structurally self-centered about a person involved in a catastrophe centering their own guilt rather than the catastrophe itself. Truman, whatever his flaws, seemed to grasp that the appropriate response to having made such a decision was to own it and deal with its consequences, not to perform anguish about it to the person who actually bore the responsibility.
Abstract for those who want to see it without clicking on the link:
The standard theory of model-free reinforcement learning assumes that the environment dynamics are stationary and that agents are decoupled from their environment, such that policies are treated as being separate from the world they inhabit. This leads to theoretical challenges in the multi-agent setting where the non-stationarity induced by the learning of other agents demands prospective learning based on prediction models. To accurately model other agents, an agent must account for the fact that those other agents are, in turn, forming beliefs about it to predict its future behavior, motivating agents to model themselves as part of the environment. Here, building upon foundational work on universal artificial intelligence (AIXI), we introduce a mathematical framework for prospective learning and embedded agency centered on self-prediction, where Bayesian RL agents predict both future perceptual inputs and their own actions, and must therefore resolve epistemic uncertainty about themselves as part of the universe they inhabit. We show that in multi-agent settings, self-prediction enables agents to reason about others running similar algorithms, leading to new game-theoretic solution concepts and novel forms of cooperation unattainable by classical decoupled agents. Moreover, we extend the theory of AIXI, and study universally intelligent embedded agents which start from a Solomonoff prior. We show that these idealized agents can form consistent mutual predictions and achieve infinite-order theory of mind, potentially setting a gold standard for embedded multi-agent learning.
Thanks! Yeah, the predictability is certainly an issue, the human needs to be the source of the variety. I had the idea of using the API and a script with a long list of random concepts that get randomly applied to some messages with a prompt like "consider some way how the current story is analogous to [X] and introduce a new twist based on that" or "work in some reference to [Y] into your next response", but I'm not sure if that'd actually have the desired effect and it seems like a lot of work.
I've tried some open weight models a little bit but I generally tend to have pretty complex worldbuilding and character psychology etc. in my stories, so have found that the most cutting-edge models generally seem to be best at understanding it. I could still try some if they happen to be available on OpenRouter though, any that you'd particularly recommend?
I liked the examples, though they felt slightly abstract and I felt they could have been further improved by adding specifics. I asked Claude to generate one-paragraph stories about them and thought that they were useful for getting the concepts better. (Edited a bit to remove redundant/overwrought sentences.)