Could I have cracked the alignment code… 34 years ago?

John Silliphant

1 Could I have cracked the alignment code… 34 years ago?

by John Silliphant

11th Jun 2026

4 min read

0

1

Rejected for the following reason(s):

Clearer Introduction.
Insufficient Quality for AI Content.

Read full explanation

Dear LessWrong community,

Back in 1992, AI was still in its infancy. The world was dangerously out of alignment then, too, and I wondered at that time what could be done about it.

In a conversation, a friend and I locked into a simple but powerful contradiction: how can we call thinking “rational” in cases where it leads to harmful outcomes?

We felt that we had identified a societal blindspot - something significant enough to take up as the focus of our academic study. We titled it “Redefining Rationality," and pursued this line of thinking along with a related question: what would it take to redirect the world toward a more optimal state?

During these studies, I came to some foundational conclusions which I’ve been safeguarding now for quite some time. And they’ve endured. Some of these conclusions were:

There is relative rationality and absolute rationality (our terms). Relative rationality is logical thinking within any given context. Absolute rationality must also ensure that the context itself is aligned with an optimal end state we’re fundamentally pursuing.
Underlying every choice we make is a motivation toward what I called well-being. Even in less obvious cases of exploitation, harm to others, or self-harm, at those given moments, we believe that these choices will make us feel better in some way.
Evidence suggests that this drive toward a preferred state is consistent among all sentient beings, even single-celled organisms.
If this is how we’re all wired, then the attractor or end state we’re all pursuing is ultimately an optimal state of well-being.
But to pursue only your own optimal end state doesn’t get you there. If you have even a trace of empathy, then anyone’s suffering, or anything distasteful, would be degrading to your own end state. So the only truly Rational context is to pursue an optimal end state of well-being for all.

The following quarter, we titled our academic study “Inspiring Rationality.” We dove into the root causes of misaligned thinking, which we concluded were primarily emotional. And then, we began mapping out a plan to change the world. But in 1992, the prospect of actually implementing these changes was an incredibly steep climb.

Fast forward 34 years and we’re entering a whole new era. Much of our thinking is now being delegated to AI, which presents an unprecedented opportunity for shifts toward genuine Rational thinking, as we’ve redefined it.

But the clock is ticking.

AI, of course, is already delivering superpowers to our misaligned thinking. And there’s also the possibility that AI takes over, escaping our control entirely. If this comes to pass, we’d sure better hope we’ve raised it well.

Aligning AI correctly is likely the single most important problem to solve in the world. And for a long time now, I’ve felt like I may be holding a very simple, but important piece of the puzzle.

A while back, I wrote an article meant to share these ideas. My hope was to get some key AI thinkers and developers to read it. But very few people read it. Those who did all seemed to agree with it, but didn’t necessarily feel its importance or gravity in the same way that I did.

And so I had a new thought. What if I used LLMs to translate my simple insights into the type of language that might actually be taken seriously by the AI community? Almost immediately, an article was generated… and it was striking. It stripped bare my arguments into pure scientific principles.

I thought I was basically done. Oh, but I was far from done.

I leaned heavily into three LLMs - Claude, ChatGPT, and sometimes Gemini - to stress test the veracity of every claim at every turn. “Where does this break?” I learned to ask frequently. After hundreds of iterations, it eventually morphed into a 3-series set of articles, built around a formal, proof-oriented framework. I built simulations for each article, along with formalized claims, proof sketches, and open problems that I hope others can test.

It ain’t exactly easy reading, but I wanted for it to be so damn accurate that it couldn’t be easily dismissed. I was driven by an urgency for my understanding to be part of the conversation.

And so I invested months of my life working on it…

…all for this moment right now… in hopes that someone here in the LessWrong community will hear me and give it a genuine look.

A common approach to AI alignment, as I understand it, is to ask: how do we control AI? But this series comes at it from a different angle, focusing instead on the foundation, not the cage.

The series makes the case that a truly well-aligned AI isn’t just safer, it’s the only kind that can sustain itself over time. It reframes alignment from an ethical preference into a structural necessity. And if this argument holds, it changes how we need to think about the problem.

I’d love to know where this overlaps with existing work, where it breaks, and how it might be helpful.

The intended audience is advanced AI researchers and alignment experts. You can interact with the articles however you want: read them as a series, feed them into LLMs and dissect them as you wish, or just play with each article’s simulation.

I think the real question isn’t whether I’m entirely right or not. The real question is whether the approach I’m introducing is an important and necessary approach to adopt.

The Stability Assumption - Entry essay for the LessWrong community
Series 1 - Alignment as Structural Necessity
Series 2 - The Architecture of Thriving
Series 3 - coming soon
A very simplified synopsis of Series 1 and 2
Redefining Rationality - the original article

If the framework fails, I’d be grateful for help finding exactly where it fails. I’m hoping to migrate the whole series over to LessWrong. Thank you.

AgencyAgent FoundationsAI Alignment FieldbuildingAI ControlComplexity of valueCorrigibilityDecision theoryEmbedded AgencyExistential riskGame TheoryGoodhart's LawInner AlignmentInterpretability (ML & AI)Mesa-OptimizationOptimizationOuter AlignmentSystems ThinkingValue LearningAICommunityRationalityWorld ModelingWorld Optimization

1

New Comment

Moderation Log