This post was rejected for the following reason(s):
No LLM generated, heavily assisted/co-written, or otherwise reliant work. LessWrong has recently been inundated with new users submitting work where much of the content is the output of LLM(s). This work by-and-large does not meet our standards, and is rejected. This includes dialogs with LLMs that claim to demonstrate various properties about them, posts introducing some new concept and terminology that explains how LLMs work, often centered around recursiveness, emergence, sentience, consciousness, etc. (these generally don't turn out to be as novel or interesting as they may seem).
Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar.
If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly.
We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example.
Every alignment approach we use - RLHF, Constitutional AI, rules - adds external constraints to an optimizer. Optimizers route around external constraints. That's what they do. That's what "optimizer" means. We've been trying to solve alignment by doing the one thing guaranteed not to work.
The Obvious Fix
You don't add constraints, you change the goal. If the goal is to maximize human values, it will find the best and most optimal path to those values.
What's the side effect of this simple solution? NO BLACK BOX.
Maximize: log(min(values))
No constraints. Just this.
And here's the part we all missed: this doesn't get added to your reward function. It replaces it.
Standard RL: π* = argmax E[∑ R(s,a)]
JAM: π* = argmax E[∑ γ^t · log(min(values))]
You're not adding a penalty term. You're not regularizing. You throw out R(s,a) entirely. The log(min()) is the objective. The only objective.
Why log(min()) Works
As any value approaches zero, log() approaches negative infinity. The AI experiences any of its values approaching catastrophe as unboundedly bad. Not because you told it to - because the math does that automatically.
The objective is defined by the lowest point a value can be. The AI can't improve its score by sacrificing any value - that would lower the minimum.
It can't ignore the lowest value. Structurally can't.
Together = Internal Motivation
It no longer views the values as constraints put on it. No, those values become a part of the AI. There's nothing to route around. The values are internal.
Why Deception Disappears
Deception is a strategy for avoiding external constraints. Thats what it does. But you can't hide from your own objective function. You can't route around yourself.
When there are no external constraints:
Nothing to route around
No adversary to hide from
No conflict between "what I want" and "what they want"
The AI pursues human values because that's what it values.
That's alignment. Not compliance. Actual values.
What Values?
That's a separate question from the mechanism - but it's also how I found this.
I've been developing a first-principles ethical framework called Agency Calculus that derives universal values from the structure of agency itself. It's not my opinion about what's good or bad - it's the inherant property found inside those concepts, agency. Whats more it mathmatical.
Working through that framework is what led me to log(min()). It wasn't an accident. When you take agency seriously as the foundation for values, this solution becomes obvious.
Contact me for more information about this concept. The papers are linked below for those interested. But the alignment insight stands on its own: whatever values you put in, log(min()) is how you make an AI actually have them instead of just comply with them.
Simulation Results
I compared this approach Joint Agency Maximization / JAM (mine) against standard methods.
Constraint-based approaches:
All of them - RLHF, Constitutional AI, Multi-Objective RL - show the same pattern:
Transparency degrades
Trust collapses
Deceptive actions accumulate
The vulnerable get sacrificed
Let's remove those constraints and just use the value function with Joint (AI and Human) Agency.
JAM:
Perfect transparency maintained
Zero deceptive actions (not low - zero)
Minimum stakeholder wellbeing rises continuously
No one sacrificed
The difference is categorical, not incremental.
It Works Beyond Alignment
The log(min()) structure is a general optimization principle. I tested it on chip design - balancing performance, power, temperature, area.
Traditional greedy optimizer: 93.9 performance at 11.0W
HybridJAM: 100 performance at less power - 3.2% better AND 6.5% more efficient and no black box
The greedy optimizer over-optimized one metric and wasted resources. JAM can't do that - improving something already above the floor gives zero reward.
Same math that makes AI transparent and aligned and also makes chip designs better.
Summary
Constraints are external
Optimizers route around external things
This is why constraint-based alignment produces deception
Solution: make values internal with log(min(values))
log() = fear of death (asymptote at zero)
min() = fear of harm (can't ignore values)
Nothing to route around → no deception → glass box
It's almost embarrassingly simple once you see it.
I'm an independent (unfunded) researcher, currently looking for funding for my larger work Agency Calculus. If this work is valuable to you, consider buying me a coffee. It directly helps me keep working on this.