Thanks for engaging!
Byrnes' proposed solution
TBC, I wouldn’t say that I have any “proposed solution”, in the sense of a technical plan that I expect to work. What I have is avenues of attack (actually two of them, see §3 of my 2025 review) which are not (to my knowledge) categorically ruled out, and which thus merit further investigation. Hopefully such further investigation will eventually result in either a plan, or an argument that no plan of that type exists. If you think you have an argument that rules out one of these lines of attack, I’m generically open-minded and very interested.
But I disagree with your argument here, or don’t understand it.
First and foremost, I’m confused about whether you have in mind an LLM (like most people are thinking about) or a “brain-like AGI” (some yet-to-be-invented version of actor-critic model-based RL) (like I’m usually thinking about). If you’re talking about my technical alignment ideas, then those are specific to the latter. They don’t apply to today’s LLMs. I don’t think I have much special insight into “alignment” of today’s LLMs. But you seem to be assuming that we’re talking about LLMs (as we know them today) in many parts of this post—e.g. you talk about “produce the next token” (brain-like AGI would presumably have a wider action space than that, including invisible actions like choosing what to think about), and “training data” (RL agents generally have a “training environment” not “training data”), and “subliminal learning” (a specific LLM thing, AFAICT), and the CAST proposal (as I read it, it seems to only make sense in LLM-world), etc.
While the human society lets most individuals intervene with the reward function and training data for the belief system of others, allowing individuals to align each other to the community…
I think this is a misleading way to think about things, see my post Heritability, Behaviorism, and Within-Lifetime RL. My reward function is locked away inaccessible in my head. If you put me in a room with another person, they will affect the inputs (and thus also outputs) of my reward function, but that’s equally true if you put me in a room with a turtle, or a book. Either way, saying that the other person can “intervene” on my reward function has a misleading connotation that they can just decide what they want my reward function to be.
However, unlike the AIs, human circuitry related to scheming is pruned by the fact that potential victims usually have similar or higher capabilities, causing the schemers to be punished.
I don’t think this argument works, here’s an example (copied from “Behaviorist” RL reward functions lead to scheming):
As an illustration that something is wrong with “honeypots-for-alignment” theory, that theory would predict that, if a man is in prison, and he tries to escape but gets caught and harshly punished a few times, then he stops wanting to escape prison at all. Even if, later on, there’s a prison riot, and the guards are all dead, and the front door is open, and there’s a bus idling outside about to drive the rest of the prisoners to a non-extradition country … the prediction of “honeypots-for-alignment” theory is that this man will not get on the bus, but rather sit patiently in the otherwise-empty prison, waiting for the cops to arrive and lock him back up. …Well, this prediction is obviously absurd. … [more at the link]
Or consider how rebellious teens don’t usually “learn” to be honest to their hated parents, teachers, and authority figures. Right? Adults don’t “learn” to be honest to hated authority figures either, they just get savvy enough to know what they can get away with.
Anyway, is your actual belief that humans have no innate social drives related to compassion or status-seeking? If so, we can talk about why I disagree.
I mean that humans are born with more primitive versions of such drives (e.g. valuing others smiling as a proxy for compassion), then instrumental convergence transforms these primitive versions into the actual approval reward/status seeking/etc in a manner similar to training making agents who already are in the basin of alignment/corrigibility/etc more aligned/corrigible. As for prisoners no longer wanting to escape, this is where your conjecture about the belief system is most promising: a few unsuccessful attempts instill a hard-to-reassess belief that future attempts have a low probability of success.
UPD: how similar is the belief-desire model to the master-slave model? Suppose that the belief system simply outputs what it believes to be the next tokens of, say, known theorems or of muscular actions responsible for pull-ups and the desire system steers the belief system in order to construct the outcomes like solving the problem or doing the exercise correctly. Then the desire and belief system would occupy the positions of the master and the slave.
@Steven Byrnes' recent post Why we should expect ruthless sociopath ASI and its various predecessors like "6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa" try to explain that a brain-like RLed ASI would be a ruthless consequentialist since “Behaviorist” RL reward functions lead to scheming.
Byrnes' proposed solution
Byrnes' proposed solution is based on the potential fact that "we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the human brain reward can evidently get triggered by specific activations inside my inscrutable learned world-model."
As far as I understand the proposed construction of Byrnes' agent, the belief system is activated, then interpretability is applied in order to elicit concepts from the belief system. Then the belief system's outputs and the desire system's outputs are somehow combined in order to produce the next token. The token and its consequences (e.g. the fact that the math problem stayed unsolved or was solved) are likely to become a part of training data.
While the human society lets most individuals intervene with the reward function and training data for the belief system of others, allowing individuals to align each other to the community, the AI-2027 scenario has an army of copies of the same AI become involved in the training runs and creation of synthetic data. If this happens, then I expect that the belief system will end up being corrupted not just by wishful thinking-like errors, but also by things like subliminal learning from the misaligned desire system. Therefore, I doubt that I understand how Byrnes' proposal alone prevents the agent from scheming.
An alternate explanation of the humans avoiding wholesale scheming
Additionally, the question of why the humans avoid wholesale scheming, as described[1] in Byrnes' post, has plausible explanations differing from being born with interpretability-based approval-related reward.
The second mechanism listed here also provides a potential origin story of the Approval Reward. Suppose that the humans are born with the instincts like valuing smiles and smiling when happy. Then valuing smiles causes human kids to learn to make others visibly happy, which in turn generalises to a hybrid of Myopic Approval Reward-like circuitry and scheming-related circuitry.[2] However, unlike the AIs, human circuitry related to scheming is pruned by the fact that potential victims usually have similar or higher capabilities, causing the schemers to be punished. Additionally, the humans have actions cause long-term consequences, which make it easier to transform Myopic Approval Reward-like circuitry into actual circuitry related to Longer-Term Approval Reward and not things like short-term flattery.
Alignment-related proposals
Similarly to Cannell's post on LOVE in a simbox, I suspect that the considerations above imply that drastic measures like including interpretability into the reward function are unnecessary and that the key to alignment is to reward the AI for coordination with others, genuinely trying to understand them or outright providing help to them instead of replacing them while pruning scheming-related behaviors by ensuring that the help is genuine. See also Max Harms' CAST agenda[3] and a related discussion in this comment on verbalised reasoning.
"But my camp faces a real question: what exactly is it about human brains that allows them to not always act like power-seeking ruthless consequentialists? I find that existing explanations in the discourse—e.g. “ah but humans just aren’t smart and reflective enough”, or evolved modularity, or shard theory, etc.—to be wrong, handwavy, or otherwise unsatisfying."
The formation of the two circuitry types is also impacted by parental guidance and stories included into kids' training data.
Which I reviewed here. The point 4 had me propose to consider redefining power to be upstream of the user's efforts. While this redefining makes the AI comprehensible to the user, it also incentivises the AI to have a specific moral reasoning.