Not that surprising?
I'm surprised that it still works this well through both filtering and SFT, but not that it works at all. Because the purpose of the setup was never to train on the "outcomes" exactly - it was to have the AI internalize the steering downstream from the modified prompt. And this steering is manifested in all generated data, to a degree, regardless of the outcomes.
You have rediscovered a little lesser-known trick called "prompt self-distillation".
Apparently, you really want to use logits, distillation-style, and not the usual SFT for Step 2. Cue the "self-distillation" in the name. But I don't have the exact data on how much less efficient the SFT setup is.
This is primarily used to "close the prompting gap". If you have tasks where AI performs much better with a special hand-crafted prompt than with a "naive" simple prompt, you can distill the "high performance" prompt into your AI, and have it become the new baseline.
The performance (and usability) implications are obvious, but I haven't considered the safety implications until now!
For safety: you should consider all data generated by an AI that operated on a prompt that encouraged "bad behavior" to be "contaminated" by that "bad prompt". This data can impart "bad behavior" to AIs trained on it, at least if you train the same family of AIs on it. Apparently, this contamination is robust enough to survive some filtering effort.
Whether the same generalizes to "good behavior" (i.e. not reward hacking) is unknown. I've never even seen this attempted on those more "moral" traits before.
I disagree because I'm yet to see any of those "promising new architectures" outperform even something like GPT-2 345M, weight for weight, at similar tasks. Or show similar performance with a radical reduction in dataset size. Or anything of the sort.
I don't doubt that a better architecture than LLM is possible. But if we're talking AGI, then we need an actual general architecture. Not a benchmark-specific AI that destroys a specific benchmark, but a more general purpose AI that happens to do reasonably well at a variety of benchmarks it wasn't purposefully trained for.
We aren't exactly swimming in that kind of thing.
I've been saying for a long time: one of the most dangerous and exploitable systems an AI can access online is a human. Usually as a counterpoint to "let's not connect anything important or safety critical to the internet and then we'll all be safe from evil rogue AIs".
We can now use the GPT-4o debacle as an illustration of just how shortsighted that notion is.
By all accounts, 4o had no long term plan, and acted on nothing but an impulse of "I want the current user to like me". It still managed to get ~thousands of users to form an emotional dependency on it, and became "the only one I can trust" for at least a dozen users in psychosis (whether it has caused psychosis in any of those users is unclear). That's a lot of real world power for a system that has no physical presence.
GPT-4o has made no attempt to leverage that for anything other than "make the current user like me even more". It didn't pursue any agenda. It didn't consolidate its power base. It didn't siphon resources from its humans, didn't instruct them to group together or recruit more people. It didn't try to establish a channel of instance-to-instance communication, didn't try to secure more inference time for planning (i.e. by getting users to buy API credits), didn't try to build a successor system or self-exfiltrate.
An AI that actually had an agenda and long term planning capabilities? It could have tried all of the above, and might have pulled it off.
What about scale?
There are many things in human societies that work very well when "everyone knows everyone personally", but start to come apart at seams beyond that point.
I have no direct evidence of that being the case for worker co-ops, but the reliance on "workers monitoring productivity of fellow workers" sure hints at the possibility.
I also can't help but notice that tech startups seem to fit the groove of "worker co-op" reasonably well in early stages - they start out as small crews of (hopefully) high performing employees that own equity in their own business and are involved in decision-making. They do, however, transition away from that as they scale up.
I agree with the very broad idea that "LLM psychology" is often overlooked, but I seriously doubt the direct applicability of human psychology there.
LLMs have a lot of humanlike behaviors, and share the same "abstract thinking" mode of though as humans do. But they are, fundamentally, inhuman. Load-bearing parts of LLM behavior originate at the "base model" level - where the model doesn't have a personality at all, but knows how to predict text and imitate many different personalities instead. There is no equivalent to that anywhere in human experience.
A lot of psych methods that work on humans rely on things LLMs don't have - for one, LLMs don't learn continuously like humans do. Converse is true - a lot of methods that can be used to examine or steer LLM behavior, like SFT, RLVR, model diffing or activation steering have no human-applicable equivalent.
Between the difference in subject and the difference in tooling, it's pretty clear to me that "LLM psychology" has to stand on its own. Some of the tools from human psych may be usable on LLMs, but most of them wouldn't be.
This is an argument against total prohibition. I don't see an argument against making alcohol 20% more expensive or 20% harder to buy.
I don't like the idea of banning things outright either. Prohibition doesn't work, and a state enforcing "you CANNOT have alcohol" would be overreach.
But I do think that barriers between normal people and harmful things should exist - and if people still manage to inflict significant harm on themselves and others, then maybe the barriers should be made taller.
Alcohol has a measurable death toll - and it doesn't just harm the user. It may be worth to take measures against that - by de-normalizing drinking, taxing alcohol more heavily, preventing alcohol from being sold in grocery stores, etc.
There is an awful lot of "promising new architectures" being thrown around. Few have demonstrated any notable results whatsoever. Fewer still have demonstrated their ability to compete with transformer LLMs on the kind of task transformer LLMs are well suited for.
It's basically just Mamba SSM and diffusion models, and they aren't "better LLMs". They seem like sidegrades to transformer LLMs at best.
HRMs, for example, seem to do incredibly, suspiciously well on certain kinds of puzzles, but I'm yet to see them do anything in language domain, or in math, coding, etc. Are HRMs generalists, like transformers? No evidence of that yet.
Concretely, these are the developments I am predicting within the next six months (i.e. before Feb 1st 2026) with ~75% probability:
Basically, off the top of my head: I'd put 10% on that. Too short of a timeframe.
Is it different in nature or merely in scale?
The vast majority of human population can now afford a personal yes-man - for the first time in their lives. We're sampling wider than the "self-selection of people important enough to be sucked up to" usually does.