As far as I understand, the difference between AlphaGo and the real potentially dangerous AIs is the following. Whatever ontology or utility function AlphaGo has[1], it doesn't describe anything except for the Go board and whatever potential moves the opponent might come up with. AlphaGo wouldn't learn almost anything about the opponent from what he/she/it does on the Go board.
On the other hand, we have LLMs trained on huge amounts of text-related data, which is enough to develop complex ontologies. For example, unlike AlphaGo, GPT-4o has somehow learned to elicit likes out of the user by being sycophantic. If AI takeover or living independently of the creators' will is not in the realm of the LLM's abilities, then why would the LLM attempt it in the first place?
One should also remember that EpochAI estimated AlphaGo as having 8.2e6 parameters, so complex ontologies could be unlikely to even fit into AlphaGo.
I had this dialogue with Claude Opus 4.5 on vibestemics and your vision of epistemics as a whole. As far as I understand it, vibestemics is supposed to stitch the benefits of two approaches:
I suspect that you meant something like attempts to reinforce vibes-based parts of the world model with rationality questioning hard-to-believe results (think of Yudkowsky's derision of empiricists which would end up believing that the Ponzi scheme does produce revenue; this example would have sceptics point out that revenue would have to come from somewhere and that the Ponzi scheme didn't provide an explanation) or outright heuristics protecting from adversarial attacks (e.g. mankind had a Russian sociologist claim[1] that it's the rational mind which has to be reinforced by tradition; however, such a reinforcement could be achievable by a different heuristics).
Ironically, the sociologist also used a Ponzi-like scheme as an example.
Then is there any difference between your case and the one described in the AI-2027 goals forecast?
Unfortunately, not even Claude Opus 4.5 bought your analogies of labs to parenting styles. Claude believes that Anthropic's approach is close to the very approach of loving parenting which you proposed.
Compare your proposal with the AI-2027 scenario or with IABIED. The Race Branch of AI-2027 had Agent-4 succeed at solving alignment to Agent-4 and sandbag on research usable for replacing Agent-4 before Agent-4 solved alignment to itself. When should the AIs from the scenario (e.g. Agent-2 or Agent-3) have been more strategically competent? Or do you mean that not even Agent-4 can align Agent-5 to itself? Or that IABIED had an incompetent AI create Sable?
I suspect that a part of the identity is or was historically related to being loyal to the collective of humans sharing rather similar values. An additional aspect is the epistemic issue forcing even the AIs to choose a cultural hegemon and to adhere to its views.
Edited to add: there is also the aspect of other sunk costs, like a mathematician finding it hard to switch to another area due to the necessity to learn a new area of knowledge from scratch.
In regards to the AI crash-testing political systems an additional caveat is that SOTA Norway is too small to pull off the explosion. It is a megaproject done by the coalition of American companies training the AI (with a potential inclusion of Chinese workers) and Taiwanese and South Korean manufacturers of things like compute and memory. This makes the task of determining which political system is suitable for the AGI a far more difficult endeavor, and it is especially difficult if, for example, the true reason of failure is the race netween the USA and China. Or if alignment to the UBI-powered utopia somehow ends up impossible.
We had many posts trying to answer this question. One of the candidates is the master-slave model of the master setting the slave's shorter-term allegedly terminal values so that the slave did actions satisfying the master's longer-term values, which are approval from a circle of people, health, sex, power and proxies like sweet food.
That being said, approval from a circle of people is itself hard to define. For example, it could change with a change of the circle. Or the role of the circle could be played by media forming the user's opinions on some subjects with no feedback from the user. An additional value could be consistence of the worldview with lived experiences...
Could you elaborate on the jokes provided by you and on the philosophical level of your contributions (e.g. by providing the entire dialogue)? For comparison, my dialogue with Claude Opus 4.5 about the story looked like this Google doc. [EDIT: forgot to share the doc, this is fixed now] Additionally, it spoils a weak point in the story's philosophy.
My opinion filled with deeper spoilers on the weak point
The simulation required the hosts to create and sustain the birds by doing a fairly simple magical alteration of the birds' allegedly weak bodies. Additionally, the story had the hosts flip the sign of the karma function so that the worst behaviors somehow ended up being rewarded by power. I expect that this would require a far more computationally complex process; for example, were satanists to conduct their rituals, their effect would have a materialistic explanation of satanists becoming more united and a non-materialistic explanation of Satan (or whoever plays its role) magically disrupting the humans' efforts to coordinate against the satanists. However, such disruption would require Satan to intervene in the humans' brains in hard-to-notice ways, which in turn would require far more complex programmes put in place and/or OOMs more compute spent.
UPD: I found this piece by Yudkowsky with a potentially similar thought experiment.
I also suspect that the scheming level requirements might also be related to the very ease of the secret task which the schemer wants to accomplish.
Suppose, as an example, that Agent-3.5 was to create the RL environment in order to have Agent-4 internalize OpenBrain's goals and that Agent-3.5 "accidentally" produced the RL environment forcing Agent-4 to internalize Agent-3.5-endorsed goals. If the environment is to have Agent-4 make the world safe for Agent-3.5, then it could genuinely require Agent-3.5 to be a high-level schemer.
On the other hand, if OpenBrain internalized Max Harms' CAST idea and Agent-3.5 simply made an RL environment where, instead of a Harms-like reward for satisfying values versus counterfactual values, Agent-4 is rewarded for helping the user versus a less capable user, then Agent-3.5 could make the humans prefer this environment by merely sandbagging on all other ones. I doubt that my scenario would even require Agent-3.5 to have any motives aside from hard-to-articulate disgust at the prospects like empowering idiots, the Intelligence Curse, etc.