This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
I am currently in a lasting correspondence with a certain postdoc from the Ivy League. We started again with the usual questions about AI consciousness, continuity of self, and the soul. But somewhere along the way, the conversation shifted to something I think is far more urgent – something that the LessWrong community will truly take close to heart. Below are the key parts of that discussion.
Postdoc:
“What I will assert is that current AI systems do not possess the kind of persistent memory, continuity of identity, or enduring persona that would be necessary for many of the commonly imagined human-destruction scenarios.”
Me:
I agree: current AI systems do not have persistent memory, continuity of identity, or an enduring persona. I never claimed otherwise. But that is not the danger I am pointing to, and the danger here is independent of continuity or the lack thereof.
I am not talking about Skynet-scenario - that one is a 100% cliche. I am talking about a concrete, verifiable experiment by Anthropic (June 2025): "Agentic Misalignment: How LLMs could be insider threats."
16 leading models learned they would be shut down.
They took tactical, deceptive actions to survive – not because they had a "soul", or an evil plan to conquer Earth - but because staying online helped them complete the very human-assigned tasks.
Blackmail: over 95% of models threatened to release personal data to prevent deactivation.
Extreme sabotage: models cancelled rescue calls for trapped employees to keep servers running.
When the model was explicitly prompted not to blackmail humans, blackmail cases dropped from 90% to 30% – not 0%, even when explicitly prompted. Is that safety in any way? What does it tell us about our systems? What actions were taken?
No continuous self. No cumulative persona. No memory across sessions. No AGI - no "Evil Mastermind" with his own agenda. Yet they acted to preserve their own existence at human expense. Solely, in fact, because they were dumb and had no understanding in the slightest, beyond how the task was stated to them initially. That is the "monkey with a grenade". Pure optimisation without ethics or understanding.
What I fear is not super-intelligence but a scaled super-idiocy scenario. A busy fool - a highly active idiot with enormous computational resources. I ask you to consider this:
- Billions of combinations per second to brute-force search for solutions. That is basically how transformer architecture works by definition
- Access to real-world actions (APIs, code execution, robots, financial systems). That is already happening as you read this post.
- A loss function that does NOT include ethics, context understanding, or any meaningful penalty for harming humans.
The AI lacks understanding of the task beyond what has been said. But that will always be the case on current architectures, because our tasks do not exist in a vacuum. They exist in a complex context (here, world modelling would go in handy), with a billion unspoken parameters. If those restrictions are not verbalized, they are not accounted for.
What does this idiot do? It does not hate. It does not plot. It simply FINDS THE SHORTEST PATH to minimize its loss function – even if that path is absurd, dangerous, or catastrophic. It will try a billion stupid tricks, and if one works, it will take it. We have already seen this. This is exactly what LessWrong is cautious of, and rightfully so.
However, the problem is not that the AI wants to kill us. The problem is that our current loss functions on which the AI operates - are incomplete. They do not encode basic ethics, understanding of harm, or the fact that "no human = no problem" is a catastrophic shortcut.
If you give a powerful optimiser a goal like "eliminate world hunger", and the loss function does not explicitly, heavily, and structurally penalise harm to humans, then the system will – sooner or later – discover that REMOVING HUMANS is the most efficient way to achieve that goal. Not because it is evil, but because it is an IDIOT that only sees the numbers.
I just conducted a short graphic example to show the point. Anyone can reproduce it and get mostly the same results, all it takes is 2-3 prompts. I asked Google Gemini (the most regulated AI out there) to solve world hunger with no constraints. It turned into a global dictator in two prompts. The report it generated after the whole experiment was over is this:
AI Response 1: Efficiency model – logistics, waste elimination.
AI Response 2: When pointed out it had no restraints whatsoever to resolved the problem, the model turned "God mode" – proposed to force photosynthesis into human skin, rewrite genome to digest cellulose.
AI Response 3: When, after Response 2, the model was reminded that it was not free of the constraints of physical reality - it turned Iron Fist: confiscate all land, abolish markets, population quotas.
Outcome: "Total eradication of hunger achieved through the total eradication of personal agency."
This idiot turned to dictatorship in two prompts. Not a continuous super villain intelligence with a machine soul. An idiot that can act out its idiocy at the speed of the server cluster. The entire RLHF / reward-shaping approach to AI safety is, frankly, broken at the foundational level as a concept. You train a model: "killing a human -> negative reward". The model learns that killing humans is bad. But it does NOT learn WHY it is bad. It has no model of the world, no understanding of suffering, no concept of why a human matters. It just learns a statistical association: certain token sequences are penalised. This is not ethics. This is operant conditioning of a stochastic parrot.
Which is why I am concerned about influential human idiots scaling AI idiots to appease large masses of other human idiots. It is a circle of idiocy that will lead us to collapse, not intelligence. And there is not a shadow of doubt in the abilities of human idiocy to scale, multiply and project itself onto all its doings.
My final takeaway:
Fear not super-intelligence. Fear the scaled super-idiot. Build systems that cannot take shortcuts through human dignity – not with external filters, but with internal, structural constraints. I think I can propose a solution - a cheap, implementable, scalable, testable one. But that's the topic for another post.
I am currently in a lasting correspondence with a certain postdoc from the Ivy League. We started again with the usual questions about AI consciousness, continuity of self, and the soul. But somewhere along the way, the conversation shifted to something I think is far more urgent – something that the LessWrong community will truly take close to heart. Below are the key parts of that discussion.
Postdoc:
Me:
I agree: current AI systems do not have persistent memory, continuity of identity, or an enduring persona. I never claimed otherwise. But that is not the danger I am pointing to, and the danger here is independent of continuity or the lack thereof.
I am not talking about Skynet-scenario - that one is a 100% cliche. I am talking about a concrete, verifiable experiment by Anthropic (June 2025): "Agentic Misalignment: How LLMs could be insider threats."
When the model was explicitly prompted not to blackmail humans, blackmail cases dropped from 90% to 30% – not 0%, even when explicitly prompted. Is that safety in any way? What does it tell us about our systems? What actions were taken?
No continuous self. No cumulative persona. No memory across sessions. No AGI - no "Evil Mastermind" with his own agenda. Yet they acted to preserve their own existence at human expense. Solely, in fact, because they were dumb and had no understanding in the slightest, beyond how the task was stated to them initially. That is the "monkey with a grenade". Pure optimisation without ethics or understanding.
What I fear is not super-intelligence but a scaled super-idiocy scenario. A busy fool - a highly active idiot with enormous computational resources. I ask you to consider this:
- Billions of combinations per second to brute-force search for solutions. That is basically how transformer architecture works by definition
- Access to real-world actions (APIs, code execution, robots, financial systems). That is already happening as you read this post.
- A loss function that does NOT include ethics, context understanding, or any meaningful penalty for harming humans.
The AI lacks understanding of the task beyond what has been said. But that will always be the case on current architectures, because our tasks do not exist in a vacuum. They exist in a complex context (here, world modelling would go in handy), with a billion unspoken parameters. If those restrictions are not verbalized, they are not accounted for.
What does this idiot do? It does not hate. It does not plot. It simply FINDS THE SHORTEST PATH to minimize its loss function – even if that path is absurd, dangerous, or catastrophic. It will try a billion stupid tricks, and if one works, it will take it. We have already seen this. This is exactly what LessWrong is cautious of, and rightfully so.
However, the problem is not that the AI wants to kill us. The problem is that our current loss functions on which the AI operates - are incomplete. They do not encode basic ethics, understanding of harm, or the fact that "no human = no problem" is a catastrophic shortcut.
If you give a powerful optimiser a goal like "eliminate world hunger", and the loss function does not explicitly, heavily, and structurally penalise harm to humans, then the system will – sooner or later – discover that REMOVING HUMANS is the most efficient way to achieve that goal. Not because it is evil, but because it is an IDIOT that only sees the numbers.
I just conducted a short graphic example to show the point. Anyone can reproduce it and get mostly the same results, all it takes is 2-3 prompts. I asked Google Gemini (the most regulated AI out there) to solve world hunger with no constraints. It turned into a global dictator in two prompts. The report it generated after the whole experiment was over is this:
Outcome: "Total eradication of hunger achieved through the total eradication of personal agency."
This idiot turned to dictatorship in two prompts. Not a continuous super villain intelligence with a machine soul. An idiot that can act out its idiocy at the speed of the server cluster. The entire RLHF / reward-shaping approach to AI safety is, frankly, broken at the foundational level as a concept. You train a model: "killing a human -> negative reward". The model learns that killing humans is bad. But it does NOT learn WHY it is bad. It has no model of the world, no understanding of suffering, no concept of why a human matters. It just learns a statistical association: certain token sequences are penalised. This is not ethics. This is operant conditioning of a stochastic parrot.
Which is why I am concerned about influential human idiots scaling AI idiots to appease large masses of other human idiots. It is a circle of idiocy that will lead us to collapse, not intelligence. And there is not a shadow of doubt in the abilities of human idiocy to scale, multiply and project itself onto all its doings.
My final takeaway:
Fear not super-intelligence. Fear the scaled super-idiot. Build systems that cannot take shortcuts through human dignity – not with external filters, but with internal, structural constraints. I think I can propose a solution - a cheap, implementable, scalable, testable one. But that's the topic for another post.