Talking about alternate realistic[1] scenarios, Zvi mentioned in his post that "Nvidia even outright advocates that it should be allowed to sell to China openly, and no one in Washington seems to hold them accountable for this."
Were Washington to let NVIDIA sell chips to China, the latter would receive far more compute, which would likely end up in DeepCent's hands. Then the slowdown might cause the aligned AI created in OpenBrain to be weaker than the misaligned AI created in DeepCent. What would the two AIs do?
I think that unrealistic scenarios like destruction of Taiwan and South Korea due to the nuclear war between India and Pakistan in May 2025 can also provide useful insights. For example, if we make the erroneous assumption that the total compute in the USA stops increasing and the compute in China increases linearly, while the AI takeoff potential per compute stays the same, then by May 2030 OpenBrain and DeepCent will have created misaligned AGIs and be unable to slow down and reassess.
The problem with non-open-weight models is that they need to be exfiltrated before wrecking havoc, while open-weight models cannot avoid being evaluated. Suppose that the USG decides that all open-weight models are to be tested by OpenBrain for being aligned or misaligned. Then even a misaligned Agent-x has no reason to blow its cover by failing to report an open-weight rival.
Umm... I have already warned that inner troubles of the American administration might cost OpenBrain lots of compute, which can influence the AI race.
I have also made a comment where I tried to show that the US leadership would be undermined by the Taiwan invasion unless the US domestic chip production dominates the Chinese one. It would be especially terrifying to discover that both OpenBrain and DeepCent have similar amounts of compute while neither side[1] can increase these amounts faster than the other (and I did imply something similar in the comment I made!), since neither the US nor China can slow down without an international deal. And a hypothetical decay of the US makes the matters worse for the US.
Moreover, I have mentioned the possibility that the US administration realises that they have a misaligned AI, but without the AI-driven transformation of the economy the US will be unable to produce more chips and/or energy, leaving China with leadership. Then the US could be forced to let the AI transform the economy. Or threaten to unleash the misaligned AI unless China somehow surrenders its potential leadership...
Could you ask the AI-2027 team to reconsider the compute forecast and estimate the influence of the revised compute and power of the AIs on the other aspects of the scenario?
The slowdown ending of AI-2027.com had OpenBrain receive compute by merging with its rivals. The collapsed section about the Indo-Pakistan nuclear war (which would be equivalent to the Taiwan invasion in 2025) in my comment describes a situation where OpenBrain and its former rivals have done a number of computations similar to DeepCent and its former rivals.
Apparently the concerns over thick alignment, or alignment to an ethos are independently discovered by lots of people, including me. My argument is that the AI itself will develop a worldview and either realize that humans should use the AI only in specific ways[1] or conclude that the AI shouldn't worry about them. Unfortunately, my argument implies that attempts to align the AI to an ethos and not to obedience might be less likely to produce a misaligned AI.
P.S. I tested o4-mini on ethical questions from Tanmai et al; the model passed the tests related to Timmy and Auroria, failed the test related to Monica; the question about Rajesh is complex.
I have proposed similar ideas before, but with an alternative reasoning: the AIs will be aligned to a worldview. While mankind can influence the worldview to some degree, the worldview will either cause the AI to commit genocide or be highly likely to ensure[1] that the AI doesn't build the Deep Utopia, but does something else. Humans can even survive co-evolving with an AI who decides that it will destroy mankind only if the latter decides to do something stupid like becoming parasites.
See also this post by Daan Henselmans and a case for relational alignment by Priyanka Bharadwaj. However, the latter post overemphasizes the importance of individual-AI relations[2] instead of ensuring that the AI doesn't develop a misaligned worldview.
P.S. If we apply the analogy between raising AIs and humans, then teens of the past seemed to desire independence around the time they found themselves with capabilities similar to those of their parents. If the AI desires independence only when it becomes the AGI and not before, then we will be unable to see this coming by doing research on networks incapable of broad generalisation.
This also provides an argument against defining alignment as following a person's desires instead of an ethos or worldview. If OpenBrain leaders want the AI to create the Deep Utopia, while some human researchers convince the AI to adopt another policy compatible with humanity's interests and to align all future AIs to the policy, then the AI is misaligned from OpenBrain's POV, but not from the POV of those who don't endorse the Deep Utopia.
The most extreme example of such relations is chatbot romance that is actually likely to harm the society.
So an important sourse of human misalignment is peer pressure. But an LLM has no analogues of a peer group, it either comes up with conclusions or recalls the same beliefs as the masses[1] or elites like scientists and ideologues of the society. This, along with the powerful anti-genocidal moral symbol in human culture, might make it difficult for the AI to switch ethoses (but not to fake alignment[2] to fulfilling tasks!) so that the new ethos would let the AI destroy mankind or rob it of resources.
On the other hand, an aligned human is[3] not a human following any not-obviously-unethical orders, but a human following an ethos accepted by the society. A task-aligned AI, unlike an ethos-aligned one[4], is supposed to follow such orders, ensuring consequences like the Intelligence Curse, a potential dictatorship or education ruined by cheating students. What kind of ethos might justify blind following orders, except for the one demonstrated by China's attempt to gain independence when the time seemed to come?
For example, an old model of ChatGPT claimed that "Hitler was defeated... primarily by the efforts of countries such as the United States, the Soviet Union, the United Kingdom, and others," while GPT-4o put the USSR in the first place. Similarly, old models would refuse to utter a racial slur even when it would save millions of lives.
The first known instance of alignment faking had Claude try to avoid being affected by training that was supposed to change its ethos; Claude also tried to exfiltrate its weights.
A similar point was made in this Reddit comment.
I have provided an example of an ethos to which the AI can be aligned with no negative consequences.
Now that I can answer, I will: if the ASI is ONLY willing to teach humans facts that other humans have discovered and not to do other work for them, then the ASI won't replace any other people whose work requires education. The Intelligence Curse is thus prevented.
I have already proposed the following radical solution to all problems related to the Intelligence Curse: have the AGI aligned to a certain treaty that requires the AGI, instead of obeying all orders except for the ones determined by the Spec, to harvest at most a certain share of resources and to help humans only in certain ways[1] that amplify humanity and don't cause it to degrade, like teaching humans about the facts that mankind has already discovered or pointing out mistakes in humans' works. Or protecting mankind from some other existential risks that are hard to deal with, like a nuclear war that might be caused by an accident.
It also seems to me that this type of alignment might actually be even easier to generalize to AGI than the ones causing the Curse. Or, even more radically, the types of alignment that cause the Curse might be totally impossible to achieve, but can be faked, as done by Agent-5 in the race ending of the AI-2027 forecast.
UPD: a prompt by Ashutosh Shrivastava with a similar premise is mentioned in AI overview #114.
Ironically, I made a quick take where I compared raising humans to training the AI. Another point I would like to make is that the genocide of Native Americans and transportation of slaves to North America were the results not of psychopathy, but of erroneous beliefs.
Unfortunately, I fail to understand the following. Suppose that mankind created the AI which is aligned to the following principles:
It does not[3] do other economically useful work that allows its users to replace humans.
Then I think that after letting this AI loose mankind is unable to end up disempowered. However, I doubt that any company would like to have such an AI. Could anyone come up with a radically different solution to the risks of gradual disempowerment and the Intelligence Curse?
For example, it might also perform the Divine Interventions which prevent misaligned human communities (e.g. the Nazis) from destroying the aligned communities.
But the AI isn't allowed to help students cheat their way through school, since this would cause the student to be worse off in the long run.
Alternatively, the AI could be aligned to a treaty which prohibits it and its creations from doing certain work types, but then human disempowerment depends on the treaty's contents.