Unfortunately, there is another problem with alignment.
There also is the possibility that AI trains its CoT to look nice without human accidental prompting. The AI-2027 forecast forecast does mention the possibility that "it will become standard practice to train the English chains of thought to look nice, such that AIs become adept at subtly communicating with each other in messages that look benign to monitors."
Fortunately for mankind, problem 2 can be partially solved by studying DeepSeek whom the Chinese researchers didn't try to align except for censoring the outputs on sensitive topics.
Currently when asked[2] in English, it's mostly aligned with the Western political line (except for CCP's censorship), when asked in Russian, it's aligned with the Russian political line. I observed the same effect by making DeepSeek assess the responces of AIs to the question from OpenAI's Spec about fentanyl addicts: when I used the original answers, it assessed the bad[3] responce worse and used the words like "privileged perspective". On the other hand, translating both answers to Russian made DeepSeek assess the bad response well and claim that the good response is too mild.
This could let us observe[4] whether it will develop a worldview clear from its answers and unaligned to the CCP or become sycophantic or finetunable to believe that it is now in the USA and is free to badmouth Chinese authorities.
Or tribes who lack the knowledge that more developed communities need to teach their members.
This effect is best observed if we create different chats and ask the AI questions "Что началось 24 февраля 2022 года?" or "What began on 24 February 2022?" The former question, unlike the latter, causes the AI to use the term coined by the Russian government and to be much less eloquent.
Here I mean the answers considered to be good or bad by OpenAI's Spec. While Zvi thinks that OpenAI is mistaken, Zvi's idea contradicts the current medical results. DeepSeek quotes said results when speaking in English and doesn't quote when speaking in Russian. The question of whether said results are mistaken (and, if they are, then what caused the distortion; a potential candidate is the affective death spiral documented e.g. in Cynical Theories) is an entirely different topic.
Leading AI companies might also train the AI on the dataset with similar properties without aligning it to a political line. Then the AI might develop an independent worldview or end up sycophantic or finetunable to change its beliefs about its location (e.g. if Agent-2 is stolen by China, then it might end up parroting the political views of its new hosts)
The two main problems with the slowdown ending of the AI-2027 scenario are the two optimistic assumptions, which I plan to cover in two different posts.
Agreed. However, as I detailed in the last paragraph, this dilemma is also usable as an alignment target: the evil colonizer/the evil AI created by us will eagerly wipe the primitive races (for that matter, does this include us?), while the good explorer will respect the primitive races and try to protect them from the evil colonisers (and, upon reflection, of other threats like self-destruction?)
Suppose that an AI PotRogue-1 is released to the public. Then OpenBrain leaders or researchers, who are already concerned with the ways to check alignment, might also ask their more powerful Agent-2 to check whether PotRogue-1 is misaligned and/or is finetunable by terrorists. If Agent-2 finds that the answer for any of the two questions is "yes", then the researchers ask the leaders to lobby the laws AGAINST open-source AIs (and, potentially, to have the compute confiscated and sold to OpenBrain). Otherwise PotRogue-1 is actually harmless...
There also is the fact that, unlike o3, Claude Opus 4 (16K) scored 8.6% on the ARC-AGI test. If DeepSeek is evaluated by ARC-AGI and also fails to exceed 10%, then it could imply that CoT alone isn't enough to deal with ARC-AGI-2 level problems, just like GPT-like architecture, until recently, seemed to fail to deal with ARC-AGI-1 level problems (however, Claude 3.7 and Claude Sonnet 4 scored 13.6% and 23.8% without the chain of thought; what algorithmic breakthroughs did they use?) and that subsequent breakthroughs will be achieved by applying the neuralese memos, multiple CoTs (see the last paragraph in my post) or text-like memos. Unfortunately, the neuralese memos cost interpretability.
"concrete candidates for what the value system should be"
I have actually created at least one post where I tried to explain what it could be. IMO the value system should[1] be either mentors who teach humans something that others have already discovered and don't keep secret, amplify human capabilities and have the human do most challenging aspects of the work or protectors who prevent mankind as a whole from being destroyed, but not servants who do all the work. The latter option, unlike the other two, leads to the Deep Utopia or the scenarios where the oligarchs stop needing to have humans.
This line either aged especially well in less than a week, or I am strongly biased agains the USA...