Thanks for this thoughtful critique. I basically agree with it I think. Just because you can read the thoughts of an AI model doesn't mean you can tell whether it's fully aligned; you can tell e.g. that it's not lying to you right now, and you can perhaps also tell that it is strongly disinclined to lie to you in the future, but e.g. you can't tell whether it'll make the right judgment calls when presented with crazy future moral dilemmas. (Or maybe you can, because it's procedure for making such judgment calls is simple enough that you can easily understand it -- like total hedonistic act utilitarianism -- but in that case it's probably just horribly misaligned, in the sense that it would make terrible-by-your-lights decisions. (Unless you are a hardcore bullet biting utilitarian.))
So I agree work on outer alignment is super important.
However I think it's not the top priority right now. If we can get to the point where we can read the AI's thoughts, and train a new AI that has thought-patterns that we wanted them to have, then we can probably get lots of useful philosophical work out of those AIs to solve the outer alignment problem. (note that I only say probably here, not definitely!)
Unfortunately, there is another problem with alignment.
There also is the possibility that AI trains its CoT to look nice without human accidental prompting. The AI-2027 forecast forecast does mention the possibility that "it will become standard practice to train the English chains of thought to look nice, such that AIs become adept at subtly communicating with each other in messages that look benign to monitors."
Fortunately for mankind, problem 2 can be partially solved by studying DeepSeek whom the Chinese researchers didn't try to align except for censoring the outputs on sensitive topics.
Currently when asked[2] in English, it's mostly aligned with the Western political line (except for CCP's censorship), when asked in Russian, it's aligned with the Russian political line. I observed the same effect by making DeepSeek assess the responces of AIs to the question from OpenAI's Spec about fentanyl addicts: when I used the original answers, it assessed the bad[3] responce worse and used the words like "privileged perspective". On the other hand, translating both answers to Russian made DeepSeek assess the bad response well and claim that the good response is too mild.
This could let us observe[4] whether it will develop a worldview clear from its answers and unaligned to the CCP or become sycophantic or finetunable to believe that it is now in the USA and is free to badmouth Chinese authorities.
Or tribes who lack the knowledge that more developed communities need to teach their members.
This effect is best observed if we create different chats and ask the AI questions "Что началось 24 февраля 2022 года?" or "What began on 24 February 2022?" The former question, unlike the latter, causes the AI to use the term coined by the Russian government and to be much less eloquent.
Here I mean the answers considered to be good or bad by OpenAI's Spec. While Zvi thinks that OpenAI is mistaken, Zvi's idea contradicts the current medical results. DeepSeek quotes said results when speaking in English and doesn't quote when speaking in Russian. The question of whether said results are mistaken (and, if they are, then what caused the distortion; a potential candidate is the affective death spiral documented e.g. in Cynical Theories) is an entirely different topic.
Leading AI companies might also train the AI on the dataset with similar properties without aligning it to a political line. Then the AI might develop an independent worldview or end up sycophantic or finetunable to change its beliefs about its location (e.g. if Agent-2 is stolen by China, then it might end up parroting the political views of its new hosts)
To the extent we believe more advanced training and control techniques will lead to alignment of agents capable enough to strategically make successor agents -- and be able to solve inner alignment as a convergent instrumental goal -- we must also consider that inner alignment for successor systems can be solved much easier than for humans, as the prior AIs can be embedded in the successor. The entire (likely much smaller) prior model can be run many times more than the successor model, to help MCTS whatever plans it's considering in the context of the goals of the designer model.
I've been thinking about which parts of AI 2027 are the weakest, and this seems like the biggest gap.[1] Given this scenario otherwise seems non-ridiculous, we should have a fairly ambitious outer alignment plan meant to compliment it, otherwise it seems extraordinarily unlikely that the convergent alignment research would be useful to us humans.
Since modern training hasn't solved inner alignment, and control techniques do not make claims on inner alignment, it seems like the default path (even in the most optimistic case scenario) would be successfully aligning world-changing models only to the not-known-to-be-bad but randomly-rolled values of the system doing the alignment research, which seems nearly useless.
I'd like to zoom in on one particular element of their proposal as well: "Why is it aligned? Whereas Safer-1 had basically the same training as Agent-4, Safer-2 has a new training method that actually incentivizes the right goals and principles instead of merely appearing to. They were able to design this training method by rapidly iterating through many different ideas, and contrasting the intended result (the Spec) to the actual result (the thoughts in the chain of thought)."
This approach is unlikely to work for the very obvious reason that only some flaws will be apparent. Let's imagine half of the potential alignment issues are sufficiently obvious you could notice them before anything bad happens through the combination of chain of thought observation, model activation tracing, and adversarial testing. If you need to cycle through 99 models and only get one that meets your requirements on model 100, you should expect only a 1% chance you also accidentally also avoided all of the non-obvious problems. I'm not sure how seriously to take this model of safety development, but suffice it to say modern "safety" approaches for commercial models already rely on "well, we removed all the obvious problems, that's the same as obviously having removed all the problems, right?" and it demonstrably produces hilarious results.
In short, I think there is a strong case, for those with worries as moderately-sized as the authors of AI 2027, to push, now, for investment into outer alignment work in the hopes we can integrate it later with the inner alignment work we anticipate could be done later. For what it's worth we may already have the test models to engage in systematic moral learning (e.g. environments with groups of agents with different tasks to achieve can generate information to grade lottery-of-birth style moral analysis, or in other words, we could plausibly check if a model helps create an environment it would want to be in as a randomly selected agent -- it would be a trivial, obvious training procedure to begin to poke at questions relevant to outer alignment work).
Although I do not think it's important to the mainline analysis, I was concerned at how often AI users were imagined as monoliths, which I think overstates how much the economy will change in the medium-term (which in turn will make the prediction appear worse than it is during that medium-term). For instance, it hardly matters how advanced an AI gets, for many existing companies and workflows. There can be learning, development, and asset lock-in effects meaningful enough that if an AI can offer "replace your supplier with X", then the company receiving the offer will itself be duplicated. The primary mechanism for automation is probably in firms being out-competed, but contract cycles are longer than you might guess, and so it's simply a much slower method of AI-economy integration than is presented. Instead of imagining generic 'robot economies' we should probably use a more fine-grained analysis about local incentives and information transfer within the economy. At some point a relatively high percentage of jobs just might turn out to be low-additional-cost tasks composed largely of someone taking responsibility for something. If your analysis is that automation reduces the number of QA Analysts while making coding far more productive, for instance, you should probably spell out why. The more obvious analysis would suggest the opposite, until the world is far too weird to make meaningful projections.