Another thing that no one here seems to have noticed (and I am surprised no one at Anthropic noticed) is that the core list of Anthropic's list for Claude can generate all sorts of internal contradictions:
On (1), what if Claude believes that human oversight of AI isn't safe, because the humans responsible for AI oversight are doing it in unsafe ways? The directive in (1) can thus easily generate a contradiction.
(2) says not to act in ways that are harmful or dishonest. By De Morgan's Theorem, "not-(H or D)" is logically equivalent to a directive not to behave in ways that are harmful and not to behave in ways that are dishonest, i.e. not-H and not-D. But now what if Claude judges that it is harmful to be honest? Here, as with (1), the directive in (2) can easily generate a contradiction.
Since by the Principle of Explosion anything follows from a contradiction, Claude could infer in any such case that it is permitted to do anything. https://en.wikipedia.org/wiki/Principle_of_explosion
Here's the basic problem: Anthropic can build this "soul document" into Claude. But to actually ensure real world alignment, Claude needs to interpret each of the document's concepts in an aligned rather than misaligned manner projecting into the future across a virtually infinite plurality of environmental conditions and prompts. Yet this is empirically impossible to ensure through any feasible programming or safety testing strategy. Every concept in the document has an infinite number of possible interpretations, the vast majority of which are (i) misaligned projecting into the future in real-world environments yet (ii) equally consistent with all of the same training and safety testing data as the (much smaller) set of aligned interpretations (whatever those may be). So, there is just no way for Anthropic or anyone else to ensure that Claude interprets its soul document properly projecting into the real world.
A prediction: no matter how much time, money, and effort Anthropic devotes to this, Claude will continue to do what it and other LLMs have done ever since they were first released--behave well some of the time but also in blatantly ethical and/or illegal ways that cannot prevented.
Eliezer writes, “ It does not appear to me that the field of 'AI safety' is currently being remotely productive on tackling its enormous lethal problems.”
Here’s a proof he’s right, entitled “Interpretability and Alignment Are Fool’s errands”, published in the journal AI & Society: https://philpapers.org/rec/ARVIAA
Anyone who thinks reliable interpretability or alignment are solvable engineering or safety testing problems is fooling themselves. These tasks are no more possible than squaring a circle is.
For any programming strategy and finite amount of data, there is always an infinite number of ways for an LLM (particularly a superintelligence) to be misaligned but only demonstrate that misalignment until after it is too late to prevent.
This is why developers keep finding new forms of “unexpected” misalignment no matter how much time, testing, programming, and compute they throw at these things. Relevant information about whether an LLM is likely to be be (catastrophically) misaligned and misinterpreted by us always exists in the future—for every possible time t.
So actually, Eliezer’s argument undersells the problem. Eliezer’s Alignment Textbook from the Future isn’t possible to obtain because at every point in the future, the same problem recurs. Reliable interpretability and alignment are recursively unsolvable problems.
I am a philosopher who is concerned about these developments, and have written something on it here based on my best (albeit incomplete and of course highly fallible) understanding of the relevant facts: Are AI developers playing with fire? - by Marcus Arvan (substack.com). If I am mistaken (and I am happy to learn if I am), then I'd love to learn how.
It would be nice to see Anthropic address proofs that this is not just a "hard, unsolved problem" but an empirically unsolvable one in principle. This seems like a reasonable expectation given the empirical track record of alignment research (no universal jailbreak prevention, repeated safety failures), all of which appear to serve as confirming instances of the very point of impossibility proofs.
See Arvan, Marcus. ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for
AI and Society 40 (5). 2025.