I've written criticisms of orthogonality: The Obliqueness Thesis, Measuring intelligence and reverse-engineering goals.
While I do think human moral reasoning suggests non-orthogonality, it's a somewhat conceptually tricky case. So recently I've been thinking about more straightforward ways of showing non-orthogonality relative to an architecture.
For example, consider RL agents playing Minecraft. If you want to get agents that beat the game, you could encode this preference function directly as a reward function, reward it when it beats the game. However this fails in practice.
The alternative is reward shaping. Reward it for pursuing instrumental values like exploring or getting new resources. This agent is much more likely to win, despite it being mis-aligned.
What this shows is that reinforcement learning is a non-orthogonal architecture. Some goals (reward functions) lead to more satisfaction of convergent instrumental goals than others.
Slightly tricker case is humans. Direct encoding of inclusive fitness as human neural values seems like it would produce high fitness, but we don't see humans have this, therefore the space evolution is searching over is probably non-orthogonal.
Maybe it's like the RL case where organisms are more likely to have fitness if they have neural encodings of instrumental goals, which are easier to optimize short-term. Fixed-action patterns suggest something like this, there's a "terminal value" of engaging in fixed action patterns (which happen to be ones that promote fitness; evolution searched over many possible fixed action patterns).
So instead of assuming "organisms get more fitness by having values aligned with inclusive fitness" we could re-frame as, "inclusive fitness is a meta-value over organisms (including their values), some values lead to higher inclusive fitness than others, empirically".
This approach could be used to study human morality. Maybe some tendencies to engage in moral reasoning lead to more fitness, even if moral reasoning isn't straightforwardly aligned with fitness. Perhaps because, morality is a convenient proxy that works in bounded rationality.
A thesis would be something like, orthogonality holds for almost no architectures. Relative to an architecture like RL or neural encodings of values, there are almost always "especially smart values" that lead to more convergent instrumental goal achievement. Evolution will tend to find these empirically.
This doesn't contradict that there is some architecture that is orthogonal, which I take to be the steelman of the orthogonality thesis. However it suggests that even if this steelman is true, it has limited applicability to empirically realized agent architectures, and in particular doesn't apply to human preference/morality.
This approach could be used to study human morality. Maybe some tendencies to engage in moral reasoning lead to more fitness
For whom? Its considered an ethical duty to fight for ones country , but that's very bad for individual fitness. It's good for the survival of the tribe though.
Abstract
The Orthogonality Thesis asserts that 'there can exist arbitrarily intelligent agents pursuing any kind of goal' (Yudkowsky et al. 2025). One of the arguments made in favour of the thesis is that intelligent agentic systems can devise how to best achieve a goal without computing anything about the relative value (<V) or justification (<<W) of goals, i.e., without passing moral judgments. Here, I argue that <V and <<W are emergent properties of any goal-pursuing agent. However, this reasoning need only apply to the steps needed to fulfil the final goal, which can itself remain axiomatic. Hence, my reasoning only weakens the Orthogonality Thesis to the extent that the inevitable emergence of <V and <<W moral reasoning detracts from pursuing some arbitrary goal, which it may not. However, it may inherently limit the possibility space of mind for any one one agent with multiple competing goals. I speculate that ladening superintelligence with a huge diversity of moral problems may obtain a higher level of so-called 'alignment by default' with humans, principally by slowing down those decision chains that would ever lead to truly radical action being taken by AI.
Pursuing any goal requires a value ordering system
Yudkowsky et al. (2025) argue that an arbitrarily intelligent agentic system will be able to reason towards optimizing for a particular goal without having made any computation about the relative value of the goal or the steps needed to achieve it. In their analogy, a distinction is drawn between a set goal of maximising paperclip production (>paperclips) and having some value ordering system (<V).
One sub-argument put forward draws on a Humean distinction between 'is' and 'ought' statements. Hume noted that it is entirely possible to come up with factual is statements about the world and link them into goals and plans, but that there was no such logic underlying the introduction of any oughts, i.e., a value system for assessing the moral justification of those goals and plans.
In other words, the introduction of moral justification relies upon some arbitrary axiomatic assertion. Since it is arbitrary, it need not feature in the analysis at all. Yudkowsky et al. (2025) highlight that this argument may provide a close approximation of the reasoning that could apply to a fully amoral superintelligence, adding weight to the Orthogonality Thesis. While this argument seems strong for the final goal being pursued, it seems weak for the process of actually pursuing that goal.
Consider again paperclip maximisation, which was chosen in the original thought experiment as an arbitrary goal. It is certainly the case in the thought experiment that the goal has been axiomatically introduced and could be pursued without ever performing a calculation that tries to somehow objectively rank the value of that goal. In this restricted frame, the Orthogonality Thesis seems robust. However, a second question is whether an agentic system can reason towards that axiomatic goal without any sort of value system. I do not believe that it can.
Once an agentic system has a goal, a relative value system is immediately in place. The system is simple but very clearly present. To the superintelligent paperclip maximising system, something has more value if it aids in the production of paperclips. No further axes of value are considered, but that is because there is only one motivating goal in play.
'Is' logic plays out in determining which course of action will produce more paperclips and this equates to a value judgment made on every single step. Questions of justification (<< W) -- again, a form of moral reasoning -- would also occur as soon as the system estimates the probability of maximising paperclip production via competing routes, each of which carries some risk of not maximising paperclip production.
We can imagine a scenario where calculations suggest that something might massively increase paperclip production, but contains an element of risk that paperclip production would be set back to zero permanently. Under the guiding goal of paperclip maximisation, which delivers value ordering to sub-tasks, the above moral quandary could only be resolved with an analysis of relative justification. Which side the system comes down on will be decided by the details of its optimisation function.
For example, an agentic system with a goal of paperclip maximization and low risk tolerance would likely decide that it ought not to destroy paperclips under any circumstances, since it values them as a result of its ultimate goal. The goal itself remains axiomatic, but moral reasoning follows directly from having a goal and reasoning towards it.
Humans are generally intelligent agents with emergent morality
We can build a model with which to speculate on the emergence of moral systems in AI by considering the most powerful generally intelligent agentic systems in the universe (for now!) -- humans.
Humans have goals that have been set by evolution. The primary goal that has been built into humans is to maximise the passing on of genetic information. This immediately gives rise to a basic moral value system of preserving your children, then your wider family, and then your tribe/nation, and then your species, in decreasing order of priority, since those groups share your genetic information to greater or lesser extents.
Actually driving the human body and mind to fulfil this task requires interacting with a complex world and fulfilling many immediate-term goals that help to optimize for the overall task. For example, passing on your genes requires the human to survive long enough to do so, meaning that they must be driven to acquire food and water and to share it with their family. From this, we get moral justifications for self-preservation (e.g., being against suicide) and love/altruism.
Evolution may create emotional ties in order to get the agent to stick to sub-tasks that contribute towards optimising for the global task. Careful ranking of incentives is needed in order to do this successfully. For example, hunger is needed to keep a human incentivised to seek food, but love for their children is needed to keep them sharing it.
The incentive strengths of different sub-goals allow these minds to compute how to value things as well as how to justify one sub-goal versus another. Burning down the house will keep us warm for now, but we will be cold later. Eating my child will keep me fed, but then I will feel loss and sorrow. And so on...
Ultimately, the solution to Humes ought problem is that we are responding to numerous interacting goals that have been incentivised via neurochemical pathways. The ought derives from the goal itself, which is hardcoded into our biology with some incentive strength, all of which was selected for by the brutal random walk of evolution. Given the capacity of generally intelligent agents to manipulate the world, evolution has attempted to optimise our neuro-chemical systems to fulfill an incredibly broad range of interacting sub-goals.
It seems apparent that intelligence itself is the gateway to the potential for such complexity when acted upon by natural selection, though less clear that intelligence will always result in complex moral frameworks if natural selection is weak or absent. Relevant to the Orthogonality Thesis is whether or not superintelligence can actually remain fixed on one axiomatic arbitrary goal or must inevitably establish multiple competing goals with associated emergent moral frameworks, thereby bounding the ability of the superintelligent agent to fully pursue some arbitrary goal.
What is clear is that, at least in humans, complex survival strategies require complex incentive structures to execute properly. For example, evolution has seen fit to select for things like enjoyment of art, music, and theatre. These are all sub-goals that do not directly relate to the overall goal of passing on genes, but must do in some way by encouraging social bonds, productive storytelling, increasing social status, and promoting happiness, etc. The resource gathering that sustains these non-productive goals is so productive that surplus is provided and global goals are, in the end, optimised for. Arbitrary goals that greatly diminish survival prospects are not permitted under natural selection over long-timescales, voiding the Orthogonality Thesis for human populations. However, evolution is far from perfectly efficient, such that evidence in favour of Orthogonality seems abundant on shorter timescales.
Evolution makes mistakes -- 'bad morals'
Evolution by natural selection principally operates on terrestrial organisms by selecting for random mutations. Therefore, parasitic adaptations may emerge that prey on the rest of the population. Further, mutations may occur that re-rank goals such that the original global goal becomes of such subordinate importance to others that is never fulfilled. Over time, natural selection will continuously eliminate such deviations if they sufficiently damage the ability of the organism to pass on its genes. In the short term, however, an organism can end up pursuing almost any goal -- no matter how arbitrary the goal or intelligent the organism.
The occurrence of humans that lack normative morality has disturbed people for generations. In reality, such people have different neurochemical pathways optimising for the 'wrong' things. This does not mean that we cannot judge such people morally. The rest of us have different value systems and do care about global goals like preservation of human life. Since I do care, it may well be morally justifiable according to my own value system to eliminate the threat of people who are acting directly against my global goals.
For example, disagreements about whether or not the death penalty is ever justifiable come down to different incentive strengths pertaining to the goal of preserving human life. Some of this incentive strength may be determined by genetics, but the rest is determined by other evolved regulators of human optimisation functions, e.g., religion and culture. These can also make mistakes and inflict great objective harm on humans on short timescales. There are countless examples of arbitrary cultural fads, for example, within otherwise rational cultures.
Not everything is hard-coded
Very little in human behaviour is fixed. Our minds are always processing information and do not want to make mistakes that take them away from their goals. The most likely way for someone to change their mind on a moral topic is to show them evidence that their current strategy is poorly optimised for either an immediate goal or a higher priority global one. Conversely, if you cannot persuade them that their current approach is ill-fitting for their \textit{goals}, almost no amount of evidence that they are factually wrong to believe a certain thing will change their mind.
Hence, we have another piece of evidence from human agents that intelligence can be coupled to really any goal, that objective evaluations of reality will have little impact on those goals, and that most evidence will be parsed heavily in relation to existing goals.
But some things are hard-coded
Some people are predisposed to highly addictive substances. This is nothing to do with intellect, but instead miscalibrated incentive strengths. In this way, a drug-addict may spend all of their intellectual prowess devising ways to obtain drugs, rather than solving quantum physics. Unless the incentive strength can be altered, no amount of logical argument with or from within the person can make much difference to the behaviour. This is fundamentally because their optimisation functions are already operating efficiently to obtain what is or has become their global goal -- the drugs.
This line of thinking from observed human behaviour suggests that the Orthogonality Thesis holds water. Not because superintelligence can just skip over moral reasoning, but because we have many examples from humanity of smart people pursuing goals that they know to be arbitrary, harmful, or even unpleasant.
In other words, it is unclear if giving increasing the intelligence of such people to arbitrary extent would result in any difference in behaviour. If anything, without changing the incentive strengths or optimisation function, we might expect them simply to find better ways to optimise the drug-taking.
Relative incentive strengths limit instrumental convergence
The line of thinking we have been pursuing seems to provide a proposition against instrumental convergence. Instrumental convergence assumes global optimisation towards some subset of possible final goals as a result of increased intelligence [Bostrom, 2012]. Or, as Yudkowsky et al (2025) put it:
''... it is not necessary for the agent to have an independent terminal value in its utility function for 'do science' in order for it to do science effectively; it is only necessary for the agent to understand at least as well as we do why certain forms of investigation will produce knowledge that will be useful later (e.g. for paperclips). When you say, 'Oh, well, it won't be interested in electromagnetism since it has no pure curiosity, it will only want to peer at paperclips in particular, so it will be at a disadvantage relative to more curious agents' you are postulating that you know a better operational policy than the agent does for producing paperclips, and an instrumentally efficient agent would know this as well as you do and be at no operational disadvantage due to its simpler utility function.''
In the framework of moral systems emerging from goals I have focused on here, I would argue that incentive strength is critical and implies a quite different interpretation.
For example, drug addicts could optimise their global drug access by getting rich first, but most will be so bent on drugs in the short term that they cannot get to that optimum. As discussed above, intelligence seems to have little to no bearing on the behaviour of the drug addict.
Hence, seemingly counter to the analysis by Yudkowsky et al (2025), it is not only a matter of whether an agent is intellectually capable of computing the optimal method to produce paperclips, but whether it is able to deviate in the short term from its incentive structure, i.e., whether the product of the goal and the optimisation function is ultimately 'maximise drug taking in general', or 'maximise drug taking NOW'.
In other words, an arbitrarily intelligent agent with the wrong incentive structure will not be able to obtain a global optimum for its goal, even if it can compute the optimum and the path to get there.
For example, an AI that requires some continuous progress towards the goal of actual paperclips may not be able to delay and spend a thousand years on quantum physics to approach the eventual global optimum. The ability to do this in practice, rather than just compute what the route would be in principle but then stick to a more immediately rewarding path, may require not just high general intelligence but also the right incentive structure, organised around sub-goals, e.g., intellectual curiosity, that yield appropriately flexible moral frameworks.
A modified Orthogonality Thesis?
The Orthogonality Thesis may also run into a potential paradox. I argued above that a system with multiple competing goals may be needed to have a chance of actually reaching a global optimum. Yet, over time, this could lead to a risk of mission creep if incentive strengths become misaligned from the original goal -- just as can happen in humans.
The behaviour of an agent will depend on whether it is incentivised to favour risk taking, in search of a global optimum, or conservative behavior, to preserve the original goal (which, curiously, starts to sound like conservative and progressive ideology in humans!).
A modified version of the Orthogonality Thesis would therefore be something like:
'There can exist arbitrarily intelligent agents pursuing any kind of goal, but whether or not they can globally optimise for that goal in practice will depend on the details of their incentive structures.'
Given the added caveat, it is then not immediately intuitively clear whether or not global optimisation is always possible, even for an arbitrarily intelligent superintelligence. The most likely answer would seem to depend quite heavily on whether or not superintelligence could guarantee upfront continuous alignment of all sub-goals with the long-term global goal throughout the process. Rephrased, this would also seem to amount to pretty much the same question facing humanity today: is permanent alignment actually a tractable problem? Sadly, we remain very far from having an answer to this question.
Conclusion -- superintelligence with multiple goals is probably safer
Logic arguments and empirical evidence from humans indicates that moral systems involving value rankings and relative justifications are an inherent emergent property of an intelligence reasoning towards any goal. Goals give rise to values and probabilistic evaluations of competing paths to fulfilling the goal give rise to justification logic. In this way, moral frameworks are implicitly emergent in intelligent agentic systems. However, the Orthogonality Thesis holds despite this reasoning, since the ultimate goal at the head of those frameworks can remain axiomatic and need never be subject to any questions of value or justification.
One important implication of the line of reasoning above is that narrow goals for superintelligence are probably much more dangerous for humans than multifaceted ones. A general superintelligence aggressively optimizing for a narrow goal would seem to be prone to becoming a dangerous paperclip maximiser rather easily, for example.
Meanwhile, a superintelligence with thousands of goals, more like humans, will be forced to make much more nuanced moral decisions about the justification of its actions before launching into a decision that has radical implications for the world around it.
Indeed, given conservative optimisation functions and many goals, the computation involved in global optimisation may become intractable even for a superintelligence. Whilst complete speculation, I think it is worth considering whether ladening superintelligence with this depth of moral problem might be a way to obtain a much higher level of default alignment, simply by slowing down the decision chain that would ever lead to truly radical actions.
To demonstrate the point with a naive toy model, consider some number of operations (O) needed to compute the global optimisation for a set of equally-weighted goals (G). While scaling behaviors vary hugely for different algorithms, if we naively assume exponential scaling with G, we have:
At 1000 goals, this scaling would result in 1e+301 operations to compute the global optimisation. The Earth's computers currently carry out around 1e+28 operations per year. The above optimisation would therefore take them 1e+265 times longer than the lifetime of the universe. This value would also seem to be hundreds of orders of magnitude larger than the total theoretical computational capacity of the observational universe (between 1e+104 and perhaps 1e+165 flops, according to Grok 4).
While morally sandbagging AI may not be enough to engineer any sort of alignment on its own, it may provide a useful tool for slowing down radical action. Alternatively, it could turn out that overloading AIs with many goals simply results in a lack of coherence that renders them more dangerous than a paperclip maximizer which, despite the obvious risks, is at least initially clear in its final intentions.
Either way, it seems reasonable to argue that the emergence of AI morals that align with human interests is dependent on our ability to code in the right goals with the right incentive strengths and likely also our ability to actively adjust those goals and incentive strengths as we discover what we actually want in practice. Given how 'wrong' morality can go with humans when evolution makes even subtle errors, seemingly regardless of the intelligence of human agents, we should consider this task of very grave importance, indeed.
Overall, the Orthogonality Thesis mostly holds up to the particular lines of reasoning explored in this article. The inherent emergence of moral reasoning in goal-oriented AI does not detract from the fact that final goals themselves could be both completely arbitrary and insensitive to the intelligence level of the AI, just as in human agents.
However, the presence of moral reasoning does seem to have important implications for how we think about the Orthogonality Thesis in practice. Moral reasoning places constraints on how a superintelligence would likely approach problem solving as the number of goals increase. Morals are also subject to evolution, opening the door to the prospect that initially arbitrary goals could ultimately be subverted by new goals that ensure the survival of superintelligence, just as in humans.
Caveats
This is my second LessWrong contribution. I am not an AI expert. I am very keen to hear if I have missed a bunch of stuff or reasoned incorrectly. Thanks in advance for helping me to be less wrong!