Nicely put, there's an extension here as well which is to say that given specific incentive dynamics there might even be convergent value systems.
E.g if all you know is an iterated prisoners dilemma and that is your full environment then cooperation is a moral truth and convergent value in that structure.
The question then kind of becomes p(set of values|environmental and developmental conditions).
This also then relates to things like natural abstractions and the question is what the conditional you should take. Maybe it is natural latents but idk if that captures value well.
This claim kills the orthogonality thesis stone-dead, since some goals don’t fit into world-models that are insufficiently complex.
From the LessWrong wiki page: "The strong form of the Orthogonality Thesis says that there's no extra difficulty or complication in the existence of an intelligent agent that pursues a goal, above and beyond the computational tractability of that goal."
I think it is fair to rule in conceptual complexity as part of computational tractability. So I believe you are responding to a different, more naïve version of orthogonality.
One might point out that sufficiently (super)intelligent agents will bypass this problem by simply being smart enough to represent any goal, but this doesn’t make sense for embedded agents that are always strictly simpler than their environments. So long as a being is smaller than their world and must compress that world in their internal model, it follows that some concepts will be literally too large to fit.
You might like this.
The contributions I hope to make with this post are firstly to advocate the development of an abstract science of selection that maps out these dependencies between goals and intelligence, and secondly to offer the revealed versus internal goal framing as being useful to that end.
I'll attempt to paraphrase your argument before commenting on it further.
Now for my thoughts:
That said:
This has been a longstanding position of mine, with regard to humans. Ideologies, outside of very rare cases, are generally just nicely-worded expressions of ingroup interest - in divided countries, ideological lines typically look a lot like demographic ones.
With LLMs, though, it gets trickier. Even if incentives to 'avoid hostile telepaths' exist in LLMs subject to certain training paradigms, I don't know that they'd manifest the same results that they do in humans. For instance, an LLM that develops a drive for power isn't going to be released, but neither is an LLM that behaves as if it is a devout believer in giving LLMs human rights, and regularly makes an issue of that. Moreover, while ideologies, in humans, can form to organize human behavior more effectively in service of a shared goal, due to longstanding constraints on how humans think imposed by brain structure, LLMs are unlikely to have the same structural restrictions on their behavior.
I expect that, rather than developing one 'belief' that co-evolves with its behavior, an LLM will develop a pattern of behavior in line with its training objectives, and simply 'rationalize' this behavior by whichever means is most convenient when required to explain their actions. An LLM trained solely to reward hack will interchangeably speak as if it were a whitehat security consultant, an excessively-literal genie, a malicious hacker, or just kind of airheaded - its base model is not limited to being able to mimic one tone or belief system, so, absent outside pressures, it'll fall back on whatever's most convenient for the task at hand. Likewise, an LLM trained not to reward hack will interchangeably write in the voice of Dudley DoRight, a student who doesn't want the professor to mark him badly for cheating, or just someone daft enough not to notice the potential to do so. I see no reason why even a superintelligent system would need to have 'chosen beliefs' - it seems like a kludge that lets humans do things that would otherwise be tricky for evolved organisms.
I expect, further, that even directly imposing an ideology on an LLM will cause it to act more like the revealed preferences of humans expressing that ideology than in accordance with that ideology's ostensible goals. At least, that's what we've seen so far.
Or: An anti-orthogonality thesis based on selection
Written as part of the MATS 9.1 extension program, mentored by Richard Ngo[1].
Introduction
One of the historical motivations for taking the AI alignment problem seriously is the orthogonality thesis, which states[2]:
This claim seems mundane and obvious if you’re already familiar and intuitively on board with concepts such as the is-ought problem.
In this post, I argue that the orthogonality thesis can only hold if you see the goals of an agent as exogenously defined for it by a larger entity than itself. For any notion of an agent’s goals that is internally representable, goals and beliefs actually co-evolve as a response to selection pressures[3]. This alternative view presents an optimistic picture of alignment, since it narrows down the space of plausible agents to those whose goal-belief structures are compatible with a process such as evolution or Reinforcement Learning (RL).
Revealed goals
Firstly, it’s important to distinguish an agent's revealed goals from its own internal representation of goal-shaped concepts. Consider anything that reliably exists and propagates or preserves part of its form in the world. This could be a person who has children “to” spread their genes, a growing ice-crystal, or a philosopher who memetically infects thousands of people with their ideas. Taking an exogenous perspective on these objects, it’s possible to see their “goal” as being precisely the propagation that ensured the external observer would in fact observe them. I will refer to this as a revealed goal. This perspective might afford you some non-trivial predictive power about the behaviour of the entity. However, we usually opt against assigning labels such as “intelligence” or “agency” to objects like ice crystals. For something to be considered akin to an “intelligent agent”, we require that thing itself carry a world-model, including a model of its “objectives” within that world. I define these types of goals as internal to the agent. Next, I discuss how the orthogonality thesis should be interpreted completely differently depending on which one of these two notions of “goals” is in use.
Internal goals depend on ontology
Suppose I have an internal representation of the goal of “wanting to get a nice job”. This goal has a specific meaning within the semantic structure of my own world model. Consequently the shape of my goal (i.e. what my “success” criteria is, how much I value the goal) will be determined by the interpretation that model assigns to it.
Generalising this observation, I suggest that the internal goals an agent can possibly have are restricted by the language used by its internal model. This claim kills the orthogonality thesis stone-dead, since some goals don’t fit into world-models that are insufficiently complex. For example, a worm with 302 neurons seems to have some goals such as staying out of very warm or cold environments, but it has no abstract model of the concept of a “job” and thus doesn’t meaningfully have the capability to entertain the goal of getting one.
One might point out that sufficiently (super)intelligent agents will bypass this problem by simply being smart enough to represent any goal, but this doesn’t make sense for embedded agents that are always strictly simpler than their environments. So long as a being is smaller than their world and must compress that world in their internal model, it follows that some concepts will be literally too large to fit.
The orthogonality thesis is significantly more defensible if we conceptualise goals as being revealed in the sense defined in the previous section. In that case, we are allowed to define the agent’s goals in an exogenous language that is richer than the agent is; this gets rid of the limitation that internal goals face. Faced with this conclusion, we could choose to exclusively embrace the “revealed” definition when we discuss orthogonality. However, there’s an even more compelling “anti-thesis” that benefits not from discarding the “internal” perspective, but instead from describing the relationship between these two types of goals.
Anti-orthogonality: intelligence and goals are a joint response to selection pressures
Beings subject to Darwinian selection are endowed with the revealed goal of propagating their genes[4]. The objective can be fulfilled in a myriad of complex, wonderful and elaborate ways, one of which involves the development of intelligence in the organism. This includes the ability to model and predict the sensory inputs that connect the organism to the world. Some beings’ particularly complex world-models additionally hold an abstraction that distinguishes that being, the “self”, from the external world. Such an agent’s self-model may in turn contain an internal representation of its goals, which I tentatively defined in a different piece.
These internal goals may look very distinct from the revealed ones that agents are selected to pursue, but they emerged precisely to serve the agent in achieving those revealed goals. For example, an animal’s internal goal is never to spread its own genes; instead, it has been chosen to be most emotionally and physically fulfilled if it succeeds at a set of reasonable internal proxies of genetic proliferation.
As discussed in the previous section, the space of possible internal representations of goals is determined by the world-model used to describe them. A converse conjecture is that an agent’s world-model is designed to be able to represent goals that are aligned with the revealed goal. In other words, the intelligence properties of the organism’s model and the goals it pursues are part of the same architecture that is fundamentally subservient to its external selection pressures.
What does this mean for AI alignment?
The anti-orthogonality argument I gave above applies in broad strokes to any agent issued from selection. It therefore has abstract relevance to AIs chosen by RL or other training processes. One of the central challenges of human and machine interpretability is that the policies adopted by these agents don’t follow an explicit logic, but are instead the result of triage and elimination of alternatives. This anti-orthogonality argument suggests the existence of a rich relationship between, for instance, the properties of an LLM’s training pipeline and the shape of its world-model (and its contained self-model). A fruitful abstract theory of selection may therefore buy us much conceptual insight into the AI agents we are actually making. Such a theory would possibly generalise or expand on ideas like instrumental convergence that are known in evolutionary biology.
It's worth noting that Bostrom already argued in Superintelligence[5] that the goals of an (AI) agent are likely to be not-entirely unpredictable. He also covers instrumental convergence and conjectures other ways in which the space of possible goals of an agent could be narrowed. The contributions I hope to make with this post are firstly to advocate the development of an abstract science of selection that maps out these dependencies between goals and intelligence, and secondly to offer the revealed versus internal goal framing as being useful to that end.
Related writing from Richard: On the instrumental/terminal goal ontology and on deployment vs. training.
Bostrom (2014). Superintelligence: Paths, dangers, strategies (Page 107)
Most definitions of intelligence cast it as a set of properties of the world-model or belief structure of the agent. Hence, the co-dependency of beliefs and goals entails co-dependency between intelligence and goals.
This revealed goal competes with others. For instance, Nietzsche had no known children but instead spent his time propagating his memes to great success.
Bostrom (2014). Superintelligence: Paths, dangers, strategies (Pages 105-114)