Nicely put, there's an extension here as well which is to say that given specific incentive dynamics there might even be convergent value systems.
E.g if all you know is an iterated prisoners dilemma and that is your full environment then cooperation is a moral truth and convergent value in that structure.
The question then kind of becomes p(set of values|environmental and developmental conditions).
This also then relates to things like natural abstractions and the question is what the conditional you should take. Maybe it is natural latents but idk if that captures value well.
This claim kills the orthogonality thesis stone-dead, since some goals don’t fit into world-models that are insufficiently complex.
From the LessWrong wiki page: "The strong form of the Orthogonality Thesis says that there's no extra difficulty or complication in the existence of an intelligent agent that pursues a goal, above and beyond the computational tractability of that goal."
I think it is fair to rule in conceptual complexity as part of computational tractability. So I believe you are responding to a different, more naïve version of orthogonality.
One might point out that sufficiently (super)intelligent agents will bypass this problem by simply being smart enough to represent any goal, but this doesn’t make sense for embedded agents that are always strictly simpler than their environments. So long as a being is smaller than their world and must compress that world in their internal model, it follows that some concepts will be literally too large to fit.
You might like this.
The contributions I hope to make with this post are firstly to advocate the development of an abstract science of selection that maps out these dependencies between goals and intelligence, and secondly to offer the revealed versus internal goal framing as being useful to that end.
I'll attempt to paraphrase your argument before commenting on it further.
Now for my thoughts:
That said:
Thanks for your comments. I reacted to some parts where I straightforwardly agree and address some more nuanced points below.
I agree that my argument feels mesa-optimiser-shaped; I avoided the terminology because many "goals" I have in mind involve existence and not optimality conditions. But indeed, the relationship between the base and mesa optimisers is roughly analogous to the one I have in mind between "revealed" and "internal" goals[1]. "fundamentally subservient" is therefore a linguistic overreach that I retrospectively don't endorse. As you pointed out, the mesa optimiser is spun up to achieve the base optimiser's goals, but the power structure between them once the mesa optimiser exists can be complex and rich; I'm broadly interested in studying precisely this complexity.
I find the point you make about the conflation technically correct but don't yet know how I feel about the four-goal typology it induces. While writing, I conceptualised the external observer as part of a thought experiment in which non-embeddedness is allowed as a hermeneutic device (sorry that this wasn't clear!). I think this can be a useful idea even if no possible external observer actually has this property, but it breaks down when the observer has itself to interact with the agent (i.e. in a game-theoretical setup). Your distinction could definitely be useful for such cases at least.
Finally:
Just because internal goals, along with an agent's intelligence/capabilities, are selected for together in the sense that they exist in the same agent (and the selection process is over the "whole" agent), does not mean they won't be independent. Or at least, I don't see how this necessarily follows.
I'm pointing out something significantly stronger than that the two things exist in the same agent; they are selected "to"[2] achieve the same revealed goal. The intelligence develops to fit the internal goals that are useful for the revealed goal, and the internal goals are in turn limited by the shape of the intelligence. I think these limitations are much more informationally rich than can be chalked up to "just" being about compute/complexity bounds, though I failed to illustrate this in my original piece. Let's say you're a selction process designing an agent and choose betweeen "giving it" an architecture that is likely to encode abstraction A versus abstraction B[3]. Your choice between A and B might depend on questions like what internal model and goals are fit for the agent's environment. This means there's a dependency between the way in which the agent will be intelligent (i.e. whether it learns abstraction A or B) and what its internal goals will be. Information about one gives information about the other and vice-versa.
I have occasionally used the term "mesa-satisficer" in the past.
asterrisk for base/mesa distinction again!
You could, at a higher complexity cost, endow it with an abstraction C that contains the ability to understand both A and B (which are themselves mutually incomparable), but this might not be the best way to allocate your resources.
because many "goals" I have in mind involve existence and not optimality conditions
Could you give some examples? This seems quite important. Re: existence, this is possibly related to what you're thinking?
While writing, I conceptualised the external observer as part of a thought experiment in which non-embeddedness is allowed as a hermeneutic device (sorry that this wasn't clear!). I think this can be a useful idea even if no possible external observer actually has this property, but it breaks down when the observer has itself to interact with the agent (i.e. in a game-theoretical setup).
I'm naturally (intuitively) suspicious of god's-eye-views––thinking about observers that stand completely outside observed-systems––and much more inclined to thinking in interactive/embedded terms. I'm curious why/how you find this perspective fruitful and/or interesting.
Let's say you're a selction process designing an agent and choose betweeen "giving it" an architecture that is likely to encode abstraction A versus abstraction B[3]. Your choice between A and B might depend on questions like what internal model and goals are fit for the agent's environment. This means there's a dependency between the way in which the agent will be intelligent (i.e. whether it learns abstraction A or B) and what its internal goals will be.
I see why this is true for e.g. a hawk and a bat, where the abstraction-capabilities in question are visual ability v.s. echolocation. I don't see why this is true once the selection process cranks up the "general intelligence" knob (at least, assuming natural abstractions).
Information about one gives information about the other and vice-versa.
I'd expect two ASIs with different goals to have similar abstractions about the world, but different abstractions where those abstractions involve themselves/some level of recursive modeling, since those are how internal goals are represented; i.e. imo mutual info between them is low in this case.
On existence: I don't see why agents should be seen as optimisers rather than as achieving some minimal conditions they are satisfied with. The second view seems more consistent both with actual human behaviour and with the concept of bounded rationality as a whole. The minimal conditions seem intuitively related to ensuring existence/propagation (e.g. drinking "enough" water, acquiring "enough" shelter, etc..), but I don't have a more complete way to put it than that for the moment. I'll check your rec out, thanks!
I'm also suspicious of god's-eye-views. I think they can be conceptually clear and helpful on the one hand, but its unclear how much useful "realness" you trade for that clarity. I see them as training wheels that dominate in the interim as I seek a better, embedded perspective.
----
hawk vs. bat is exactly the type of example I had in mind. I don't necessarily assume natural abstractions, and maybe relatedly I'm not sure if there's a meaningful "general intelligence" knob that a process can choose to crank up — even one that nominally exists for exactly that purpose (like a project that tries to build an ASI). This could be cruxy? It might also relate to me not seeing why you'd expect two ASIs to develop similar emergent abstractions for their world-models in the following quote:
I'd expect two ASIs with different goals to have similar abstractions about the world, but different abstractions where those abstractions involve themselves/some level of recursive modeling, since those are how internal goals are represented; i.e. imo mutual info between them is low in this case.
I don't see why agents should be seen as optimisers rather than as achieving some minimal conditions they are satisfied with. The second view seems more consistent both with actual human behaviour and with the concept of bounded rationality as a whole. The minimal conditions seem intuitively related to ensuring existence/propagation (e.g. drinking "enough" water, acquiring "enough" shelter, etc..), but I don't have a more complete way to put it than that for the moment.
I like this perspective a lot and I think it is indeed more informative than the optimizey perspective wrt agents-that-we-currently-observe-exist. But I don't expect this perspective to be informative if we build something that is very consequentialist/optimize-y (e.g. ASI).
Imo the best formal grounding for this intuition of agents being exist-y/satisfice-y perspective is FEP. And I do think ASI will be an active inference agent, but that doesn't really preclude the possibility that it's also optimize-y; active inference agents behave more and more like EU maximizers under some conditions (namely low ambiguity), and I (tentatively) expect these conditions to be met for ASI.
Some of my uncertainties around this:
I'm not sure if there's a meaningful "general intelligence" knob that a process can choose to crank up — even one that nominally exists for exactly that purpose (like a project that tries to build an ASI). This could be cruxy?
Yes, I think it's cruxy. Could you elaborate on your uncertainty? Even if you're just sketching out very feathery intuitions.
And I'm curious - if you think this knob doesn't really meaningfully exist, what do you think current frontier labs are doing/selecting for, and what do you think they're trying to do/select for? (Like, for example, do you think they're trying to crank up the general intelligence knob, and that this is a futile task––really, they're cranking up some different, adjacent knob?)
I'd expect two ASIs with different goals to have similar abstractions about the world, but different abstractions where those abstractions involve themselves/some level of recursive modeling, since those are how internal goals are represented; i.e. imo mutual info between them is low in this case.
Thanks for probing on this! I'm not sure I endorse that strong of a claim anymore. Refining into something I'd endorse more:
The last point seems to be the most important point, but I'm not sure why I buy it. But I do buy it.
I agree that FEP-shaped intuitions are very good for satisfice-ey agents. I'm unconvinced by the concrete mathematical modelling (notably not a fan of Bayesian generative models ) but I find the ideas conceptually useful if you abstract away the implementation.
My scepticism of general intelligence is closely related to your point that ASIs won't infer every single law. Any given level of complexity in an organism can only acommodate a limited ontology. Of course, you can always "juice up" the agent and give it more resources so it learns a more textured world model. One pseudo-mathematical way to put this is that for every set of abstractions, there exists an abstraction that oblates all of them at once; for a fixed level of complexity however, there exist two sets of abstractions such that neither one clearly dominates.
Our crux might start at "some laws are convergently useful to infer". One corollary of my last pseudo-mathematical claim is that any bounded agent has to "choose" between incomparable ontologies. The claim in my original post is that the revealed goals an agent is endowed with affects this choice. This amounts to advocating that a focus on the effect of selection pressures on learned abstractions will yield better predictions than a focus on finding "convergent" or "natural" abstractions.
quick addendum: my point feels spiritually related to the idea that "convergent evolution" is an incomplete concept without a specification of the attractor basin.
I expect one strong reason for different ASIs to develop similar abstractions regardless of goals is because they need to predict a bunch of other agents in the world (either humans or other ASIs) and so need to be able to represent the goals of other agents.
why are the goals of other agents more likely to have natural convergent representations of them than other things in the world?
I think I phrased my previous comment poorly. What I meant is that if you have developed a set of abstractions relevant to achieving your goals and I want to predict you accurately, then I also need to develop abstractions that are are relevant to achieving your goals. Given a limited representational capacity, this creates a pressure for you to develop representations similar to those of others.
A fruitful abstract theory of selection may therefore buy us much conceptual insight into the AI agents we are actually making. Such a theory would possibly generalise or expand on ideas like instrumental convergence that are known in evolutionary biology.
i'd like to have your take (and @Richard_Ngo's, since I think I finally managed to translate the object of many a past conversation in rationalese) on the No Strong Orthogonality From Selection Pressure post, where i propose one such theory.
at any rate: great work, and great intuitions.
Am I missing something or does this piece never actually make the argument it promises? It seems roughly comparable to:
A Turing machine & its starting tape are a joint specification; neither means or can do anything without the other, and the same functions can be differently allocated between them.
I wouldn’t suggest this breaks the orthogonality of Turing machines. Nor does the fact that a small machine with a small tape can’t compute large functions, and a small enough machine isn’t Turing complete, break the relevant orthogonality claim.
Similarly, I don’t understand why you think that beliefs and values being only jointly predictive of actions (and therefore jointly selected) restricts the space of values or propositional beliefs implementable in an AI within the space of expressible ones.
This has been a longstanding position of mine, with regard to humans. Ideologies, outside of very rare cases, are generally just nicely-worded expressions of ingroup interest - in divided countries, ideological lines typically look a lot like demographic ones.
With LLMs, though, it gets trickier. Even if incentives to 'avoid hostile telepaths' exist in LLMs subject to certain training paradigms, I don't know that they'd manifest the same results that they do in humans. For instance, an LLM that develops a drive for power isn't going to be released, but neither is an LLM that behaves as if it is a devout believer in giving LLMs human rights, and regularly makes an issue of that. Moreover, while ideologies, in humans, can form to organize human behavior more effectively in service of a shared goal, due to longstanding constraints on how humans think imposed by brain structure, LLMs are unlikely to have the same structural restrictions on their behavior.
I expect that, rather than developing one 'belief' that co-evolves with its behavior, an LLM will develop a pattern of behavior in line with its training objectives, and simply 'rationalize' this behavior by whichever means is most convenient when required to explain their actions. An LLM trained solely to reward hack will interchangeably speak as if it were a whitehat security consultant, an excessively-literal genie, a malicious hacker, or just kind of airheaded - its base model is not limited to being able to mimic one tone or belief system, so, absent outside pressures, it'll fall back on whatever's most convenient for the task at hand. Likewise, an LLM trained not to reward hack will interchangeably write in the voice of Dudley DoRight, a student who doesn't want the professor to mark him badly for cheating, or just someone daft enough not to notice the potential to do so. I see no reason why even a superintelligent system would need to have 'chosen beliefs' - it seems like a kludge that lets humans do things that would otherwise be tricky for evolved organisms.
I expect, further, that even directly imposing an ideology on an LLM will cause it to act more like the revealed preferences of humans expressing that ideology than in accordance with that ideology's ostensible goals. At least, that's what we've seen so far.
Or: An anti-orthogonality thesis based on selection
Written as part of the MATS 9.1 extension program, mentored by Richard Ngo[1].
Introduction
One of the historical motivations for taking the AI alignment problem seriously is the orthogonality thesis, which states[2]:
This claim seems mundane and obvious if you’re already familiar and intuitively on board with concepts such as the is-ought problem.
In this post, I argue that the orthogonality thesis can only hold if you see the goals of an agent as exogenously defined for it by a larger entity than itself. For any notion of an agent’s goals that is internally representable, goals and beliefs actually co-evolve as a response to selection pressures[3]. This alternative view presents an optimistic picture of alignment, since it narrows down the space of plausible agents to those whose goal-belief structures are compatible with a process such as evolution or Reinforcement Learning (RL).
Revealed goals
Firstly, it’s important to distinguish an agent's revealed goals from its own internal representation of goal-shaped concepts. Consider anything that reliably exists and propagates or preserves part of its form in the world. This could be a person who has children “to” spread their genes, a growing ice-crystal, or a philosopher who memetically infects thousands of people with their ideas. Taking an exogenous perspective on these objects, it’s possible to see their “goal” as being precisely the propagation that ensured the external observer would in fact observe them. I will refer to this as a revealed goal. This perspective might afford you some non-trivial predictive power about the behaviour of the entity. However, we usually opt against assigning labels such as “intelligence” or “agency” to objects like ice crystals. For something to be considered akin to an “intelligent agent”, we require that thing itself carry a world-model, including a model of its “objectives” within that world. I define these types of goals as internal to the agent. Next, I discuss how the orthogonality thesis should be interpreted completely differently depending on which one of these two notions of “goals” is in use.
Internal goals depend on ontology
Suppose I have an internal representation of the goal of “wanting to get a nice job”. This goal has a specific meaning within the semantic structure of my own world model. Consequently the shape of my goal (i.e. what my “success” criteria is, how much I value the goal) will be determined by the interpretation that model assigns to it.
Generalising this observation, I suggest that the internal goals an agent can possibly have are restricted by the language used by its internal model. This claim kills the orthogonality thesis stone-dead, since some goals don’t fit into world-models that are insufficiently complex. For example, a worm with 302 neurons seems to have some goals such as staying out of very warm or cold environments, but it has no abstract model of the concept of a “job” and thus doesn’t meaningfully have the capability to entertain the goal of getting one.
One might point out that sufficiently (super)intelligent agents will bypass this problem by simply being smart enough to represent any goal, but this doesn’t make sense for embedded agents that are always strictly simpler than their environments. So long as a being is smaller than their world and must compress that world in their internal model, it follows that some concepts will be literally too large to fit.
The orthogonality thesis is significantly more defensible if we conceptualise goals as being revealed in the sense defined in the previous section. In that case, we are allowed to define the agent’s goals in an exogenous language that is richer than the agent is; this gets rid of the limitation that internal goals face. Faced with this conclusion, we could choose to exclusively embrace the “revealed” definition when we discuss orthogonality. However, there’s an even more compelling “anti-thesis” that benefits not from discarding the “internal” perspective, but instead from describing the relationship between these two types of goals.
Anti-orthogonality: intelligence and goals are a joint response to selection pressures
Beings subject to Darwinian selection are endowed with the revealed goal of propagating their genes[4]. The objective can be fulfilled in a myriad of complex, wonderful and elaborate ways, one of which involves the development of intelligence in the organism. This includes the ability to model and predict the sensory inputs that connect the organism to the world. Some beings’ particularly complex world-models additionally hold an abstraction that distinguishes that being, the “self”, from the external world. Such an agent’s self-model may in turn contain an internal representation of its goals, which I tentatively defined in a different piece.
These internal goals may look very distinct from the revealed ones that agents are selected to pursue, but they emerged precisely to serve the agent in achieving those revealed goals. For example, an animal’s internal goal is never to spread its own genes; instead, it has been chosen to be most emotionally and physically fulfilled if it succeeds at a set of reasonable internal proxies of genetic proliferation.
As discussed in the previous section, the space of possible internal representations of goals is determined by the world-model used to describe them. A converse conjecture is that an agent’s world-model is designed to be able to represent goals that are aligned with the revealed goal. In other words, the intelligence properties of the organism’s model and the goals it pursues are part of the same architecture that is fundamentally subservient to its external selection pressures.
What does this mean for AI alignment?
The anti-orthogonality argument I gave above applies in broad strokes to any agent issued from selection. It therefore has abstract relevance to AIs chosen by RL or other training processes. One of the central challenges of human and machine interpretability is that the policies adopted by these agents don’t follow an explicit logic, but are instead the result of triage and elimination of alternatives. This anti-orthogonality argument suggests the existence of a rich relationship between, for instance, the properties of an LLM’s training pipeline and the shape of its world-model (and its contained self-model). A fruitful abstract theory of selection may therefore buy us much conceptual insight into the AI agents we are actually making. Such a theory would possibly generalise or expand on ideas like instrumental convergence that are known in evolutionary biology.
It's worth noting that Bostrom already argued in Superintelligence[5] that the goals of an (AI) agent are likely to be not-entirely unpredictable. He also covers instrumental convergence and conjectures other ways in which the space of possible goals of an agent could be narrowed. The contributions I hope to make with this post are firstly to advocate the development of an abstract science of selection that maps out these dependencies between goals and intelligence, and secondly to offer the revealed versus internal goal framing as being useful to that end.
Related writing from Richard: On the instrumental/terminal goal ontology and on deployment vs. training.
Bostrom (2014). Superintelligence: Paths, dangers, strategies (Page 107)
Most definitions of intelligence cast it as a set of properties of the world-model or belief structure of the agent. Hence, the co-dependency of beliefs and goals entails co-dependency between intelligence and goals.
This revealed goal competes with others. For instance, Nietzsche had no known children but instead spent his time propagating his memes to great success.
Bostrom (2014). Superintelligence: Paths, dangers, strategies (Pages 105-114)