Hierarchical Agency: A Missing Piece in AI Alignment

This was an impressive demonstation of Claude for interviews. Was this one take?

(Also what prompt did you use? I like how your Claude speaks.)

There was some selection of branches, and one pass of post-processing.

It was after ˜30 pages of a different conversation about AI and LLM introspection, so I don't expect the prompt alone will elicit the "same Claude". Start of this conversation was

Thanks! Now, I would like to switch to a slightly different topic: my AI safety oriented research on hierarchical agency. I would like you to role-play an inquisitive, curious interview partner, who aims to understand what I mean, and often tries to check understanding using paraphrasing, giving examples, and similar techniques. In some sense you can think about my answers as trying to steer some thought process you (or the reader) does, but hoping you figure out a lot of things yourself. I hope the transcript of conversation in edited form could be published at ... and read by ...

Overall my guess is this improves clarity a bit and dilutes how much thinking per character there is, creating somewhat less compressed representation. My natural style is probably on the margin too dense / hard to parse, so I think the result is useful.

[-]Chris Lakin1y93

Do we have a LessWrong tag for "hierarchical agency" or "multi-scale alignment" or something? Should I make one?

[-]Jan_Kulveit1y20

I guess make one? Unclear if hierarchical agency is the true name

[-]Chris Lakin1y20

Yeah i'm confused about what to name it. we can always change it later i guess.

also let me know if you have any posts you want me to definitely tag for it that you think i might miss otherwise

[-]Mateusz Bagiński1y50

Compositional agency?

[-]Seth Herd1y72

Are you working on this because you expect our first AGIs to be such hierarchical systems of subagents?
1. Or because you expect systems in which AGIs supervise subagents?
In either case, isn't the key question still whether the agent(s) at the top of the hierarchy are aligned?
In other areas of complex systems (economics, politics and nations, and notably psychology), mathematical formulations address sub-parts of the systems, but typically are not relied on for an overall analysis. Instead, understanding complex systems requires integrating a number of tools for understanding different parts, levels, and aspects of the system.
1. I worry that the cultural foundations of AI alignment bias the people most serious about it to focus excessively on mathematical/formal approaches.

[-]Jan_Kulveit1y20

I expect "first AGI" to be reasonably modelled as composite structure in a similar way as a single human mind can be modelled as composite.
The "top" layer in the hierarchical agency sense isn't necessarily the more powerful / agenty: the superagent/subagent direction is completely independent from relative powers. For example, you can think about Tea Appreciation Society at a university using the hierarchical frame: while the superagent has some agency, it is not particularly strong.
I think the nature of the problem here is somewhat different than typical research questions in e.g. psychology. As discussed in the text, one place where having mathematical theory of hierarchical agency would help is making us better at specifications of value evolution. I think this is the case because a specification would be more robust to scaling of intelligence. For example, compare learning objective
a. specified as minimizing KL divergence between some distributions
b. specified in natural language as "you should adjust the model so the things read are less surprising and unexpected"
You can use objective b. + RL to train/finetune LLMs, exactly like RLAIF is used to train "honesty", for example.
Possible problem with b. is the implicit representations of natural language concepts like honesty or surprise are likely not very stable: if you would train a model mostly on RL + however Claude understands these words, you would probably get pathological results, or at least something far from how you understand the concepts. Actual RLAIF/RLHF/DPO/... works mostly because it is relatively shallow: more compute goes into pre training.

[-]Seth Herd1y2-1

Ah. Now I understand why you're going this direction.

I think a single human mind is modeled very poorly as a composite of multiple agents.

This notion is far more popular with computer scientists than with neuroscientists. We've known about it since Minsky and think about it; it just doesn't seem to mostly be the case.

Sure you can model it that way, but it's not doing much useful work.

I expect the same of our first AGIs as foundation model agents. They will have separate components, but those will not be well-modeled as agents. And they will have different capabilities and different tendencies, but neither of those are particularly agent-y either.

I guess the devil is in the details, and you might come up with a really useful analysis using the metaphor of subagents. But it seems like an inefficient direction.

[-]Seth Herd1y77

I strongly support you using this format if it helps you share your thinking. It sounds like we wouldn't be seeing this any time soon without the interview format. It's interesting and I might try it. And I encourage others to do so if helps them share sooner or more efficiently.

Along those lines, I strongly encourage any sort of AI-assisted writing as long as the central ideas are human-generated or at least thoroughly thought-through and endorsed by the human posting them.

This post and every post longer than two paragraphs would really benefit from some sort of summary or TLDR so people can prioritize properly.

Questions/thoughts on the comment posted separately.

[-]Davidmanheim1yΩ251

Are you familiar with Davidad's program working on compositional world modeling? (The linked notes are from before the program was launched, there is ongoing work on the topic.)

The reason I ask is because embedded agents and agents in multi-agent settings should need compositional world models that include models of themselves and other agents, which implies that hierarchical agency is included in what they would need to solve.

It also relates closely to work Vanessa is doing (as an "ARIA Creator") in learning theoretic AI, related to what she has called "Frugal Compositional Languages" and see this work by @alcatal - though I understand both are not yet addressing on multi-agent world models, nor is it explicitly about modeling the agents themselves in a compositional / embedded agent way, though those are presumably desiderata.

[-]Gunnar_Zarncke1y50

In the section about existing theories that could be related I was missing Luhmann's Social Systems theory. It is not a mathematical theory but otherwise fits the desiderata. This is what o1-preview says about it (after being fed the above dialog up to the theories suggested by Claude:

Introducing Luhmann's Social Systems Theory into our discussion is an intriguing idea. Let's explore how it might help us address the challenge of modeling hierarchical agency with a suitable mathematical formalism.
Overview of Luhmann's Social Systems Theory:
Niklas Luhmann's theory conceptualizes society as a complex set of self-referential social systems composed not of individuals or actions, but of communications. In his framework:
Autopoietic Systems: Social systems are autopoietic, meaning they are self-producing and maintain their boundaries through their own operations. They continuously reproduce the elements (communications) that constitute them.
Functional Differentiation: Society is differentiated into various subsystems (e.g., economy, law, politics), each operating based on its own binary code (profit/non-profit, legal/illegal, power/no power).
Communication as the Fundamental Unit: Communications are the basic elements, and meaning is generated through the network of communications within a system.
Operative Closure and Cognitive Openness: Systems are operationally closed—they can only refer to themselves—but cognitively open, as they observe and are influenced by their environment.
Relating Luhmann's Theory to Hierarchical Agency:
Type Consistency Across Levels:
Agents as Systems: If we consider both subagents and superagents as autopoietic systems of communications, we maintain type consistency. Both levels are constituted by the same fundamental processes.
Scale-Free Structure: Since Luhmann's theory doesn't privilege any particular scale, it aligns with your desideratum for a scale-free formalism.
Expressiveness for Real-World Situations:
Conflict and Integration: Luhmann's concept of structural coupling explains how different systems interact and influence each other without losing their autonomy. This could model conflicts between layers or how a superagent gains agency at the expense of subagents.
Emergent Properties: The emergent nature of social systems in Luhmann's theory parallels how superagents might exhibit properties not reducible to their subagents.
Representation of Intentionality:
System Intentionality: While individual intentions are abstracted away, systems exhibit purposive behavior through their operations. This could be adapted to represent the goals and beliefs of agents at different levels.
Communication of Intentions: Intentionality can be embedded in the communications that constitute the system, allowing us to model beliefs and desires.
Towards a Mathematical Formalism:
While Luhmann's theory is primarily sociological and lacks a mathematical framework, we can attempt to formalize its concepts:
Agent-Based Modeling:
Agents as Communication Processes: Model agents (both subagents and superagents) as processes that generate and respond to communications.
Network Dynamics: Use graph theory to represent communications as edges and agents as nodes, allowing us to analyze the structure and dynamics of interactions.
Category Theory:
Objects and Morphisms: Utilize category theory to model agents as objects and communications as morphisms, preserving type consistency and composability.
Functorial Relationships: Define functors that map between categories of agents at different hierarchical levels.
Dynamic Systems Theory:
Differential Equations: Represent the evolution of communications and agent states over time using differential equations.
Attractors and Stability: Analyze how certain communication patterns lead to stable superagent formations.
Information Theory:
Entropy and Redundancy: Measure the information content of communications, which could correlate with the intentionality and decision-making processes of agents.
Mutual Information: Assess how much information is shared between subagents and superagents, indicating the degree of coupling.
Potential Benefits:
Unified Framework: By treating communications as the foundational elements, we can model both agents and superagents within the same formal structure.
Emergence and Self-Organization: The formalism could capture how complex behaviors emerge from simple interactions, which is crucial for understanding hierarchical agency.
Applicability to AI Systems: For AI alignment, this approach might help in designing AI agents that can form or integrate into higher-level systems without losing alignment with human values.
Challenges:
Abstract Nature: Luhmann's theory is highly abstract, which might make direct mathematical translation difficult.
Intentionality Representation: Modeling intentionality purely through communications may not capture all aspects of beliefs and desires inherent in agents.

[-]João Ribeiro Medeiros1y42

The scale problem is so universal and hard to tap into. Taking lessons from physics, I would caution against building a fully generalized framework where agents and subagents function under the same interactions, there are transitions between the micro and macro states where simmetry breaks completely. Complexity also points towards this same problem, emergent behaviour in cellular automata is hardly well predicted from the smaller parts which make up that behaviour.

[-]Chris Lakin1y40

I just made a twitter list with accounts interested in hierarchical agency (or what i call "multi-scale alignment"). Lmk who should be added

[-]Chris Lakin1y40

Random but you might like this graphic I made representing hierarchical agency from my post today on a very similar idea. What would you change about it?

[-]Chris Lakin1y40

I'm glad you wrote this! I've been wanting to tell othres about ACS's research and finally have a good link

[-]rvnnt1y32

A related pattern-in-reality that I've had on my todo-list to investigate is something like "cooperation-enforcing structures". Things like

legal systems, police
immune systems (esp. in suppressing cancer)
social norms, reputation systems, etc.

I'd been approaching this from a perspective of "how defeating Moloch can happen in general" and "how might we steer Earth to be less Moloch-fucked"; not so much AI safety directly.

Do you think a good theory of hierarchical agency would subsume those kinds of patterns-in-reality? If yes: I wonder if their inclusion could be used as a criterion/heuristic for narrowing down the search for a good theory?

[-]Noosphere891y30

Most of the basis of cooperation enforcing structures, I'd argue rests on 2 general principles:

An iterated game, such that there is an equilibrium for cooperation, and
The ability to enforce a threat of violence if a player defects, ideally credibly, and often extends to a monopoly on violence.

Once you have those, cooperative equilibria become possible.

[-]Davidmanheim1y20

Norms can accomplish this as well - I wrote about this a couple weeks ago.

[-]Noosphere891y20

I basically agree that norms can accomplish this, conditional on the game always being iterated, and indeed conditional on countries being far-sighted enough, almost any outcome is possible, thanks to the folk theorems.

[-]Oliver Sourbut6mo20

Nice! I'm very late to this. A few thoughts.

Focusing on 'full agents' might be misleading. Humans are fairly agenty as far as things go. Mobs and corporations and countries are a bit less agenty on the whole. So maybe you need some spectrum (or space) of agentness for things to be appropriately similarly-typed.

Game theory and MDPs and so on often treat the population as static. But we spawn (and dissolve) actors all the time. So a full theory here would need to build that in from the start. This has the nice property that it might be a foundation to describe a hierarchy too: mobs, corporations, and countries get spawned, dissolved, transformed, etc. all the time, after all. (Looser thought: could some agents-composed-of-parts be 'suspended' or 'transferred' to different substrate...?)

[-]Richard_Ngo9moΩ220

I didn't end up putting this in my coalitional agency post, but at one point I had a note discussing our terminological disagreement:

I don’t like the word hierarchical as much. A theory can be hierarchical without being scale-free—e.g. a theory which describes something in terms of three different layers doing three different things is hierarchical but not scale-free.

Whereas coalitions are typically divided into sub-coalitions (e.g. the "western civilization" coalition is divided into countries which are divided into provinces/states; political coalitions are divided into different factions and interest groups; etc). And so "coalitional" seems much closer to capturing this fractal/scale-free property.

[+][comment deleted]1y10

^{^}

This has various downsides, like difficulty in coordinating with other people, incomprehensibility to funders, etc.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

115

Hierarchical Agency: A Missing Piece in AI Alignment

115

Ω 44

115

Ω 44