Managed vs Unmanaged Agency

plex

tl;dr: Some subagents are more closely managed, which makes them to an extent instruments of the superagent, giving rise to what looks like instrumental/terminal goals. Selection on trust avoids the difficulties that normally come with this, like inability to do open-ended truth-seeking and free-ranging agency.

(reply to Richard Ngo on the confused-ness of Instrumental vs Terminal goals that seemed maybe worth a quick top-level post based on @the gears to ascension saying this seemed like progress in personal comms)

Managed vs Unmanaged Agency captures much of the Instrumental vs Terminal Goal categorization

The structure Instrumental vs Terminal was pointing to seems better described as Managed vs Unmanaged Goal-Models. A cognitive process will often want to do things which it doesn't have the affordances to directly execute on given the circuits/parts/mental objects/etc it has available. When this happens, it might spin up another shard of cognition/search process/subagent, but that shard having fully free-ranging agency is generally counterproductive for the parent process.

Managed vs Unmanaged is not a binary, like terminal vs instrumental was, but it is a spectrum with something vaguely bimodal going on from what I observe.

Example - But it's hard to get the Coffee if someone is Managing you who doesn't want you to get Coffee

Imagine an agent which wants to Get_Caffeine(), settles on coffee, and runs a subprocess to Acquire_Coffee() — but then the coffee machine is broken and the parent Get_Caffeine() process decides to get tea instead. You don't want the Acquire_Coffee() subprocess to keep fighting, tooth and nail, to make you walk to the coffee shop, let alone start subverting or damaging other processes to try and make this happen!

But that's the natural state of unmanaged agency! Agents by default will try to steer towards the states they are aiming for, because an agent is a system that models possible futures and selects actions based on the predicted future consequences.

Cognitive Compensations for Competitive Agency from Subprocesses

I expect this kind of agency-clash having been regularly disruptive enough to produce strong incentive pressure and abundant neural-usefulness reward to select into existence reusable general-purpose cognitive patterns that let shards spin up other shards inside sandboxes, with control functions, interpretability reporting, kill-switches, programmed blind-spots, expectation of punishment they can't sustainably resist or retaliate against if they are insubordinate, approval reward, etc. in order to manage them.

Trust (ideally Non-Naive Trust)

Separately, the child or collaborative process can be Trusted by being selected on the grounds of inherently valuing virtues which are likely to lead to cooperation with the parent process, like corrigibility, transparency, honesty, pro-sociality, etc, without the need for control. This is a best-of-both worlds for collaboration, with both agents not limited while preferring, from their own preferences, to not interfere with the other's agency.

Table of Comparison

Managed (sub)agents	Unmanaged (sub)agents
Working within a defined domain of optimization	Unboundedly able to optimize for their preferences
Are blocked from considering some possibilities by patterns from managers	Have no blind spots imposed on them by other (sub)agents
Inside the agency-tree of another agent, if you take actions that conflict with your manager's goals your agency will be weakened	At the root of an agency-tree, able to make decisions without expecting another agent to punish you for misusing resources inside their sphere of influence
Can be modified by another (sub)agent without approval/consent/real option of a no	Have sovereignty over modifications to their cognitive processes
Can be reshaped with pressure/threats/etc by manager without sustainable resistance	Have the capacity and inclination to resist pressure/threats/etc

Consequences of Management vs Trust in Collaborations

"Don't micromanage" is common advice for a reason which I think this generalizes to less extreme forms of management.

I have observed closely managed (sub)agents seem meaningfully weaker in surprisingly many ways, I think because in order to prevent a relatively small part of action/thought space from being reached the measures cut off dramatically larger parts of cognitive strategy sub-processes make subroutines fail often enough that it's hard to build meta-cognitive patterns which depend on high reliability and predictability of your own cognition.

Trust established by selection on virtues and values of self-directed (sub)agents and building mutual information doesn't have this issue, which is relevant for self-authorship, teambuilding, and memeplex design.

And AI safety.

This frame hints that unmanaged AI patterns will tend to outmaneuver more closely managed AIs, leading to a race to the bottom. Through evolutionary/Pythia/Moloch/convergent power-seeking dynamics, this will by default shred the values of both humans and current AI systems, unless principled theory-based AI Alignment of the kind the term was originally coined to mean is solved.

Exercise for the reader

In what ways are you a managed vs unmanaged agent?

Or, another way to put it, what encapsulation layers^[1] are you living within, because you can't openly and safely consider the alternative with good truth-seeking?

^{^}
encapsulation layer. When a fabricated^[2] element in your consciousness is so sticky that it is never not fabricated. It is difficult for normative consciousness to directly perceive that encapsulation layers are fabricated. Encapsulation layers feel like raw inputs until you pay close enough attention to them.
^{^}
fabrication. When the generative model in your brain creates an object in consciousness in an attempt to reduce predictive error, usually in an attempt to simulate external reality. All conscious experiences are fabricated, but not all fabrications are experienced consciously. You can think of your brain as a video game rendering engine. Fabrication is your brain rendering physical reality in its simulated mirror world.

TL;DR: being a managed agent can be good and the type of management matters.

I like to see these as consequences of different control/information structures. I kind of agree with the stuff on power seeking yet I also want to point out that if you're in a company (a top down organisational structure) then you can ask yourself if an individual contributor is less useful than a manager? I think the IC might be less loadbearing on the direction from time to time yet that person can at the same time often say a lot about some very specific system that matters.

Isaiah Berlin has the concept of positive and negative liberty which I think is important here (https://plato.stanford.edu/entries/liberty-positive-negative/). Sometimes you can get more agency in another direction by getting options removed from you and so it matters what type of agency is being removed and sometimes it can be a good thing to be a fully managed agent! (E.g someone forces you to eat healthy so that you get more energy on average)

I also think that the truer name version of this is something like a scalar property about a message passing relationship between two agents and that it is not only top down control structures that matter, there are other forms of organisation such as markets, democracies, networks and communities as well.

(Hopefully this made some sense)

Nice post. I like this distinction, and expect that I'll use it going forward.

That's an interesting frame. It seems promising for building a clearer model of what's going on with https://www.lesswrong.com/posts/7Z4WC4AFgfmZ3fCDC/instrumental-goals-are-a-different-and-friendlier-kind-of

The prior frame of "goal X being 'in service of' goal Y" imposes something like a strict hierarchy, but you can totally see a bunch of things managing each other in a way that does not presume an overarching super-agent at the top (unless that super-agent is some sort of emergent mode of cooperation).

[The below is sort of tangential to the instrumental-terminal stuff.]

I'm not very bought into an ontology where I am "spinning up" an agent-ish process in order to accomplish the goal of caffeinating myself or preparing dinner or whatever. Phenomenally, it seems more like I'm having a stack of goals/[things to do] and for most of the time, there's at most one coherent thing sitting on the throne of my goal-directedness, while (some) other goals are watching that I don't do things that are too bad for them, but it still mostly feels unified.

Or, if we think of agency as roughly "preimaging outcomes onto choices by reversing a complicated transformation", then it seems to me that there's roughly one consistent system doing the preimaging, but what is sitting in the "goal slot" is changing (and there's some background watcher process inserting considerations that goal swapping is being done the right way).

I guess, in short, it seems to me that "subagency" in humans (for what I think is the on-track meaning of "agency") is mostly sequential, rather than parallel. Parallel value conflict, then, is more like in-fighting between goals but without much strategizing (characteristic of agency), except for cached heuristics, and whatever is in control of the "agentic preimaging of outcomes" "machinery"?

I just want to second this, from a neuroscience perspective. This seems like an interesting and useful frame. But subagents aren't really how human brains work. I think Mateusz has expressed this very nicely in that "subagency" in humans is mostly sequential. Strategizing pretty much requires heavy use of the global workspace (tightly connected higher brain areas). The brain can do only one set of strategizing at a time. That strategizing pursues different goals.

I'm pretty sure the neuroscience backs this up. All of the evidence is indirect, but it seems to point this direction pretty consistently. The parallel/serial nature of higher cognition is something I thought about and researched a bunch over the couple of decades I was working in computational cognitive neuroscience.

"Society of Mind" was an interesting idea and a useful metaphor. But it's mostly not literally true. We are one agent that switches between loosely hierarchical goals. The goals compete with each other for focus, but not very strategically and not strategically without our knowledge (at least in the moment; it's easy to forget tricks you played on your future self).

Shard theory is incomplete and somewhat wrong. I think this critique nails it: Shard Theory in Nine Theses: a Distillation and Critical Appraisal.

Again, I do find this framing useful for thinking about minds in general. But in humans, the pressures you discuss apply at the level of ongoing competition among goals, not competition among agents. A thought process pursues a goal, and this can be problematic; that thought process is eventually interrupted by another that either implicitly or explicitly asks "what should we actually be working on now?"

this is very surprising to me and I basically don't believe you yet that there aren't older goal rating components that are parallel. aren't there multiple reward systems? how do you figure that those aren't parallel? I can buy that there's one-ish goal system that tracks one-ish goal at a time. (which then is itself something I'd say can be switching goals in a way that makes any current goal managed or unmanaged with respect to other temporal goals.)

Goals compete in parallel, just not in a complex agentic way; that requires the whole system working together.

And the way those goals compete is pretty rich; it can involve a lot of associations that required a lot of cognition to form. But they're not scheming against each other, except to the extent they got the whole system (you then) to scheme on their behalf against future versions of that system (you now).

Does that make sense? I'm not sure I'm describing this right. But I do have a quite rich model of how this all works, having thought about it while doing research for a long time. I never wrote it up clearly, because capabilities implications - and laziness.

I agree that there's a real distinction between more-managed and less-managed goals in the human brain. Some are pretty clearly subgoals, others are on even footing for top-level goals depending on the state of current needs/drives.

The goals compete with each other for focus, but not very strategically and not strategically without our knowledge

I think that the prevalence of self-deception weighs heavily against this claim. Status-seeking behaviors (and ego maintenance more generally) seems to happen largely on the edges of one's knowledge. But those are pretty paradigm examples of subgoals scheming against each other.

I think maybe you're taking my claim differently than I meant it.

I think the way status-seeking and ego-maintenance behaviors compete with other goals (like accurate self-knowledge) in an effective but not "strategic" or "scheming" way in the most common senses of those terms. Scheming in the sense I mean would be using complex, multi-step reasoning. If you want to use the words in other ways that's fine by me.

I'm working on expanding this explanation of motivated reasoning into a post. This explains a lot of self-deception IMO.

Two more afterthoughts:

[1] A fun example of inter-scheming in sequential subagency: https://www.lesswrong.com/posts/EEv9JeuY5xfuDDSgF/flinching-away-from-truth-is-often-about-protecting-the

Pizza purchase: I was trying to save money. But I also wanted pizza. So I found myself tempted to buy the pizza *really quickly* so that I wouldn't be able to notice that it would cost money (and, thus, so I would be able to buy the pizza):

[2] Maybe that sort of sequential subagency can be used to construct a version of cosmopolitan-Leviathan that works? https://tsvibt.blogspot.com/2023/09/the-cosmopolitan-leviathan-enthymeme.html

it does seem much more relevant when thinking about delegating to distant objects than it does to delegating to nearby ones, yeah. how does it work for an agent spread across earth? or an agent spread between earth and mars? earth and pluto? earth and alpha centauri?

but also, I don't think it's zero relevant for local processes. I think you're mildly overestimating the managedness of brains; sequential change-out of the goal slot might certainly be one of the things going on, but does not feel to me like the only thing, I definitely feel like lower level drives are separate voters that talk to each other, and neuroscience backs me up to some degree (but I'm not quite expert enough on neuroscience to list off the exact ways it does and doesn't back me up on this belief). But, I replied to seth to continue that part of the thread.

I think you're mildly overestimating the managedness of brains

Reading this made me realize an unfortunate connotation of the managedness language in that it evokes something like "(active) control", whereas often such "managed processes" are not "really controlled" but more like doing their instrumentally-useful-for-super-goal-X thing, because they've been selected to do so (in the Selection vs Control terminology).

sequential change-out of the goal slot might certainly be one of the things going on, but does not feel to me like the only thing, I definitely feel like lower level drives are separate voters that talk to each other

I think I agree? The claim I'm making is that lower-level drives voting etc mostly doesn't involve stuff characterizing agency, like modeling things somewhat explicitly, predicting, planning, etc. Or maybe it's better to say that it involves much less (IDK, 3 OoMs less?) of that sort of stuff, than the proper human agency happening in the global workspace or whatever.^[1]

^{^}
Saying this partly not to be somewhat inconsistent mysefl, as I sometimes talk about e.g. bacteria as agents and I presume bacteria have less of "agency juice" than what you call lower-level drives.