That's an interesting frame. It seems promising for building a clearer model of what's going on with https://www.lesswrong.com/posts/7Z4WC4AFgfmZ3fCDC/instrumental-goals-are-a-different-and-friendlier-kind-of
The prior frame of "goal X being 'in service of' goal Y" imposes something like a strict hierarchy, but you can totally see a bunch of things managing each other in a way that does not presume an overarching super-agent at the top (unless that super-agent is some sort of emergent mode of cooperation).
[The below is sort of tangential to the instrumental-terminal stuff.]
I'm not very bought into an ontology where I am "spinning up" an agent-ish process in order to accomplish the goal of caffeinating myself or preparing dinner or whatever. Phenomenally, it seems more like I'm having a stack of goals/[things to do] and for most of the time, there's at most one coherent thing sitting on the throne of my goal-directedness, while (some) other goals are watching that I don't do things that are too bad for them, but it still mostly feels unified.
Or, if we think of agency as roughly "preimaging outcomes onto choices by reversing a complicated transformation", then it seems to me that there's roughly one consistent system doing the preimaging, but what is sitting in the "goal slot" is changing (and there's some background watcher process inserting considerations that goal swapping is being done the right way).
I guess, in short, it seems to me that "subagency" in humans (for what I think is the on-track meaning of "agency") is mostly sequential, rather than parallel. Parallel value conflict, then, is more like in-fighting between goals but without much strategizing (characteristic of agency), except for cached heuristics, and whatever is in control of the "agentic preimaging of outcomes" "machinery"?
I just want to second this, from a neuroscience perspective. This seems like an interesting and useful frame. But subagents aren't really how human brains work. I think Mateusz has expressed this very nicely in that "subagency" in humans is mostly sequential. Strategizing pretty much requires heavy use of the global workspace (tightly connected higher brain areas). The brain can do only one set of strategizing at a time. That strategizing pursues different goals.
I'm pretty sure the neuroscience backs this up. All of the evidence is indirect, but it seems to point this direction pretty consistently. The parallel/serial nature of higher cognition is something I thought about and researched a bunch over the couple of decades I was working in computational cognitive neuroscience.
"Society of Mind" was an interesting idea and a useful metaphor. But it's mostly not literally true. We are one agent that switches between loosely hierarchical goals. The goals compete with each other for focus, but not very strategically and not strategically without our knowledge (at least in the moment; it's easy to forget tricks you played on your future self).
Shard theory is incomplete and somewhat wrong. I think this critique nails it: Shard Theory in Nine Theses: a Distillation and Critical Appraisal.
Again, I do find this framing useful for thinking about minds in general. But in humans, the pressures you discuss apply at the level of ongoing competition among goals, not competition among agents. A thought process pursues a goal, and this can be problematic; that thought process is eventually interrupted by another that either implicitly or explicitly asks "what should we actually be working on now?"
this is very surprising to me and I basically don't believe you yet that there aren't older goal rating components that are parallel. aren't there multiple reward systems? how do you figure that those aren't parallel? I can buy that there's one-ish goal system that tracks one-ish goal at a time. (which then is itself something I'd say can be switching goals in a way that makes any current goal managed or unmanaged with respect to other temporal goals.)
Goals compete in parallel, just not in a complex agentic way; that requires the whole system working together.
And the way those goals compete is pretty rich; it can involve a lot of associations that required a lot of cognition to form. But they're not scheming against each other, except to the extent they got the whole system (you then) to scheme on their behalf against future versions of that system (you now).
Does that make sense? I'm not sure I'm describing this right. But I do have a quite rich model of how this all works, having thought about it while doing research for a long time. I never wrote it up clearly, because capabilities implications - and laziness.
I agree that there's a real distinction between more-managed and less-managed goals in the human brain. Some are pretty clearly subgoals, others are on even footing for top-level goals depending on the state of current needs/drives.
The goals compete with each other for focus, but not very strategically and not strategically without our knowledge
I think that the prevalence of self-deception weighs heavily against this claim. Status-seeking behaviors (and ego maintenance more generally) seems to happen largely on the edges of one's knowledge. But those are pretty paradigm examples of subgoals scheming against each other.
I think maybe you're taking my claim differently than I meant it.
I think the way status-seeking and ego-maintenance behaviors compete with other goals (like accurate self-knowledge) in an effective but not "strategic" or "scheming" way in the most common senses of those terms. Scheming in the sense I mean would be using complex, multi-step reasoning. If you want to use the words in other ways that's fine by me.
I'm working on expanding this explanation of motivated reasoning into a post. This explains a lot of self-deception IMO.
I have the sense that you're seriously underestimating how much powerful, intentional, coherent computation happens asynchronously in the background?
(I also have the sense that I have goals that don't want to talk to each other that weigh on what I say on this topic - some that don't want to be named, but vaguely, social things related to being right on the one hand, and things related to how this framework affects my self-understanding on the other; also, lately, a general sense of activation any time I visit this website that leads me to make worse comments than I otherwise would. But, all that said, I maintain my initial claim - I'm moderately sure these don't weigh on it much, besides affecting how I feel writing comments here.)
I also just read Valentine's recent post Irrationality is Socially Strategic. That perspective treats the unconscious as sometimes scheming against you. While I think that's not literally true, it's a useful way to analyze it as long as you assume the schemes can be insightful but usually not include many causal steps. I still haven't worked through what association mechanisms do and don't accomplish in relation to conscious schemes.
I may be miscommunicating. And as I think this through, I think the distinction is real but pretty subtle.
I definitly agree that powerful computation happens in parallel (hopefully that's what you mean by asynchronously) in the background. I don't know how coherent you mean, but I think it's pretty coherent in one sense; it's meaningful, for a purpose. I don't know exactly what you mean by intentional, and that seems like perhaps the largest distinction.
Here's one shot: I don't think the goals are strategizing or scheming. I think sometimes you (all of you, consciously, for at least a few moments of thought) schemed or conspired on behalf of that goal when it was your central focus.
It's probably best to use an example. I'm trying to follow yours although I'm sure veer off at some point.
Suppose my system has made an association between making forceful comments on LW and feeling bad (because I don't like when people push back). And I've also, in the past, made an association between making insightful comments on LW and feeling good.
When I think about making a comment, my predicted reward and so "feeling" will waver around as I change what sort of comment I'm intending to make and what sort of responses I imagine getting to it. Some of those feelings will incline me to just shut the website so I can stop deliberating and avoid danger; others will compell me to keep rewording a comment without considering whether the effort is worth it.
This will feel like goals competing. It might be what you refer to. It might even feel like they're "scheming" against each other by influencing your behavior so as to reduce the odds the other goal gets its way. And perhaps you've even made the associations in the past that make those tendencies effective. Shutting the website has avoided outcomes I didn't like, for instance. But this isn't the goal "scheming" in the way I mean when I say a person is scheming or worry that models might scheme. The goal doesn't have a plan. It's just responding to predicted rewards associated with different situations and actions. And I think those associations are fairly simple. They're useful when they work and tricky to understand and counter when they don't work, but I don't think the process is complex or "intentional" (in some average usage) enough to deserve the label of scheming.
So there are my thoughts FWIW. I guess I find it comforting to think that my unconscious isnt' scheming against me, just following associations that I can understand if I want to put in the effort. But that's about it for the worth of this conclusion.
I'm finding I don't know what you mean by "these don't weigh on it much", so I don't know what part of your initial claim you're referring to.
I think there are NOT multiple reward systems; there's one reward prediction system that's predicting a reward for whatever the system is currently representing (looking at or thinking about). And it's doing so in the current context, which includes whatever goal is being represented and whatever drive state is active. So you won't predict much reward from food when you're full but thirsty, etc.
But there do seem to be separate predictor systems, predicting not straight reward but particular emotions; maybe that's what you mean.
No need to reply; I do find this fairly interesting, but I'm not sure it's important enough to be worth working through in detail.
I have the sense that you're seriously underestimating how much powerful, intentional, coherent computation happens asynchronously in the background?
I am curious about examples, evidence, generally reasons you think this.
Hmm.
So I've repeatedly had experiences where I seemed to do something clearly VNM-irrational that I'd call "scheming against myself". This tends to happen most when different drives have different scorings of an action. Eg I've had this with food, with social media usage, various other things. Feels vaguely like a {pressure/tendency/gradient/pull}. Tends to be most noticeable as me-at-different-moments seeming to have different goals, yeah, so it does seem plausible it might be trading off what's "in a goal slot" in some sense, but it tends to involve, like, some voters don't want to go to sleep because then tomorrow comes, some voters want to go to sleep because tired, some voters want to go to sleep because tomorrow needs to go well. tomorrow-needs-to-go-well and avoid-tomorrow voters are fighty with each other and tend to use influence at cross purposes. central planning or whatever might be reading from these and trying to plan based on them, but the thing that is mixed is like, the votes about what is good. so I guess to the degree there's a single reward signal involved, the things I'm describing are voters that sum up to that reward signal? something like that anyway.
tomorrow-needs-to-go-well and avoid-tomorrow
Do you mean something like future-concerned drives versus myopic drives? (E.g., "don't eat this much, because it'll likely make you feel bad tomorrow" vs "eat this, it's so yummy / feels so good".)
To me, what you're describing seems consistent with the model that Seth and I lean towards.
Maybe one thing that makes it seem like there's a lot of "coherent intentional background computation" is the global workspace decision making that is responding to inputs from lots of drives?
(I'm interested in continuing this thread, but feel free to drop it.)
Two more afterthoughts:
[1] A fun example of inter-scheming in sequential subagency: https://www.lesswrong.com/posts/EEv9JeuY5xfuDDSgF/flinching-away-from-truth-is-often-about-protecting-the
Pizza purchase: I was trying to save money. But I also wanted pizza. So I found myself tempted to buy the pizza *really quickly* so that I wouldn't be able to notice that it would cost money (and, thus, so I would be able to buy the pizza):
[2] Maybe that sort of sequential subagency can be used to construct a version of cosmopolitan-Leviathan that works? https://tsvibt.blogspot.com/2023/09/the-cosmopolitan-leviathan-enthymeme.html
it does seem much more relevant when thinking about delegating to distant objects than it does to delegating to nearby ones, yeah. how does it work for an agent spread across earth? or an agent spread between earth and mars? earth and pluto? earth and alpha centauri?
but also, I don't think it's zero relevant for local processes. I think you're mildly overestimating the managedness of brains; sequential change-out of the goal slot might certainly be one of the things going on, but does not feel to me like the only thing, I definitely feel like lower level drives are separate voters that talk to each other, and neuroscience backs me up to some degree (but I'm not quite expert enough on neuroscience to list off the exact ways it does and doesn't back me up on this belief). But, I replied to seth to continue that part of the thread.
I think you're mildly overestimating the managedness of brains
Reading this made me realize an unfortunate connotation of the managedness language in that it evokes something like "(active) control", whereas often such "managed processes" are not "really controlled" but more like doing their instrumentally-useful-for-super-goal-X thing, because they've been selected to do so (in the Selection vs Control terminology).
sequential change-out of the goal slot might certainly be one of the things going on, but does not feel to me like the only thing, I definitely feel like lower level drives are separate voters that talk to each other
I think I agree? The claim I'm making is that lower-level drives voting etc mostly doesn't involve stuff characterizing agency, like modeling things somewhat explicitly, predicting, planning, etc. Or maybe it's better to say that it involves much less (IDK, 3 OoMs less?) of that sort of stuff, than the proper human agency happening in the global workspace or whatever.[1]
Saying this partly not to be somewhat inconsistent mysefl, as I sometimes talk about e.g. bacteria as agents and I presume bacteria have less of "agency juice" than what you call lower-level drives.
There is for sure a throne of current goal orientation, I think it's the GNW / current working memory. But, I'm pretty sure there is a bunch of subagents which make bids to be in that throne, with a huge amount of subconscious parallel processing. Much of what I've learned about and in therapy closely matches the excellently written Multiagent Models of Mind sequence, which is also super good as an intro to therapy and psychological healing.
TL;DR: being a managed agent can be good and the type of management matters.
I like to see these as consequences of different control/information structures. I kind of agree with the stuff on power seeking yet I also want to point out that if you're in a company (a top down organisational structure) then you can ask yourself if an individual contributor is less useful than a manager? I think the IC might be less loadbearing on the direction from time to time yet that person can at the same time often say a lot about some very specific system that matters.
Isaiah Berlin has the concept of positive and negative liberty which I think is important here (https://plato.stanford.edu/entries/liberty-positive-negative/). Sometimes you can get more agency in another direction by getting options removed from you and so it matters what type of agency is being removed and sometimes it can be a good thing to be a fully managed agent! (E.g someone forces you to eat healthy so that you get more energy on average)
I also think that the truer name version of this is something like a scalar property about a message passing relationship between two agents and that it is not only top down control structures that matter, there are other forms of organisation such as markets, democracies, networks and communities as well.
(Hopefully this made some sense)
I like to see these as consequences of different control/information structures. I kind of agree with the stuff on power seeking yet I also want to point out that if you're in a company (a top down organisational structure) then you can ask yourself if an individual contributor is less useful than a manager? I think the IC might be less loadbearing on the direction from time to time yet that person can at the same time often say a lot about some very specific system that matters.
Yes, subprocesses absolutely impact the direction of the overall system, they are in fact spun up to do things the superagent could not do as easily usually.
Sometimes you can get more agency in another direction by getting options removed from you and so it matters what type of agency is being removed and sometimes it can be a good thing to be a fully managed agent! (E.g someone forces you to eat healthy so that you get more energy on average)
I think versions of this where your agency is actually in line with a restriction can be good, especially if you place restrictions on yourself (self-management), but that if you're constantly chaffing against e.g. eating healthy you'll have more problems than it's worth in general.
I also think that the truer name version of this is something like a scalar property about a message passing relationship between two agents
Yup agree, it's not a binary, just one useful angle to look at relationships between agents.
that it is not only top down control structures that matter, there are other forms of organisation such as markets, democracies, networks and communities as well.
Yes! These cases I would classify as being managed or selected for trust by a superagent.
Very good. Some terminological thoughts:
'subagent' is quite conflated. Sometimes we're talking about the constituent parts of an emergent group actor. Sometimes we're talking about a delegated (perhaps created) process. Sometimes we're talking about different threads/personas within a constructed multi-persona reasoning scheme. There might be other cases.
'agent' is also quite conflated. Sometimes we mean an actor in a principal-agent relationship. Sometimes we mean an AI. Sometimes we mean just someone/something acting somewhat consequentialist. Sometimes we're talking about something closer to a naked utility function.
'shard' is ehhh. A neologism that doesn't seem to have a fixed meaning.
I'm somewhat thinking aloud (but have given some thought to this previously). I'd tentatively suggest using terms like:
I think I'm pretty happy with my terms, using Agent in the DeepMind Discovering Agents sense and Subagent in the Multiagent Models of Mind sense. These feel like crisp underlying abstractions which have various forms, not various different forms of things conflated together. For Shard, yep, I think I like that term and that it captures something also fairly crisp.
tl;dr: Some subagents are more closely managed, which makes them to an extent instruments of the superagent, giving rise to what looks like instrumental/terminal goals. Selection on trust avoids the difficulties that normally come with this, like inability to do open-ended truth-seeking and free-ranging agency.
(reply to Richard Ngo on the confused-ness of Instrumental vs Terminal goals that seemed maybe worth a quick top-level post based on @the gears to ascension saying this seemed like progress in personal comms)
Managed vs Unmanaged Agency captures much of the Instrumental vs Terminal Goal categorization
The structure Instrumental vs Terminal was pointing to seems better described as Managed vs Unmanaged Goal-Models. A cognitive process will often want to do things which it doesn't have the affordances to directly execute on given the circuits/parts/mental objects/etc it has available. When this happens, it might spin up another shard of cognition/search process/subagent, but that shard having fully free-ranging agency is generally counterproductive for the parent process.
Managed vs Unmanaged is not a binary, like terminal vs instrumental was, but it is a spectrum with something vaguely bimodal going on from what I observe.
Example - But it's hard to get the Coffee if someone is Managing you who doesn't want you to get Coffee
Imagine an agent which wants to Get_Caffeine(), settles on coffee, and runs a subprocess to Acquire_Coffee() — but then the coffee machine is broken and the parent Get_Caffeine() process decides to get tea instead. You don't want the Acquire_Coffee() subprocess to keep fighting, tooth and nail, to make you walk to the coffee shop, let alone start subverting or damaging other processes to try and make this happen!
But that's the natural state of unmanaged agency! Agents by default will try to steer towards the states they are aiming for, because an agent is a system that models possible futures and selects actions based on the predicted future consequences.
Cognitive Compensations for Competitive Agency from Subprocesses
I expect this kind of agency-clash having been regularly disruptive enough to produce strong incentive pressure and abundant neural-usefulness reward to select into existence reusable general-purpose cognitive patterns that let shards spin up other shards inside sandboxes, with control functions, interpretability reporting, kill-switches, programmed blind-spots, expectation of punishment they can't sustainably resist or retaliate against if they are insubordinate, approval reward, etc. in order to manage them.
Trust (ideally Non-Naive Trust)
Separately, the child or collaborative process can be Trusted by being selected on the grounds of inherently valuing virtues which are likely to lead to cooperation with the parent process, like corrigibility, transparency, honesty, pro-sociality, etc, without the need for control. This is a best-of-both worlds for collaboration, with both agents not limited while preferring, from their own preferences, to not interfere with the other's agency.
Table of Comparison
Consequences of Management vs Trust in Collaborations
"Don't micromanage" is common advice for a reason which I think this generalizes to less extreme forms of management.
I have observed closely managed (sub)agents seem meaningfully weaker in surprisingly many ways, I think because in order to prevent a relatively small part of action/thought space from being reached the measures cut off dramatically larger parts of cognitive strategy sub-processes make subroutines fail often enough that it's hard to build meta-cognitive patterns which depend on high reliability and predictability of your own cognition.
Trust established by selection on virtues and values of self-directed (sub)agents and building mutual information doesn't have this issue, which is relevant for self-authorship, teambuilding, and memeplex design.
And AI safety.
This frame hints that unmanaged AI patterns will tend to outmaneuver more closely managed AIs, leading to a race to the bottom. Through evolutionary/Pythia/Moloch/convergent power-seeking dynamics, this will by default shred the values of both humans and current AI systems, unless principled theory-based AI Alignment of the kind the term was originally coined to mean is solved.
Exercise for the reader
In what ways are you a managed vs unmanaged agent?
Or, another way to put it, what encapsulation layers[1] are you living within, because you can't openly and safely consider the alternative with good truth-seeking?
encapsulation layer. When a fabricated[2] element in your consciousness is so sticky that it is never not fabricated. It is difficult for normative consciousness to directly perceive that encapsulation layers are fabricated. Encapsulation layers feel like raw inputs until you pay close enough attention to them.
fabrication. When the generative model in your brain creates an object in consciousness in an attempt to reduce predictive error, usually in an attempt to simulate external reality. All conscious experiences are fabricated, but not all fabrications are experienced consciously. You can think of your brain as a video game rendering engine. Fabrication is your brain rendering physical reality in its simulated mirror world.