In my previous posts, I have been building up a model of mind as a collection of subagents with different goals, and no straightforward hierarchy. This then raises the question of how that collection of subagents can exhibit coherent behavior: after all, many ways of aggregating the preferences of a number of agents fail to create consistent preference orderings.

We can roughly describe coherence as the property that, if you become aware that there exists a more optimal strategy for achieving your goals than the one that you are currently executing, then you will switch to that better strategy. If an agent is not coherent in this way, then bad things are likely to happen to them.

Now, we all know that humans sometimes express incoherent behavior. But on the whole, people still do okay: the median person in a developed country still manages to survive until their body starts giving up on them, and typically also manages to have and raise some number of initially-helpless children until they are old enough to take care of themselves.

For a subagent theory of mind, we would like to have some explanation of when exactly the subagents manage to be collectively coherent (that is, change their behavior to some better one), and what are the situations in which they fail to do so. The conclusion of this post will be:

We are capable of changing our behaviors on occasions when the mind-system as a whole puts sufficiently high probability on the new behavior being better, when the new behavior is not being blocked by a particular highly weighted subagent (such as an IFS-style protector) that puts high probability on it being bad, and when we have enough slack in our lives for any new behaviors to be evaluated in the first place. Akrasia is subagent disagreement about what to do.

Correcting your behavior as a default

There are many situations in which we exhibit incoherent behavior simply because we’re not aware of it. For instance, suppose that I do my daily chores in a particular order, when doing them in some other order would save more time. If you point this out to me, I’m likely to just say “oh”, and then adopt the better system.

Similarly, several of the experiments which get people to exhibit incoherent behavior rely on showing different groups of people different formulations of the same question, and then indicating that different framings of the same question get different answers from people. It doesn’t work quite as well if you show the different formulations to the same people, because then many of them will realize that differing answers would be inconsistent.

But there are also situations in which someone realizes that they are behaving in a nonsensical way, yet will continue behaving in that way. Since people usually can change suboptimal behaviors, we need an explanation for why they sometimes can’t.

Towers of protectors as a method for coherence

In my post about Internal Family Systems, I discussed a model of mind composed of several different kinds of subagents. One of them, the default planning subagent, is a module just trying to straightforwardly find the best thing to do and then execute that. On the other hand, protector subagents exist to prevent the system from getting into situations which were catastrophic before. If they think that the default planning subagent is doing something which seems dangerous, they will override it and do something else instead. (Previous versions of the IFS post called the default planning agent, “a reinforcement learning subagent”, but this was potentially misleading since several other subagents were reinforcement learning ones too, so I’ve changed the name.)

Thus, your behavior can still be coherent even if you feel that you are failing to act in a coherent way. You simply don’t realize that a protector is carrying out a routine intended to avoid dangerous outcomes - and this might actually be a very successful way of keeping you out of danger. Some subagents in your mind think that doing X would be a superior strategy, but the protector thinks that it would be a horrible idea - so from the point of view of the system as a whole, doing X is not a better strategy, so not switching to it is actually better.

On the other hand, it may also be the case that the protector’s behavior, while keeping you out of situations which the protector considers unacceptable, is causing other outcomes which are also unacceptable. The default planning subagent may realize this - but as already established, any protector can overrule it, so this doesn’t help.

Evolution’s answer here seems to be spaghetti towers. The default planning subagent might eventually figure out the better strategy, which avoids both the thing that the protector is trying to block and the new bad outcome. But it could be dangerous to wait that long, especially since the default planning agent doesn't have direct access to the protector's goals. So for the same reasons why a separate protector subagent was created to avoid the first catastrophe, the mind will create or recruit a protector to avoid the second catastrophe - the one that the first protector keeps causing.

With permission, I’ll borrow the illustrations from eukaryote’s spaghetti tower post to illustrate this.

Example Eric grows up in an environment where he learns that disagreeing with other people is unsafe, and that he should always agree to do things that other people ask of him. So Eric develops a protector subagent running a pleasing, submissive behavior.

Unfortunately, while this tactic worked in Eric’s childhood home, once he became an adult he starts saying “yes” to too many things, without leaving any time for his own needs. But saying “no” to anything still feels unsafe, so he can’t just stop saying “yes”. Instead he develops a protector which tries to keep him out of situations where people would ask him to do anything. This way, he doesn’t need to say “no”, and also won’t get overwhelmed by all the things that he has promised to do. The two protectors together form a composite strategy.

While this helps, it still doesn’t entirely solve the issue. After all, there are plenty of reasons that might push Eric into situations where someone would ask something of him. He still ends up agreeing to do lots of things, to the point of neglecting his own needs. Eventually, his brain creates another protector subagent. This one causes exhaustion and depression, so that he now has a socially-acceptable reason for being unable to do all the things that he has promised to do. He continues saying “yes” to things, but also keeps apologizing for being unable to do things that he (honestly) intended to do as promised, and eventually people realize that you probably shouldn’t ask him to do anything that’s really important to get done.

And while this kind of a process of stacking protector on top of a protector is not perfect, for most people it mostly works out okay. Almost everyone ends up having their unique set of minor neuroses and situations where they don’t quite behave rationally, but as they learn to understand themselves better, their default planning subagent gets better at working around those issues. This might also make the various protectors relax a bit, since the various threats are generally avoided and there isn’t a need to keep avoiding them.

Gradually, as negative consequences to different behaviors become apparent, behavior gets adjusted - either by the default planning subagents or by spawning more protectors - and remains coherent overall.

But sometimes, especially for people in highly stressful environments where almost any mistake may get them punished, or when they end up in an environment that their old tower of protectors is no longer well-suited for (distributional shift), things don’t go as well. In that situation, their minds may end up looking like this a hopelessly tangled web, where they have almost no flexibility. Something happens in their environment, which sets off one protector, which sets off another, which sets off another - leaving them with no room for flexibility or rational planning, but rather forcing them to act in a way which is almost bound to only make matters worse.

This kind of an outcome is obviously bad. So besides building spaghetti towers, the second strategy which the mind has evolved to employ for keeping its behavior coherent while piling up protectors, is the ability to re-process memories of past painful events.

As I discussed in my original IFS post, the mind has methods for bringing up the original memories which caused a protector to emerge, in order to re-analyze them. If ending up in some situation is actually no longer catastrophic (for instance, you are no longer in your childhood home where you get punished simply for not wanting to do something), then the protectors which were focused on avoiding that outcome can relax and take a less extreme role.

For this purpose, there seems to be a built-in tension. Exiles (the IFS term for subagents containing memories of past trauma) “want” to be healed and will do things like occasionally sending painful memories or feelings into consciousness so as to become the center of attention, especially if there is something about the current situation which resembles the past trauma. This also acts as what my IFS post called a fear model - something that warns of situations which resemble the past trauma enough to be considered dangerous in their own right. At the same time, protectors “want” to keep the exiles hidden and inactive, doing anything that they can for keeping them so. Various schools of therapy - IFS one of them - seek to tap into this existing tension so as to reveal the trauma, trace it back to its original source, and heal it.

Coherence and conditioned responses

Besides the presence of protectors, another possibility for why we might fail to change our behavior are strongly conditioned habits. Most human behavior involves automatic habits: behavioral routines which are triggered by some sort of a cue in the environment, and lead to or have once led to a reward. (Previous discussion; see also.)

The problem with this is that people might end up with habits that they wouldn’t want to have. For instance, I might develop a habit of checking social media on their phone when I’m bored, creating a loop of boredom (cue) -> looking at social media (behavior) -> seeing something interesting on social media (reward).

Reflecting on this behavior, I notice that back when I didn’t do it, my mind was more free to wander when I was bored, generating motivation and ideas. I think that my old behavior was more valuable than my new one. But even so, my new behavior still delivers enough momentary satisfaction to keep reinforcing the habit.

Subjectively, this feels like an increasing compulsion to check my phone, which I try to resist since I know that long-term it would be a better idea to not be checking my phone all the time. But as the compulsion keeps growing stronger and stronger, eventually I give up and look at the phone anyway.

The exact neuroscience of what is happening at such a moment remains only partially understood (Simpson & Balsam 2016). However, we know that whenever different subsystems in the brain produce conflicting motor commands, that conflict needs to be resolved, with only one at a time being granted access to the “final common motor path”. This is thought to happen in the basal ganglia, a part of the brain closely involved in action selection and connected to the global neuronal workspace.

One model (e.g. Redgrave 2007, McHaffie 2005) is that the basal ganglia receives inputs from many different brain systems; each of those systems can send different “bids” supporting or opposing a specific course of action to the basal ganglia. A bid submitted by one subsystem may, through looped connections going back from the basal ganglia, inhibit other subsystems, until one of the proposed actions becomes sufficiently dominant to be taken.

The above image from Redgrave 2007 has a conceptual image of the model, with two example subsystems shown. Suppose that you are eating at a restaurant in Jurassic Park when two velociraptors charge in through the window. Previously, your hunger system was submitting successful bids for the “let’s keep eating” action, which then caused inhibitory impulses to the be sent to the threat system. This inhibition prevented the threat system from making bids for silly things like jumping up from the table and running away in a panic. However, as your brain registers the new situation, the threat system gets significantly more strongly activated, sending a strong bid for the “let’s run away” action. As a result of the basal ganglia receiving that bid, an inhibitory impulse is routed from the basal ganglia to the subsystem which was previously submitting bids for the “let’s keep eating” actions. This makes the threat system’s bids even stronger relative to the (inhibited) eating system’s bids.

Soon the basal ganglia, which was previously inhibiting the threat subsystem’s access to the motor system while allowing the eating system access, withdraws that inhibition and starts inhibiting the eating system’s access instead. The result is that you jump up from your chair and begin to run away. Unfortunately, this is hopeless since the velociraptor is faster than you. A few moments later, the velociraptor’s basal ganglia gives the raptor’s “eating” subsystem access to the raptor’s motor system, letting it happily munch down its latest meal.

But let’s leave velociraptors behind and go back to our original example with the phone. Suppose that you have been trying to replace the habit of looking at your phone when bored, to instead smiling and directing your attention to pleasant sensations in your body, and then letting your mind wander.

Until the new habit establishes itself, the two habits will compete for control. Frequently, the old habit will be stronger, and you will just automatically check your phone without even remembering that you were supposed to do something different. For this reason, behavioral change programs may first spend several weeks just practicing noticing the situations in which you engage in the old habit. When you do notice what you are about to do, then more goal-directed subsystems may send bids towards the “smile and look for nice sensations” action. If this happens and you pay attention to your experience, you may notice that long-term it actually feels more pleasant than looking at the phone, reinforcing the new habit until it becomes prevalent.

To put this in terms of the subagent model, we might drastically simplify things by saying that the neural pattern corresponding to the old habit is a subagent reacting to a specific sensation (boredom) in the consciousness workspace: its reaction is to generate an intention to look at the phone. At first, you might train the subagent responsible for monitoring the contents of your consciousness, to output moments of introspective awareness highlighting when that intention appears. That introspective awareness helps alert a goal-directed subagent to try to trigger the new habit instead. Gradually, a neural circuit corresponding to the new habit gets trained up, which starts sending its own bids when it detects boredom. Over time, reinforcement learning in the basal ganglia starts giving that subagent’s bids more weight relative to the old habit’s, until it no longer needs the goal-directed subagent’s support in order to win.

Now this model helps incorporate things like the role of having a vivid emotional motivation, a sense of hope, or psyching yourself up when trying to achieve habit change. Doing things like imagining an outcome that you wish the habit to lead to, may activate additional subsystems which care about those kinds of outcomes, causing them to submit additional bids in favor of the new habit. The extent to which you succeed at doing so, depends on the extent to which your mind-system considers it plausible that the new habit leads to the new outcome. For instance, if you imagine your exercise habit making you strong and healthy, then subagents which care about strength and health might activate to the extent that you believe this to be a likely outcome, sending bids in favor of the exercise action.

On this view, one way for the mind to maintain coherence and readjust its behaviors, is its ability to re-evaluate old habits in light of which subsystems get activated when reflecting on the possible consequences of new habits. An old habit having been strongly reinforced reflects that a great deal of evidence has accumulated in favor of it being beneficial, but the behavior in question can still be overridden if enough influential subsystems weigh in with their evaluation that a new behavior would be more beneficial in expectation.

Some subsystems having concerns (e.g. immediate survival) which are ranked more highly than others (e.g. creative exploration) means that the decision-making process ends up carrying out an implicit expected utility calculation. The strengths of bids submitted by different systems do not just reflect the probability that those subsystems put on an action being the most beneficial. There are also different mechanisms giving the bids from different subsystems varying amounts of weight, depending on how important the concerns represented by that subsystem happen to be in that situation. This ends up doing something like weighting the probabilities by utility, with the kinds of utility calculations that are chosen by evolution and culture in a way to maximize genetic fitness on average. Protectors, of course, are subsystems whose bids are weighted particularly strongly, since the system puts high utility on avoiding the kinds of outcomes they are trying to avoid.

The original question which motivated this section was: why are we sometimes incapable of adopting a new habit or abandoning an old one, despite knowing that to be a good idea? And the answer is: because we don’t know that such a change would be a good idea. Rather, some subsystems think that it would be a good idea, but other subsystems remain unconvinced. Thus the system’s overall judgment is that the old behavior should be maintained.

Interlude: Minsky on mutually bidding subagents

I was trying to concentrate on a certain problem but was getting bored and sleepy. Then I imagined that one of my competitors, Professor Challenger, was about to solve the same problem. An angry wish to frustrate Challenger then kept me working on the problem for a while. The strange thing was, this problem was not of the sort that ever interested Challenger.
What makes us use such roundabout techniques to influence ourselves? Why be so indirect, inventing misrepresentations, fantasies, and outright lies? Why can't we simply tell ourselves to do the things we want to do? [...]
Apparently, what happened was that my agency for Work exploited Anger to stop Sleep. But why should Work use such a devious trick?
To see why we have to be so indirect, consider some alternatives. If Work could simply turn off Sleep, we'd quickly wear our bodies out. If Work could simply switch Anger on, we'd be fighting all the time. Directness is too dangerous. We'd die.
Extinction would be swift for a species that could simply switch off hunger or pain. Instead, there must be checks and balances. We'd never get through one full day if any agency could seize and hold control over all the rest. This must be why our agencies, in order to exploit each other's skills, have to discover such roundabout pathways. All direct connections must have been removed in the course of our evolution.
This must be one reason why we use fantasies: to provide the missing paths. You may not be able to make yourself angry simply by deciding to be angry, but you can still imagine objects or situations that make you angry. In the scenario about Professor Challenger, my agency Work exploited a particular memory to arouse my Anger's tendency to counter Sleep. This is typical of the tricks we use for self-control.
Most of our self-control methods proceed unconsciously, but we sometimes resort to conscious schemes in which we offer rewards to ourselves: "If I can get this project done, I'll have more time for other things." However, it is not such a simple thing to be able to bribe yourself. To do it successfully, you have to discover which mental incentives will actually work on yourself. This means that you - or rather, your agencies - have to learn something about one another's dispositions. In this respect the schemes we use to influence ourselves don't seem to differ much from those we use to exploit other people - and, similarly, they often fail. When we try to induce ourselves to work by offering ourselves rewards, we don't always keep our bargains; we then proceed to raise the price or even deceive ourselves, much as one person may try to conceal an unattractive bargain from another person.
Human self-control is no simple skill, but an ever-growing world of expertise that reaches into everything we do. Why is it that, in the end, so few of our self-incentive tricks work well? Because, as we have seen, directness is too dangerous. If self-control were easy to obtain, we'd end up accomplishing nothing at all.

-- Marvin Minsky, The Society of Mind

Akrasia is subagent disagreement

You might feel that the above discussion doesn’t still entirely resolve the original question. After all, sometimes we do manage to change even strongly conditioned habits pretty quickly. Why is it sometimes hard and sometimes easier?

Redgrave et al. (2010) discuss two modes of behavioral control: goal-directed versus habitual. Goal-directed control is a relatively slow mode of decision-making, where “action selection is determined primarily by the relative utility of predicted outcomes”, whereas habitual control involves more directly conditioned stimulus-response behavior. Which kind of subsystem is in control is complicated, and depends on a variety of factors (the following quote has been edited to remove footnotes to references; see the original for those):

Experimentally, several factors have been shown to determine whether the agent (animal or human) operates in goal-directed or habitual mode. The first is over-training: here, initial control is largely goal-directed, but with consistent and repeated training there is a gradual shift to stimulus–response, habitual control. Once habits are established, habitual responding tends to dominate, especially in stressful situations in which quick reactions are required. The second related factor is task predictability: in the example of driving, talking on a mobile phone is fine so long as everything proceeds predictably. However, if something unexpected occurs, such as someone stepping out into the road, there is an immediate switch from habitual to goal-directed control. Making this switch takes time and this is one of the reasons why several countries have banned the use of mobile phones while driving. The third factor is the type of reinforcement schedule: here, fixed-ratio schedules promote goal-directed control as the outcome is contingent on responding (for example, a food pellet is delivered after every n responses). By contrast, interval schedules (for example, schedules in which the first response following a specified period is rewarded) facilitate habitual responding because contingencies between action and outcome are variable. Finally, stress, often in the form of urgency, has a powerful influence over which mode of control is used. The fast, low computational requirements of stimulus–response processing ensure that habitual control predominates when circumstances demand rapid reactions (for example, pulling the wrong way in an emergency when driving on the opposite side of the road). Chronic stress also favours stimulus–response, habitual control. For example, rats exposed to chronic stress become, in terms of their behavioural responses, insensitive to changes in outcome value and resistant to changes in action–outcome contingency. [...]
Although these factors can be seen as promoting one form of instrumental control over the other, real-world tasks often have multiple components that must be performed simultaneously or in rapid sequences. Taking again the example of driving, a driver is required to continue steering while changing gear or braking. During the first few driving lessons, when steering is not yet under automatic stimulus–response control, things can go horribly awry when the new driver attempts to change gears. By contrast, an experienced (that is, ‘over-trained’) driver can steer, brake and change gear automatically, while holding a conversation, with only fleeting contributions from the goal-directed control system. This suggests that many skills can be deconstructed into sequenced combinations of both goal-directed and habitual control working in concert. [...]
Nevertheless, a fundamental problem remains: at any point in time, which mode should be allowed to control which component of a task? Daw et al. have used a computational approach to address this problem. Their analysis was based on the recognition that goal-directed responding is flexible but slow and carries comparatively high computational costs as opposed to the fast but inflexible habitual mode. They proposed a model in which the relative uncertainty of predictions made by each control system is tracked. In any situation, the control system with the most accurate predictions comes to direct behavioural output.

Note those last sentences: besides the subsystems making their own predictions, there might also be a meta-learning system keeping track of which other subsystems tend to make the most accurate predictions in each situation, giving extra weight to the bids of the subsystem which has tended to perform the best in that situation. We’ll come back to that in future posts.

This seems compatible with my experience in that, I feel like it’s possible for me to change even entrenched habits relatively quickly - assuming that the new habit really is unambiguously better. In that case, while I might forget and lapse to the old habit a few times, there’s still a rapid feedback loop which quickly indicates that the goal-directed system is simply right about the new habit being better.

Or, the behavior in question might be sufficiently complex and I might be sufficiently inexperienced at it, that the goal-directed (default planning) subagent has always mostly remained in control of it. In that case change is again easy, since there is no strong habitual pattern to override.

In contrast, in cases where it’s hard to establish a new behavior, there tends to be some kind of genuine uncertainty:

  • The benefits of the old behavior have been validated in the form of direct experience (e.g. unhealthy food that tastes good, has in fact tasted good each time), whereas the benefits of the new behavior come from a less trusted information source which is harder to validate (e.g. I’ve read scientific studies about the long-term health risks of this food).
  • Immediate vs. long-term rewards: the more remote the rewards, the larger the risk that they will for some reason never materialize.
  • High vs. low variance: sometimes when I’m bored, looking at my phone produces genuinely better results than letting my thoughts wander. E.g. I might see an interesting article or discussion, which gives me novel ideas or insights that I would not otherwise have had. Basically looking at my phone usually produces worse results than not looking at it - but sometimes it also produces much better ones than the alternative.
  • Situational variables affecting the value of the behaviors: looking at my phone can be a way to escape uncomfortable thoughts or sensations, for which purpose it’s often excellent. This then also tends to reinforce the behavior of looking at the phone when I’m in the same situation otherwise, but without uncomfortable sensations that I’d like to escape.

When there is significant uncertainty, the brain seems to fall back to those responses which have worked the best in the past - which seems like a reasonable approach, given that intelligence involves hitting tiny targets in a huge search space, so most novel responses are likely to be wrong.

As the above excerpt noted, the tendency to fall back to old habits is exacerbated during times of stress. The authors attribute it to the need to act quickly in stressful situations, which seems correct - but I would also emphasize the fact that negative emotions in general tend to be signs of something being wrong. E.g. Eldar et al. (2016) note that positive or negative moods tend to be related to whether things are going better or worse than expected, and suggest that mood is a computational representation of momentum, acting as a sort of global update to our reward expectations.

For instance, if an animal finds more fruit than it had been expecting, that may indicate that spring is coming. A shift to a good mood and being “irrationally optimistic” about finding fruit even in places where the animal hasn’t seen fruit in a while, may actually serve as a rational pre-emptive update to its expectations. In a similar way, things going less well than expected may be a sign of some more general problem, necessitating fewer exploratory behaviors and less risk-taking, so falling back into behaviors for which there is a higher certainty of them working out.

So to repeat the summary that I had in the beginning: we are capable of changing our behaviors on occasions when the mind-system as a whole puts sufficiently high probability on the new behavior being better, when the new behavior is not being blocked by a particular highly weighted subagent (such as an IFS protector whose bids get a lot of weight) that puts high probability on it being bad, and when we have enough slack in our lives for any new behaviors to be evaluated in the first place. Akrasia is subagent disagreement about what to do.

New to LessWrong?

New Comment
31 comments, sorted by Click to highlight new comments since: Today at 11:28 AM

I have earlier stated that "To understand study edge cases", and the model of subagents in a single brain would benefit from studying just such edge cases, namely the Dissociative Identity Disorder, formerly known as the Multiple Personalities Disorder, where the subagents are plainly visible and their interaction and inter-communication is largely broken, making some of your suppositions and conjectures easy to study and test. There are many sites devoted to this largely misunderstood disorder, and the estimated prevalence of it in the general population is somewhere around 1-3%, so, odds are, you know someone with it personally, without realizing it. One good introduction to the topic is the documentary Many Sides of Jane, which may give you some very basic understanding of how subagents might function (and dysfunction) in the mind. Akrasia, fight for control, mutual sabotage, various subagent roles and behaviors are covered in the documentary in an accessible way and could serve as a much needed feedback for your ideas.

Thanks! I've been intending to first work out a preliminary version of the intuitive model that I've got in my head in sufficient detail to know exactly what claims I'm even making (these posts), and then delve into various other sources once I've finished writing down my initial rough sketch. (As I've found that trying to read too broadly about a research question before I've got a mental skeleton to "hang the content on" just causes me to forget most of the stuff that would actually have been relevant.) I'll add your recommendations to the list of things to look at.

I haven't gotten around watching that particular documentary, but I now briefly discuss DID (as well as quoting what you said about subagents and trauma elsewhere) in subagents, trauma, and rationality.


I continue to appreciate Kaj's writeups of this paradigm. As I mentioned in a previous curation notice, the "integrating subagents" paradigm has organically gained some traction in the rationalsphere but hasn't been explicitly written up in a way that lets people build off or critique it in detail.

Particular things I liked

  • The use of the spaghetti tower diagrams to illustrate what may be going on.
  • The Minsky interlude was entertaining and provided a nice change of pace.
  • The crystallization of why it's hard to resolve particular kinds of subagent disagreements was useful.

One thing that I felt somewhat uncertain about was this passage:

So besides building spaghetti towers, the second strategy which the mind has evolved to employ for keeping its behavior coherent while piling up protectors, is the ability to re-process memories of past painful events.

Something about this felt like a bigger leap and/or stronger claim than I'd been expecting. Specifying "the second strategy which the mind has evolved" felt odd. Partly because it seems to implicitly claim there are exactly 2 strategies, or that they evolved in a specific order. Partly because "re-process memories of past painful events" reifies a particular interpretation that I'd want to examine more.

It seems like nonhuman animals need to deal with similar kinds of spaghetti code, but I'd be somewhat surprised if the way they experienced that made most sense to classify as "re-processing memories."

Partly because it seems to implicitly claim there are exactly 2 strategies, or that they evolved in a specific order.

Oh, I didn't mean to imply either of those. (those are the two strategies that I know of, but there could obviously be others as well)

It seems like nonhuman animals need to deal with similar kinds of spaghetti code, but I'd be somewhat surprised if the way they experienced that made most sense to classify as "re-processing memories."

How come?

I haven't thought that much about it, but "re-process memories" feels like... it sort of requires language, and orientation around narratives. Or maybe it's just that that's what it feels from the inside when I do it, I have a hard time imagining other ways it could be.

When I think about, say, a rabbit re-processing memories, I'm not sure what the qualia of that would be like.

My current guess, is for non-social reprocessing, I'd expect it to look more like tacking on additional layers of spaghetti code, or simply fading away of unused spaghetti code.

Say that one time visiting an open field got you almost killed, so you avoided open fields. But eventually you found an open field where that where there weren't predators. And the warning flags that would get thrown when you see a bird or dog (that'd reinforce the "ahh! open field === predators === run!" loop), would turn out to be false alarms ("oh, that's not a dog, that's some non-predator animal"). So gradually those loops would fire less often until they stop firing.

But that doesn't feel like "reprocessing", just continuous processing. Reprocessing feels like something that requires you to have an ontology, and you actually realized you were classifying something incorrectly and then actually believe the new reclassification, which I don't expect a rabbit to do. It's plausible that smarter birds or apes might but it stills feels off.

I think I'd still expect most primitive social interactions to be similarly a matter of reinforcement learning. Maybe at some point a bully or alpha threatened you and you were scared of them. But then later when they achieved dominance (or were driven out of dominance by a rival), and then they stopped bullying you, and the "threat! ahh! submit to them!" loop stopped firing as often or as hard, and then eventually faded.

I'd predict it's (currently) a uniquely language-species thing to go "oh, I had made a mistake" and then reprocess memories in a way that changes your interpretation of them. (I'm not that confident in this, am mulling over what sorts of experiments would distinguish this)

Note that memory re-consolidation was originally discovered in rats, so there at least appears to be preliminary evidence that goes against this perspective.. Although "Memory" here refers to something different than what we normally think about, the process is basically the same.

There's also been some interesting speculation that what's actually going on in modalities like IFS and Focusing is the exact same process. The speculation comes from the fact that the requirements seem to be the same for both animal memory reconsolidation and therapies that have fast/instant changes such as coherence therapy, IFS, EMDR, etc. I've used some of these insights to create novel therapeutic modalities that seem to anecdotally have strong effects by applying the same requirements in their most distilled form.

Interesting. Your link seems to include a lot of papers that I'm not quite sure how to orient around. Do you have suggestions on where/how to direct my attention there?

I originally learned about the theory from the book I linked to, which is a good place to start but also clearly biased because they're trying to make the case that their therapy uses memory reconsolidation. Wikipedia seems to have a useful summary.

I haven't thought that much about it, but "re-process memories" feels like... it sort of requires language, and orientation around narratives.

Hmm. I'm not sure to what extent, if any, I'm using language when I'm re-processing memories? Except when I'm explicitly thinking about what I want to say to someone, or what I might want to write, I generally don't feel like I think in a language: I feel like I think in mental images and felt senses.

"Narratives", I think, are basically impressions of cause and effect or simple mental models, and any animals that could be described as "intelligent" in any reasonable sense do need to have those. "Memory re-processing", would then just be an update to the mental model that you interpreted the memory in terms of.

I feel like this excerpt from "Don't Shoot the Dog" could be an example of very short-term memory reprocessing:

I once videotaped a beautiful Arabian mare who was being clicker-trained to prick her ears on command, so as to look alert in the show ring. She clearly knew that a click meant a handful of grain. She clearly knew her actions made her trainer click. And she knew it had something to do with her ears. But what? Holding her head erect, she rotated her ears individually: one forward, one back; then the reverse; then she flopped both ears to the sides like a rabbit, something I didn't know a horse could do on purpose. Finally, both ears went forward at once. Click! Aha! She had it straight from then on. It was charming, but it was also sad: We don't usually ask horses to think or to be inventive, and they seem to like to do it.

This (and other similar anecdotes in the book) doesn't look to me like it's just simple reinforcement learning: rather, it looks to me more like the horse has a mental model of the trainer wanting something, and is then systematically exploring what that something might be, until it hits on the right alternative. And when it does, there's a rapid re-interpretation of the memory just a moment ago: from "in this situation, my trainer wants me to do something that I don't know what", to "in this situation, my trainer wants me to prick my ears".

Hmm, a story that might be relevant:

My mom once hit a dog with her car, and then brought it to a vet. She tried to find the original owner but couldn't, and eventually adopted him formally. e was very small, and had been living in the woods for weeks at least, and had lots of injuries.

For several months after bringing the dog home, it would sit and stare blankly into the corner of the wall.

Eventually, my sister started spending hours at a time leaving food next to her while lying motionless. Eventually, he started eating the food. Eventually, he started letting her touch him (but not other humans). Nowadays, he appears to be generally psychologically healthy.

This seems a lot more like classic PTSD, and something like actual therapy. It still doesn't seem like it requires reprocessing of memories, although it might. I also don't expect this sort of situation happens that often in the wild.

The original question which motivated this section was: why are we sometimes incapable of adopting a new habit or abandoning an old one, despite knowing that to be a good idea? And the answer is: because we don’t know that such a change would be a good idea. Rather, some subsystems think that it would be a good idea, but other subsystems remain unconvinced. Thus the system’s overall judgment is that the old behavior should be maintained.

To me this is the key insight for working with subagent models. Just to add something about the phenomenology of it, I think many people struggle with this because the conflicts can feel like failures to update on evidence, which feels like a failure as a result of identifying with a particular subagent (see a recent article I posted on akrasia that makes this same claim and tries to convince the reader of it in terms of dual-process theory). Thus this is a case of easily said, difficultly done, but I think just having this frame is extremely helpful for making progress because at least you have a way of thinking of yourself as not fighting against yourself but manipulating complex machinery that decides what you do.

As a bonus to my developmental psychology friends out there, I think this points to the key insight for making the Kegan 3 to 4 transition (and for my Buddhist friends out there, the insight that, once grokked, will produce stream entry), although your milage may vary.

for making the Kegan 3 to 4 transition

(Did you mean to write 4 to 5?)

No, I meant 3 to 4. What I think of as the 4 to 5 key insight builds on this one to say that not only can you think of yourself as a manipulable complex system/machinery and work with that, it takes a step back and says what you choose to make the system do is also able to be manipulated. That's of course a natural consequence of the first insight, but really believing it and knowing how to work with it takes time and constitutes the transition to another level because getting that insight requires the ability to intuitively work with an additional level of abstraction in your thinking.

Following on 5 to 6 is about stepping back from what you choose to make the system do and finding you can treat as object/manipulate how you choose (preferences; the system that does the choosing). Then 6 to 7 is about getting back one more level and seeing you can manipulate not just preferences but perceptions since they control the inputs that produce preferences.

So, just to check, we are still talking about the Kegan stage 4 that according to Kegan, 35 % of the adult population has attained? Are you saying that getting to stage 4 actually is actually the same as attaining stream entry, or just that the work to get to stream entry involves similar insights?

So I do think stream entry is way more common than most people would think because the thing that is stream entry is amazing and useful but also incredible normal and I think lots of folks are walking around having no idea they attained it (this relies, though, on a very parsimonious approach to what counts as stream entry). Whether my identifying it with Kegan 4 means the same thing as what that study from which that number comes does (which was itself, as I recall, not that great a study, and was lead by Lahey) is questionable since it depends on where you choose to draw the borders for each stage (the Subject-Object Interview manual provides one way of doing this, and is the method by which the number you mention was obtained).

My suspicion is that the number is much lower as I would count it, which I calculate as closer to 7% doing a Fermi estimate based on my own observations and other evidence I know of, even though a lot of folks (this is where I would say the 35% number makes sense) are somewhere in what I would consider the 3.5 to 4 range where they might be able to pass as 4 but have not yet had the important insight that would put them fully into the 4 stage.

So all those caveats aside, yes, I consider stream entry to be pointing at the same thing as Kegan 4.

Having studied and achieved stream entry (in the Vipassana tradition), I very much doubt many people have stumbled into it. Although for clarity, what % of the population are we taking about? Quick fermi: from what I’ve seen in the spiritual community about 1/1000 have achieved stream entry spontaneously / easily. Out of my bubble, I’d say 1/100 is spiritually inclined. Then I’d add another factor of at least 1/100 to control for my bubble being the Bay Area.

The reason why I doubt it is because most people will tell you (and have written) that it has take them (and people they know) many years and intense practice to get it.

I do think a lot of people have gotten A&P though.

To be clear I am appropriating stream entry here the same way Ingram has, which has much more inclusive (because they are much smaller and more specific) criteria than what is traditional. I agree with your point about A&P, and maybe I am typical minding here because I made it through to what matches with once returner without formal practice (although I did engage in a lot of practices that were informal that drug me along the same way).

Are Ingram's criteria particularly inclusive? He has talked a bunch about most people who think themselves being stream enterers not actually being that, e.g.:

The A&P is so commonly mistaken for things like Equanimity, higher jhanas (third and fourth, as well as formless realms), and Stream Entry, or even some higher path, even on its first occurrence, that I now have to actively check myself when responding to emails and forum posts so that I don't automatically assume that this is what has gone on, as it is probably 50:1 that someone claiming stream entry has actually just crossed the A&P. [...]

Overcalling attainments has become something of an endemic disease in those exposed to the maps. It annoys the heck out of dharma teachers who feel some responsibility to keep practitioners on the rails and in the realms of reality.

Right, if you've not had the later experiences (equanimity, fruition leading to attainment) you're likely to mistake others for them, especially if you have a very squishy model of enlightenment and especially especially if you are trying hard to attain the path. My comment was more a reference to the fact that Ingram seems to view stream entry as a very precise thing relative to how it is talked about in theravada, which is why it seems possible that some of the above disagreement on numbers might be due to a different sense of what qualifies as stream entry.

I have my own fairly precise way of describing it, which is that you develop the capacity to always reason at Commons' MHC level 13 (this is placed about halfway along the 4 to 5 transition in the normal Kegan model by Wilburl but I consider that to be an inflation of what's really core 4), i.e. you S1 reason that way, deliberative S2 reasoning at that level is going to happen first but doesn't count. At least as of right now I think that, but I could probably be convinced to wiggle the location a little bit because I'm trying to project my internal model of it back out to other existing models that I can reference.

What do you mean by trying hard? Why is this less beneficial than not trying hard? How not to try hard?

I have been practicing meditation for 2.5 years and I think I did not even make it to A&P. Might that be the sign that I am doing something wrong? 

When people try too hard, they set up strong expectations about what will happen. This works at cross purposes to awakening to just what is because it is a way of strongly grasping for something other than what is. Awakening, whether that be stream entry or enlightenment, requires surrendering or giving oneself over.

Importantly, though, trying like I'm talking about here is distinct from effort, which you need. You have to show up, do the practice, and wholeheartedly work at whatever it is you're doing. But effort can be skillfully applied without trying or grasping for something.

jump up one more level to kegan 5 (<1% of the population) and it jives much more closely with survey estimates of .5% of the population having some sort of permanent attainment (the survey does not use the theravadan map)

I found a scientific author who wrote extensively about subpersonalities, and it looks like he was not mentioned yet here. I will add the links I found:

David Lester “A Subself Theory of Personality",

Also, the Encyclopedia of Personality and Individual Differences includes a section by Lester with finding about subpersonalities (p 3691) - yes, it is page three thousand something.

Have you - or anyone, really - put much thought into the implications of these ideas to AI alignment?

If it's true that modeling humans at the level of constitutive subagents renders a more accurate description of human behavior, then any true solution to the alignment problem will need to respect this internal incoherence in humans.

This is potentially a very positive development, I think, because it suggests that a human can be modeled as a collection of relatively simple subagent utility functions, which interact and compete in complex but predictable ways. This sounds closer to a gears-level portrayal of what is happening inside a human, in contrast to descriptions of humans as having a single convoluted and impossible-to-pin-down utility function.

I don't know if you're at all familiar with Mark Lippman's Folding material and his ontology for mental phenomenology. My attempt to summarize his framework of mental phenomena is as follows: there are belief-like objects (expectations, tacit or explicit, complex or simple), goal-like objects (desirable states or settings or contexts), affordances (context-activated representations of the current potential action space) and intention-like objects (plans coordinating immediate felt intentions, via affordances, toward goal-states). All cognition is "generated" by the actions and interactions of these fundamental units, which I infer must be something like neurologically fundamental. Fish and maybe even worms probably have something like beliefs, goals, affordances and intentions. Ours are just bigger, more layered, more nested and more interconnected.

The reason I bring this up is that Folding was a bit of a kick in the head to my view on subagents. Instead of seeing subagents as being fundamental, I now see subagents as expressions of latent goal-like and belief-like objects, and the brain is implementing some kind of passive program that pursues goals and avoids expectations of suffering, even if you're not aware you have these goals or these expectations. In other words, the sense of there being a subagent is your brain running a background program that activates and acts upon the implications of these more fundamental yet hidden goals/beliefs.

None of this is at all in contradiction to anything in your Sequence. It's more like a slightly different framing, where a "Protector Subagent" is reduced to an expression of a belief-like object via a self-protective background process. It all adds up to the same thing, pretty much, but it might be more gears-level. Or maybe not.

I definitely have some thoughts on the AI alignment implications, yes. Still working out exactly what they are. :-) A few fragmented thoughts, here's what I wrote in the initial post of the sequence:

In a recent post, Wei Dai mentioned that “the only apparent utility function we have seems to be defined over an ontology very different from the fundamental ontology of the universe”. I agree, and I think it’s worth emphasizing that the difference is not just “we tend to think in terms of classical physics but actually the universe runs on particle physics”. Unless they've been specifically trained to do so, people don’t usually think of their values in terms of classical physics, either. That’s something that’s learned on top of the default ontology.
The ontology that our values are defined over, I think, shatters into a thousand shardsof disparate models held by different subagents with different priorities. It is mostly something like “predictions of receiving sensory data that has been previously classified as good or bad, the predictions formed on the basis of doing pattern matching to past streams of sensory data”. Things like e.g. intuitive physics simulators feed into these predictions, but I suspect that even intuitive physics is not the ontology over which our values are defined; clusters of sensory experiences are that ontology, with intuitive physics being a tool for predicting how to get those experiences. This is the same sense in which you might e.g. use your knowledge of social dynamics to figure out how to get into situations which have made you feel loved in the past, but your knowledge of social dynamics is not the same thing as the experience of being loved.

Also, here's what I recently wrote to someone during a discussion about population ethics:

I view the function of ethics/morality as two-fold:

1) My brain is composed of various subagents, each of which has different priorities or interests. One way of describing them would be to say that there are consequentialist, deontologist, virtue ethical, and egoist subagents, though that too seems potentially misleading. Subagents probably don't really care about ethical theories directly, rather they care about sensory inputs and experiences of emotional tone. In any case, they have differing interests and will often disagree about what to do. The _personal_ purpose of ethics is to come up with the kinds of principles that all subagents can broadly agree upon as serving all of their interests, to act as a guide for personal decision-making.

(There's an obvious connection from here to moral parliament views of ethics, but in those views the members of the parliament are often considered to be various ethical theories - and like I mentioned, I do not think that subagents really care about ethical theories directly. Also, the decision-making procedures within a human brain differ substantially from those of a parliament. E.g. some subagents will get more voting power on times when the person is afraid or sexually aroused, and there need to be commonly-agreed upon principles which prevent temporarily-powerful agents from using their power to take actions which would then be immediately reversed when the balance of power shifted back.)

2) Besides disagreements between subagents within the same mind, there are also disagreements among people in a society. Here the purpose of ethics is again to act as providing common principles which people can agree to abide by; murder is wrong because the overwhelming majority of people agree that they would prefer to live in a society where nobody gets murdered.

The personal level of this view produces something tending towards (though I have not carefully considered to what extent I endorse all of the claims attributed to particularists on that page), while the societal level of it tends towards the "doing moral philosophy is engineering social technologies" stance ( ).
You mention that person-affecting views are intractable as a solution to generating betterness-rankings between worlds. But part of what I was trying to gesture at when I said that the whole approach may be flawed, is that generating betterness-rankings between worlds does not seem like a particularly useful goal to have.

On my view, ethics is something like an ongoing process of negotiation about what to do, as applied to particular problems: trying to decide which kind of world is better in general and in the abstract, seems to me like trying to decide whether a hammer or a saw is better in general. Neither is: it depends on what exactly is the problem that you are trying to decide on and its context. Different contexts and situations will elicit different views from different people/subagents, so the implicit judgment of what kind of a world is better than another may differ based on which contextual features of any given decision happen to activate which particular subagents/people.

Getting back to your suggested characterization of my position as "we ought to act as if something like a person affecting view were true" - I would say "yes, at least sometimes, when the details of the situation seem to warrant it, or at least that is the conclusion which my subagents have currently converged on". :slightly_smiling_face: I once ( ) wrote that:

> I've increasingly come to think that living one's life according to the judgments of any formal ethical system gets it backwards - any such system is just a crude attempt of formalizing our various intuitions and desires, and they're mostly useless in determining what we should actually do. To the extent that the things that I do resemble the recommendations of utilitarianism (say), it's because my natural desires happen to align with utilitarianism's recommended courses of action, and if I say that I lean towards utilitarianism, it just means that utilitarianism produces the least recommendations that would conflict with what I would want to do anyway.

Similarly, I can endorse the claim that "we should sometimes act as if the person-affecting view was true", and I can mention in conversation that I support a person-affecting view. When I do so, I'm treating it as a shorthand for something like "the judgments generated by my internal subagents sometimes produce similar judgments as the principle called 'person-affecting view' does, and I think that adopting it as a societal principle in some situations would cause good results (in terms of being something that would produce the kinds of behavioral criteria that both my and most people's subagents could consider to produce good outcomes)".

Also a bunch of other thoughts which partially contradict the above comments, and are too time-consuming to write in this margin. :)

Re: Folding, I started reading the document and found the beginning valuable, but didn't get around reading it to the end. I'll need to read the rest, thanks for the recommendation. I definitely agree that this

Instead of seeing subagents as being fundamental, I now see subagents as expressions of latent goal-like and belief-like objects, and the brain is implementing some kind of passive program that pursues goals and avoids expectations of suffering, even if you're not aware you have these goals or these expectations. In other words, the sense of there being a subagent is your brain running a background program that activates and acts upon the implications of these more fundamental yet hidden goals/beliefs.

sounds very plausible. I think I was already hinting at something like that in this post, when I suggested that essentially the same subsystem (habit-based learning) could contain competing neural patterns corresponding to different habits, and treated those as subagents. Similarly, a lot of "subagents" could emerge from essentially the same kind of program acting on contradictory beliefs or goals... but I don't know how I would empirically test one possibility over the other (unless reading the Folding document gives me ideas), so I'll just leave that part of the model undefined.

I sort of started in this vicinity but then ended up somewhere else.

Note: Due to a bug, if you were subscribed to email notifications for curated posts, the curation email for this post came from Alignment Forum instead of LessWrong. If you're viewing this post on AF, to see the comments, view it on LessWrong instead. (This is a LessWrong post, not an AF post, but the two sites share a database and have one-directional auto-crossposting from AF to LW.)

I think that my akrasia manifests itself as agents that vote for the right thing being too weak because the mechanism of positive reinforcement being somehow broken. I mean that when I choose doing the thing that I want, I know that there won't be any pleasantness to experience. If the right things do not feel right, it is much harder to choose them. This is similar to alexithymia. I have talked with psychiatrist about that, and he prescribed aripiprazole. I have been taking it for 4 weeks by now, and I am starting to see the benefits.


Besides the subsystems making their own predictions, there might also be a meta-learning system keeping track of which other subsystems tend to make the most accurate predictions in each situation, giving extra weight to the bids of the subsystem which has tended to perform the best in that situation.

This is why I eat junk food, sans guilt. I don't want my central planning subagent to lose influence over unimportant details. Spend your weirdness points wisely.