A typical paradigm by which people tend to think of themselves and others is as consequentialist agents: entities who can be usefully modeled as having beliefs and goals, who are then acting according to their beliefs to achieve their goals.
An abstraction in the computer science sense is a simplification which tries to hide the underlying details of a thing, letting you think in terms of the simplification rather than the details. To the extent that the abstraction actually succeeds in hiding the details, this makes things a lot simpler. But sometimes the abstraction inevitably leaks, as the simplification fails to predict some of the actual behavior that emerges from the details; in that situation you need to actually know the underlying details, and be able to think in terms of them.
Agent-ness being a leaky abstraction is not exactly a novel concept for Less Wrong; it has been touched upon several times, such as in Scott Alexander’s Blue-Minimizing Robot Sequence. At the same time, I do not think that it has been quite fully internalized yet, and that many foundational posts on LW go wrong due to being premised on the assumption of humans being agents. In fact, I would go as far as to claim that this is the biggest flaw of the original Sequences: they were attempting to explain many failures of rationality as being due to cognitive biases, when in retrospect it looks like understanding cognitive biases doesn’t actually make you substantially more effective. But if you are implicitly modeling humans as goal-directed agents, then cognitive biases is the most natural place for irrationality to emerge from, so it makes sense to focus the most on there.
Just knowing that an abstraction leaks isn’t enough to improve your thinking, however. To do better, you need to know about the actual underlying details to get a better model. In this sequence, I will aim to elaborate on various tools for thinking about minds which look at humans in more granular detail than the classical agent model does. Hopefully, this will help us better get past the old paradigm.
One particular family of models that I will be discussing, will be that of multi-agent theories of mind. Here the claim is not that we would literally have multiple personalities. Rather, my approach will be similar in spirit to the one in Subagents Are Not A Metaphor:
Here’s are the parts composing my technical definition of an agent:
This could be anything from literally a utility function to highly framing-dependent. Degenerate case: embedded in lookup table from world model to actions.
Degenerate case: stateless world model consisting of just sense inputs.
3. Search Process
Causal decision theory is a search process. “From a fixed list of actions, pick the most positively reinforced” is another. Degenerate case: lookup table from world model to actions.
Note: this says a thermostat is an agent. Not figuratively an agent. Literally technically an agent. Feature not bug.
This is a model that can be applied naturally to a wide range of entities, as seen from the fact that thermostats qualify. And the reason why we tend to automatically think of people - or thermostats - as agents, is that our brains have evolved to naturally model things in terms of this kind of an intentional stance; it’s a way of thought that comes natively to us.
Given that we want to learn to think about humans in a new way, we should look for ways to map the new way of thinking into a native mode of thought. One of my tactics will be to look for parts of the mind that look like they could literally be agents (as in the above technical definition of an agent), so that we can replace our intuitive one-agent model with intuitive multi-agent models without needing to make trade-offs between intuitiveness and truth. This will still be a leaky simplification, but hopefully it will be a more fine-grained leaky simplification, so that overall we’ll be more accurate.
My model of what I think our subagents looks like draws upon a number of different sources, including neuroscience, psychotherapy and meditation, so in the process of sketching out my model I will be covering a number of them in turn. To give you a rough idea of what I'm trying to do, here's a summary of some upcoming content.
Book summary: Consciousness and the Brain. One of the fundamental building blocks of much of consciousness research, is that of Global Workspace Theory (GWT). This could be described as a component of a multiagent model, focusing on the way in which different agents exchange information between one another. One elaboration of GWT, which focuses on how it might be implemented in the brain, is the Global Neuronal Workspace (GNW) model in neuroscience. Consciousness in the Brain is a 2014 book that summarizes some of the research and basic ideas behind GNW, so summarizing the main content of that book looks like a good place to start our discussion and for getting a neuroscientific grounding before we get more speculative.
Building up to an IFS model. One theoretical approach for modeling humans as being composed of interacting parts is that of Internal Family Systems. In my experience and that of several other people in the rationalist community, it’s very effective for this purpose. However, having its origins in therapy, its theoretical model may seem rather unscientific and woo-y. This personally put me off the theory for a long time, as I thought that it sounded fake, and gave me a strong sense of "my mind isn't split into parts like that".
In this post, I construct a mechanistic sketch of how a mind might work, drawing on the kinds of mechanisms that have already been demonstrated in contemporary machine learning, and then end up with a model that pretty closely resembles the IFS one.
Subagents, introspective awareness, and blending. In this post, I extend the model of mind that I've been building up in previous posts to explain some things about change blindness, not knowing whether you are conscious, forgetting most of your thoughts, and mistaking your thoughts and emotions as objective facts, while also connecting it with the theory in the meditation book The Mind Illuminated.
Subagents, akrasia, and coherence in humans. We can roughly describe coherence as the property that, if you become aware that there exists a more optimal strategy for achieving your goals than the one that you are currently executing, then you will switch to that better strategy. For a subagent theory of mind, we would like to have some explanation of when exactly the subagents manage to be collectively coherent (that is, change their behavior to some better one), and what are the situations in which they fail to do so.
My conclusion is that we are capable of changing our behaviors on occasions when the mind-system as a whole puts sufficiently high probability on the new behavior being better, when the new behavior is not being blocked by a particular highly weighted subagent (such as an IFS-style protector) that puts high probability on it being bad, and when we have enough slack in our lives for any new behaviors to be evaluated in the first place. Akrasia is subagent disagreement about what to do.
Integrating disagreeing subagents. In the previous post, I suggested that akrasia involves subagent disagreement - or in other words, different parts of the brain having differing ideas on what the best course of action is. The existence of such conflicts raises the question, how does one resolve them?
In this post I discuss various techniques which could be interpreted as ways of resolving subagents disagreements, as well as some of the reasons for why this doesn’t always happen.
Subagents, neural Turing machines, thought selection, and blindspots. In my summary of Consciousness and the Brain, I briefly mentioned that one of the functions of consciousness is to carry out artificial serial operations; or in other words, implement a production system (equivalent to a Turing machine) in the brain.
While I did not go into very much detail about this model in the post, I’ve used it in later articles. For instance, in Building up to an Internal Family Systems model, I used a toy model where different subagents cast votes to modify the contents of consciousness. One may conceptualize this as equivalent to the production system model, where different subagents implement different production rules which compete to modify the contents of consciousness.
In this post, I flesh out the model a bit more, as well as applying it to a few other examples, such as emotion suppression, internal conflict, and blind spots.
Subagents, trauma, and rationality. This post interprets the appearance of subagents as emerging from unintegrated memory networks, and argues that the presence of these is a matter of degree. There’s a continuous progression of fragmented (dissociated) memory networks giving arise to increasingly worse symptoms as the degree of fragmentation grows. The continuum goes from everyday procrastination and akrasia on the “normal” end, to disrupted and dysfunctional beliefs on the middle, and conditions like clinical PTSD, borderline personality disorder, and dissociative identity disorder on the severely traumatized end.
I also argue that emotional work and exploring one's past traumas in order to heal them, is necessary for effective instrumental and epistemic rationality.
Against "System 1" and "System 2". The terms System 1 and System 2 were originally coined by the psychologist Keith Stanovich and then popularized by Daniel Kahneman in his book Thinking, Fast and Slow. Stanovich noted that a number of fields within psychology had been developing various kinds of theories distinguishing between fast/intuitive on the one hand and slow/deliberative thinking on the other. Often these fields were not aware of each other. The S1/S2 model was offered as a general version of these specific theories, highlighting features of the two modes of thought that tended to appear in all the theories.
Since then, academics have continued to discuss the models. Among other developments, Stanovich and other authors have discontinued the use of the System 1/System 2 terminology as misleading, choosing to instead talk about Type 1 and Type 2 processing. In this post, I will build on some of that discussion to argue that Type 2 processing is a particular way of chaining together the outputs of various subagents using working memory. Some of the processes involved in this chaining are themselves implemented by particular kinds of subagents.
Near-term posts (partially already written):
A non-mysterious explanation of the Three Marks of Existence. If being an agent is a leaky abstraction, then one way of characterizing insight meditation would be as a technique for finding and staring at the places where the abstraction does leak. Here, I offer a model of insight meditation as a way to witness some of the processes by which the experience of being an agent is constructed, helping dissolve the kinds of confusions that make us think we are agents in the first place.
One way of carving up the space of things that you’ll find by doing insight meditation is by what some Buddhist schools call the Three Marks of Existence: no-self, impermanence, and unsatisfactoriness. Here, I try to sketch out an explanation of the kinds of things that these marks are pointing to, how they underlie a more accurate model of human psychology than the folk intuition does, and how witnessing them might be expected to transform one’s expectations.
Farther out (sketched out but not as extensively planned/written yet)
The game theory of rationality and cooperation in a multiagent world. Multi-agent models have a natural connection to Elephant in the Brain -style dynamics: our brains doing things for purposes of which we are unaware. Furthermore, there can be strong incentives to continue systematic self-deception and not integrate conflicting beliefs. For instance, if a mind has subagents which think that specific beliefs are dangerous to hold or express, then they will work to suppress subagents holding that belief from coming into conscious awareness.
“Dangerous beliefs” might be ones that touch upon political topics, but they might also be ones of a more personal nature. For instance, someone may have an identity as being “good at X”, and then want to rationalize away any contradictory evidence - including evidence suggesting that they were wrong on a topic related to X. Or it might be something even more subtle.
These are a few examples of how rationality work has to happen on two levels at once: to debug some beliefs (individual level), people need to be in a community where holding various kinds of beliefs is actually safe (social level). But in order for the community to be safe for holding those beliefs (social level), people within the community also need to work on themselves so as to deal with their own subagents that would cause them to attack people with the “wrong” beliefs (individual level). This kind of work also seems to be necessary for fixing “politics being the mind-killer” and collaborating on issues such as existential risk across sharp value differences; but the need to carry out the work on many levels at once makes it challenging, especially since the current environment incentivizes many (sub)agents to sabotage any attempt at this.
(This topic area is also related to that stuff Valentine has been saying about Omega.)
AI alignment and multiagent models: submind values and the default human ontology. In a recent post, Wei Dai mentioned that “the only apparent utility function we have seems to be defined over an ontology very different from the fundamental ontology of the universe”. I agree, and I think it’s worth emphasizing that the difference is not just “we tend to think in terms of classical physics but actually the universe runs on particle physics”. Unless they've been specifically trained to do so, people don’t usually think of their values in terms of classical physics, either. That’s something that’s learned on top of the default ontology.
The ontology that our values are defined over, I think, shatters into a thousand shards of disparate models held by different subagents with different priorities. It is mostly something like “predictions of receiving sensory data that has been previously classified as good or bad, the predictions formed on the basis of doing pattern matching to past streams of sensory data”. Things like e.g. intuitive physics simulators feed into these predictions, but I suspect that even intuitive physics is not the ontology over which our values are defined; clusters of sensory experiences are that ontology, with intuitive physics being a tool for predicting how to get those experiences. This is the same sense in which you might e.g. use your knowledge of social dynamics to figure out how to get into situations which have made you feel loved in the past, but your knowledge of social dynamics is not the same thing as the experience of being loved.
This sequence is part of research done for, and supported by, the Foundational Research Institute.