A twenty-years-old doesn't value all the same things their fifty-years-old self will value
I have heard this often, but is it really so? When I think about my past self, it seems to me that I am actually more coherent than the society keeps telling me. But of course, maybe I am just lying to myself, rewriting my memory to believe that my past self shared my current values.
I am really curious whether my 20 or 30 years younger self would be okay with my current values and behavior, perhaps after hearing about experience they didn't have yet.
How could we test this experimentally, without a time machine? By giving young people a values questionnaire, including a lot of hypotheticals, such as "if it turned out after repeated attempts that X does not work, would it be okay to give up on X?", then calling them 10 years later and comparing the answers?
This is so great, I find that this so satisfyingly ties together a bunch of piecemeal understandings in my head. Maybe it's not worth getting into, because it's more about understanding humans than the general case of shard-based agents, but... Human brains have a lot of weird bugs that can lead to accidental shard creations / shifts and other stuff like optical illusions or certain drugs being more addictive than would be predicted based on the amount of subjective pleasure they seem to deliver based on idiosyncrasies of how they activate the reward systems. Or like how the local plasticity of the cortex, which allows modules to learn, and also allows for local learning to reallocate module territory on the borders from one module to the other, can lead sometimes to information leaks between modules that can end up accidentally reinforced. Like sensory leaks between skin areas which aren't physically co-located on the body but whose receptive fields in the brain are next to each other and compete for territory. That's an example of something I wouldn't attempt to reproduce if I were trying to make a shard-based / brain-like agent.
But the planner's actual terminal value of satisfying the shard economy's weighted preferences...
Suppose I have a heuristic that fires strongly when I'm eating cupcakes or thinking thoughts that will lead to eating cupcakes, and then a control algorithm that makes my muscles fire to make things happen that correspond to thoughts that a heuristic rates highly, and this control algorithm is hooked up to the cupcake heuristic.
Saying that the "actual terminal value" of the control algorithm is to satisfy whatever heuristic is in its input slot (and so changing the heuristic to one that fires for strawberries wouldn't be a big deal) is kinda wrong. Saying that the "actual terminal value" of the control algorithm is only cupcakes and nothing else would make sense is also kinda wrong. They're both kinda wrong because trying to declare one thing the "actual terminal value" is the wrong exercise to be engaging in in the first place!
This is related to my other warning about the word "actual": this idea that you're "actually" the control algorithm and not the cupcake heuristic. There are multiple ways to think about you that work better or worse in different contexts (Since I just finished editing a sequence about this, I will shamelessly link it). I am so large I don't just contain multitudes, I contain multitudes of ways of parceling myself up into multitudes.
They're both kinda wrong because trying to declare one thing the "actual terminal value" is the wrong exercise to be engaging in in the first place!
I disagree. I'm not talking about the intentional stance or such "external" descriptions. I'm claiming that if you took the explicit algorithmic implementation of the human mind and looked over it, you would find some kind of distinct "planner" part, and that part would be something like an idealized utility-maximizer with a pointer to the shard economy in place of its utility function.
It's not a frame that can be kinda wrong/awkward to use. It's a specific mechanistic prediction that's either flat-out right or flat-out wrong.
This is related to my other warning about the word "actual": this idea that you're "actually" the control algorithm and not the cupcake heuristic
Mm, I'm more willing to relax this assumption. It ties into my model of self-awareness — I suspect it might be the case that the planner is the thing that's being fed summaries of the brain's state, making it literally the thing that's having qualia. But I haven't fully worked out my model of that.
I suspect that much of the appeal of shard theory is working through detailed explanations of model-free RL with general value function approximation for people who mostly think of AI in terms of planning/search/consequentialism.
But if you already come from a model-free RL value approx perspective, shard theory seems more natural.
Moment to moment decisions are made based on value-function bids, with little to no direct connection to reward or terminal values. The 'shards' are just what learned value-function approximating subcircuits look like in gory detail.
The brain may have a prior towards planning subcircuitry, but even without a strong prior planning submodules will eventually emerge naturally in a model-free RL learning machine of sufficient scale (there is no fundamental difference between model-free and model-based for universal learners). TD like updates ensure that the value function extends over longer timescales as training progresses. (and in general humans seem to plan on timescales which scale with their lifespan, as you'd expect)
Humans can change their action patterns on a dime, inspired by philosophical arguments, convinced by logic, indoctrinated by political or religious rhetoric, or plainly because they're forced to.
I'd add that action patterns can change for reasons other than logical/deliberative ones. For example, adapting to a new culture means you might adopt and have new reactions to objects, gestures, etc that are considered symbolic in that culture.
I have two strong intuitions about human values that'd seemed utterly irreconcilable to me, up until recently.
I believe I see a way to unify the two. The crucial insights have been supplied by the Shard Theory, and the final speculative picture is broadly supported by it. This result mostly dissolved all of my high-level confusions about human values and goal-directed behavior, in addition to satisfying a lot of other desiderata.
1. The Shard Theory of Human Value: A Recap
Disclaimer: This summary does not represent the views of Team Shard, but only my subjective understanding. For the official summary, see this.
According to the Shard Theory, in the course of brain development, humans jointly learn two things:
The latter are "shards" . In their most primitive form, they're just observation→action activation patterns. You see a lollipop enter your field of vision, you grab it. You see a flashlight pointed at your face, you close your eyes.
As the world-models grows more advanced, the shards could grow more sophisticated as well. Instead of only attaching to observations, they can attach to things in the world-model. If you're modelling the world as containing a lollipop the next room over, your lollipop-shard will bid for a plan to go grab it. If your far-future model says that becoming a salaried professional will give you enough income to buy a lot of lollipops, your lollipop-shard will bid for it.
A lot of other values and habits are implemented the same way. The desire to do nice things for people you like, the avoidance of life-threatening situations, the considerations that go into the choice of career — all of those are just shard-implemented reaction patterns, which react to things in your world-model and bid for particular responses to them. If you expect someone you like to be unhappy, a shard activates, bidding for an action-sequence that changes that prediction. If you expect to be in a life-threatening situation, a whole bunch of shards rebel against that vision. If you're considering career choices, you're choosing between different models of the future, and whichever wins the "popularity contest" among the shards is what ends up implemented.
Shards can conflict. Some values are mutually contradictory; the preference for lollipops might conflict with preferences for health and being attractive and avoiding dentists, so plans a lollipop-shards bids for may be overruled by other shards. If the lollipop-shard is suppressed too many times, it'll atrophy and die out.
Shards have a self-preservation instinct. Some indirect — they see that certain changes to personality will decrease the amount of things they value in the future, and will bid against such value-drift plans (you don't want to self-modify to hate your loved ones, because that will make you do things that will make them unhappy). Some direct — these shards can identify themselves in the world-model, and directly bid against plans that eliminate them. (You might inherently like some aspects of your personality, and protest against changes to them — not because of outside-world outcomes, but because that's who you like to be. Conversely, imagine a non-reflectively-stable shard, like a crippling fear of spiders or drug addiction. You don't value valuing this, so you can implement plans that eliminate the corresponding shards via e. g. therapeutic interventions.)
All together, a mind like this would resemble humans pretty well. In particular, it crisply defines what "human flourishing" is. It's the state of the world which minimizes constituent shards' resistance to it; a picture of the world that the maximum number of shards approve. And in addition to satisfying our values on the object-level, it'll also need to satisfy shards' preferences for self-perpetuation.
Hence our preference for diverse, dynamic futures in which we remain ourselves.
2. The Gaps in the Picture
But. That's clearly not a complete story of how humans work, is it?
The shard economy as presented in Part 1 is too rigid. According to it, a human's policy is a relatively shallow function of that human's constituent shards, and significant changes to it imply correspondingly significant changes in the shard economy. And such changes would be rare: ancient, deeply-established shards would have a lot of sway, and their turnover would be low.
But that's not what we often observe. Humans can change their action patterns on a dime, inspired by philosophical arguments, convinced by logic, indoctrinated by political or religious rhetoric, or plainly because they're forced to.
Suppose a human has a bunch of deeply ingrained values, like a) "donate to the local community" or b) "eat pork" or c) "don't kill".
None of these are knock-out rebuttals. Indeed, even in the last two examples, the new action patterns are not implacable. A sufficiently strong trigger/shard — like a deep trauma, or a very strong value like the love for a child — can break past the life-preservation act in (4) or the ideological takeover in (5).
But this doesn't fully gel with the basic shard-centred picture either. It implies circumstances in which a human's behavior is mainly explained and controlled by some isolated deliberative process, not their entire set of ingrained values. Some part of the human logically reasons out a new policy and then implements it; not as the result of stochastic shard negotiation, but in circumvention of it.
Another issue is the sheer generalizability of human behavior this implies. I can imagine responding to any event my world-model can model in any way I can model. I don't need a special observation→action shard for every case — my collection of shards is already somehow fully generalizable. And if I were trapped in a dystopia, I'd be able to spoof the existence of whatever shards my captors want me to have, regardless of my actual shard makeup.
So what's up with that?
3. An Attempt At Reconciliation
We clearly need to introduce some mechanism of planning/search. The exact implementation is a source of some disagreement:
Regardless of the specifics, however what we get is: an advanced, consequentialist plan-making mechanism whose goal is to come up with plans that satisfy the weighted sum of the preferences of the shard economy.
This, I argue, is what we are: that planner mechanism, a fairly explicit mesa-optimizer algorithm running on our brains. And our terminal value is to satisfy our shards' preferences.
Which is... a pretty difficult proposition, actually. Because many of these shards do not actually codify preferences, and certainly not universal ones. Some of our goals might be defined over specific environments/segments of the world-model, in ways that are difficult to translate/generalize to other environments. Some others might not be "goals" at all, just if-then activation patterns. To do our job, we essentially have to compile our own values, routing around various type errors.
To illustrate what I mean, a few examples:
Hence all of our problems with value reflection: there are often multiple "valid" ways to bootstrap any specific shard to value status.
Hence the various pitfalls we could fall into. These processes of interpretation or generalization are conducted by a deliberative and logical process. And that process can be mistaken, can be fooled by logical or logical-sounding arguments. Hence our prosperity to adopt flawed-but-neat ideologies, or become mistaken about what we really want.
Hence our ability to self-modify. The planner can become convinced (either rightly or not) that certain shards need to be created or destroyed for the good of the whole shard economy, then implement plans that do so (build/destroy good/bad habits, remove values that contradict others). At the same time, we also have preferences for retaining our ability to self-modify — both because we're not sure our current model of our desires is accurate, and maybe because we have a shard-implemented preference for mutability.
Thus: We are approximations of idealized utility-maximizers over an inchoate mess of a thousand shards of desire.
4. Nice Things About This Framework
5. Closing Thoughts
This framework, in conjunction with my previous toy model, essentially dissolves my main confusions about goal-directedness, human values, and development thereof.
The question to tackle, now, seems to be goal translation/value compilation. How do we adapt the values/goals defined over one environment for another? How do we bootstrap things that do not have the type "value" to the status of a value? What algorithms, in general, exist for doing this? How many possible "solutions" (final value distributions) such procedures tend to have, and how can the space of solutions be constrained?
In a way, this is just a reformulation of the ontology-shift problem, but this framing seems to make it easier to reason about. And easier to investigate.
Acknowledgements
Thanks to TurnTrout, Charles Foster, and Quintin Pope for productive discussions and critique.
Or, if we've experienced an ontology break so serious as to invalidate all of our constituent shards, as long as there's the potential for new shards to be formed, which will be adapted to the new world-model.