A bunch of my response to shard theory is a generalization of how niceness is unnatural. In a similar fashion, the other “shards” that the shard theory folk want to learn are unnatural too.
That said, I'll spend a few extra words responding to the admirably-concrete diamond maximizer proposal that TurnTrout recently published, on the theory that briefly gesturing at my beliefs is better than saying nothing.
I’ll be focusing on the diamond maximizer plan, though this criticism can be generalized and applied more broadly to shard theory.
- The first “problem” with this plan is that you don't get an AGI this way. You get an unintelligent robot that steers towards diamonds. If you keep trying to have the training be about diamonds, it never particularly learns to think. When you compromise and start putting it in environments where it needs to be able to think to succeed, then your new reward-signals end up promoting all sorts of internal goals that aren't particularly about diamond, but are instead about understanding the world and/or making efficient use of internal memory and/or suchlike.
- Separately, insofar as you were able to get some sort of internalized diamond-ish goal, if you're not really careful then you end up getting lots of subgoals such as ones about glittering things, and stones cut in stylized ways, and proximity to diamond rather than presence of diamond, and so on and so forth.
- Furthermore, once you get it to be smart, all of those little correlates-of-training-objectives that it latched onto in order to have a gradient up to general intelligence, blow the whole plan sky-high once it starts to reflect.
What the AI's shards become under reflection is very sensitive to the ways it resolves internal conflicts. For instance, in humans, many of our values trigger only in a narrow range of situations (e.g., people care about people enough that they probably can't psychologically murder a hundred thousand people in a row, but they can still drop a nuke), and whether we resolve that as "I should care about people even if they're not right in front of me" or "I shouldn't care about people any more than I would if the scenario was abstracted" depends quite a bit on the ways that reflection resolves inconsistencies.
Or consider the conflict "I really enjoy dunking on the outgroup (but have some niggling sense of unease about this)" — we can't conclude from the fact that the enjoyment of dunking is loud, whereas the niggling doubt is quiet, that the dunking-on-the-outgroup value will be the one left standing after reflection.
As far as I can tell, the "reflection" section of TurnTrout’s essay says ~nothing that addresses this, and amounts to "the agent will become able to tell that it has shards". OK, sure, it has shards, but only some of them are diamond-related, and many others are cognition-related or suchlike. I don't see any argument that reflection will result in the AI settling at "maximize diamond" in-particular.
Finally, I'll note that the diamond maximization problem is not in fact the problem "build an AI that makes a little diamond", nor even "build an AI that probably makes a decent amount of diamond, while also spending lots of other resources on lots of other stuff" (although the latter is more progress than the former). The diamond maximization problem (as originally posed by MIRI folk) is a challenge of building an AI that definitely optimizes for a particular simple thing, on the theory that if we knew how to do that (in unrealistically simplified models, allowing for implausible amounts of (hyper)computation) then we would have learned something significant about how to point cognition at targets in general.
TurnTrout’s proposal seems to me to be basically "train it around diamonds, do some reward-shaping, and hope that at least some care-about-diamonds makes it across the gap". I doubt this works (because the optimum of the shattered correlates of the training objectives that it gets are likely to involve tiling the universe with something that isn't actually diamond, even if you're lucky-enough that it got a diamond-shard at all, which is dubious), but even if it works a little, it doesn't seem to me to be teaching us any of the insights that would be possessed by someone who knew how to robustly aim an idealized unbounded (or even hypercomputing) cognitive system in theory.
I appreciate you writing your quick thoughts on this. I have a few primary reactions, and then I'll detail specific reactions.
Hm, doesn't it need to think in its curriculum I described in the OP?
For further detail, take an arbitrary task with a high skill ceiling and a legible end condition, give it some reward shaping and use self-play if appropriate, and put a diamond at the end and give the agent reward. I agree that even in successful stories, the agent also develops non-diamond shards.
Here's a consideration for why training might produce an AGI, which I realized after writing the story. Given relevant features, it's often trivial for even linear models to outperform experts (see Statistical Prediction Rules Out-Perform Expert Human Judgments). What I remember to be a common hypothesis: Human experts are often good at finding features to pay attention to (e.g. patient weight) but bad at setting regression coefficients to come to a decision.
Analogously, consider an SSL+IL initialization in which the AI has imitatively learned sophisticated subroutines for perception, prediction, and action, such that the AI can imitate human-level performance on supervised training distribution (eg navigating mazes). Then PG-style RL finetuning might rearrange and reweight what subroutines to use when, efficiently finding a better subroutine arrangement for decision-making in a range of situations. And thereby doing better than human expert demonstrators.
(Yes, this is sample inefficient, and I didn't particularly optimize the story for sample efficiency. I focused on telling any story at all which has the desired alignment outcome.)
Why "rather than" instead of "in addition to"? Are you just stating your belief here, or did you mean to argue for it? Maybe you're saying "It's hard to get the diamond shard to form properly", which I agree with and it's a primary way I expect the story to go wrong. I think that relatively simple interventions will plausibly solve this problem, though, and so consider this more of a research question than a fatal flaw in the training story template.
If I read you properly, that's not the relevant section. The relevant sections are the next two: The agent prevents value drift and The values handshake. EG I said:
I think there's a very straightforward case here. In the relevant context, suppose the agent is primarily making decisions on the basis of whether they lead to more or fewer diamonds. The agent considers adopting a reflectively stable utility function which doesn't produce diamonds. The agent doesn't choose this plan because it doesn't lead to diamonds.
I agree that there are ways this can go wrong, some of which you highlight. But the a priori argument makes me expect that, all else equal and conditional on a strong diamond shard at time of values handshake, the agent will probably equilibrate to making lots of diamonds.
I did not claim to be solving the diamond maximization problem, but maybe you wanted to add your own take here? As I wrote in the original post, I think "maximize diamonds" is a seriously mistaken subproblem choice:
I think that "get an agent which reflectively equilibrates to optimizing a single commonly considered quantity like 'diamonds'" is probably extremely hard and anti-natural. I think MIRI should not have chosen this as a subproblem.
I also think that relaxing the problem by assuming hypercomputation encourages thinking about argmax search, which I think is a subtle but serious trap. For specific generalizable reasons which I'll soon post about, this design pattern seems basically impossible to align compared to shard agents.
Really? That seems wrong. Suppose that the time of the values handshake, the agent has a strong diamond-shard. I understand you to predict that the agent adopts a reflective utility function which, when optimized, won't lead to actual diamond. Why? Why wouldn't the diamond-shard just bid this plan down, because it doesn't lead to actual diamond?
In addition to my "unbounded/hypercomputing is a red herring" response:
Someone can say "You can reliably solve computer vision tasks by doing deep learning" isn't telling you how to write superhumanly good features into the vision model, surpassing previous hand-designed expert attempts. They don't know how the SOTA deep vision models will work internally. And yet it's still good advice. It's still telling you something about how to train good vision models.
Similarly, if you're in a state of ignorance (lethality 19) about how to reliably point any cognitive system to any latent parts of reality, and someone proposes a plan which does plausibly (for specific reasons, not as a vague "it could work" hope) produce an AI which makes lots of real-world diamonds, then that seems like progress to me. (I'm fine agreeing to disagree here, I don't think it's productive to dispute how much credit I should get.)
I think it would make more sense to claim that niceness / other shards are "contingent" instead of "unnatural." If shard theory is correct, shards are literally natural in that they are found in nature as the predictable outcome of human value formation. Same for niceness.
You call shards "little correlates" and, previously, "ad-hoc internalized correlates." I don't know what you intend to communicate by this. The shards are, mechanistically speaking, contextually activated influences on the agent's decision-making. What information does "ad-hoc" or "little correlate" add to that picture? I'm currently guessing that it expresses your skepticism that shards can cohere into reflectively stable caring?
This is an interesting example. To me, the more relevant questions seem to be: How much evidence is "loudness" (e.g. if I really enjoy something which I do frequently, I sure am more likely to reflectively endorse it compared to if I didn't enjoy it, even though there are highly available counterexamples to this tendency), and how relevant is this for the diamond story?
EDIT: As I think I wrote in the OP, it's not enough for a shard to be strongly influencing decision-making in a given context. Especially for an anti-outgroup shard which is unendorsed (eg bids for outcomes which other reflectively aware shards bid against), this shard also seemingly has to be reflectively and broadly activated in order to be retained. So, yeah, if there's an anti-outgroup shard which gets "maneuvered around and removed" by other shards, sure, that can happen. My takeaway isn't "anything can get removed for hard-to-understand reasons", but rather "one particular way shards can get removed is that they directly conflict with other powerful shards."
I think a diamond-manufacturing subshard would resource-conflict (instrumental conflict, not terminal conflict) with eg a power-seeking subshard (manufacturing diamonds uses energy). Or even against a staple-manufacturing subshard (staples require materials and energy). But I expect the reflective utility function to reflect gains from intershard trade and specialization of different parts of the future resources towards the different decision-making influences (eg maybe one kind of comet is better specialized for making staples, and another kind for diamonds).
Or maybe not. Maybe it goes some other way. But this kind of conflict seems different from anticorrelated terminal value (eg anti-outgroup can impinge on nice-shards, altruism-shards, empathy...) across a shard power imbalance (nonreflective anti-outgroup vs reflective niceness shard).
And my point here isn't "I have now defused the general class of objection, checkmate!"... It's still a live and legit worry to me, but I don't view this phenomenon as not comprehensible, I don't feel epistemically helpless here (not meaning to make claims about how you feel tbc).
(My take on the reflective stability part of this)
The reflective equilibrium of a shard theoretic agent isn’t a utility function weighted according to each of the shards, it’s a utility function that mostly cares about some extrapolation of the (one or very few) shard(s) that were most tied to the reflective cognition.
It feels like a ‘let’s do science’ or ‘powerseek’ shard would be a lot more privileged, because these shards will be tied to the internal planning structure that ends up doing reflection for the first time.
There’s a huge difference between “Whenever I see ice cream, I have the urge to eat it”, and “Eating ice cream is a fundamentally morally valuable atomic action”. The former roughly describes one of the shards that I have, and the latter is something that I don’t expect to see in my CEV. Similarly, I imagine that a bunch of the safety properties will look more like these urges because the shards will be relatively weak things that are bolted on to the main part of the cognition, not things that bid on the intelligent planning part. The non-reflectively endorsed shards will be seen as arbitrary code that is attached to the mind that the reflectively endorsed shards have to plan around (similar to how I see my “Whenever I see ice cream, I have the urge to eat it” shard.
In other words: there is convergent pressure for CEV-content integrity, but that does not mean that the current way of making decisions (e.g. shards) is close to the CEV optimum, and the shards will choose to self modify to become closer to their CEV.
I don't feel epistemically helpless here either, and would love a theory of which shards get preserved under reflection.
Also, in OP, you write:
I read a connotation here like "TurnTrout isn't proposing anything sufficiently new and impressive." To be clear, I don't think I'm proposing an awesome new alignment technique. I'm instead proposing that we don't need one.
Assuming shard theory is basically correct, this aspect of Nate's story can be resolved by viewing self-reflection as a context like any other. If you put the system in a training setup which causes it to self-reflect, and reward it when it comes to the 'more diamonds' conclusion, then this should cause it to reflectively want more diamonds.
The only question is, how much does training it to max diamonds in maze finding cause the 'max diamonds' shard to be activated while in the self-reflecting context?
Also, notably, it will definitely be doing a modicum of self-reflection during the normal course of training, as the shards which do self-reflection will steer the future towards locations which reinforce their weight.
Okay so if I'm understanding a little bit better now. What you're getting at is that self-generated true and useful philosophical insights become more and more likely to cause an ai to crash out of its domain of trained validity the smarter the ai gets, because philosophical insights are adversarial examples to many possible very smart beings, and therefore the order of philosophical insights can cause an insight to start propagating crash behavior through the rest of the network of nearby internal and external compute components starting from an agentic subnetwork?
Ok, so perhaps TurnTrout would disagree with me here, but my plan for coming up with a AGI-that-makes-diamonds using Shard theory would look more like this:
Create not one AI via this process, but a million. Each time, varying the parameters of the base instincts (the dynamic reward functions) which you have designed to try to get an AGI to care about diamond. Study the results in terms of how close each seems to get to being a 'true' diamond valuer. Then, extrapolate from these results, use your new-found knowledge to create a new batch of experiments. Examine and learn from these. Repeat this several times. Just for the sake of learning, try making it care about multiple things: diamonds, bananas, and chairs. Try using interpretability & editing tools to delete shards, or freeze some and train the others. The things you then end up learning about how to steer value systems of agents along the way turn out to be the true treasure all along. Then use this knowledge to actually try to build your diamond valuer.
The flaw I see in this plan is the question, "Can we successfully use these experiments to hill climb towards useful knowledge or would we just be fooling ourselves because even the seemingly 'better' agents would just be better liars?"
I think that then points at a dependency on reliable interpretability tools.
Worlds Where Iterative Design Fails
More generally, deceptive alignment is likely to bite, and TurnTrout seems to handwave it away. There are other problems, but this is why I'm unimpressed by his claims about shard theory.
It's possibly even worse than HCH, conditional on it being outer alignment at optimum.
I have the view that we need to build an archway of techniques to solve this problem. Each block in the arch is itself insufficient. You must have a scaffold in place while building the arch to keep the half-constructed edifice from falling. In my view that scaffold is the temporary patch of 'boxing'. The pieces of the arch which must be put together while the scaffold is in place: mechanistic interpretability, abstract interpretability, HCH, Shard theory experimentation leading to direct shard measurement and editing, replicating studying and learning from compassion circuits in the brain in the context of brain-like models, toy models of deceptive alignment, red teaming of model behavior under the influence of malign human actors, robustness / stability under antagonistic optimization pressure, the nature of the implicit priors of the machine learning techniques we use, etc.
I don't think any single technique can be guaranteed to get us there at this point. I think what is needed is more knowledge, more understanding. I think we need to get that through collecting empirical data. Lots of empirical data. And then thinking carefully about the data and coming up with hypotheses to explain it, and then testing those.
I don't think criticizing individual blocks of the arch for not already being the entire arch is particularly useful.
Yes, but TurnTrout seems to want to go from shard theory being useful to shard theory being the solution, which leaves me worried.
I disagree with John's post in a similar way to how Steven Byrnes disagrees in the comments. It's not the speed of takeoff that matters, it's our loss of control. If the takeoff happens very fast, but we have an automatic "turn it off if it gets too smart" system in place that successfully turns it off, and then we test it in a highly impaired mode (lowered intelligence/functionality, lowered speed) to learn about it... this is potentially a win not a loss.
As for John W's point 'getting what you measure', yes. That's the hard task interpretability must conquer. I think it is possible to hill climb towards getting better at this so long as you are in control and able to run many separate experiments.
Will humans stop having children as they get smarter and more powerful because they inadvertently gathered a bunch of utility function quirks like "curiosity"?
Will humans stop having children in the limit of intelligence and power, because we have all of these sub-shards like "make sure your children are safe", and "have lots of sex" instead of one big "spread your genes" one? Do they stop doing that when you introduce them to superstimulants via the internet or give them access to contraceptives that decouple sex from reproduction?
The reason human morality is contextual and self contradictory, and we have to resolve a bunch of internal conflicts at the limit of reflectivity, is because we weren't actually trained to care about other people, the subgoal if any was "maintain the trustworthiness indicators of the people we're most likely to be able to cooperate with". So your examples are very cheesy and not at all convincing.
Do humans decide to kill or sterilize their children at higher INT and WIS scores if you change some abstract metacognition parameters that affect how they resolve (deliberately engineered) inconsistencies?
Number of children in our world is negatively correlated with educational achievement and income, often in ways that look like serving other utility function quirks at the expense of children (as the ability to indulge those quirks with scarce effort improved faster with technology faster than those more closely tied to children), e.g. consumption spending instead of children, sex with contraception, pets instead of babies. Climate/ecological or philosophical antinatalism is also more popular the same regions and social circles. Philosophical support for abortion and medical procedures that increase happiness at the expense of sterilizing one's children also increases with education and in developed countries. Some humans misgeneralize their nurturing/anti-suffering impulses to favor universal sterilization or death of all living things including their own lineages and themselves.
Sub-replacement fertility is not 0 children, but it does trend to 0 descendants over multiple generations.
Many of these changes are partially mediated through breaking attachment to fertility-supporting religions that conduce to fertility and have not been robust to modernity, or new technological options for unbundling previously bundled features.
Human morality was optimized in a context of limited individual power, but that kind of concern can and does dominate societies because it contributes to collective action where CDT selfishness sits out, and drives attention to novel/indirect influence. Similarly an AI takeover can be dominated by whatever motivations contribute to collective action that drives the takeover in the first place, or generalizes to those novel situations best.
The party line of MIRI is not that a super intelligence, without extreme measures, would waste most of the universe's EV on frivolous nonsense. The party line is that there is a 99+% chance that an AI, even if trained specifically to care about humans, would not end up caring about humans at all, and instead turn the universe into uniform squiggles. That's the claim I find unsubstantiated by most concrete concerns they have, and which seems suspiciously disanalogous to the one natural example we have. 99% of people in first world countries are not forgoing pregnancy for educational attainment.
It'd of course still be extremely terrible, and maybe even more terrible, if what I think is going to happen happens! But it doesn't look like all matter becoming squiggles.
I wasn't arguing for "99+% chance that an AI, even if trained specifically to care about humans, would not end up caring about humans at all" just addressing the questions about humans in the limit of intelligence and power in the comment I replied to. It does seem to me that there is substantial chance that humans eventually do stop having human children in the limit of intelligence and power.
A uniform fertility below 2.1 means extinction, yes, but in no country is the fertility rate uniformly below 2.1. Instead, some humans decide they want lots of children despite the existence of contraception and educational opportunity, and others do not. It seems to me that a substantial proportion of humans would stop having children in the limit of intelligence and power. It also seems to me like a substantial number of humans continue (and would continue) to have such children as if they value it for its own sake.
This suggests that the problems Nate is highlighting, while real, are not sufficient to guarantee complete failure - even when the training process is not being designed with those problems in mind, and there are no attempts at iterated amplification whatsoever. This nuance is important because it affects how far we should think a naive SGD RL approach is from limited "1% success", and whether or not simple modifications are likely to greatly increase survival odds.
Reflection isn't easy. Humans don't seem to get it right often or at all. It is not something that gets turned on at a certain optimization strength but that grows out of precursors. Optimization power can be directed via attention mechanisms to inner and outer processes and I guess it is possible to prevent or sufficiently inhibit reflection.