[This is post is a slightly edited tangent from my dialogue with John Wentworth here. I think the point is sufficiently interesting and important that I wanted to make it as a top level post, and not leave it buried in that dialog on mostly another topic.]

The conventional story is that natural selection failed extremely badly at aligning humans. One fact about humans that casts doubt on this story is that natural selection got the concept of "social status" into us, and it seems to have done a shockingly good job of aligning (many) humans to that concept.

Evolution somehow gave humans some kind of inductive bias (or something) such that our brains are reliably able to learn what it is to be "high status", even though the concrete markers for status are as varied as human cultures. 

And it further, it successfully hooked up the motivation and planning systems to that "status" concept. Modern humans not only take actions that play for status in their local social environment, they sometimes successfully navigate (multi-decade) career trajectories and life paths, completely foreign to the ancestral environment, in order to become prestigious by the standards of the local culture. 

And this is one of the major drivers of human behavior! As Robin Hanson argues, a huge portion of our activity is motivated by status-seeking and status-affiliation.

This is really impressive to me. It seems like natural selection didn't do so hot at aligning humans to inclusive genetic fitness. But it did kind of shockingly well aligning humans to the goal of seeking, even maximizing, status, all things considered.[1]

This seems like good news about alignment. The common story that condoms prove that evolution basically failed at alignment—that as soon as we developed the technological capability to route around the evolution's "goal" of maximizing the frequency of your alleles in the next generation, to attain only the proxy measure of sex, we did that—doesn't seem to apply to our status drive.

It looks to me like "status" generalized really well across the distributional shift of technological civilization. Humans still recognize it and optimize for it, regardless of whether the status markers are money or technical acumen or h-factor or military success.[2]

This makes me way less confident about the standard "evolution failed at alignment" story.

  1. ^

    I guess that we can infer from this that having an intuitive "status" concept was much more strongly instrumental for attaining high inclusive genetic fitness in the ancestral environment than having an intuitive concept of "inclusive genetic fitness" itself. A human-level status seeking agent with a sex drive does better by the standard of inclusive genetic fitness than a human-level agent IGF maximizer.

    The other hypothesis, of course, is that the "status" concept was easier to encode in an human than the "inclusive genetic fitness concept, for some reason.

  2. ^

    I'm interested if others think that this is an illusion, if it only looks like the status target generalized, because I'm drawing the target around where the arrow landed. That is, what we think of as "social status" is exactly the parts of social status in the ancestral environment that did generalize to across cultures.

New Comment
37 comments, sorted by Click to highlight new comments since: Today at 12:40 AM

Some possible examples of misgeneralization of status :

  1. arguing with people on Internet forums
  2. becoming really good at some obscure hobby
  3. playing the hero in a computer RPG (role-playing game)

Not sure how much I believe this myself, but Jacob cannell has an interesting take that social status isn't a "base drive" either, but is basically a proxy for "empowerment", influence over future states of the world. If that's true it's perhaps not so surprising that we're still well-aligned, since "empowerment" is in some sense always being selected for by reality.

I want to briefly note my disagreement: I think the genome specifically builds what might be called an innate status drive into the brain (stronger in some people than others), in addition to within-lifetime learning. See my discussions here and here, plus this comment thread, and hopefully better discussion in future posts.

A great counterpoint! 

Yeah, I wrote some years ago about how status wasn't a status wasn't a special feature that humans attribute to each other for contingent social psychology reasons, but rather falls out very naturally as an instrumentally convergent resource.

Yeah, when I consider that, it does undercut the claim that evolution shaped us to optimize for status. It shaped us to to want things, and also to find strategies to get them.

I disagree with “natural selection got the concept of "social status" into us” or that status-seeking behavior is tied to “having an intuitive "status" concept”.

For example, if Bob wants to be a movie star, then from the outside you and I can say that Bob is status-seeking, but it probably doesn’t feel like that to Bob; in fact Bob might not know what the word “status” means, and Bob might be totally oblivious to the existence of any connection between his desire to be a movie star and Alice’s desire to be a classical musician and Carol’s desire to eat at the cool kids table in middle school.

I think “status seeking” is a mish-mosh of a bunch of different things but I think an important one is very roughly “it’s intrinsically motivating to believe that other people like me”. (More discussion in §2.2.2 & §2.6.1 here and hopefully more in future posts.) I think it’s possible for the genome to build “it’s intrinsically motivating to believe that other people like me” into the brain whereas it would not be analogously possible for the genome to build “it’s intrinsically motivating to have a high inclusive genetic fitness” into the brain. There are many reasons that the latter is not realistic, not least of which is that inclusive genetic fitness is only observable in hindsight, after you’re dead.

For example, if Bob wants to be a movie star, then from the outside you and I can say that Bob is status-seeking, but it probably doesn’t feel like that to Bob; in fact Bob might not know what the word “status” means, and Bob might be totally oblivious to the existence of any connection between his desire to be a movie star and Alice’s desire to be a classical musician and Carol’s desire to eat at the cool kids table in middle school.

That seems true to me? I don't me that humans become aligned with their explicit verbal concept of status. I mean that (many) humans are aligned with the intuitive concept that they somehow learn over the course of development.
 

I think it’s possible for the genome to build “it’s intrinsically motivating to believe that other people like me” into the brain whereas it would not be analogously possible for the genome to build “it’s intrinsically motivating to have a high inclusive genetic fitness” into the brain. There are many reasons that the latter is not realistic, not least of which is that inclusive genetic fitness is only observable in hindsight, after you’re dead.

Makes sense!

 

I don't me that humans become aligned with their explicit verbal concept of status. I mean that (many) humans are aligned with the intuitive concept that they somehow learn over the course of development.

How do you know that there is any intuitive concept there? For example, if Bob wants to sit at the cool kid’s table at lunch and Bob dreams of being a movie star at dinner, who’s to say that there is a single concept in Bob’s brain, verbalized or not, active during both those events and tying them together? Why can’t it simply be the case that Bob feels motivated to do one thing, and then later on Bob feels motivated to do the other thing?

Well, there's convergent structure in the observed behavior. There's a target that seems pretty robust to a bunch of different kinds of perturbations and initial conditions. 

It's possible that that's implanted by a cluge of a bunch of different narrow adaptions. That's the null hypothesis even. 

But the fact that (many) people will steer systematically towards opportunities of high prestige, even when what that looks like is extremely varied, seems to me like evidence for an implicit concept that's hooked up to some planning machinery, rather than (only) a collection of adaptions that tend to produces this kind of behavior?

I think you’re responding to something different than what I was saying.

Again, let’s say Bob wants to sit at the cool kid’s table at lunch, and Bob dreams of being a movie star at dinner. Bob feels motivated to do one thing, and then later on Bob feels motivated to do the other thing. Both are still clearly goal-directed behaviors: At lunchtime, Bob’s “planning machinery” is pointed towards “sitting at the cool kid’s table”, and at dinnertime, Bob’s “planning machinery” is pointed towards “being a movie star”. Neither of these things can be accomplished by unthinking habits and reactions, obviously.

I think there’s a deep-seated system in the brainstem (or hypothalamus). When Bob’s world-model (cortex) is imagining a future where he is sitting at the cool kid’s table, then this brainstem system flags that future as “desirable”. Then later on, when Bob’s world-model (cortex) is imagining a future where he is a movie star, then this brainstem system flags that future as “desirable”. But from the perspective of Bob’s world-model / cortex / conscious awareness (both verbalized and not), there does not have to be any concept that makes a connection between “sit at the cool kid’s table” and “be a movie star”. Right?

By analogy, if Caveman Oog feels motivated to eat meat sometimes, and to eat vegetables other times, then it might or might not be the case that Oog has a single concept akin to the English word “eating” that encompasses both eating-meat and eating-vegetables. Maybe in his culture, those are thought of as two totally different activities—the way we think of eating versus dancing. It’s not like there’s no overlap between eating and dancing—your heart is beating in both cases, it’s usually-but-not-always a group activity in both cases, it alleviates boredom in both cases—but there isn’t any concept in English unifying them. Likewise, if you asked Oog about eating-meat versus eating-vegetables, he would say “huh, never thought about that, but yeah sure, I guess they do have some things in common, like both involve putting stuff into one’s mouth and moving the jaw”. I’m not saying that this Oog thought experiment is likely, but it’s possible, right? And that illustrates the fact that coherently-and-systematically-planning-to-eat does not rely on having a concept of “eating”, whether verbalized or not.

This seems like good news about alignment.

To me it sounds like alignment will do a good job of aligning AIs to money. Which might be ok in the short run, but bad in the longer run.

Yes, and yes.

It seems like Evolution did not "try" to have humans aligned to status. It might have been a proxy for inclusive genetic fitness, but if so, I would not say that evolution "succeeded" at aligning humans. My guess is it's not a great proxy for inclusive genetic fitness in the modern environment (my guess is it's weakly correlated with reproductive success, but clearly not as strongly as the relative importance that humans assign to it would indicate if it was a good proxy for inclusive genetic fitness).

Of course, my guess is after the fact, for any system that has undergone some level of self-reflection and was put under selection that causes it to want coherent things, you will be able to identify some patterns in its goals. The difficult part in aligning AIs is in being able to choose what those patterns are, not being able to cohere some patterns at the end of it. My guess is with any AI system, if we were to survive and got to observe it as its made its way to coherence, we would be able to find some robust patterns in its goals (my guess is in the case of LLMs something related to predicting text, but who knows), but that doesn't give me much solace in the AI treating me well, or sharing my goal. 

A super relevant point. If we try to align our AIs with something, and they end up robustly aligned with some other proxy thing, we definitely didn't succeed. 

But, it's still impressive to me that evolution hooked up general planning capabilities to a (learned) abstract concept, at all. 

Like there's this abstract concept, which varies a lot in it's particulars, from environment to environment. And which the brain has to learn to detect it aside from the particulars. Somehow the genome is able to construct the brain such that the motivation circuitry can pick out that abstract concept, after is it learned (or as it is being learned) and use that as a major criterion of the planning and decision machinery. And the end result is that the organism as a whole ends up not that far from a [abstract concept]-maximizer.

This is a lot more than I might expect evolution to be able to pull off, if I thought that our motivations were a hodge-podge of adaptions that cohere (as much as they do) into godshatter.

My point is NOT that evolution killed it, alignment is easy. My point is that evolution got a lot further than I would have guessed was possible.
 

Status is a way to have power. Aligning an agent to be power-maximizing is qualitatively different from what we want from AI which we want to align to care about our own ends.

If the agent had no power whatsoever to effect the world then it wouldn’t matter if it cared or not.

So the real desire is that it must have a sufficient amount, but not over some threshold that will prove to be too frightening.

Who gets to decide this threhsold?

An AGI can kill you even if it's not beyond what you consider to be "too frightening".

The grading isn't on a scale. 

The threshold still has to be greater than zero power for its ‘care’ to matter one way or the other. And the risk that you mention needs to be accepted as part of the package, so to speak.

So who gets to decide where to place it above zero?

Seems like the main difference is that you're "counting up" with status and "counting down" with genetic fitness.

There's partial overlap between people's reproductive interests and their motivations, and you and others have emphasized places where there's a mismatch, but there are also (for example) plenty of people who plan their lives around having & raising kids. 

There's partial overlap between status and people's motivations, and this post emphasizes places where they match up, but there are also (for example) plenty of people who put tons of effort into leveling up their videogame characters, or affiliating-at-a-distance with Taylor Swift or LeBron James, with minimal real-world benefit to themselves.

And it's easier to count up lots of things as status-related if you're using a vague concept of status which can encompass all sorts of status-related behaviors, including (e.g.) both status-seeking and status-affiliation. "Inclusive genetic fitness" is a nice precise concept so it can be clear when individuals fail to aim for it even when acting on adaptations that are directly involved in reproduction & raising offspring.

Why do you highlight status among bazilliion other things that generalized too, like romantic love, curiosity, altruism?

…and eating, and breastfeeding…

Note that this doesn't undermine the post, because it's thesis only gets stronger if we assume that more alignment attempts like romantic love or altruism generalized, because that could well imply that control or alignment is actually really easy to generalize, even when the intelligence of the aligner is way less than the alignee.

This suggests that scalable oversight is either a non-problem, or a problem only at ridiculous levels of disparity, and suggests that alignment does generalize quite far.

This, as well as my belief that current alignment designers have far more tools in their alignment toolkit than evolution had makes me extremely optimistic that alignment is likely to be solved before dangerous AI.

Those are motivations but they don't (mostly) have the type signature of "goals" but rather the type signature of "drives".

I pursue interesting stuff because I'm curious. That doesn't require me to even have a concept of curiosity—it could in principle be steering me without my awareness. My planning process might use curiosity, but it isn't aligned with curiosity, in the sense that we make plans that maximize our curiosity (usually). We just do what's interesting.

In contrast, social status is a concept that humans learn, and it does look like the planning process is aligned with the status concept, in that (some) humans habitually make plans that are relatively well described as status maximizing. 

Or another way of saying it. Our status motivations are not straightforward adaption execution. It's recruiting the general intelligence in service of this concept, in much the way that we would want an AGI to be aligned with a concept like the Good or corrigibility.

Romantic love, again people act on (including using their general intelligence), but their planning process is not in general aligned with maximization of romantic love. (Indeed, I'm editorializing human nature here, but it looks to me like romantic love is mostly a strategy to get other goals).

Altruism - It's debatable whether most instances of maximizing altruistic impact are better described as status maximization. Regardless, this is an overriding strategic goal, recruiting general intelligence, for a very small fraction of humans.

I don't think that everybody has the built in drive to seek "high social status", as defined by the culture they are born into or any specific aspect of it that can be made to seem attractive.  I know people who just think its an annoying waste of time.  Or like myself spent half my life chasing it then found inner empowerment and came to find the proxy of high status was a waste of time and quit chasing.

Maybe related, I do think we all generally tend to seek "signalling" and in some cases spend great energy doing it.  I admit I sometimes do, but it's not signalling high status, its just signalling chill and contentedness. I have observed some kind of signalling in pretty much every adult I have witnessed, though its hard to say for sure, its more my assumption of their deepest motivation.  The strength of the drive isn't always strong for some people or its just very temporary.  There are likely much stronger drivers (e,g, avoiding obvious suffering).  Signalling perhaps helps us attract others who align with us and form "tribes", so it can be worth the energy.

“[optimization process] did kind of shockingly well aligning humans to [a random goal that the optimization process wasn’t aiming for (and that’s not reproducible with a higher bandwidth optimization such as gradient descent over a neural network’s parameters)]”

Nope, if your optimization process is able to crystallize some goals into an agent, it’s not some surprising success, unless you picked these goals. If an agent starts to want paperclips in a coherent way and then every training step makes it even better at wanting and pursuing paperclips, your training process isn’t “surprisingly successful” at aligning the agent with making paperclips.

This makes me way less confident about the standard "evolution failed at alignment" story.

If people become more optimistic, because they see some goals in an agent, and say the optimization process was able to successfully optimize for that, but they don’t have evidence of the optimization process having tried to target the goals they observe, they’re just clearly doing something wrong.

Evolutionary physiology is a thing! It is simply invalid to say “[a physiological property of humans that is the result of evolution] existing in humans now is a surprising success of evolution at aligning humans”.

Maybe our culture fits our status-seeking surprisingly well because our culture was designed around it.

We design institutions to channel and utilize our status-seeking instincts. We put people in status conscious groups like schools, platoons, or companies. There we have ceremonies and titles that draw our attention to status.

And this works! Ask yourself, is it more effective to educate a child individually or in a group of peers? The latter. Is it easier to lead a solitary soldier or a whole squad? The latter. Do people seek a promotion or a pay rise? Both, probably. The fact is, that people are easier to guide when in large groups, and easier to motivate with status symbols.

From this perspective, our culture and inclination for seeking status have developed in tandem, making it challenging to determine which influences the other more. However, it appears that culture progresses more rapidly than genes, suggesting that culture conforms to our genes, rather than the reverse.

Another perspective: Sometimes our status seeking is nonfunctional and therefore nonaligned. For example we also waste a lot of effort on status, which seems like a nonfunctional drive. People will compete for high status professions like musician, streamer, celebrity and most will fail, which makes it seem like an unwise investment of time. This seems misaligned, as it's not adaptive.

It seems that a huge part of "human behaviour is explained by status seeking" is just post hoc proclaiming that whatever humans do is status seeking

Suppose you want to predict whether a given man will go hang out with friends or work more on a project. How does the idea of status seeking helps? When we already know that the human chose friends we say, yes of course, he get more status around his friend group by spending more time with them, improving their bonds and having good friends is a marker of status in its own right. Likewise, when we know that the man chose work, we can say that this is behaviour that leads towards promotion and more money and influence inside the company which is a marker of high status. But when we want to predict beforehand... I don't think it really helps.

The concept of status helps us predict that any given person is likely to do one of the relatively few things that are likely to increase their status, and not one of the many more things that are neutral or likely to decrease status, even if it can't by itself tell us exactly which status-raising thing they would do. Seems plenty useful to me.

How are you telling the difference between "evolution aligned humans to this thing that generalized really well across the distributional shift of technological civilization" vs. "evolution aligned humans to this thing, which then was distorted / replaced / cut down / added to by the distributional shift of technological civilization"?

Eye-balling it? I'm hoping commenters will help me distinguish between these cases, hence my second footnote.

Agree. This connects to why I think that the standard argument for evolutionary misalignment is wrong: it's meaningless to say that evolution has failed to align humans with inclusive fitness, because fitness is not any one constant thing. Rather, what evolution can do is to align humans with drives that in specific circumstances promote fitness. And if we look at how well the drives we've actually been given generalize, we find that they have largely continued to generalize quite well, implying that while there's likely to still be a left turn, it may very well be much milder than is commonly implied.

[-]TAG2mo20

So humans are "aligned" if humans have any kind of values? That's not how alignment is usually used.

The only metric natural selection is “optimizing” for is inclusive genetic fitness. It did not “try” to align humans with social status, and in many cases people care about social status to the detriment of their inclusive genetic fitness. This is a failure of alignment, not a success.

True and important. I don't mean to imply otherwise. Evolution failed at it's "alignment goal". 

If (as I'm positing here) it successfully constructed humans to be aligned to some other concept, that's not the alignment goal, and that concept, and that alignment, generalized well, that doesn't mean that evolution failed any less hard.

But it does seem notable if that's what happened! Because it's some evidence about alignment generalization.

Less wrong obsesses about status to an incredibly unhealthy degree.

edit: removed sarcastic meme format.

Is that intended to mean “lesswrong people are obsessed with their own and each other’s status”, or “lesswrong people are obsessed with the phenomenon of human status-seeking”? (or something else?)

The former, by nature of a distorted view on the latter. I don't think status is a single variable, and I think when you split it up into its more natural components - friendship, caring, trust, etc - it has a much more human ring to it than "status" and "value" do, which strike me as ruthless sociopathic businesspeople perspectives. It is true that status is a moderately predictive oversimplification, but I claim that that is because it is oversimplifying components that are correlated in the circumstances where status appears to work predictively. Command hierarchy is itself a bug to fix, anyhow. Differences in levels of friendship, caring, trust, respect, etc should not cause people to form a deference tree, healthy social networks are far more peer to peer than ones that form around communities obsessed with the concept of "status".

I was asking because I published 14,000 words on the phenomenon of human status-seeking last week. :-P I agree that there have been many oversimplified accounts of how status works. I hope mine is not one of them. I agree that “status is not a single variable” and that “deference tree” accounts are misleading. (I think the popular lesswrong / Robin Hanson view is that status is two variables rather than one, but I think that’s still oversimplified.)

I don’t think the way that “lesswrong community members” actually relate to each other is “ruthless sociopathic businesspeople … command hierarchy … deference tree” kind of stuff. I mean, there’s more-than-zero of that, but not much, and I think less of it in lesswrong than in most groups that I’ve experienced—I’m thinking of places I’ve worked, college clubs, friend groups, etc. Hmm, oh here’s an exception, “the group of frequent Wikipedia physics article editors from 2005-2018” was noticeably better than lesswrong on that axis, I think. I imagine that different people have different experiences of the “lesswrong community” though. Maybe I have subconsciously learned to engage with some parts of the community more than others.