Comments

Twitter.

https://threadreaderapp.com/thread/1767710372306530562.html <= This link will take you to the thread, but NOT hosted on twitter.

True and important. I don't mean to imply otherwise. Evolution failed at it's "alignment goal". 

If (as I'm positing here) it successfully constructed humans to be aligned to some other concept, that's not the alignment goal, and that concept, and that alignment, generalized well, that doesn't mean that evolution failed any less hard.

But it does seem notable if that's what happened! Because it's some evidence about alignment generalization.

Well, there's convergent structure in the observed behavior. There's a target that seems pretty robust to a bunch of different kinds of perturbations and initial conditions. 

It's possible that that's implanted by a cluge of a bunch of different narrow adaptions. That's the null hypothesis even. 

But the fact that (many) people will steer systematically towards opportunities of high prestige, even when what that looks like is extremely varied, seems to me like evidence for an implicit concept that's hooked up to some planning machinery, rather than (only) a collection of adaptions that tend to produces this kind of behavior?

Eye-balling it? I'm hoping commenters will help me distinguish between these cases, hence my second footnote.

For example, if Bob wants to be a movie star, then from the outside you and I can say that Bob is status-seeking, but it probably doesn’t feel like that to Bob; in fact Bob might not know what the word “status” means, and Bob might be totally oblivious to the existence of any connection between his desire to be a movie star and Alice’s desire to be a classical musician and Carol’s desire to eat at the cool kids table in middle school.

That seems true to me? I don't me that humans become aligned with their explicit verbal concept of status. I mean that (many) humans are aligned with the intuitive concept that they somehow learn over the course of development.
 

I think it’s possible for the genome to build “it’s intrinsically motivating to believe that other people like me” into the brain whereas it would not be analogously possible for the genome to build “it’s intrinsically motivating to have a high inclusive genetic fitness” into the brain. There are many reasons that the latter is not realistic, not least of which is that inclusive genetic fitness is only observable in hindsight, after you’re dead.

Makes sense!

 

A super relevant point. If we try to align our AIs with something, and they end up robustly aligned with some other proxy thing, we definitely didn't succeed. 

But, it's still impressive to me that evolution hooked up general planning capabilities to a (learned) abstract concept, at all. 

Like there's this abstract concept, which varies a lot in it's particulars, from environment to environment. And which the brain has to learn to detect it aside from the particulars. Somehow the genome is able to construct the brain such that the motivation circuitry can pick out that abstract concept, after is it learned (or as it is being learned) and use that as a major criterion of the planning and decision machinery. And the end result is that the organism as a whole ends up not that far from a [abstract concept]-maximizer.

This is a lot more than I might expect evolution to be able to pull off, if I thought that our motivations were a hodge-podge of adaptions that cohere (as much as they do) into godshatter.

My point is NOT that evolution killed it, alignment is easy. My point is that evolution got a lot further than I would have guessed was possible.
 

Those are motivations but they don't (mostly) have the type signature of "goals" but rather the type signature of "drives".

I pursue interesting stuff because I'm curious. That doesn't require me to even have a concept of curiosity—it could in principle be steering me without my awareness. My planning process might use curiosity, but it isn't aligned with curiosity, in the sense that we make plans that maximize our curiosity (usually). We just do what's interesting.

In contrast, social status is a concept that humans learn, and it does look like the planning process is aligned with the status concept, in that (some) humans habitually make plans that are relatively well described as status maximizing. 

Or another way of saying it. Our status motivations are not straightforward adaption execution. It's recruiting the general intelligence in service of this concept, in much the way that we would want an AGI to be aligned with a concept like the Good or corrigibility.

Romantic love, again people act on (including using their general intelligence), but their planning process is not in general aligned with maximization of romantic love. (Indeed, I'm editorializing human nature here, but it looks to me like romantic love is mostly a strategy to get other goals).

Altruism - It's debatable whether most instances of maximizing altruistic impact are better described as status maximization. Regardless, this is an overriding strategic goal, recruiting general intelligence, for a very small fraction of humans.

A great counterpoint! 

Yeah, I wrote some years ago about how status wasn't a status wasn't a special feature that humans attribute to each other for contingent social psychology reasons, but rather falls out very naturally as an instrumentally convergent resource.

Yeah, when I consider that, it does undercut the claim that evolution shaped us to optimize for status. It shaped us to to want things, and also to find strategies to get them.

Can you give some examples of the prompts you're using here? In what ways are you imagining it helping with alignment?

Given that there is not legislation to enforce a slowdown, it is preferable that Anthropic style AIs be state of the art than OpenAI style, as long as the ai safety community use claude heavily during that time.

Can someone elaborate on why this would be?

Load More