Previously, I argued that we should expect future ML systems to often exhibit "emergent" behavior, where they acquire new capabilities that were not explicitly designed or intended, simply as a result of scaling. This was a special case of a general phenomenon in the physical sciences called More Is Different.

I care about this because I think AI will have a huge impact on society, and I want to forecast what future systems will be like so that I can steer things to be better. To that end, I find More Is Different to be troubling and disorienting. I’m inclined to forecast the future by looking at existing trends and asking what will happen if they continue, but we should instead expect new qualitative behaviors to arise all the time that are not an extrapolation of previous trends.

Given this, how can we predict what future systems will look like? For this, I find it helpful to think in terms of "anchors"---reference classes that are broadly analogous to future ML systems, which we can then use to make predictions.

The most obvious reference class for future ML systems is current ML systems---I'll call this the current ML anchor. I think this is indeed a pretty good starting point, but we’ve already seen that it fails to account for emergent capabilities.

What other anchors can we use? One intuitive approach would be to look for things that humans are good at but that current ML systems are bad at. This would include:

  • Mastery of external tools (e.g. calculators, search engines, software, programming)
  • Very efficient learning (e.g. reading a textbook once to learn a new subject)
  • Long-term planning (e.g. being able to successfully achieve goals over months)

Models sufficiently far in the future will presumably have these sorts of capabilities. While this still leaves unknowns---for instance, we don't know how rapidly these capabilities will appear---it's still a useful complement to the current ML anchor. I'll call this the human anchor.

A problem with the human anchor is that it risks anthropomorphising ML by over-analogizing with human behavior. Anthropomorphic reasoning correctly gets a bad rap in ML, because it's very intuitively persuasive but has a mixed at best track record. This isn't a reason to abandon the human anchor, but it means we shouldn't be entirely satisfied with it.

This brings us to a third anchor, the optimization anchor, which I associate with the "Philosophy" or thought experiment approach that I've described previously. Here the idea is to think of ML systems as ideal optimizers and ask what a perfect optimizer would do in a given scenario. This is where Nick Bostrom's colorful description of a paperclip maximizer comes from, where an AI asked to make paperclips turns the entire planet into paperclip factories. To give some more prosaic examples:

  • The optimization anchor would correctly predict imitative deception (Lin et al., 2021), since a system optimized to produce high-probability outputs has no intrinsic reason to be truthful.
  • It also would observe that power-seeking is instrumentally useful for many different goals, and so predict that optimal policies (as well as sufficiently powerful neural networks) will tend to do so (Turner et al., 2021).

Ideas produced by the optimization anchor are often met with skepticism, because they often contradict the familiar current ML anchor, and they don't benefit from the intuitive appeal of the human anchor. But the differences from these other two anchors are precisely what make the optimization anchor valuable. If you (like me) feel that both the current ML and human anchors paint an incomplete picture, then you should want a third independent perspective.

The optimization anchor does have limitations. Since it abstracts ML into an ideal optimizer, it ignores most on-the-ground facts about neural networks. This can lead to underconstrained predictions, and to ignoring properties that I think will be necessary for successfully aligning ML systems with humans. I'll say more about this later, but some particularly important properties are that neural networks often generalize in "natural" ways, that we can introspect on network representations, and that training dynamics are smooth and continuous. Researchers focused on the optimization anchor don't entirely ignore these facts, but I think they tend to underemphasize them and are overly pessimistic as a result.

The Value of Thought Experiments

The optimization anchor points to the value of thought experiments more generally. While it poses the thought experiment of "What if AI were a perfect optimizer?", there are many other thought experiments that can provide insights that'd be hard to obtain from the ML or human anchors. In this sense thought experiments are not a single anchor but a generator for anchors, which seems pretty valuable.

One thought experiment that I particularly like is: What happens if most of an agent's learning occurs not during gradient descent, but through in-context learning[1]? This is likely to happen eventually, as ML agents are rolled out over longer time horizons (think artificial digital assistants) and as ML improves at in-context learning. Once this does happen, it seems possible that agents' behavior will be controlled less by the "extrinsic" shaping of gradient descent and more by whatever "intrinsic" drives they happen to have[2]. This also seems like a change that could happen suddenly, since gradient descent is slow while in-context learning is fast.

It would be great if we had a community of researchers making thought experiments with clearly stated assumptions, explaining in detail the consequences of those assumptions and ideally connecting it to present-day research.

Other Anchors

There are many other anchors that could be helpful for predicting future ML systems. Non-human animal behavior could provide a broader reference class than humans alone. Evolution and the economy are both examples of powerful, distributed optimization processes. I am most excited about better understanding complex systems, which include biological systems, brains, organizations, economies, and ecosystems and thus subsume most of the reference classes discussed so far. It seems to me that complex systems have received little attention relative to their germaneness to ML. Indeed, emergence is itself a concept from complex systems theory that is useful for understanding recent ML developments.

Limitations of Thought Experiments

I've focused so far on predicting problems that we need to address. But at some point we actually have to solve the problems. In this regard thought experiments are weaker, since while they often point to important big-picture issues, in my view they fare poorly at getting the details right, which is needed for engineering progress. For instance, early thought experiments considered a single AI system that was much more powerful than any other contemporary technologies, while in reality there will likely be many ML systems with a continuous distribution of capabilities. More recent thought experiments impose discrete abstractions like "goals" and "objectives" that I don’t think will cleanly map onto real ML systems. Thus while thought experiments can point to general ideas for research, even mapping these ideas to the ontology of ML systems can be a difficult task.

As a result, while we can't blindly extrapolate empirical trends, we do need a concerted empirically-based effort to address future ML risks. I'll explain why I think this is possible in a later post, but first I'll take us through an example of "taking a thought experiment seriously", and what it implies about possible failure modes of ML systems.


  1. In-context learning refers to learning that occurs during a single "rollout" of a model. The most famous example is GPT-3's ability to learn new tasks after conditioning on a small number of examples. ↩︎

  2. While this statement borders on anthropomorphizing, I think it is actually justified. For instance, depending on the training objective, many agents will likely have a "drive" towards information-gathering, among others. ↩︎

42

20 comments, sorted by Click to highlight new comments since: Today at 8:31 PM
New Comment

A problem with the human anchor is that it risks anthropomorphising ML by over-analogizing with human behavior. Anthropomorphic reasoning correctly gets a bad rap in ML,

The fear of anthropomorphising AI is one of the more ridiculous traditional mental blindspots in the LW/rationalist sphere.

The entire premise of AGI is that of an artificial system that 'thinks', putting it solidly in the reference class that contains humans and nearly nothing else, so it's literally the one case where anthropomorphising is actually justified, especially when you consider that brains are efficient!

The "AI will be anthropomorphic" viewpoint more correctly predicted the success of the brain-reverse engineering approaches (aka DL), that AGI will require lengthy education/training processes like humans, will have far more human-like cognitive limitations than you'd otherwise expect, etc etc. Absolutely trounced the competing "anti-anthropomophic" viewpoint in predicting the nature of AGI.

The fear of anthropomorphising AI is one of the more ridiculous traditional mental blindspots in the LW/rationalist sphere.

You're really going to love Thursday's post :).

Jokes aside, I actually am not sure LW is that against anthropomorphising. It seems like a much stronger injunction among ML researchers than it is on this forum.

I personally am not very into using humans as a reference class because it is a reference class with a single data point, whereas e.g. "complex systems" has a much larger number of data points.

In addition, it seems like intuition about how humans behave is already pretty baked in to how we think about intelligent agents, so I'd guess by default we overweight it and have to consciously get ourselves to consider other anchors.

I would agree that it's better to do this by explicitly proposing additional anchors, rather than never talking about humans.

To an extent that's true. There are certainly some similarities in how human brains work and how deep learning works, if for no other reason than that DL uses a connectionist approach to AI, which has given narrow AIs something like an intuition, rather than the hard-coded rules of GOFAI. And yes, once we start developing goal-oriented artificial agents, humans will remain, for a long time, the best model we have for approaching an understanding of them.

However, remember how susceptible current DL models can be to adversarial examples, even when the adversarial examples have no perceptible difference to non-adversarial examples as fas as humans can tell. That means that something is going on in DL systems that is qualitatively much different from how human brains process information. Something that makes them fragile in a way that is hard to anthropomorphize. Something alien.

And then there is the orthogonality thesis. Even though humans are the best example we currently have of general intelligence, there is no reason to think that the first AGIs will have goal/value structures any less alien to humans than would a superintelligent spider. Anthropomorphization of such systems carries the risk of assuming too much about how they think or what they want, where we miss critical points of misalignment.

However, remember how susceptible current DL models can be to adversarial examples, even when the adversarial examples have no perceptible difference to non-adversarial examples as fas as humans can tell. That means that something is going on in DL systems that is qualitatively much different from how human brains process information. Something that makes them fragile in a way that is hard to anthropomorphize. Something alien.

That is highly debatable. There has been work on constructing adversarial examples for human brains, and some interesting demonstrations of considerable neural-level control even with our extremely limited ability to observe brains (ie. far short of 'know every single parameter in the network exactly and are able to calculate exact network-wide gradients for it or a similar network'), and theoretical work arguing that adversarial examples are only due to the most obvious way that current DL models differ from human brains - being much, much, much smaller.

There has been work on constructing adversarial examples for human brains, and some interesting demonstrations of considerable neural-level control even with our extremely limited ability to observe brains

Do you have a source for this? I would be interested in looking into it. I could see this happening for isolated neurons, at least, but it would be curious if it could happen for whole circuits in vivo.

Does this go beyond just manipulating how our brains process optical illusions? I don't see how the brain would perceive the type of pixel-level adversarial perturbations most of us think of (e.g.: https://openai.com/blog/adversarial-example-research/) as anything other than noise, if it even reaches past the threshold of perception at all. The sorts of illusions humans fall prey to are qualitatively different, taking advantage of our perceptual assumptions like structural continuity or color shifts under changing lighting conditions or 3-dimensionality. We don't tend to go from making good guesses about what something is to being wildly, confidently incorrect when the texture changes microscopically.

My guess would be that you could get rid of a lot of adversarial susceptibility from DL systems by adding in the right kind of recurrent connectivity (as in predictive coding, where hypotheses about what the network is looking at help it to interpret low-level features), or even by finding a less extremizing nonlinearity than ReLU (e.g.: https://towardsdatascience.com/neural-networks-an-alternative-to-relu-2e75ddaef95c). Such changes might get us closer to how the brain does things.

Overparameterization, such as through making the network arbitrarily deep, might be able to get you around some of these limitations eventually (just like a fully connected NN can do the same thing as a CNN in principle), but I think we'll have to change how we design neural networks at a fundamental level in order to avoid these issues more effectively in the long term.

Look through https://www.gwern.net/docs/ai/adversarial/index The theoretical work is the isoperimetry paper: https://arxiv.org/abs/2105.12806

I don't see how the brain would perceive the type of pixel-level adversarial perturbations most of us think of (e.g.: https://openai.com/blog/adversarial-example-research/) as anything other than noise, if it even reaches past the threshold of perception at all.

Here is a paper showing that humans can classify pixel-level adversarial examples that look like noise at better than chance levels, see Experiment 4 (and also #5-6): https://www.nature.com/articles/s41467-019-08931-6

Thanks for the links!

However, remember how susceptible current DL models can be to adversarial examples, even when the adversarial examples have no perceptible difference to non-adversarial examples as fas as humans can tell. That means that something is going on in DL systems that is qualitatively much different from how human brains process information.

I agree with gwern's points about the difficulty of comparison, but also agree that even if we could compare more directly, there probably is some robustness gap. In DL systems this seems to be mostly due to overfitting noise. Early biological vision stages are strongly compressing and thus indirectly act as powerful noise filters, which happens to also greatly increase robustness, whereas DL vision systems usually aren't trained that way. But I expect that gap to narrow as DL models become more brain like in size, efficiency, regularization, and in training regime (more self-supervised, unsupervised, etc).

there is no reason to think that the first AGIs will have goal/value structures any less alien to humans than would a superintelligent spider

There will be great economic pressure to create AGIs aligned to human values - at least in corporate or customer facing forms. But some humans aren't particularly aligned to any other agent's values either, so it's not as if anthropomorphic somehow implies altruistic (and indeed countless human stories rely on non-altruistic, unaligned anthropomorphic things in the form of villains).

Finally, the orthogonality thesis applies to humans as well; key components of generic human motivation are instrumental self-motivational drives - power seeking, curiosity, etc - just exactly the same things we are worried unaligned AGI will have. This idea that humans have these highly specific values that are weirdly different than the values of practical generic learning agents is actually mostly false; substantial evidence from neuroscience now indicates that the brain is very much like a generic self-motivated learning system with some minimal required modifications for mating to work thrown in (which are actually optional regardless), and must be so for various bootstrapping efficiency reasons. The complexity of human values is largely emergent; and AGI will follow suit in that regard.

it's not as if anthropomorphic somehow implies altruistic

Yeah, my take is that there are humans with the kinds of goals and motivations that we'd be OK with an AGI having, and there are also humans with the kinds of goals and motivations that we absolutely don't want our AGIs to have. Therefore "AGIs with human-like motivation systems" is a potentially promising area to explore, but OTOH "We'll build an AGI with a human-like motivation system, and then call it a day" is not sufficient, we need to do extra work beyond just that.

There will be great economic pressure to create AGIs aligned to human values - at least in corporate or customer facing forms.

Well, yes, people will probably try, but that doesn't mean they'll succeed :-P

This idea that humans have these highly specific values that are weirdly different than the values of practical generic learning agents is actually mostly false; substantial evidence from neuroscience now indicates that the brain is very much like a generic self-motivated learning system with some minimal required modifications for mating to work thrown in (which are actually optional regardless), and must be so for various bootstrapping efficiency reasons. The complexity of human values is largely emergent; and AGI will follow suit in that regard.

I think this claim is mixing up evolution versus within-lifetime learning, or else maybe I just disagree with you.

To grossly oversimplify, my model is: "motivations are mostly built by evolution, intelligence is mostly built by within-lifetime learning".

So the evolution-of-humans model would be: there's an outer-loop selection process not only selecting the within-lifetime neural architectures and hyperparameters and learning rules etc. but also selecting the within-lifetime reward function, and then the within-lifetime learning happens in an inner loop, or in ML terminology, as a single super-long episode with online learning. And over the course of that episode the agent gradually learns all about itself and the world, and winds up pursuing goals. The within-lifetime reward function is not exactly the same as those goals, but is upstream of those goals and closely related. (I'm oversimplifying in lots of ways, but I think this is spiritually correct.)

Example: Suppose a human is trying to stay alive. There are two ways that this might have happened. The first story is that the outer-loop (evolutionary) selection has endowed humans with a within-lifetime reward function that (by and large) tends to make humans have drives related to staying alive, as an end in itself. ("I dunno, I just love being alive", they say.) The second story involves the inner loop, instrumental convergence, and means-end reasoning: Imagine a person saying "I'm old and tired and sick, but dammit I refuse to die until I finish writing my novel!"

I think when people say "human values", they're mainly talking about the genetically-endowed drives that come from the within-lifetime reward function, not the instrumental goals that pop up via means-end reasoning. So the question is: would the AGI's "innate" reward function for within-lifetime learning resemble a human's?

Well, IF we copy evolution, and select our AGI's within-lifetime reward function by a blind outer-loop search, I think we're likely to wind up with AGIs whose "innate" drives include things like curiosity and self-preservation and power-seeking, which are also recognizable as human innate drives. I think this is the point you were trying to make. (Sorry if I'm misunderstanding.)

But there are two key points.

First, I think this is a terrible way to make a within-lifetime reward function, and we don't have to do it that way, and we probably won't. The within-lifetime reward function is the single most important ingredient in AGI safety and alignment. The last thing I would do is pick it by a blind search. And I don't think we need to. I propose that we should write the within-lifetime reward function ourselves, by hand. After all, ML researchers hand-code reward functions all the time, and the agents do just fine. Even when the reward function is picked based on the whims of some ancient Atari game developer, the agents still usually do fine. Putting curiosity into the reward function helps, but we already know how to add curiosity into a reward function, we can copy the formula from arxiv, we don't need to do a blind-search-over-within-lifetime-reward-functions.

Second, even if we did pick the AGI's within-lifetime reward function by an evolution-like outer-loop search, the AGI probably wouldn't wind up with human-like social instincts. That's important—I think human social instincts underlie pretty much all of human moral intuitions, and spill into practically every other part of human behavior. I suppose you could get some kind of AGI social instincts by doing your outer-loop search in a simulation where multiple AGIs can work together and have iterated prisoners' dilemmas etc. But social interactions per se are insufficient—human-like social instincts are different from wolf-like social instincts or bonobo-like social instincts etc. I think there are a lot of possible equilibria, and it depends on idiosyncrasies of their social environment and power dynamics (details). Also, even if you somehow nailed it, you would get AGIs with human-like social instinct towards the other AGIs in the simulation, and not necessarily towards humans.

To grossly oversimplify, my model is: "motivations are mostly built by evolution, intelligence is mostly built by within-lifetime learning".

Sure - if by 'motivations' we are referring to the innate optimization criteria of the learning machinery.

Suppose a human is trying to stay alive. There are two ways that this might have happened. The first story is that the outer-loop (evolutionary) selection has endowed humans with a within-lifetime reward function that (by and large) tends to make humans have drives related to staying alive, as an end in itself. ("I dunno, I just love being alive", they say.) The second story involves the inner loop, instrumental convergence, and means-end reasoning:

This is actually a bad example because the typical human goal of avoiding death is mostly a learned instrumental goal, not innate. Humans are not born with a concept of death, but instead are born with generic empowerment self-motivation which immediately recognizes death as a terrible outcome to avoid.

I think we're likely to wind up with AGIs whose "innate" drives include things like curiosity and self-preservation and power-seeking, which are also recognizable as human innate drives. I think this is the point you were trying to make.

Yes - my key points are 1.) that self-motivated empowerment/curiosity are the primary simplest human drives. 2.) That these are near unavoidable, almost requirements for AGI. They are completely unsafe by themselves without additional drives/constraints, but that's just as true for evolution.

Whether we 'write the AI's reward function by hand" or not doesn't really matter much - what matters is what success requires. There's a growing body of work in DL and neuroscience (would take too long to summarize here, subject of a future post) indicating that human level learning ability in complex environments simply requires empowerment/curiosity (self-motivated learning). In a nutshell, it can be derived as the only viable solutions to the sparse reward problem. The real world is unlike Atari, it doesn't provide a reward signal. But because of instrumental convergence, the immediate instrumental goals for most all long term terminal goals converge on empowerment/curiosity.

Second, even if we did pick the AGI's within-lifetime reward function by an evolution-like outer-loop search, the AGI probably wouldn't wind up with human-like social instincts.

Humans aren't born with "human-like social instincts". Humans develop social skills through learning driven by empowerment/curiosity tempered by and or combined with altruism/empathy.

Also, even if you somehow nailed it, you would get AGIs with human-like social instinct towards the other AGIs in the simulation, and not necessarily towards humans.

I'm genuinely curious how you think this would arise, in that it requires some strangely specific "sim human" vs "human" detector, additionally combined with confidence that there is only one level of sim.

Thanks!

Humans aren't born with "human-like social instincts". Humans develop social skills through learning driven by empowerment/curiosity tempered by and or combined with altruism/empathy.

Let's talk about things like: sense-of-fairness, sense-of-justice, status-seeking, pride, defensiveness, guilt, revenge, schadenfreude, affection, generosity, in-group signaling, etc.

I claim that all those things stem from a suite of innate reactions hardcoded by the genome, and I call those things "social instincts".

People don't seek revenge because they figured out earlier in life that revenge would be instrumentally useful; they seek revenge because they feel a burning desire for revenge. Right? And when that burning desire is utterly destructive to all their other life goals, well in some cases they'll go ahead and do it anyway.

It seems to me that the revenge drive and all those other things I mentioned are an uncanny match for dispositions that lead to "thriving in small highly-social groups of hunter-gatherer humans". I don't think that's a coincidence, and I don't see any other explanation. Again, if someone has a vengeful personality, it's not a consequence of their understanding the game-theoretic justification for having a vengeful personality. So I think these things come from specific genetically-hardwired circuitry setting up specific drives. I'm confused about what you think is going on, if it's not that.

If we want an AI to succeed at inventing solar cells or whatever, and we don't care whether the AI can thrive in small highly-social groups of hunter-gatherer humans, then it seems to me that the AI does not require (or even benefit from) things like revenge drives and sense-of-fairness drives.

Another piece of evidence for that: if you compare neurotypical people vs sociopaths vs people with autism, there's huge variation in the relative and absolute strengths of the various social-instinct-related drives. But people in all three categories can be smart and competent and able to accomplish tricky goals.

The typical human goal of avoiding death is mostly a learned instrumental goal, not innate

Yeah, I guess you're right. Maybe I should have said "avoiding bodily injury" instead. We have an innate drive designed for the express purpose of making us want to avoid bodily injury (i.e. pain), but we can also wind up wanting to avoid bodily injury due to means-end reasoning ("That sounds fun, but I can't risk getting injured right now, the Big Game is coming up on Sunday!").

I'm genuinely curious how you think this would arise, in that it requires some strangely specific "sim human" vs "human" detector, additionally combined with confidence that there is only one level of sim.

I think humans have a bunch of partially-redundant innate mechanisms in the brainstem to flag other humans in the vicinity. There's good evidence for an innate face-detector in the brainstem. I strongly suspect that the brainstem also has a human-speech-sound detector (I think it helps speed up language learning). There are obviously also innate smell-of-a-person detectors, and touch-of-a-person detectors, etc.

It's a messy and under-determined process, apparently. Consider animals: Sometimes a human will relate to an animal in a way that is influenced by human social instincts. Other times, a human will relate to an animal in a way that is absolutely not influenced by human social instincts—e.g. the people who run a factory farm.

Anyway, I presume that, in a simulation environment with lots of AGIs cooperating together from a position of similar power and gains-from-trade, they would likewise develop social instincts of some sort (perhaps human-like, perhaps not). I assume those instincts would be implemented in a way that involves (among other things) messy redundant heuristics in the within-lifetime reward function for flagging situations where the AGI is interacting with another AGI. Then if we keep the same reward function and train an AGI with humans around, will it treat humans as in-group AGIs, or out-group AGIs, or things-to-which-social-instincts-do-not-apply, or what? I have no idea; I think it's impossible to say in advance.

Let's talk about things like: sense-of-fairness, sense-of-justice, status-seeking, pride, defensiveness, guilt, revenge, schadenfreude, affection, generosity, in-group signaling, etc.

I claim that all those things stem from a suite of innate reactions hardcoded by the genome, and I call those things "social instincts".

Status-seeking likely emerges from empowerment and social dynamics, guilt is likewise just emergent regret from altruism/empathy, affection/generosity are just manifestations of altruism/empathy. Fairness/justice/revenge/anger are all likely just manifestations of the same core emotion interacting with theory of mind (injustice triggers anger, and revenge is the consequentialist endpoint of anger). In other words, I'm not claiming there aren't any innate hardcoded emotional circuits - obviously there are - just that there are less truly innate then you posit, and instead most emerge from learning with a smaller simpler set of innate primal drives/emotions.

People don't seek revenge because they figured out earlier in life that revenge would be instrumentally useful; they seek revenge because they feel a burning desire for revenge. Right?

Revenge is simply planning under the influence of anger/wrath. The anger/injustice emotional circuity is innate, and so is planning, so humans don't need to learn to plan while predominantly angry, but they do need to learn to map those emergent mental behaviors to the word 'revenge'.

So I think these things come from specific genetically-hardwired circuitry setting up specific drives.

Sure we agree there, I just don't think there are as many or as complex innate sub components as you are positing.

If we want an AI to succeed at inventing solar cells or whatever, and we don't care whether the AI can thrive in small highly-social groups of hunter-gatherer humans, then it seems to me that the AI does not require (or even benefit from) things like revenge drives and sense-of-fairness drives.

Sure we don't need the justice/anger emotional subsystem, or the mating specific components, but we still want the equivalent of empathy/altruism.

I think humans have a bunch of partially-redundant innate mechanisms in the brainstem to flag other humans in the vicinity. There's good evidence for an innate face-detector in the brainstem. I strongly suspect that the brainstem also has a human-speech-sound detector

It's debatable how important (vs vestigial) many of these innate detectors are in humans, but they certainly don't seem to be very important/necessary for AGI. They were likely far more important for smaller brained and shorter lived mammalian ancestors.

Then if we keep the same reward function and train an AGI with humans around, will it treat humans as in-group AGIs, or out-group AGIs, or things-to-which-social-instincts-do-not-apply, or what?

If the AGI grows up in simulations that descend from modern game-tech with realistic humans, it would be pretty wierd if that somehow didn't transfer to recognizing humans as sapients (especially given how humans have no problem recognizing agents in the shape of animals or inanimate objects as sapients). This is relevant because simulations will likely continue to be the dominate most effective means of testing/evaluating/developing AI/AGI.

Thanks!

The main thing I had originally wanted to push back on was your earlier claim "This idea that humans have these highly specific values that are weirdly different than the values of practical generic learning agents is actually mostly false".

But later IIUC you said "The anger/injustice emotional circuity is innate" and that a practical generic learning agent does not need that circuitry. (If so, I agree.)

If I'm understanding you correctly, you also think that altruism/empathy also involves purpose-built innate circuitry, and that we can make a practical generic learning agent without that altruism/empathy circuitry, and it would still be competent (e.g. able to invent a better solar cell), but the people who make AGIs will in fact choose to put that altruism/empathy circuitry in. (If so, I agree that people will want to put that circuitry in, but I'm concerned that they will not know how to put it in, and I'm also concerned that people will do dangerous experiments where they omit that circuitry just to see what happens etc.)

I find it hard to reconcile the claims "This idea that humans have these highly specific values that are weirdly different than the values of practical generic learning agents is actually mostly false" versus "It is perfectly possible to build a practical, powerful learning agent with neither anger/injustice emotional circuitry nor altruism/empathy emotional circuitry." Those seem contradictory, right? Am I misunderstanding something?

 

The other side of my mildly-anti-anthropomorphism argument: I think it's possible that we will make an AGI with things inside its within-lifetime reward function that give it one or more "innate drives" that are radically different from any of the innate drives in humans, e.g. an "innate drive" for making paperclips analogous to the human innate drive for not being in pain. My impression is that you think this won't happen, but I'm not sure if that's because you think it's impossible / nonsensical, or because you think that the people who make AGIs will successfully avoid putting in drives like that.

(My belief is that "people will make AGIs with innate drives that are different from any of the innate drives in humans" is both possible and likely to actually happen, unless we put great effort into developing best practices for safe AGI design, and future AGI designers actually follow those best practices.)

I'm not claiming there aren't any innate hardcoded emotional circuits - obviously there are - just that there are less truly innate then you posit, and instead most emerge from learning with a smaller simpler set of innate primal drives/emotions.

I don't have a strong opinion about how complex are the "innate primal drives / emotions" that underlie human social instincts. In particular, I'm open-minded to the possibility that there's one innate reaction circuit that underlies (what we think of as) schadenfraude and revenge and pride etc., or whatever.

Well, hmm, maybe "open-minded but leaning skeptical". For example, I think humans have an innate eye-contact detector in the brainstem that triggers some set of corresponding reactions. I think that's a thing with dedicated innate circuitry. I also think "disgust" is its own dedicated thing in the brainstem—actually, I heard there are two slightly-different innate disgust reactions, associated with slightly-different facial expressions—and disgust reactions wind up playing a role in social emotions too. Anyway, various things like that make me skeptical that there's a simple "grand unified theory of human social emotions".

Well, maybe it depends on how accurate we're talking about. Maybe we can list all the human innate reaction circuits, in descending order of importance for human social emotions, and maybe the top one or two or three things would be sufficient to reproduce all the most salient and important phenomena in human social instincts, and maybe the other 500 things further down the list are all kinda subtle details that don't add up to much. I'm very open-minded to that possibility.

debatable how important (vs vestigial) many of these innate detectors are in humans

My current belief (see the blog post draft #3 that I shared with you a couple weeks ago) is that the simplest within-lifetime reward function for a powerful AGI consists of (1) some kind of curiosity drive, (2) some kind of drive to pay attention to humans, including human language.

Your list half-overlaps with mine. IIUC, you have (1) some kind of curiosity drive, (2*) "empowerment" drive. Did I get that right?

Why do I think (2) is important? Because a curious agent can be curious about anything—it can construct better and better models of trees, or clouds, or the shape and distribution of pebbles, etc. Granted, human language is an endlessly-complex pattern that might evoke curiosity … but the agent could also run Rule 110 in its head forever and also find endlessly-complex patterns that might evoke curiosity. So I think (2) is necessary to point the curiosity drive in the right general direction. This is why I put a lot of emphasis on those innate face detectors, human-speech-sound detectors, etc.

Why do I think (2*) is not important? (With the caveat that maybe I'm misunderstanding what you mean by empowerment.) Because we can get empowerment from curiosity, through means-end instrumental reasoning within a lifetime.

I also think that the dynamic in humans is "drive for status → drive for empowerment", rather than the other way around. (You can also get drive for empowerment from almost any other drive.) I think "drive for status" is a beautiful explanation of tons of things, not all of which are explainable via drive for empowerment, and that analogous status hierarchies / status drives exist in other animals too like the Arabian babbler (see Elephant in the Brain).

The main thing I had originally wanted to push back on was your earlier claim "This idea that humans have these highly specific values that are weirdly different than the values of practical generic learning agents is actually mostly false".

I take values to mean longer term goals like "save the world", or "become super successful/rich/powerful" or "do God's work" or whatever, not low level drives or emotions.

If I'm understanding you correctly, you also think that altruism/empathy also involves purpose-built innate circuitry, and that we can make a practical generic learning agent without that altruism/empathy circuitry, and it would still be competent (e.g. able to invent a better solar cell), but the people who make AGIs will in fact choose to put that altruism/empathy circuitry in. (If so, I agree that people will want to put that circuitry in, but I'm concerned that they will not know how to put it in, and I'm also concerned that people will do dangerous experiments where they omit that circuitry just to see what happens etc.)

Yeah. The 'altruism/empathy' circuit obviously has some innateness, but it is closely connected to and reliant on learned theory of mind. Sadly not all researchers seem to even care to put something like that in, although they should try. How that subsystem interacts with learned theory of mind is also complex, and is probably more inherently fragile than the generic unsupervised empowerment/curiosity learning system. It may be difficult to scale correctly, even for those who are bothering to try.

I find it hard to reconcile the claims "This idea that humans have these highly specific values that are weirdly different than the values of practical generic learning agents is actually mostly false" versus "It is perfectly possible to build a practical, powerful learning agent with neither anger/injustice emotional circuitry nor altruism/empathy emotional circuitry." Those seem contradictory, right? Am I misunderstanding something?

What I'm considering values here are almost exclusively learned (and typically social/cultural) concepts. Truly alien values would require creating a de novo alien cultural history (possible in sims but unlikely until later).

I think it's possible that we will make an AGI with things inside its within-lifetime reward function that give it one or more "innate drives" that are radically different from any of the innate drives in humans, e.g. an "innate drive" for making paperclips analogous to the human innate drive for not being in pain

I think this unlikely as these just simply aren't useful for AGI in complex environments. Simple innate drives (score reward) barely work in Atari (and not even for all games). Moving to more complex environments requires some form of intrinsic-motivation (empowerment/curiosity/etc), which is both necessary, sufficient, and strictly dominate/superior.

Your list half-overlaps with mine. IIUC, you have (1) some kind of curiosity drive, (2*) "empowerment" drive. Did I get that right? Why do I think (2) is important?

I lump empowerment/curiosity together, as they are both candidates for intrinsic-motivation learning, and I'm currently unsure what is the best model for human learning (some data from Atari and Minecraft indicates info-gain is a better fit than empowerment, and 'input entropy' is somewhat better than info-gain[1], although this may be specific to their approximation of empowerment). Regardless either is universal because of instrumental convergence, so empowerment can lead to curiosity and vice versa, but last I checked artificial curiosity still had some edge case issues.

I think "drive for status" is a beautiful explanation of tons of things, not all of which are explainable via drive for empowerment,

That seems pretty unlikely. Empowerment (or intrinsic-motivated learning) is fully universal/generic, simple, and is fully sufficient to explain drive for status, but the converse is not true. Empowerment explains play and early learning in children, how humans play novel games, why we very rapidly learn the value of money, drive for status, etc.


  1. Matusch, Brendon, Jimmy Ba, and Danijar Hafner. "Evaluating Agents without Rewards." arXiv preprint arXiv:2012.11538 (2020). gs-link ↩︎

Yeah. The 'altruism/empathy' circuit obviously has some innateness, but it is closely connected to and reliant on learned theory of mind. Sadly not all researchers seem to even care to put something like that in, although they should try. How that subsystem interacts with learned theory of mind is also complex, and is probably more inherently fragile than the generic unsupervised empowerment/curiosity learning system. It may be difficult to scale correctly, even for those who are bothering to try.

Strong agree

I think this unlikely as these just simply aren't useful for AGI in complex environments. Simple innate drives (score reward) barely work in Atari (and not even for all games). Moving to more complex environments requires some form of intrinsic-motivation (empowerment/curiosity/etc), which is both necessary, sufficient, and strictly dominate/superior.

I'm confused. Suppose AGI developer Alice wants to build an AGI that makes her as much money as possible. I would propose that maybe Alice would try a within-lifetime reward function which is a linear (or nonlinear) combination of (1) curiosity / intrinsic motivation, and (2) reward when Alice's bank account balance goes up. The resulting AGI would have both an "innate" curiosity drive and an "innate" "make Alice's-bank-account-balance-go-up" drive. The latter (unlike the former) is very unlike any of the innate drives in humans.

In other words, I'm open to the possibility that some kind of intrinsic motivation is sufficient to make a powerful agent, but the AGI designers don't just want any powerful agent, they want a powerful agent trying to do something in particular that the AGI designer has in mind. And one obvious way to do so is to put that something-in-particular into the reward function in addition to curiosity / whatever.

I take values to mean longer term goals like "save the world", or "become super successful/rich/powerful" or "do God's work" or whatever, not low level drives or emotions.

Oh sure, I normally call that "explicit goals". I guess maybe your point is that among the 7 billion humans you'll find such an incredibly diverse collection of explicit goals that it's hard to imagine an AGI with a goal far outside that span? If so, I guess that's true, to a point. But I suspect that "maximize paperclips in our future light-cone" would still be an example of something that (to my knowledge) no human in history has ever adopted as an explicit long-term goal. Whereas I think we could make an AGI with that goal.

I'm confused. Suppose AGI developer Alice wants to build an AGI that makes her as much money as possible. I would propose that maybe Alice would try a within-lifetime reward function which is a linear (or nonlinear) combination of (1) curiosity / intrinsic motivation, and (2) reward when Alice's bank account balance goes up.

Should have clarified, but when I said intrinsic motivation was necessary and sufficient, I meant only for creating powerful (but unaligned AGI). Clearly intrinsic motivation by itself is undesirable - as it's not aligned - so any reasonable use of intrinsic motivation should always use that as an instrumental 'boostrap' motivator, not the sole or final terminal utility.

You could of course use the specific combination of 1.) intrinsic motivation and 2.) account balance reward, but that also sounds pretty obviously disastrous: when the agent surpasses human capability its best route to maximizing 2 and 1 tends to involve taking control of the account, at which point the human becomes irrelevant at best.

Although I agree this agent would be unlike humans in terms of low level innate drives, most of the variance in human actions is explained purely by intrinsic motivation - which would also be true for this agent.

And one obvious way to do so is to put that something-in-particular into the reward function in addition to curiosity / whatever.

Yeah of course - the intrinsic motivation should never be the only/sole component.

But I suspect that "maximize paperclips in our future light-cone" would still be an example of something that (to my knowledge) no human in history has ever adopted as an explicit long-term goal. Whereas I think we could make an AGI with that goal.

So actually I think if you attempt to work out how to implement that (in a powerful AGI), it's probably as difficult as making approximately aligned AGI. The bank account example is somewhat easier (especially if it's a cryptocurrency account) as it has a direct external signal.

For paperclipping or intra-agent alignment, the key hard problem is actually the same: balancing intrinsic motivation and some learned model utility criteria under scaling. So I suspect most attempts therein either fail to create powerful AGI, or create powerful AGI that fails to paperclip (or align), and instead just falls into the extremely strong generic power-seeking attractor.

Creating any kind of AGI that is actually powerful is hard, and creating AGI that is both powerful and reliably optimizes long term for any world model concept X other than just power-seeking is especially hard, regardless of what X is.

Learning the world model concepts itself is not the hard part, as powerful AGI already necessarily gives you that. (And in the specific case of human alignment any powerful agent already must learn models of human utility functions as part of learning a powerful world model)

Thanks, this is great, I really feel like we're converging here. Here's where I think we stand.

Intrinsic motivation / curiosity:

We both agree that humans have an "intrinsic motivation" drive and that AGI will likewise have an "intrinsic motivation" drive, at least for the early part of training (perhaps it can "fade out" when the AGI is sufficiently smart and self-aware, such that instrumental convergence can substitute for intrinsic motivation?). I'm calling the intrinsic motivation "curiosity", and I'm punting on the details of how it works. You're calling it "curiosity / empowerment", and apparently have something very specific in mind.

I think that intrinsic motivation in both humans & AGIs needs to be supplemented by a "drive to pay attention to humans", which in humans is based on superficial things like an innate brainstem circuit that disproportionately fires when hearing human speech. Without that drive, I think the curiosity would be completely undirected, and you could wind up with an AGI that ignores the world and spends forever running Rule 110 in its head and finding its increasingly-complicated patterns, or studying the coloration of pebbles, etc. Whereas I think you disagree, and you think that "intrinsic motivation", properly implemented, will automatically point itself at the world and technology and humans etc., and not at patterns-in-rule-110.

We also disagree about "drive for having high social status / impressing my friends": You think it's purely a special case of "intrinsic motivation" and thus requires no further explanation, I think it comes at least in part from "social instincts", i.e. low-level drives that evolved in humans specifically because we are social animals.

I'm not immediately sure how to move forward in resolving either of those. I think you said you were going to have a post explaining more about how you think intrinsic motivation works, so maybe I'll just wait for that.

Other low-level drives:

I think we agree that humans have some "social" low-level drives like "altruism / empathy" and "justice/anger" (which I'd call a subset of "social instincts"). We might be disagreeing about how complicated social instincts are (e.g. "how many low-level drives"), with me saying they're probably pretty complicated and you saying they're simple. But it's also possible that we're not disagreeing at all, but rather answering different questions, i.e. "the main aspects of human social instincts" versus "human social instincts in exact detail including subtle mood-shifts based on how somebody smells" or whatever.

I think we agree that AGI can have some or all of those human social instincts, but only if the AGI designers put them in, which would require (1) more research to nail down exactly how they're implemented, (2) advocacy etc. to convince AGI designers to actually put in whatever social instincts we think they ought to put in.

I think we also agree that AGI can have low-level drives very different from any of the low-level drives in humans, like a low-level drive to get a high score in PacMan—not as a means to an end, but rather because the PacMan score is directly baked into the innate within-lifetime reward function. I think you're inclined to emphasize that most of these possible low-level drives would be terribly dangerous, and I'm inclined to emphasize that future AGI designers might put them in anyway.

Explicit goals:

I think we agree that humans, combining their modestly-heterogeneous innate drives (e.g. psychopaths, people with autism, etc.) with modestly-heterogeneous training data (a.k.a. life history), can wind up pursuing an insane variety of explicit goals, like the guy trying to set a world record for longest time spent bathing in ice-water, etc. etc. So the claim "the AGI may wind up pursuing goals radically unlike humans" is less clear-cut than it sounds. OTOH, "the AGI may wind up pursuing explicit goals unlike typical humans in my culture" is a weaker statement, and I think definitely true. I would even say the stronger thing—that it is in fact possible for a future AGI to wind up pursuing an explicit goal that none of the 100 billion humans in history have ever pursued, e.g. maximizing the quantity of solar cells in the future light-cone, particularly if the AGI is programmed to have a low-level innate drive that no human has ever had, and if AGI designers don't really know what they're doing.

Where does that leave anthropomorphism?

When I think of anthropomorphism I have a negative association because I'm thinking of things like my comment here, where somebody was claiming that AGI isn't dangerous because if an AGI just thought hard enough about it, it would conclude that acting honorably is inherently good and hurting people is inherently bad, because after all, that's just the way it is. From my perspective, this is problematic anthropomorphism because the process of moral reasoning involves (among other things) queries to low-level "social instincts" drives (especially related to altruism and justice), and whoever builds the AGI won't necessarily put in the same "social instincts" drives that humans have.

(I could have also pointed out that high-functioning sociopaths often have a very good understanding of honor etc. but not find those things motivating at all. Maybe that's a general rule: if we see an "anthropomorphism" argument that really only applies to neurotypical people, and not to psychopaths and people with autism etc., then that's a giant red flag.)

Anyway, when you think of anthropomorphism, it seems that your mind immediately goes to "humans can sometimes be single-mindedly in pursuit of power, and AGIs also can sometimes be single-mindedly in pursuit of power", which happens to be a statement I agree with. So you wind up with a positive association.


Couple other things:

You could of course use the specific combination of 1.) intrinsic motivation and 2.) [bank] account balance reward, but that also sounds pretty obviously disastrous

Agree, but only if we define "obviously" as "obviously to me and you". I still think there's a good chance that somebody would try.

So actually I think if you attempt to work out how to implement that (in a powerful AGI), it's probably as difficult as making approximately aligned AGI.

Oh, sorry for bad communication, when I said "I think we could make an AGI with that goal [of maximizing paperclips]", I should have added "in principle". Obviously right now we can't make any AGI whatsoever, and additionally we don't know how to reliably make the AGI that is trying to do some particular thing that we had in mind. I doubt the problem of making a paperclip maximizer is fundamentally impossible, and I'd be pretty confident that we could eventually figure it out if we wanted to (which we don't), if only we could survive long enough to do arbitrarily much trial-and-error. :-P

Thanks for the organized reply, i'll try to keep the same format.

Intrinsic motivation / curiosity:

You are familiar with the serotogenic and dopaminergic pathways and associated learning systems - typically simplified to an unsupervised learning component and a reward learning component.

My main point is that picture is incomplete/incorrect, and the brain's main learning system involves some form of empowerment. Curiosity is typically formulated as improvement in prediction capability, so it's like a derivative of more standard unsupervised learning (and thus probably a component of that system). But that alone isn't so great at learning for the roughly half the brain involved in action/motor/decision/planning. Some form of 'empowerment' criteria - specifically maximization of mutual information between actions and future world state (or observations, but the former is probably better) is a more robust general learning signal for action learning, and seems immune to the problems that plague pure curiosity approaches like the rule 101 type issues you mention.

For example: dopamine release on winning a bet has nothing to do with innate drives, it's purely an empowerment type learning signal. This is actually just the normal learning system at work.

The brain is mostly explained by this core learning system (which perhaps has just two or three main components). The innate drives (hunger,thirst,comfort/pain,sex,etc) are completely insufficient as signals for training the brain. They are instead satisficing drives that quickly saturate. They are secondary learning signals, but moreover they also can directly control/influence behavior in key situations, like the emotional subsystems. (Naturally there are exceptions to typical saturation - humans with a mutation causing perpetual unsatisfiable deep hunger and thus think about food all day long)

Empowerment that operates over learned world state also could support easy modulation - for example by up-weighting the importance of modeling humans/agents.

The altruism/empathic component isn't really like those innate drives (it's not really satisfying/saturating), and so instead is more core, part of the primary utility function and learning systems. (And also probably involves it's own neuromodulator component through oxytocin).

I think that intrinsic motivation in both humans & AGIs needs to be supplemented by a "drive to pay attention to humans", which in humans is based on superficial things like an innate brainstem circuit that disproportionately fires when hearing human speech.

Human infants grow up around humans who spend a large amount of time talking near the child. It's actually a dominant component of the audio landscape human infants grow up in. Any reasonably competent UL system will learn a model of human speech just from this training data (and ML systems prove this). Any innate human-speech brainstem circuit is of secondary importance - perhaps it speeds up learning a bit (like the simple brainstem face detector that helps prime the cortex), but it simply can not be necessary - as that would be incompatible with everything we know about the powerful universal learning capability of the brain.

Then once the brain has learned a recognition model of human speech, empowerment based learning is completely sufficient to learn speech production motor skills, simply by learning to maximize the mutual information between larynx motor actions and future predicted human speech audio world state. Again the brain may use some tricks to speed up learning, but the universal learning system is doing all the heavy lifting.

We also disagree about "drive for having high social status / impressing my friends": You think it's purely a special case of "intrinsic motivation" and thus requires no further explanation,

Once a child has learned a model of other humans - parents, friends, general models of other 'kids', etc, the empowerment system naturally then tries to learn ways to control these agents. This is so difficult that it basically drives a huge chunk of subsequent learning for most people, and becomes social theory of mind and innate 'game theory'. Social status is simply a proxy measure for influence, so it's closely correlated - or even just the same as - maximization of mutual info between actions and future agent beliefs (ie empowerment). If you think of what the word influence means, it's actually just a definition of a specific form of empowerment.

Other low-level drives:

The ancient innate Satisficing drives are what I think of as the low-level drive category (hunger,thirst,pain,sex,etc).

And finally the core emotions (happiness, sadness, fear, anger) are a third category. They are ancient subsystems that are both behavioral triggers and learning modulators. Happiness/sadness are just manifestations of predicted utility, whereas fear and anger are innate high-stress behavior modes (flight and fight responses). Humans then inherit more complex triggers - such as the injustice/righteousness triggers for anger, and more complex derived emotions.

I would put altruism/empathy in its own category, although it's also obviously closely connected to the emotion of love. Implementation wise it results in mixing of the learned utility functions of external agents into the agent's own root utility function. It is essentially evolved alignment. There are good reasons for this to evolve - basically shared genes and disposable somas, and we'll want something similar in AGI. It's a social component in the sense that it needs to connect the learned models of external agents to the core utility function.

I think we agree that AGI can have some or all of those human social instincts, but only if the AGI designers put them in, which would require (1) more research to nail down exactly how they're implemented, (2) advocacy etc.

We want to align AGI, and the brain's empathic/altruistic system could show us a practical way to achieve that. I don't see much role for the other emotional circuitry or innate drives. So we mostly agree here except you seem more interested in various 'social instincts' beyond just empathy/altruism (alignment).

Where does that leave anthropomorphism?

I believe humans (and more specifically high-impact humans) are mostly explained by a universal/generic learning system optimizing for a few things: mainly some mix of empowerment, curiosity, and altruism/empathy. There are many other brain systems (innate drives, emotions, etc), but they aren't so relevant.

I also believe brains are efficient, and thus AGI will end up being brain like - specifically it will also be mostly understandable as a universal neural learning system optimizing for some mix of empowerment, curiosity, and altruism/empathy or equivalents. There may be some other components, but they aren't as important.

Goals and values are complex learned concepts. Initial AGI will not reinvent all of human cultural history, and instead will just absorb human values - as they emerge from a universal learning system training on human world experience data, and AGI will have a similar universal learning system and similar experience training data. This doesn't imply AGI will have the exact same values of some typical mix of humans. Only that it's values will be mostly sampled from within the wide human-set.

From the original comment I was replying to (from Jon Garcia, not you):

There is no reason to think that the first AGIs will have goal/value structures any less alien to humans than would a superintelligent spider

There are deep reasons to believe AGI will be more anthropomorphic than not - mostly created in the image of humans. AGI will be much closer to a human mind than some hypothetical superintelligent spider.