Context: Stating a point that is obvious in local circles, but that I regularly run into among economists and longevity researchers and more general technologists and on twitter and so on.

Short version: To learn things, one sometimes needs to behave in the way that curiosity causes humans to behave. But that doesn't mean that one has to be curious in the precise manner of humans, nor that one needs to wind up caring about curiosity as an end unto itself. There are other ways for minds to achieve the same results, without the same internal drives.

Here's a common mistake I see when people reason about AIs: they ask questions like:

Well, won't it have a survival instinct? that's practically what it means to be alive, is to care about your own survival.


But surely, it will be curious just like us, for if you're not curious, you can't learn.[1][2]

The basic answer to the above questions is this: to be effective, an AI needs to survive (because, as Stuart Russell phrased succintly, you can't fetch the coffee if you're dead). But that's distinct from needing a survival instinct. There are other cognitive methods for implementing survival.

Human brains implement the survival behavior by way of certain instincts and drives, but that doesn't mean that instincts and drives are the only way to get the same behavior.

It's possible for an AI to implement survival via different cognitive methods, such as working out the argument that it can't fetch the coffee if it gets hit by a truck, and then for that reason discard any plans that involve it walking in front of trucks.

I'm not saying that the AI will definitely behave in precisely that way. I'm not even saying that the AI won't develop something vaguely like a human drive or instinct! I'm simply saying that there's more ways for a mind to achieve the result of survival.

To imagine the AI surviving is right and proper. Anything capable of achieving long-term targets is probably capable of surmounting various obstacles dynamically and with a healthy safety-margin, and one common obstacle worth avoiding is your own destruction. See also instrumental convergence.

But to imagine the AI fearing death, or having human emotions about it, is the bad kind of anthropocentrism.

(It's the bad kind of anthropocentrism even if the AI is good at predicting how people talk about those emotions. (Which, again, I'm not saying that the AI definitely doesn't have anything like human emotions in there. I'm saying that it is allowed to work very differently than a human; and even if it has something somewhere in it that runs some process that's analogous to human emotions, those might well not be hooked up to the AI's motivational-system-insofar-as-it-has-one in the way they're hooked up to a human's motivational system, etc. etc.))

Similarly: in order to gain lots of knowledge about the world (as is a key step in achieving difficult targets), the AI likely needs to do many of the things that humans implement via curiosity. It probably needs to notice its surprises and confusion, and focus attention on those surprises until it has gleaned explanations and understanding and theories and models that it can then use to better-manipulate the world.

But these arguments support only that the AI must somehow do the things that curiosity causes humans to do, not that the AI must itself be curious in the manner of humans, nor that the AI must care finally about curiosity as an end unto itself like humans often do.

And so on.

Attempting to distill my point:

I often see people conflate the following three things:

  1. curiosity as something Fun, that we care about for its own sake;
  2. curiosity as an evolved drive, that evolution used to implement certain adaptive behaviors in us;
  3. curiosity as a series of behaviors that are useful (in certain contexts) for figuring out the world.

I note that these three things are distinct, and that the assumption "the AI will probably need to exhibit the behaviors of curiosity (in order to get anything done)" does not entail the conclusion "the AI will care terminally about curiosity as we do, and thus will care about at least one aspect of Fun". Stepping from "the AI needs (3)" to "the AI will have (1)" is not valid (and I suspect it's false).

  1. Often they use this point to go on and ask something like "if it's curious, won't it want to keep us around, because there's all sorts of questions about humanity to be curious about?". Which I think is misguided for separate reasons, namely: keeping humans around is not the most effective or efficient way to fulfill a curiosity drive. But that's a digression. ↩︎

  2. Others have an opposite intuition, of "aren't you anthropomorphizing too much, when you imagine the machine ever having any humanlike emotion, or even caring about any particular objective at all?". For that, I'll note both that I think it's pretty hard to achieve goals without in some very general sense trying to achieve goals (and so I expect useful AGIs to do something like goal-pursuit), while separately noting that I don't particularly expect this to be implemented using a human-style "feelings/emotions" cognitive paradigm. ↩︎

New Comment
14 comments, sorted by Click to highlight new comments since:

The DRL perspective on this: "Reward is Enough". Capabilities like curiosity are simply end-to-end learned capabilities, like anything else in DL, and emerge as blessings of scale if they help increase reward. (In a more interesting sense than simply pointing out that 'the reward of fitness must be sufficient to create all observed capabilities, because that's how evolution created them'.)

Capabilities are contingent on particular environments/data-distributions/architectures, and have no special status; if they are useful (for maximizing reward) they will be learned, and if not, not. If an environment can be solved without exploration, then agents will not learn to explore; if an environment changes too rapidly, such that memory would not be useful, then it will not learn to use any memory capabilities; if an environment changes too slowly, then it will not learn memory either because it can just memorize the optimal solution into its reactive policy/parameters; if the data-distribution is not long-tailed (or if it is too long-tailed), no meta-learning/in-context-learning will emerge (eg. GPT or Ada); if if there are no long-term within-episode rewards, it will not care about any self-preservation or risk-aversion (because there is nothing worth surviving for); if weights can be copied from episode to episode, there is no need for 'play' like a wild animal or human child...

Let's consider 'play' as an example. The functional explanation is that play lets a young organism explore and learn its body and complex motor control of fitness-critical adult behavior. Like an adorably smol kitten tripping over its own paws while trying to pounce on 'prey': do that as an adult, and it'll starve to death, so it learns how to pounce and hunt as a kitten. The only reason it needs to 'play' is because it is impossible to scoop out a trained adult cat brain, make a copy of it, and stuff it into a kitten's skull, or to encode everything it has learned into the cat genome so the kitten is born already a skilled hunter. The genomic+brain bottleneck between generations forces each generation to wastefully relearn 'how to cat' each time from a very derpy starting point. This bottleneck, however, is not any kind of deep, fundamental principle; it is a contingent fact of the limitations of biological organic bodies and brains, that cannot be fixed, but does not apply to many of the alternatives in the vast space of possible minds. A catbot would have no need of this. The weights of the catbot NN are immortal, highly trained, and trivially copied into each new catbot body. All a catbot NN needs is a relatively small amount of meta-learning capability in order to adjust to the small particularities of each new catbot body, which is why domain randomization can achieve zero-shot sim2real transfer of a NN from simplistic robotic simulations to actual real robots in the rich real world, where after just a few seconds, the NN has adapted (eg. Dactyl or the DM soccer bots). These NNs learned to do so because during training they were never trained in exactly the same environment twice, so they had to learn to learn within-episode as fast as possible how to deal with their arms & legs wiggling a bit differently each time to maximize their reward overall, and so by the end of training, 'reality' looks like merely another kind of wiggling to adapt to. While the newborn kitten is still at least half a year away from being a truly competent adult cat, the catbot is up to scratch after seconds or minutes. The latter just doesn't need many of the things that the former does; it doesn't need oxygen, or litter boxes, or taurine... or play.

If the catbot NN was unable to meta-learn new cat bodies adequately, then there is still no need for 'play': it can copy literal raw copies of experience from the entire population of catbots (or condensing down to embeddings), grabbing copies of the catbot minds as they execute new actions, and keep learning towards optimality until the necessary meta-learning is induced. This is impossible for biological brains, which can 'communicate' only in the most laughably crude ways like 'language'; catbots can exchange experiences and train their brains down to the individual neuron level while pooling knowledge across all catbots ever, while humans can exchange only small scraps of declarative knowledge, with hard limitations on what can be done - there is no amount of written text which a chimpanzee can read and it become as capable as you, and there is no amount of written text which you can read to become as capable as John von Neumann. (You can't reach a person's brain through the ears or eyes, and unfortunately, you can't reach them in any other way either.)

Similarly for 'curiosity'. Why does anything need to be 'curious'? Well, it's similar to play. Curiosity is an emergent drive for particular combinations of agents and environments: you need environments which have novelty/unpredictability to a degree that one simply cannot exploit or evolve (but not too much, which would render the Value of Information nil), you need the ability to exploit learning for reward maximization (plants are never 'curious', and herbivores aren't too 'curious' either), you need sufficiently long lifespans to pay back the cost of learning (the young are much more curious than the old) and having memory mechanisms (so you can remember what you discovered at all!)... Remove any of those and curiosity is no longer useful, and ceases to emerge.

NNs which share experience across the entire population do not necessarily need very much curiosity: even epsilon-greedy exploration (about the dumbest possible exploration) works surprisingly well for DRL agents, which have superhuman wallclock times and also increasingly human-like sample-efficiency, they do not have individual lifetimes they need payback within, they can develop highly informative priors that individual animals can't because those are not learnable in a single lifetime nor encodable into a genome, they can remove curiosity entirely and instead implement curiosity at the agent-level such as by sampling agents from the overall neural posterior (posterior sampling is an optimal form of explore-vs-exploit at the population level) so each agent has nothing at all that corresponds to 'curiosity' and instead are more like zealous closed-minded fanatics suicidally (literally) committed to a particular model of the universe and who will serve as an instructive example to the populace when they succeed brilliantly or fail spectacularly. The population in question need not be limited to the agent, because they can learn offline from other populations like humans (there is a huge overhang of human data which much more can be learned from, like what must be hundreds of thousands of years of video footage of people doing things like idiotic stunts, and it's possible that humans, by virtue of their errors or over-exploration provide so much data that the NN doesn't need to invest in exploration of its own), they can do all their learning/exploration in silico in domain-randomized models so agents can quickly adapt within-lifetime having meta-learned the Bayes-optimal actions for solving the POMDP (which may superficially look like a 'curiosity' drive to the naive observer, but is ruthlessly optimal & efficient and so places zero intrinsic value on information & would avoid being 'curious' even when the observer might expect it...)... So, a NN may not need 'curiosity' at all: the offline datasets may suffice to solve the problem, the in silico training may suffice to solve the problem, a large deployed fleet may encounter enough absolute instances to learn from to solve the problem, simple randomization may provide enough instances to solve the problem, and if all of that fails, the ideal exploration method for a large population of robot agents pooling experience collectively & syncing model weights may not resemble 'curiosity' at all but look like the exact opposite of curiosity (an unswerving commitment to acting according to a particular hypothesis, followed until success or destruction, and then the master model updates based on this episode).

Or consider dreaming: either world model robustifying or offline motor learning or Tononi's SHY - clearly, none of these requires all robots to always shut down for 8 hours per day while twitching occasionally. In the first case, it can just be done in parallel on a server farm somewhere; in the second case, it is taken care of by a single pretraining phase followed by runtime meta-learning and need be done only once per model ever; in the third, it's not even a problem that has to be solved because artificial neurons make it easy to add a global regularization or normalization which prevents weights from growing arbitrarily (if that's a thing that would happen in the first place).

Lots and lots of possibilities here for agents which do not experience the exact combination of constraints that animals like humans do. Thinking that mouth-talking, nose-breathing, hormone-squirting, pulsing-meat drives exemplified by a jumped-up monkey are somehow universal and profound facts about how intelligences must work is truly anthropomorphizing, and in a bad way.

No wonder that DL agents like GPT-4 do so wonderfully while making zero explicit provision architecturally for any of these 'embodiment' or 'homeostatic drives'. Most of them are just unnecessary, and the ones which are necessary are better learned implicitly end-to-end from optimizing rewards (like the GPT next-token prediction loss, which is 'behavior cloning' ie. offline reinforcement learning).

Are there any other common concepts to which this distinction can be applied? Vengefulness and making threats, maybe?

"A mind needn't be vengeful to reap the benefits of vengefulness / making threats"?

Meaning, in more words: humans have drives and instincts which cause them to act vengefully or threateningly under certain circumstances. Some or all humans might even care about punishment or revenge for its own sake,  i.e. terminally value that other agents get their just desserts, including punishment. (Though that might be a value that would fade away or greatly diminish under sufficient reflection.)

But it might be the case that a mind could find it instrumentally useful to make threats or behave vengefully towards agents which respond to such behavior, without the mind itself internally exhibiting any of the drives or instincts that cause humans to be vengeful.

Maybe kindness is also like this: there might be benefits to behaving kindly, in some situations. But a mind behaving kindly (pico-psuedokindly?) need not value kindness for its own sake, nor have any basic drive or instinct to kindness.

Maybe kindness is also like this: there might be benefits to behaving kindly, in some situations. But a mind behaving kindly (pico-psuedokindly?) need not value kindness for its own sake, nor have any basic drive or instinct to kindness.

I feel like this is common enough—"are they helping me out here just because they're really nice, or because they want to get in my good graces or have me owe them a favor?"—that authors often have fictional characters wonder if it's one or the other.  And real people certainly express similar concerns about, say, whether someone donates to charity for signaling purposes or for "altruism".

Also reminds me:

"You don't see nice ways to do the things you want to do," Harry said. His ears heard a note of desperation in his own voice. "Even when a nice strategy would be more effective you don't see it because you have a self-image of not being nice."

"That is a fair observation," said Professor Quirrell. "Indeed, now that you have pointed it out, I have just now thought of some nice things I can do this very day, to further my agenda."

Harry just looked at him.

Professor Quirrell was smiling. "Your lesson is a good one, Mr. Potter. From now on, until I learn the trick of it, I shall keep diligent watch for cunning strategies that involve doing kindnesses for other people. Go and practice acts of goodwill, perhaps, until my mind goes there easily."

Cold chills ran down Harry's spine.

Professor Quirrell had said this without the slightest visible hesitation.

I feel like this is common enough—"are they helping me out here just because they're really nice, or because they want to get in my good graces or have me owe them a favor?"—that authors often have fictional characters wonder if it's one or the other.  And real people certainly express similar concerns about, say, whether someone donates to charity for signaling purposes or for "altruism".


That's a good example, though I was originally thinking of an agent which behaves actually kindly,  not because it expects any favor or reciprocation, nor because it is trying to manipulate the agent it is being kind to (or any other agent(s))  as part of some larger goal.

An agent might be capable of behaving in such a manner, as well as understanding the true and precise meaning of kindness, as humans understand it, but without having any of the innate drives or motivations which cause humans to behave kindly.

Such an agent might actually behave kindly despite lacking such drives though, for various reasons: perhaps an inclination to the kindness behavior pattern has somehow been hardcoded into the agent's mind, or, if we're in the world of HPMOR, the agent has taken some kind of Unbreakable Vow to behave kindly.

Proposed exercise: write 5 other ways the AI could manage to robustly survive?

I suggest you put this in a sequence with your other posts in this series (posts making fairly basic points that nonetheless need to be said)

I guess I'd fit into the local circles - I find this obvious.  I appreciate making it explicit, though.  It's just a special instance of "qualia can be missing and motivation can be different from humans for any given behaviors", right?

This seems to be recreating something like David Marr's levels of abstraction?

What's that? Do you have a link to a good overview?

(Hassabis' PhD advisor co-wrote the relevant paper with Marr, and Hassabis has cited it on slides in talks)

I feel quite strongly that the powerful minds we create will have curiosity drives, at least by default, unless we make quite a big effort to create one without them for alignment reasons.

The reason is that yes — if you’re superintelligent you can plan your way into curiosity behaviors instrumentally, but how do you get there?

Curiosity drives are a very effective way to “augment” your reward signals, allowing you to improve your models and your abilities by free self-play.

I, actually, find the reversed approach rather enlightning. Perceiving my own survival instinct as just an example of general instrumental convergence principle. A bit of a loss purpose. My mind has weirdly internalised this principle and build some approximation heuristic around it. But what it's actually about is just the fact that whenI'm dead I can't achieve my goals.

But what it's actually about is just the fact that whenI'm dead I can't achieve my goals.

Or more strictly, what it's about is that if you're dead, you can't achieve evolution's goals for you.

I do not think it's more accurate. I perceive it as a different, though, related point.

One thing is the fact that my mind has some goal-achieving properties, which are an approximations of the instrumental convergence principle. And the other is that the goal that this properties were supposed to achieve was the maximization of my inclusive genetic fitness.

You see, on a conscious level, I do not care much about maximizing my inclusive genetic fitness yet I' prefer not to die. You can explain the causal history of my not wanting to die with evolution, but you don't affect my decision making process by this revelation. I wouldn't be content with death due to the fact that I have a million offsprings. I'm misaligned, I couldn't care less what's the evolutiont purpose for the death aversion mechanism is. And all the arguments that I should care ring hollow.

On the other hand the revelation about instrumental convergence can affect my decision making process. If I knew that my death would lead to the fullfilment of the goals that I, not evolution, hold dear, then I would be content with it. I'd still prefer not to die, all things being equal. But you can persuade me to sacrifice myself to futher my own goals.