In this post, I want to briefly propose a semi-novel direction for alignment research that I'm excited about. Though some of these ideas are not brand new—they purposefully bear resemblance to recent (highly promising) work in SHARD theory and Steve Byrnes’s approach to brain-based AGI safety—I think my emphases are sufficiently different so as to justify a more thorough explanation.

Why are humans 'worthy' of being in the loop?

I think the following three claims help motivate the general research direction I have in mind. 

1) Many of the most coherent AI safety strategies proposed to date (e.g., HCH, imitative and approval-based amplification, recursive reward modeling, and more) involve human decision-makers in some meaningful capacity. I claim, therefore, that these proposals implicitly presuppose that there are specific algorithmic properties of the human mind/brain that make us comfortable entrusting these ‘humans in the loop’ with the task of minimizing the likelihood of AI-induced bad outcomes. This idea is demonstrated especially clearly by ‘safety via debate,’ for instance: 

Diagram from An overview of 11 proposals for building safe advanced AI, with my annotation in black.

2) I think the special brain algorithms in question—e.g., the ones that make us comfortable entrusting a neurotypical human to decide who won in the set-up above—are more familiarly thought of as prosocial or moral cognition. A claim like this would predict that we would be uncomfortable entrusting humans who lacked the relevant prosocial instincts (e.g., psychopaths) to oversee a safety-via-debate-type set-up, which seems correct. I think the reason that it is a very natural thought to want to incorporate neurotypical human decision-makers into alignment proposals is that we are confident (enough) that such decisions will be made carefully—or at least more carefully than if there were no humans involved. In other words, individual humans in the loop are entrusted-by-default to serve as competent advocates for the interests of society at large (and who are more than likely aware of the fact that they are serving this role), able to infer suspicious behavior, evaluate subtle short- and long-term predicted consequences, be humble about said evaluations, probably solicit second opinions, etc.—somehow! 

3) Our understanding of human prosocial cognition is growing increasingly precise and predictive. In cognitive neuroscience writ large, computational modeling has become a dominant approach to understanding the algorithms that the human brain instantiates, with talented researchers like Joshua TenenbaumRobb Rutledge, and Anne Collins leading the charge. This work has taken off in recent years and has enabled cognitive scientists to define and test hypotheses that are unprecedented in their mathematical precision and predictive power. Here are two good introductory (12) examples of this sort of work that pertain explicitly to human prosocial cognition. I expect my future writing—and that of others interested in this sort of approach—to feature far more examples of good alignment-relevant computational social neuroscience research. 

With these ideas in mind, my proposal is to conduct technical research to better understand these prosocial brain algorithms with the ultimate goal of instantiating some refined version of them directly into a future AGI (which is likely doing something approximating model-based RL). Here is a pretty straightforward two-step plan for doing so:

  • Synthesize high-quality social neuroscience literature—with a particular focus on good computational modeling work—in order to more fully develop a rigorous account of the most important algorithms underlying human prosocial behavior. 
  • Develop specific corrigibility proposals for instantiating these algorithms in AI systems in a maximally realistic and competitive manner.

In other words, I'm basically proposing that we try to better understand the properties of human cognition that render humans 'worthy of being in the loop,' and proceed to apply this understanding by instantiating the relevant computations directly into our eventual AGI. If successful, such an approach might even obviate the need for having a human in the loop in the first place—'cutting out the middleman,' as it were. 
 

Responses to some anticipated objections

Perhaps this plan sounds like a long shot to you! Here are some plausible reasons I think one might be skeptical of such an approach, followed by what I hope are plausible responses.

Plausible critique #1: Human value-based cognition/moral reasoning ain’t all that. 

Humans are jealous, status-focused, painfully short-sighted, perennially overconfident, and often ethically confused—why would we ever want to instantiate the cognitive machinery that gives rise to these features in a superintelligent AI? 

One very simple reply to this concern is we expect to be able to pick and choose from the set of prosocial brain algorithms—e.g., we might want to emulate the computational machinery that gives rise to altruistic motivations, but we'll probably skip out on whatever computational machinery gives rise to envy. 

I honestly suspect that this reply might be too simple—that a non-trivial number of prosocial brain algorithms exhibit a double-edged quality. For instance, if you want an AGI that exhibits the capacity to truly love its neighbor, this may necessarily imply an obverse capacity to, say, hate the person who kills its neighbor. Moral cognition is often messy in this way. (I definitely recommend Paul Bloom’s book Against Empathy for more on this general point.) To the degree it is possible to maximally sample the good and minimally sample the bad from prosocial brain algorithms, I obviously think we should do so. I am just skeptical of how far such an approach would take us in avoiding the ‘human values are too grey-area to copy’ critique.  

Ultimately, I think that the following is a stronger reply to this worry: the fact that human brains are able to criticize their own cognitive architecture, as is evidenced by this very criticism, is strong evidence for the plausibility of leveraging prosocial brain algorithms for alignment. 

In other words, the fact that prosocial algorithms in brains recursively accommodate the fact that prosocial algorithms in brains are sometimes suboptimal (e.g., the capacity to think thoughts like “loving one's neighbor is a double-edged sword”) is, I claim, a highly desirable property of prosocial brain algorithms. This ability to critically inspect one’s own values may well be the most important prosocial algorithm to pin down! Prosocial brains don’t just do things like share resources and shun cheaters—they challenge their own preconceived assumptions, they are motivated to self-improve, and they often are aware of their own shortcomings. The EA community is perhaps one of the better examples of this particular sort of prosocial behavior on abundant display. (Notice also that these specific self-awareness-type properties seem to do a lot of explanatory work for understanding why we trust humans to, say, mediate a safety-via-debate set-up.) In essence, the ‘human moral reasoning ain’t all that’ critique ignores that human moral reasoning is itself responsible for generating this critique!


Plausible critique #2: Human values attached to superintelligence may look far more dangerous than human values attached to human-level intelligence. 

I think there are two important and related responses to this. The first is orthogonality—i.e., that an agent’s intelligence and an agent’s goals/values (roughly speaking) vary independently. If this is true, then it follows that for any set of goals/values, we generally should not expect that increasing the intelligence of the agent who holds these goals/values will categorically modulate what these goals/values look like when implemented (so long as the agent is originally competent enough to actually achieve its goals/values). I think there is moderate empirical evidence to support this notion: namely, that increased intelligence within humans has no meaningful correlation to prosocial behavior (there are some studies that even report a weak positive correlation between human intelligence and prosociality, especially in morally complex decision domains—which makes sense). This seems to demonstrate that whatever brain algorithms motivate prosociality are not significantly altered by increases in general intelligence. Young neurotypical children (and even chimpanzees!) instinctively help others accomplish their goals when they believe they are having trouble doing so alone—as do most people on the far right tail of the IQ distribution. This all suggests to me that ‘attaching’ the right implementations of the right prosocial algorithms to generally intelligent systems should not suddenly render unrecognizable the class of behavior that emerges as a result of these algorithms.


Plausible critique #3: AGI won't be doing anything that looks like RL, so this sort of approach is basically useless!

I think understanding human prosociality in computational terms would be useful for alignment even if what is discovered cannot be instantiated straightforwardly into the AGI. For example, understanding the computational underpinnings of the human motivation to communicate honestly may clue us into the fact that some model is deceptively aligned

More to the point, however, I think it is sufficiently likely that AGI will be doing something like RL such that proceeding under this assumption is a reasonable bet. Many of the qualities that seem necessary for AGI—planning, acting competently in a changing world, solving complex problems, pursuing goals, communicating coherently, etc.—also seem like they will have to leverage reinforcement learning in some form. (I'm definitely in favor of other researchers proceeding under different assumptions about the AGI's foundational learning algorithm(s), too, as I discuss at length here.) 

In summary, then, I think this critique doesn't quite work for two reasons—(1), even if we can't instantiate prosocial algorithms directly into the AGI, understanding them would still be useful for alignment, and (2), it is likely enough that AGI will be doing RL that 'let's-intelligently-interface-with-its-values'-style approaches are a worthwhile bet.

Plausible critique #4: We are simply too ignorant about the computational underpinnings of our values/moral decision-making to pursue this sort of research trajectory. The brain is just too damn complicated!

I won’t say too much about this critique because I’ve already shared some specific resources that cut against this sort of framing (see 3 above!). In general, it seems to me that many alignment researchers come from strong technical computer science backgrounds and may therefore be less familiar with the progress that has been made in recent decades in cognitive science. In general, I think that perceptions like (a) the brain is a giant black box, (b) cognitive science is an unbearably ‘soft’ science, etc. are sorely outdated. The field has produced high-quality, falsifiable, and increasingly mathematical models of complex cognitive processes (i.e., models that make algorithmic sense of psychological and neurofunctional processes in one fell swoop), and leveraging these contributions when attempting to solve the alignment problem—the problem of getting powerful cognitive systems of our own making to behave themselves—seems essential in my view.


Conclusion

In general, I’d like to think of this research trajectory as an especially important subset of Steve Byrnes’s proposal to reverse-engineer human social instincts. I'm in strong agreement that Humans provide an untapped wealth of evidence about alignment, I'm very excited that there are yet others who are beginning to work concretely on this subproblem, and I'm hopeful that yet more people join in pursuing this sort of alignment research direction! 

I’d like to think that contained within the complex computational processes of the human brain exists a robust and stable solution to the problem of getting a generally intelligent agent to avoid existentially risky behavior, and I am excited to pursue this research agenda with this kind of brain-based treasure hunt in mind.

New Comment
30 comments, sorted by Click to highlight new comments since:
[-]Dagon12-4

I think you're downplaying the importance of training data.  30-50 years of a human's experience in interacting with other humans, being judged and negotiating with humans, caring for and being cared for by humans is REALLY hard to reproduce or reduce to simpler utility functions.  

"human in the loop" to some extent translates to "we don't actually know why we trust (some) other humans, but there exist humans we trust, so let's delegate the hard part to them".

It's worth challenging a proponent of such a scheme EXACTLY how they'll select which humans to use for this.  Presumably they're not envisioning a lottery or a measured (on what dimensions?) median human with a little bit of training.  Maybe they're thinking a small committee of academic experts.  Maybe they're handwaving the idea that some sort of CEV exists and can be extracted and applied here.

Interesting! Definitely agree that if people's specific social histories are largely what qualify them to be 'in the loop,' this would be hard to replicate for the reasons you bring up. However, consider that, for example,

Young neurotypical children (and even chimpanzees!) instinctively help others accomplish their goals when they believe they are having trouble doing so alone...

which almost certainly has nothing to do with their social history. I think there's a solid argument to be made, then, that a lot of these social histories are essentially a lifelong finetuning of core prosocial algorithms that have in some sense been there all along.  And I am mainly excited about enumerating these. (Note also that figuring out these algorithms and running them in an RL training procedure might get us the relevant social histories training that you reference—but we'd need the core algorithms first.)

"human in the loop" to some extent translates to "we don't actually know why we trust (some) other humans, but there exist humans we trust, so let's delegate the hard part to them".

I totally agree with this statement taken by itself, and my central point is that we should actually attempt to figure out 'why we trust (some) other humans' rather than treating this as a kind of black box. However, if this statement is being put forward as an argument against doing so,, then it seems circular to me.

I don't know of anyone advocating using children or chimpanzees as AI supervisors or trainers.  The gap from evolved/early-learning behaviors to the "hard part" of human alignment is pretty massive.

I don't have any better ideas than human-in-the-loop - I'm somewhat pessimistic about it's effectiveness if AI significantly surpasses the humans in prediction/optimization power, but it's certainly worth including in the research agenda.

I don't know of anyone advocating using children or chimpanzees as AI supervisors or trainers.

I think you are talking past each other. The argument is not that children would be a good choice for AI trainers. The argument is that children (and chimpanzees) show pro-social behavior. You don't have to train chimps and children for 30 years until they figure out social behavior. 

Yes, if you want to replace competent humans as trainers then yes, but having an AI that cares about humans would be a nice achievement too.

I think it's a relevant point.  Children and chimps show some kinds of behavior we classify as prosocial, fine.  But that's a motte which does NOT defend the bailey of "human-in-the-loop" necessary because only evolution can generate these behaviors, OR that HITL is sufficient (or even useful) because we only need the kind of prosociality that seems close to universal.

Hi Cameron, nice to see you here : ) what are your thoughts on a critique like: human prosocial behavior/values only look the way they look and hold stable within-lifetimes, insofar as we evolved in + live in a world where there are loads of other agents with roughly equal power as ourselves? Do you disagree with that belief? 

Hi Joe—likewise! This relationship between prosociality and distribution of power in social groups is super interesting to me and not something I've given a lot of thought to yet. My understanding of this critique is that it would predict something like: in a world where there are huge power imbalances, typical prosocial behavior would look less stable/adaptive. This brings to mind for me things like 'generous tit for tat' solutions to prisoner's dilemma scenarios—i.e., where being prosocial/trusting is a bad idea when you're in situations where the social conditions are unforgiving to 'suckers.' I guess I'm not really sure what exactly you have in mind w.r.t. power specifically—maybe you could elaborate on (if I've got the 'prediction' right in the bit above) why one would think that typical prosocial behavior would look less stable/adaptive in a world with huge power imbalances?

The trite saying that power corrupts is maybe an indication that the social behavior of humans is not super stable under capability increase. Just human social instincts are not enough. But a simulation might show the limits of this and/or allow to engineer that they are stable. 

Not super important point but be informed how the connotations landed for me.

Saying that only neurotypical people are worth societal trust seems like phrasing and expression that excludes and devalues groups with a very wide shotgun. The word neurotypical by itself does not imply error or sickness (atleast in all circles). Saying stuff like "We are uncomfortable when skinatypical people make societally impactful decisions" would be in the same sense be yikes for more general sensibilities. Even if true, acknowledgement and condoning are different things. You will want to see that this approach will not become yet another "skull measuring" disciple. There is danger that this stance will structurally be "neuroconvergent" in that even with silicon we insist upon particular cognitive rituals to be used.

The technical point is that there there are humans which we can recognise not to be proper to trust with societal impact and still be human.

Definitely agree with the thrust of your comment, though I should note that I neither believe nor think I really imply anywhere that 'only neurotypical people are worth societal trust.' I only use the word in this post to gesture at the fact that the vast majority of (but not all) humans share a common set of prosocial instincts—and that these instincts are a product of stuff going on in their brains. In fact, my next post will almost certainly be about one such neuroatypical group: psychopaths!

I think the special brain algorithms in question—e.g., the ones that make us comfortable entrusting a neurotypical human to decide who won in the set-up above—are more familiarly thought of as prosocial or moral cognition. A claim like this would predict that we would be uncomfortable entrusting humans who lacked the relevant prosocial instincts (e.g., psychopaths) to oversee a safety-via-debate-type set-up, which seems correct.

 

This passage seems to treat psychopaths as lacking prosocial circuitry causing uncomfortability entrusting. In the parent psychopaths seem to be members of those that do instantiate the circuitry. I am a bit confused.

For example some people could characterise asperger cognition as "asocial cognition". Picking out one cognition as "moral" easily slips that anything that is not that is "immoral".

I see the parent comment making a claim analogous in logical structure to "I did not say we distrust women as politicians. I just said that we trust men as politicians."

I suppose the next post will explain it in greater detail.

I am not an expert, but it seems to me that moral blindness is the main thing the psychopaths are famous for. This is not about excluding a random group of people, but about noticing that something is not a universal human experience and pointing out the specific group that lacks it. It's like: "if we want to improve the safety of a self-driving car, we could keep a human in the loop who will also observe the road and provide feedback to the AI -- but of course we should not choose a blind human for this role".

I think the reason for objecting to this portion is the specificity and massive over-simplification of the criterion for trust.  There are LOTS of reasons not to trust someone, and neurodiversity is not actually one of them.  Psychopathy probably is, but it's really hard to test for, and it's not exclusive.  We may want to exclude religious or cultural true-believers, but they're not called out here for some reason.

The question of what specific humans to entrust with such a thing is really hard, and important.  Reducing it to "neurotypical" is simply incorrect.

I agree with this. There's a lot going on and I think what we're looking for isn't (just) some kind of "prosocial algorithm" but rather a mind developed around integrity that probably comes with prosocial instincts and so on. In theory, we could imagine a self-aware psychopath who takes extreme care to avoid self-serving cognition and would always self-modify in prosocial ways. I'm not sure if such people exist, but I think it's important not to blanket-dismiss an entire class of people with a specific set of chracteristics (even if the characteristics in question are "antisocial" [at least "system-1 antisocial"] by definition).

I broadly agree with Viliam's comment above. Regarding Dagon's comment (to which yours is a reply), I think that characterizing my position here as 'people who aren't neurotypical shouldn't be trusted' is basically strawmanning, as I explained in this comment. I explicitly don't think this is correct, nor do I think I imply it is anywhere in this post.  

As for your comment, I definitely agree that there is a distinction to be made between prosocial instincts and the learned behavior that these instincts give rise to over the lifespan, but I would think that the sort of 'integrity' that you point at here as well as the self-aware psychopath counterexample are both still drawing on particular classes of prosocial motivations that could be captured algorithmically. See my response to 'plausible critique #1,' where I also discuss self-awareness as an important criterion for prosociality.  

Yes, that section is overly simplistic but not in a way that couldn't be fixed or would invalidate the thrust or the main argument.

In other words, the fact that prosocial algorithms in brains recursively accommodate the fact that prosocial algorithms in brains are sometimes suboptimal (e.g., the capacity to think thoughts like “loving one's neighbor is a double-edged sword”) is, I claim, a highly desirable property of prosocial brain algorithms. This ability to critically inspect one’s own values may well be the most important prosocial algorithm to pin down!

I think this is probably a convergent result of shard ecosystems doing reflection and planning on themselves, and less a set of human-specific algorithms (although maybe you didn't mean to claim that). That is, I think that if you get the "base shards" ~right (i.e. they care about people in a variety of ways which I'm going to handwave for now, because I don't know them more precisely yet), then the base shards will tend to end up doing value reflection. Another way of stating this position (originally argued to me by Quintin) is: "moral philosophy is weakly convergent in the type of its process (of reflecting on imperfections), but not in its reflective equilibria (i.e. the actual values which get settled on)."

Agree with the broader point of:

In essence, the ‘human moral reasoning ain’t all that’ critique ignores that human moral reasoning is itself responsible for generating this critique!

Also, see Quintin's comment about how variance in human altruism is a good portent for alignment.

Thanks for the comment! I do think that, at present, the only working example we have of an agent able explicitly self-inspect its own values is in the human case, even if getting the base shards 'right' in the prosocial sense would likely entail that they will already be doing self-reflection. Am I misunderstanding your point here?  

Cool post! I made a comment on basically the same topic in the comment section of the latest post in the Shard Theory sequence. I'm excited about the cognitive science angle you're introducing (I'm not familiar with it). That said, it seems to me that there's a lot more about trustworthiness for "being in the loop" than a general type of cognition. There are also "Elephant in the brain" issues, for instance. (See also this comment I made in this same comment section).

Also, I think it could be useful to study the conditions for prosocial phenotypes to emerge from cultural learning or upbringing or genetic priors. In AI development, we may have more control over training conditions than over specifics of the algorithm (at lest past a certain level of description). 

Thanks Lukas! I just gave your linked comment a read and I broadly agree with what you've written both there and here, especially w.r.t. focusing on the necessary training/evolutionary conditions out of which we might expect to see generally intelligent prosocial agents (like most humans) emerge. This seems like a wonderful topic to explore further IMO. Any other sources you recommend for doing so?

Thanks! Not much has been done on the topic to my knowledge, so I can only recommend this very general post on the EA forum. I think it's the type of cause area where people have to carve out their own research approach to get something started. 

2

Somewhat tangential: This paper is interesting, some parts of it strike me as very "big if true", particularly the part where QCAE Online Simulation had a correlation of 0.44 with learning rate for prosocial actions. However, their sample size is ultra-low, so I wonder, do you know if it has been independently replicated with a proper sample size and/or pregistration?

(Specifically the thing that strikes me as big is that QCAE Online Simulation is a personality self-report test while the learning rate is an objective measure, and usually objective measures can't measure the traits that personality self-reports can.)

Agreed that the correlation between the modeling result and the self-report is impressive, with the caveat that the sample size is small enough not to take the specific r-value too seriously. In a quick search, I couldn't find a replication of the same task with a larger sample, but I did find a meta-analysis that includes this task which may be interesting to you! I'll let you know if I find something better as I continue to read through the literature :)

whatever brain algorithms motivate prosociality are not significantly altered by increases in general intelligence

I tend to think that to an extent each human is kept in check by other humans so that being prosocial is game-theoretically optimal. The saying "power corrupts" suggests that individual humans are not intrinsically prosocial. Biases make people think they know better than others.

Human variability in intelligence is minuscule and is only weak evidence as to what happens when capabilities increase.

I don't hold this belief strongly, but I remain unconvinced.

I have updated to it being a mix. It is not only being kept in check by others. There are benevolent rulers. Not all and nit reliable, but there seems to be potential.

I have to be honest, I’m skeptical. If we study how human prosociality works, my expectation is that we learn enough to produce some toy models with some very simplistic pro-sociality, but this seems insufficient for generating an AI capable of navigating tough moral dilemmas; just situations sufficiently off-distribution. The reason why we want humans in the loop is not because they are vaguely pro-social but because of the ability of humans to handle novel situations.

———————————————-

Actually, I shouldn’t completely rule out the value of this research. I think that it’s failure will be an utterly boring result, but perhaps there is value in seeing how it fails concretely. That said, it’s unlikely to be worth the effort if it takes a lot of neuroscience research just to construct these examples of failure.

If I had to say where it fails, it fails to be robust to relative scale. One thing I notice in real life history is that it really requires relatively equivalent power levels, and without that, it goes wrong fast (human treatment of non-pet animals are a good example.)

But you could design training runs that include agents with very different amounts of compute and see how stable the pro-social behavior is. You could also try to determine how instinct parameters have to be tuned to keep the pro-social behavior stable.

I think the special brain algorithms in question—e.g., the ones that make us comfortable entrusting a neurotypical human to decide who won in the set-up above—are more familiarly thought of as prosocial or moral cognition. A claim like this would predict that we would be uncomfortable entrusting humans who lacked the relevant prosocial instincts (e.g., psychopaths) to oversee a safety-via-debate-type set-up, which seems correct.

This doesn't match my intuitions, but I haven't thought about safety-via-debate too much. 

  • If the debate questions are like "would this plan be good for humanity?", then no, I don't want psychopaths adjudicating. 
  • If the questions are isolated and factual, like "does this plan lead to a diamond being synthesized in this particular room?", then I don't think I mind (except, perhaps, that some part of me objects to including psychopaths in any endeavor). 
  • If the questions are factual but feed into moral decisions, and the psychopathic judge knows that -- like "does this plan lead to lots of people dying?", then the psychopath would again implicitly be judging based off of expected-outcomes (even if they're supposed to not consider anything outside the context of the debate; people are people). And I wouldn't want them involved.

Agreed that there are important subtleties here. In this post, I am really just using the safety-via-debate set-up as a sort of intuitive case for getting us thinking about why we generally seem to trust certain algorithms running in the human brain to adjudicate hard evaluative tasks related to AI safety. I don't mean to be making any especially specific claims about safety-via-debate as a strategy (in part for precisely the reasons you specify in this comment).