In this post, I want to briefly propose a semi-novel direction for alignment research that I'm excited about. Though some of these ideas are not brand new—they purposefully bear resemblance to recent (highly promising) work in SHARD theory and Steve Byrnes’s approach to brain-based AGI safety—I think my emphases are sufficiently different so as to justify a more thorough explanation.
Why are humans 'worthy' of being in the loop?
I think the following three claims help motivate the general research direction I have in mind.
1) Many of the most coherent AI safety strategies proposed to date (e.g., HCH, imitative and approval-based amplification, recursive reward modeling, and more) involve human decision-makers in some meaningful capacity. I claim, therefore, that these proposals implicitly presuppose that there are specific algorithmic properties of the human mind/brain that make us comfortable entrusting these ‘humans in the loop’ with the task of minimizing the likelihood of AI-induced bad outcomes. This idea is demonstrated especially clearly by ‘safety via debate,’ for instance:
2) I think the special brain algorithms in question—e.g., the ones that make us comfortable entrusting a neurotypical human to decide who won in the set-up above—are more familiarly thought of as prosocial or moral cognition. A claim like this would predict that we would be uncomfortable entrusting humans who lacked the relevant prosocial instincts (e.g., psychopaths) to oversee a safety-via-debate-type set-up, which seems correct. I think the reason that it is a very natural thought to want to incorporate neurotypical human decision-makers into alignment proposals is that we are confident (enough) that such decisions will be made carefully—or at least more carefully than if there were no humans involved. In other words, individual humans in the loop are entrusted-by-default to serve as competent advocates for the interests of society at large (and who are more than likely aware of the fact that they are serving this role), able to infer suspicious behavior, evaluate subtle short- and long-term predicted consequences, be humble about said evaluations, probably solicit second opinions, etc.—somehow!
3) Our understanding of human prosocial cognition is growing increasingly precise and predictive. In cognitive neuroscience writ large, computational modeling has become a dominant approach to understanding the algorithms that the human brain instantiates, with talented researchers like Joshua Tenenbaum, Robb Rutledge, and Anne Collins leading the charge. This work has taken off in recent years and has enabled cognitive scientists to define and test hypotheses that are unprecedented in their mathematical precision and predictive power. Here are two good introductory (1, 2) examples of this sort of work that pertain explicitly to human prosocial cognition. I expect my future writing—and that of others interested in this sort of approach—to feature far more examples of good alignment-relevant computational social neuroscience research.
With these ideas in mind, my proposal is to conduct technical research to better understand these prosocial brain algorithms with the ultimate goal of instantiating some refined version of them directly into a future AGI (which is likely doing something approximating model-based RL). Here is a pretty straightforward two-step plan for doing so:
- Synthesize high-quality social neuroscience literature—with a particular focus on good computational modeling work—in order to more fully develop a rigorous account of the most important algorithms underlying human prosocial behavior.
- Develop specific corrigibility proposals for instantiating these algorithms in AI systems in a maximally realistic and competitive manner.
In other words, I'm basically proposing that we try to better understand the properties of human cognition that render humans 'worthy of being in the loop,' and proceed to apply this understanding by instantiating the relevant computations directly into our eventual AGI. If successful, such an approach might even obviate the need for having a human in the loop in the first place—'cutting out the middleman,' as it were.
Responses to some anticipated objections
Perhaps this plan sounds like a long shot to you! Here are some plausible reasons I think one might be skeptical of such an approach, followed by what I hope are plausible responses.
Plausible critique #1: Human value-based cognition/moral reasoning ain’t all that.
Humans are jealous, status-focused, painfully short-sighted, perennially overconfident, and often ethically confused—why would we ever want to instantiate the cognitive machinery that gives rise to these features in a superintelligent AI?
One very simple reply to this concern is we expect to be able to pick and choose from the set of prosocial brain algorithms—e.g., we might want to emulate the computational machinery that gives rise to altruistic motivations, but we'll probably skip out on whatever computational machinery gives rise to envy.
I honestly suspect that this reply might be too simple—that a non-trivial number of prosocial brain algorithms exhibit a double-edged quality. For instance, if you want an AGI that exhibits the capacity to truly love its neighbor, this may necessarily imply an obverse capacity to, say, hate the person who kills its neighbor. Moral cognition is often messy in this way. (I definitely recommend Paul Bloom’s book Against Empathy for more on this general point.) To the degree it is possible to maximally sample the good and minimally sample the bad from prosocial brain algorithms, I obviously think we should do so. I am just skeptical of how far such an approach would take us in avoiding the ‘human values are too grey-area to copy’ critique.
Ultimately, I think that the following is a stronger reply to this worry: the fact that human brains are able to criticize their own cognitive architecture, as is evidenced by this very criticism, is strong evidence for the plausibility of leveraging prosocial brain algorithms for alignment.
In other words, the fact that prosocial algorithms in brains recursively accommodate the fact that prosocial algorithms in brains are sometimes suboptimal (e.g., the capacity to think thoughts like “loving one's neighbor is a double-edged sword”) is, I claim, a highly desirable property of prosocial brain algorithms. This ability to critically inspect one’s own values may well be the most important prosocial algorithm to pin down! Prosocial brains don’t just do things like share resources and shun cheaters—they challenge their own preconceived assumptions, they are motivated to self-improve, and they often are aware of their own shortcomings. The EA community is perhaps one of the better examples of this particular sort of prosocial behavior on abundant display. (Notice also that these specific self-awareness-type properties seem to do a lot of explanatory work for understanding why we trust humans to, say, mediate a safety-via-debate set-up.) In essence, the ‘human moral reasoning ain’t all that’ critique ignores that human moral reasoning is itself responsible for generating this critique!
Plausible critique #2: Human values attached to superintelligence may look far more dangerous than human values attached to human-level intelligence.
I think there are two important and related responses to this. The first is orthogonality—i.e., that an agent’s intelligence and an agent’s goals/values (roughly speaking) vary independently. If this is true, then it follows that for any set of goals/values, we generally should not expect that increasing the intelligence of the agent who holds these goals/values will categorically modulate what these goals/values look like when implemented (so long as the agent is originally competent enough to actually achieve its goals/values). I think there is moderate empirical evidence to support this notion: namely, that increased intelligence within humans has no meaningful correlation to prosocial behavior (there are some studies that even report a weak positive correlation between human intelligence and prosociality, especially in morally complex decision domains—which makes sense). This seems to demonstrate that whatever brain algorithms motivate prosociality are not significantly altered by increases in general intelligence. Young neurotypical children (and even chimpanzees!) instinctively help others accomplish their goals when they believe they are having trouble doing so alone—as do most people on the far right tail of the IQ distribution. This all suggests to me that ‘attaching’ the right implementations of the right prosocial algorithms to generally intelligent systems should not suddenly render unrecognizable the class of behavior that emerges as a result of these algorithms.
Plausible critique #3: AGI won't be doing anything that looks like RL, so this sort of approach is basically useless!
I think understanding human prosociality in computational terms would be useful for alignment even if what is discovered cannot be instantiated straightforwardly into the AGI. For example, understanding the computational underpinnings of the human motivation to communicate honestly may clue us into the fact that some model is deceptively aligned.
More to the point, however, I think it is sufficiently likely that AGI will be doing something like RL such that proceeding under this assumption is a reasonable bet. Many of the qualities that seem necessary for AGI—planning, acting competently in a changing world, solving complex problems, pursuing goals, communicating coherently, etc.—also seem like they will have to leverage reinforcement learning in some form. (I'm definitely in favor of other researchers proceeding under different assumptions about the AGI's foundational learning algorithm(s), too, as I discuss at length here.)
In summary, then, I think this critique doesn't quite work for two reasons—(1), even if we can't instantiate prosocial algorithms directly into the AGI, understanding them would still be useful for alignment, and (2), it is likely enough that AGI will be doing RL that 'let's-intelligently-interface-with-its-values'-style approaches are a worthwhile bet.
Plausible critique #4: We are simply too ignorant about the computational underpinnings of our values/moral decision-making to pursue this sort of research trajectory. The brain is just too damn complicated!
I won’t say too much about this critique because I’ve already shared some specific resources that cut against this sort of framing (see 3 above!). In general, it seems to me that many alignment researchers come from strong technical computer science backgrounds and may therefore be less familiar with the progress that has been made in recent decades in cognitive science. In general, I think that perceptions like (a) the brain is a giant black box, (b) cognitive science is an unbearably ‘soft’ science, etc. are sorely outdated. The field has produced high-quality, falsifiable, and increasingly mathematical models of complex cognitive processes (i.e., models that make algorithmic sense of psychological and neurofunctional processes in one fell swoop), and leveraging these contributions when attempting to solve the alignment problem—the problem of getting powerful cognitive systems of our own making to behave themselves—seems essential in my view.
In general, I’d like to think of this research trajectory as an especially important subset of Steve Byrnes’s proposal to reverse-engineer human social instincts. I'm in strong agreement that Humans provide an untapped wealth of evidence about alignment, I'm very excited that there are yet others who are beginning to work concretely on this subproblem, and I'm hopeful that yet more people join in pursuing this sort of alignment research direction!
I’d like to think that contained within the complex computational processes of the human brain exists a robust and stable solution to the problem of getting a generally intelligent agent to avoid existentially risky behavior, and I am excited to pursue this research agenda with this kind of brain-based treasure hunt in mind.