Recently I spent some time thinking about ways in which studying the human side of human-machine systems would be beneficial to build aligned AIs. I discussed these ideas informally and people seemed interested and wanted to know more. Thus, I decided to write a list of research directions for studying humans that could help solve the alignment problem.
The list is non-exhaustive. Also, the intention behind it is not to argue that these research directions are more important than any other but rather to suggest directions to someone with a related background or personal fit in studying humans. There is also a lot of valuable work in AI Strategy that involves studying humans, which I am not familiar with. I wrote this list mostly with Technical AI Safety in mind.
Before diving into my suggestions for studying humans with AI Safety in mind, I want to mention some less well-known research fields that study the interactions between human and AI systems in different ways, since I reference some of these below. Leaving aside the usual suspects of psychology, cognitive science and neuroscience, other interesting research areas I came across are
A “transdisciplinary” approach defined by Norbert Wiener in 1948 as "the scientific study of control and communication in the animal and the machine". It is currently mostly used as a historical reference and a foundational reading. However, there is growing work in integrating cybernetics concepts in current research.
Human-Computer Interaction (HCI) is an established field dating back to the 70s. It “studies the design and use of computer technology, focused on the interfaces between people and computers”. Human-AI Interaction is a recently established sub-field of HCI concerned with studying specifically the interactions between humans and “AI-infused systems”.
“Using computers to model, simulate, and analyze social phenomena. It focuses on investigating social and behavioural relationships and interactions through social simulation, modelling, network analysis, and media analysis”
Defined as “the enhanced capacity that is created when people work together, often with the help of technology, to mobilise a wider range of information, ideas, and insights”
Which some define as “the domain aimed at endowing artificial agents with social intelligence, the ability to deal appropriately with users’ attitudes, intentions, feelings, personality and expectations”
Many concrete proposals of AI Alignment solutions, such as AI Safety via Debate, Recursive Reward Modelling or Iterated Distillation and Amplification involve human supervision. However, as Geoffrey Irving and Amanda Askell argued we do not know what problems may emerge when these systems interact with real people in realistic situations. Irving and Askell suggested a specific list of questions to work on: the list is primarily aimed at the Debate technique but knowledge gained about how humans perform with one approach is likely to partially generalize to other approaches (I also recommend reading the LessWrong comments to their paper).
Potentially useful fields: Cognitive science, Human-AI Interaction.
Factored cognition and evaluation refer to mechanisms to address open-ended cognitive tasks by breaking them down (or factoring) into many small and mostly independent tasks. Note that the possibly recursive nature of this definition makes it hard to reason about the behaviour of these mechanisms in the limit. Paul Christiano already made the case for better understanding factored cognition end evaluation when describing what Ought is doing and why it matters. Factored cognition and evaluation are major components of numerous concrete proposals to solve outer alignment, including Paul’s ones. It, therefore, seems important to understand the extent to which factored cognition and evaluation work well for solving meaningful problems. Rohin Shah and Buck Shlegeris mentioned that they would love to see more research in this direction for similar reasons and also because it seems plausible to Buck that “this is the kind of thing where a bunch of enthusiastic people could make progress on their own”.
Potentially useful fields: Cognitive science, Collective Intelligence
Jan Leike et al. asked whether feedback-based models (such as Recursive Reward Modelling or Iterated Distillation and Amplification) can attain sufficient accuracy with an amount of data that we can produce or label within a realistic budget. Explicitly expressing approval for a given set of agent behaviours is time-consuming and often an experimental bottleneck. Among themselves, humans tend to use more sample efficient feedback methods, such as non-verbal communication. The most immediate way of addressing this question is to work on understanding preferences and values from natural language, which is being tackled but still unsolved. Going further, can we train agents from head nods and other micro-expressions of approval? There are already existing examples of such work coming out of Social Signal Processing. We can extend this idea as far as training agents using brain-waves, which would take us to Brain-Computer Interfaces, although this direction seems relatively further away in time. Additionally, it makes sense to study this because systems could develop it on their own and we would want to have a familiarity with it if they do.
Potentially useful fields: Artificial Social Intelligence, Neuroscience
Interpretability seems to be a key component of numerous concrete solutions to inner alignment problems. However, it also seems that improving our understanding of transparency and interpretability is an open problem. This probably requires both formal contributions around defining robust definitions of interpretability as well as the human cognitive processes involved in understanding, explaining and interpreting things. I would not be happy if we ended up with some interpretability tools that we trust for some socially idiosyncratic reasons but are not de-facto safe. I would be curious to see some work that tries to decouple these ideas and help us get out of the trap of interpretability as an ill-defined concept.
Potentially useful fields: Human-AI Interaction, Computational Social Science.
When talking about value alignment, I heard a few times an argument that goes like this: “while I can see that the algorithm is learning from my preferences, how can I know that it has learnt my preferences”? This is a hard problem since latent preferences seem to be somewhat unknowable in full. While we certainly need some work on ensuring generalisation across distributions and avoiding unacceptable outcomes, it would also be useful to better understand what would make people think that their preferences have been learnt. This could also help with concerns like gaming preferences or deceitfully soliciting approval.
Potentially useful fields: Psychology, Cognitive Science, Human-AI Interaction
This is something that I am less familiar about, but let me put it out there for debate anyway. Since we want to build systems that are aligned and compatible with human values, would it not be helpful to better understand how humans form values in their brains? I do not think that we should copy how humans form values, as there could be better ways to do it, but knowing how we do it could be helpful, to say the least. There is ongoing work in neuroscience to answer such questions.
Potentially useful fields: Neuroscience
Some think that if powerful AI systems could understand us better, such as by doing more advanced sentiment recognition, there would be a significant risk that they may deceive and manipulate us better. On the contrary, others argue that if powerful AI systems cannot understand certain human concepts well, such as emotions, it may be easier for misaligned behaviour to emerge. While an AI having deceiving intentions would be problematic for many reasons other than its ability to understand us, it seems interesting to better understand the risks, benefits, and the trade-offs of enabling AI systems to understand us better. It might be that these are no different than any other capability, or it might be that there are some interesting specificities. Some also argued that access to human modelling could be more likely to produce mesa-optimizers, learnt algorithms that have their own objectives. This argument hinges on the idea that since humans often act as optimizers, reasoning about humans would lead these algorithms to learn about optimization. A more in-depth evaluation of what reasoning about humans would involve could likely provide more evidence about the weight of this argument.
Potentially useful fields: Cognitive Science, AI Safety Strategy.
Ivan Vendrov and Jeremy Nixon made a compelling case on why working on aligning existing recommended systems can lead to significant social benefits but also have positive flow-through effects on the broader problem of AGI alignment. Recommender systems are likely the largest datasets of real-word human decisions currently existing. Therefore, working on aligning them will require significantly more advanced models of human preferences values, such as metrics of extrapolated volition. It could also provide a large-scale real-world ground to test techniques of human-machine communication as interpretability and corrigibility.
Potentially useful fields: Human-AI Interaction, Product Design
The list is non-exhaustive and I am very curious to hear additional ideas and suggestions. Additionally, I am excited about any criticism or comments on the proposed ideas.
Finally, if you are interested in this topic, there are a couple of interesting further readings that overlap with what I am writing here, specifically:
Thanks to Stephen Casper, Max Chiswick, Mark Xu, Jiajia Hu, Joe Collman, Linda Linsefors, Alexander Fries, Andries Rosseau and Amanda Ngo which shared or discussed with me some of the ideas.
On the neuroscience side, I've been trying to dive into "if we build AGI using similar algorithms as the human brain, how can we make it safe and beneficial?" Further reading. That's more studying algorithms than "studying humans", probably.
I guess these are all more on the strategy side, but...
Out of possible futures in which we've invented cheap superintelligent AGIs, do a survey of which one most people on earth would actually want to live in. How does it interact with different personalities and different value systems? Further reading.
If everyone on Earth had a superintelligent AGI helper, what would they do with it, and what would be the societal consequences? What if each person can buy an AGI whose capabilities are proportional to the amount of money they spend on its hardware?
How can we avoid the failure mode (assuming it is in fact a failure mode) where we solve the technical problem of making AGIs that are docile and subservient, but then there's a political movement of people identifying with those AGIs and lobbying to make them more selfish and independent, presumably citing the analogy of slavery? What sort of AGI, and AGI-human interaction framework, would make this more or less likely to happen? Further reading.
Thanks for writing this. I think there's a lot of useful work that can be done in this direction, and a current major bottleneck on it is identifying it in a way that the people with relevant skills to pursue it are aware of it and why it is valuable.