cross-posted from niplav.github.io
I discuss arguments for and against the usefulness of brain-computer interfaces in relation to AI alignment, and conclude that the path to AI going well using brain-computer interfaces hasn't been explained in sufficient detail.
As a response to Elon Musk declaring that NeuraLink's purpose is to aid AI alignment, Muehlhauser 2021 cites Bostrom 2014 ch. 2 for reasons why brain-computer interfaces seem unlikely to be helpful with AI alignment. However, the chapter referenced concerns itself about building superintelligent AI using brain-computer interfaces, and not specifically about whether such systems would be aligned or especially alignable.
Arguments against the usefulness for brain-computer interfaces in AI alignment have been raised, but mostly in short form on twitter (for example here). This text attempts to collect arguments for and against brain-computer interfaces from an AI alignment perspective.
I am neither a neuroscientist nor an AI alignment researcher (although I have read some blogposts about the latter), and I know very little about brain-computer interfaces (from now on abbreviated as “BCIs”). I have done a cursory internet search for a resource laying out the case for the utility of BCIs in AI alignment, but haven't been able to find anything that satisfies my standards (I have also asked on the LessWrong open thread and the AI alignment channel on the Eleuther AI discord channel, and not received any answers that provide such a resource (although I was told some useful arguments about the topic)).
I have tried to make the best case for and against BCIs, stating some tree of arguments that I think many AI alignment researchers tacitly believe, mostly taking as a starting point the Bostrom/Yudkowsky story of AI risk (although it might be generalizable to a Christiano-like story; I don't know enough about CAIS or ARCHES to make a judgment about the applicability of the arguments).
Arguments For the Utility of Brain-Computer Interfaces in AI Alignment
Improving Human Cognition
Just as writing or computers have improved the quality and speed of human cognition, BCIs could do the same, on a similar (or larger) scale. These advantages could arise out of several different advantages of BCIs over traditional perception:
- Quick lookup of facts (e.g. querying Wikipedia while in a conversation)
- Augmented long-term memory (with more reliable and resilient memory freeing up capacity for thought)
- Augmented working memory (i.e. holding 11±2 instead of 7±2 items in mind at the same time) (thanks to janus#0150 on the Eleuther AI discord server for this point)
- Exchange of mental models between humanrs (instead of explaining a complicated model, one would be able to simply “send” the model to another person, saving a lot of time explaining)
- Outsourcing simple cognitive tasks to external computers
- Adding additional emulated cortical columns to human brains
It would be useful to try to estimate whether BCIs could make as much of a difference to human cognition as language or writing or the internet, and to perhaps even quantify the advantage in intelligence and speed given by BCIs.
Understanding the Human Brain
Neuroscience seems to be blocked by not having good access to human brains while they are alive, and would benefit from shorter feedback loops and better data. A better understanding of the human brain might be quite useful in e.g. finding the location of human values in the brain (even though it seems like there might not be one such location Hayden & Niv 2021). Similarly, a better understanding of the human brain might aid in better understanding and interpreting neural networks.
Path Towards Whole-Brain Emulation or Human Imitation
Whole-brain emulation (henceforth WBE) (with the emulations being faster or cheaper to run than physical humans) would likely be useful for AI alignment if used differentially for alignment over capabilities research – human WBEs would to a large part share human values, and could subjectively slow down timelines while searching for AI alignment solutions. Fast progress in BCIs could make WBEs more likely before an AI point of no return by improving the understanding of the human brain.
A similar but weaker argument would apply to AI systems that imitate human behavior.
“Merging” AI Systems With Humans
A notion often brought forward in the context of BCIs and AI alignment is the one of “merging” humans and AI systems.
Unfortunately, a clearer explanation of how exactly this would work or help with making AI go well is usually not provided (at least I haven't managed to find any clear explanation). There are different possible ways of conceiving of humans “merging” with AI systems: using human values/cognition/policies as partial input to the AI system.
Input of Values
The most straightforward method of merging AI systems and humans could be to use humans outfitted with BCIs as part of the reward function of an AI system. In this case, a human would be presented with a set of outcomes by an AI system, and would then signal how desirable that outcome would be from the human's perspective. The AI would then search for ways to reach the states rated highest by the human with the largest probability.
If one were able to find parts of the human brain that hold the human utility function, one could use these directly as parts of the AI systems. However, it seems unlikely that the human brain has a clear notion of terminal values distinct from instrumental values and policies in a form that could be used by an AI system.
Easier Approval-Directed AI Systems
Additionally, a human connected to an AI system via a BCI would have an easier time evaluating the cognition of approval-directed agents, since they might be able to follow he cognition of the AI system in real-time, and spot undesirable thought processes (like e.g. attempts at cognitive steganography).
Input of Cognition
Related to the aspect of augmenting humans using BCIs by outsourcing parts of cognition to computers, the inverse is also possible: identifying modules of AI systems that are most likely to be misaligned to humans or produce such misalignment, and replacing them with human cognition.
For example the part of the AI system that formulates long-term plans could be most likely to be engaged in formulating misaligned plans, and the AI system could be made more myopic by replacing the long-term planning modules with BCI-augmented humans, while short-term planning would be left to AI systems.
Alternatively, if humanity decides it wants to prevent AI systems from forming human models, modeling humans & societies could be outsourced to actual humans, whose human models would be used by the AI systems.
Input of Policies
As a matter of completeness, one might hypothesize about an AI agent that is coupled with a human, where the human can overwrite the policy of the agent (or, alternatively, the agent samples policies from some part of the human brain directly). In this case, however, when not augmented with other methods of “merging” humans and AI systems, the agent has a strong instrumental pressure to remove the ability of the human to change its policy at a whim.
Aid to Interpretability Work
By increasing the speed of interaction and augmenting human intelligence, BCIs might aid the quest of improving the interpretability of AI systems.
Side-note: A Spectrum from Humans to Human Imitations
There seems to be a spectrum from biological humans to human imitations, roughly along the axes of integration with digital systems/speed: Biological humans – humans with BCIs – whole-brain emulations – human imitations. This spectrum also partially tracks how aligned these human-like systems can be expected to act: a human imitation off-distribution seems much less trustworthy than a whole-brain emulation of a human acting off distribution.
Arguments Against the Utility of Brain-Computer Interfaces in AI Alignment
And so we boldly go—into the whirling knives.
–Nick Bostrom, “Superintelligence: Paths, Dangers, Strategies” p. 143, 2014
Direct Neural Takeover Made Easy
A common observation about AI alignment is that initially AI systems would be confined to computers, hopefully only with indirect contact to the outside world (i.e. no access to roboters, nanotechnology or factories). While there are some responses to these arguments (see i.e. Yudkowsky 2016a, Yudkowsky 2016b, [Bostrom 2014 pp. 117-122]("Superintelligence: Paths, Dangers, Strategies")), the proposal of connecting humans to potentially unaligned AI systems offers these counterarguments more weight.
Given direct access to the nervous system of a human, an AI system would be more likely to be able to hijack the human and use them to instantiate more instances of itself in the world (especially on computers with more computing power or access to manufacturing capabilities). Even if the access to the human brain is severely restricted to few bits and very specific brain regions (therewhile making the connection less useful in the first place), the human brain is not modular, and as far as I understand not designed to withstand adversarial interaction on the neural level (as opposed to attacks through speech or text, which humans are arguably more optimized against through constant interaction with other humans who tried to manipulate them in the ancestral environment).
If work on BCIs is net-positive in expectation for making AI go well, it might be the case that other approaches are more promising, and that focusing on BCIS might leave those approaches underdeveloped.
For example, one can posit neural network interpretability as the GiveDirectly of AI alignment: reasonably tractable, likely helpful in a large class of scenarios, with basically unlimited scaling and only slowly diminishing returns. And just as any new EA cause area must pass the first test of being more promising than GiveDirectly, so every alignment approach could be viewed as a competitor to interpretability work. Arguably, work on BCIs does not cross that threshold.
“Merging” is Just Faster Interaction
Most proposals of “merging” AI systems and humans using BCIs are proposals of speeding up the interaction betwen humans and computers (and possibly increasing the amount of information that humans can process): A human typing at a keyboard can likely perform all operations on the computer that a human connected to the computer via a BCI can, such as giving feedback in a CIRL game, interpreting a neural network, analysing the policy of a reinforcement learner etc. As such, BCIs offer no qualitatively new strategies for aligning AI systems.
While this is not negative (after all, quantity (of interaction) can have a quality of its own), if we do not have a type of interaction that makes AI systems aligned in the first place, faster interaction will not make our AI systems much safer. BCIs seem to offer an advantage by a constant factor: If BCIs give humans a 2x advantage when supervising AI systems (by making humans 2x faster/smarter), then if an AI system becomes 2x bigger/faster/more intelligent, the advantage is nullified. Even though he feasibility of rapid capability gains is a matter of debate, an advantage by only a constant factor does not seem very reassuring.
Additionally, supervision of AI systems through fast interaction should be additional to a genuine solution to the AI alignment problem: Ideally niceness is the first line of defense and the AI would tolerate our safety measures, but most arguments for BCIs being useful already assume that the AI system is not aligned.
Problems Arise with Superhuman Systems
When combining humans with BCIs and superhuman AI systems, several issues might arise that were no problem with infrahuman systems.
When infrahuman AI systems are “merged” with humans in a way that is nontrivially different from the humans using the AI system, the performance bottleneck is likely going to be the AI part of the tandem. However, once the AI system passes the human capability threshold in most domains necessary for the task at hand, the bottleneck is going to be the humans in the system. While such a tandem is likely not going to be strictly only as capable as the humans alone (partially because the augmentation by BCI makes the human more intelligent), such systems might not be competitive against AI-only systems that don't have a human part, and could be outcompeted by AI-only approaches.
These bottlenecks might arise due to different speeds of cognition and increasingly alien abstractions by the AI systems that need to be translated into human concepts.
“Merging” AI Systems with Humans is Underspecified
To my knowledge, there is no publicly written up explanation of what it would mean for humans to “merge” with AI systems. I explore some of the possibilities in this section, but these mostly boil down faster interaction.
It seems worrying that a complete company has been built on a vision that has no clearly articulated path to success.
BCIs Speed Up Capabilities Research as Well
If humanity builds BCIs, it seems not certain that the AI alignment community is going to be especially privileged over the AI capabilities community with regards to access to these devices. Unless BCIs increase human wisdom as well as intelligence, widespread BCIs that only enhance human intelligence would be net-zero in expectation.
On the other hand, if an alignment-interested company like NeuraLink acquires a strong lead in BCI technology and provides it exclusively to alignment-oriented organisations, it appears possible that BCIs will be a pivotal tool for helping to secure the development of AI.
Before collecting these arguments and thinking about the topic, I was quite skeptical that BCIs would be useful in helping align AI systems: I believed that while researching BCIs would be in expectation net-positive, there are similarly tractable approaches to AI alignment with a much higher expected value (for example work on interpretability).
I still basically hold that belief, but have shifted my expected value of researching BCIs for AI alignment upwards somewhat (if pressed, I would give an answer of a factor of 1.5, but I haven't thought about that number very much). The central argument that prevents me from taking BCIs as an approach to AI alignment seriously is the argument that BCIs per se offer only a constant interaction speedup between AI systems and humans, but no clear qualitative change in the way humans interact with AI systems, and create no differential speedup between alignment and capabilities work.
The fact that that there is no writeup of a possible path to AI going well that is focused on BCIs worries me, given that a whole company has been founded based on that vision. An explanation of a path to success would be helpful in furthering the discussion and perhaps moving work to promising approaches to AI alignment (be it towards or away from focusing on BCIs).