The AI safety proposal was published by Yoshua Bengio on May 7, 2023.

Below I quote the salient sections of the proposal and comment on them in turn.

Main thesis: safe AI scientists

[…] I would like to share my thoughts regarding the hotly debated question of long-term risks associated with AI systems which do not yet exist, where one imagines the possibility of AI systems behaving in a way that is dangerously misaligned with human rights or even loss of control of AI systems that could become threats to humanity. A key argument is that as soon as AI systems are given goals – to satisfy our needs – they may create subgoals that are not well-aligned with what we really want and could even become dangerous for humans.

The bottom line of the thesis presented here is that there may be a path to build immensely useful AI systems that completely avoid the issue of AI alignment, which I call AI scientists because they are modeled after ideal scientists and do not act autonomously in the real world, only focusing on theory building and question answering. The argument is that if the AI system can provide us benefits without having to autonomously act in the world, we do not need to solve the AI alignment problem. This would suggest a policy banning powerful autonomous AI systems that can act in the world (“executives” rather than “scientists”) unless proven safe. However, such a solution would still leave open the political problem of coordinating people, organizations and countries to stick to such guidelines for safe and useful AI. The good news is that current efforts to introduce AI regulation (such as the proposed bills in Canada and the EU, but see action in the US as well) are steps in the right direction.

Bengio’s “AI scientists” are similar in spirit to tool AI, Oracle AI, and the “simulators” agenda. All these ideas share a weakness: tool AI may still cause a lot of harm in the hands of “misaligned” people, or people with poor ethics. I’m not sure which AI design would be on balance better[1]: tool AI or agent AI, but my intuition, stemming perhaps from the “skin in the game” principle, tells me that agent AI would be on balance better (if some other aspects of the overall system design are also done right, which I touch upon below, in the last section).

Note that in the Gaia network architecture (Kaufmann & the Digital Gaia Team, 2023), the intelligence nodes (called “Ents”) are agents as well as Bayesian learners (which is a synonym for “scientists”), as in Bengio’s proposal. But Ents are supposed to be situated within the system of incentives and action affordances that prevent these agents (if they are misaligned) from amassing a lot of power, unless they are able to create a completely isolated supply chain, like in Yudkowsky’s self-replicating nanobots argument (which many people don’t think is realistic).

Training an AI Scientist with Large Neural Nets for Bayesian Inference

I quote this section in full as I think it would be interesting for readers. I don’t have any comments or disagreements to add to it.

I would like here to outline a different approach to building safe AI systems that would completely avoid the issue of setting goals and the concern of AI systems acting in the world (which could be in an unanticipated and nefarious way). The model for this solution is the idealized scientist, focused on building an understanding of what is observed (also known as data, in machine learning) and of theories that explain those observations. Keep in mind that for almost any set of observations, there will remain some uncertainty about the theories that explain them, which is why an ideal scientist can entertain many possible theories that are compatible with the data.

The mathematically rational way to handle that uncertainty is called Bayesian inference. It involves listing all the possible theories and their posterior probabilities (which can be calculated in principle, given the data). It also mandates how (in principle) to answer any question in a probabilistic way (called the Bayesian posterior predictive) by averaging the probabilistic answer to any question from all these theories, each weighted by the theory’s posterior probability. This automatically puts more weight on the simpler theories that explain the data well (known as Occam’s razor).

Although this rational decision-making principle has been known for a long time, the advent of large neural networks that can be trained on a huge number of examples actually opens the door to obtaining very good approximations of these Bayesian calculations. See[2][3][4][5] for recent examples going in that direction.

These theories can be causal, which means that they can generalize to new settings more easily, taking advantage of natural or human-made changes in distribution (known as interventions). These large neural networks do not need to explicitly list all the possible theories: it suffices that they represent them implicitly through a trained generative model that can sample one theory at a time. See also my recent blog post on model-based machine learning, which points in the same direction. Such neural networks can be trained to approximate both a Bayesian posterior distribution over theories as well as trained to approximate answers to questions (also known as probabilistic inference or the Bayesian posterior predictive). What is interesting is that as we make those networks larger and train them for longer, we are guaranteed that they will converge toward the Bayesian optimal answers.

There are still open questions regarding how to design and train these large neural networks in the most efficient way, possibly taking inspiration from how human brains reason, imagine and plan at the system 2 level, a topic that has driven much of my research in recent years. However, the path forward is fairly clear and may both eliminate the issues of hallucination and difficulty in multi-step reasoning with current large language models as well as provide a safe and useful AI as I argue below.

AI scientists and humans working together

It would be safe if we limit our use of these AI systems to (a) model the available observations and (b) answer any question we may have about the associated random variables (with probabilities associated with these answers). One should notice that such systems can be trained with no reference to goals nor a need for these systems to actually act in the world. The algorithms for training such AI systems focus purely on truth in a probabilistic sense. They are not trying to please us or act in a way that needs to be aligned with our needs. Their output can be seen as the output of ideal scientists, i.e., explanatory theories and answers to questions that these theories help elucidate, augmenting our own understanding of the universe. The responsibility of asking relevant questions and acting accordingly would remain in the hands of humans. These questions may include asking for suggested experiments to speed-up scientific discovery, but it would remain in the hands of humans to decide on how to act (hopefully in a moral and legal way) with that information, and the AI system itself would not have knowledge seeking as an explicit goal.

These systems could not wash our dishes or build our gadgets themselves but they could still be immensely useful to humanity: they could help us figure out how diseases work and which therapies can treat them; they could help us better understand how climate changes and identify materials that could efficiently capture carbon dioxide from the atmosphere; they might even help us better understand how humans learn and how education could be improved and democratized. A key factor behind human progress in recent centuries has been the knowledge built up through the scientific process and the problem-solving engineering methodologies derived from that knowledge or stimulating its discovery. The proposed AI scientist path could provide us with major advances in science and engineering, while leaving the doing and the goals and the moral responsibilities to humans.

I think this vision of human-AI collaboration, which I can summarise as “AI does science, humans do ethical deliberation”, has a big practical problem which I already touched upon above: (some) humans, including scientists, have very bad ethics. Both “The Wisdom Gap” and the famous Edward Wilson’s quote “The real problem of humanity is the following: we have Paleolithic emotions, medieval institutions and godlike technology.” point to the fact that both humans’ individual and collective ethics (as well as our innate biological capacities to execute upon these ethics, cf. Bertrand Russell’s “two kinds of morality”) become increasingly inadequate to the power of the technology and the civilisational complexity. However, Bengio’s proposal seemingly helps to advance technology much further but doesn’t radically address the ethics question (apart from “helping to improve the education”, but this is a very slow route).

I think that all paths to safe superhuman AI (whether a “scientist” or an agent) must include turning ethics into science and then delegating it to AI as well. The only remaining role of humans would be to provide “phenomenological” evidence (i.e., telling if they perceive something as “good” or “bad”, or feel that way on the neurological level, which AI may figure out bypassing unreliable verbal report) from which AIs will infer concrete ethical models[6].

But then, if AIs are simultaneously superhuman at science and have a superhuman conscience (ethics), it feels that they could basically be made agents, too, and the results would probably be better than if agency is still in the hands of people even though both epistemology and moral reasoning are delegated to AI.

In short: I think that Bengio’s proposal is insufficient for ensuring with reasonable confidence that the AI transition of the civilisation goes well, even though the “human—AI alignment problem” is defined out of existence because the human-to-human alignment (ethics, and wisdom) problems remain unsolved. On the other hand, if we do solve all these problems (which is hard), then formally taking agency away from AI seems unwarranted.

Bengio’s “AI scientists” are also similar to OpenAI’s “alignment MVP” and Conjecture’s CoEms, with the difference that neither the “alignment MVP” nor CoEms are supposed to be “relatively permanent” solution”, but rather mostly (or exclusively) used to develop cognitive science, epistemology, ethics, and alignment science, and then scrap the “AI scientist” and build something that it itself helped to design instead.

The (not only) political challenge

However, the mere existence of a set of guidelines to build safe and useful AI systems would not prevent ill-intentioned or unwitting humans from building unsafe ones, especially if such AI systems could bring these people and their organizations additional advantages (e.g. on the battlefield, or to gain market share). That challenge seems primarily political and legal and would require a robust regulatory framework that is instantiated nationally and internationally. We have experience of international agreements in areas like nuclear power or human cloning that can serve as examples, although we may face new challenges due to the nature of digital technologies.

It would probably require a level of coordination beyond what we are used to in current international politics and I wonder if our current world order is well suited for that. What is reassuring is that the need for protecting ourselves from the shorter-term risks of AI should bring a governance framework that is a good first step towards protecting us from the long-term risks of loss of control of AI. Increasing the general awareness of AI risks, forcing more transparency and documentation, requiring organizations to do their best to assess and avoid potential risks before deploying AI systems, introducing independent watchdogs to monitor new AI developments, etc would all contribute not just to mitigating short-term risks but also helping with longer-term ones.

I would add to this that apart from legal and political, there are also economic[7] and infrastructural aspects of building up the civilisational immune system against misaligned AI developments and rogue actors.

Some economic and infrastructural restructuring ideas are presented in the Gaia network architecture paper which I already referred to earlier. I pointed out on me more infrastructural and economic inadequacies here:

  • There are no systems of trust and authenticity verification at the root of internet communication (see
  • The storage of information is centralised enormously (primarily in the data centres of BigCos such as Google, Meta, etc.)
  • Money has no trace, so one may earn money in arbitrary malicious or unlawful ways (i.e., gain instrumental power) and then use it to acquire resources from respectable places, e.g., paying for ML training compute at AWS or Azure and purchasing data from data providers. Formal regulations such as compute governance and data governance and human-based KYC procedures can only go so far and could probably be social-engineered by a superhuman imposter or persuader AI.

See also “Information security considerations for AI and the long term future” (Ladish & Heim, 2022).

  1. ^

    In technical terms, “better” here could be something like occupying a better Pareto front the usefulness/efficiency vs. safety/robustness tradeoff chart, however, using the proper logic of risk taking may lead to a different formulation.

  2. ^

    Tristan Deleu, António Góis, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, Yoshua Bengio, “Bayesian Structure Learning with Generative Flow Networks“, UAI’2022, arXiv:2202.13903, February 2022.

  3. ^

    Nan Rosemary Ke, Silvia Chiappa, Jane Wang, Anirudh Goyal, Jorg Bornschein, Melanie Rey, Theophane Weber, Matthew Botvinic, Michael Mozer, Danilo Jimenez Rezende, “Learning to Induce Causal Structure“,ICLR 2023, arXiv:2204.04875, April 2022.

  4. ^

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, Frank Hutter, “TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second”, ICLR 2023, arXiv:2207.01848, July 2022.

  5. ^

    Edward Hu, Nikolay Malkin, Moksh Jain, Katie Everett, Alexandros Graikos, Yoshua Bengio, “GFlowNet-EM for learning compositional latent variable models”, arXiv:2302.06576.

  6. ^

    Later, AI might acquire such a powerful morphological intelligence that it could be able to model entire human brains itself and thus could in principle take away this “phenomenological” role from humans, too. See “Morphological intelligence, superhuman empathy, and ethical arbitration” on this topic.

  7. ^

    The economy may coalesce with politics, since political economy is often viewed as an indivisible system and the area of study.

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 12:28 AM

I would make clear distinction between risk of AGI going rogue and AGI being used by people with poor ethics. In general, the problem of preventing a machine of accidentally doing due to malfunction is very different from preventing people from malicious use. If the idea of AI scientist can solve the first problem then it is worth promotion. 

Preventing bad actors from using AI is difficult in general because they could use open source version or develop one on their own - especially the state actors could do that. Thus, IMHO the best way to prevent, for example North Korea decides to use AI against US is for the US to have superior AI on its own.