Yoshua Bengio wrote a blogpost yesterday in which he argues for developing "scientist AI", which seems in-structure very similar to historical Tool-AI proposals.
For the (IMO) best response to this kind of proposal see Gwern's: Why Tool AIs Want to Be Agent AIs.
Below I copies the blogpost in-full, since all of it seems pretty relevant.
AI Scientists: Safe and Useful AI?
Published 7 May 2023 by yoshuabengio
There have recently been lots of discussions about the risks of AI, whether in the short term with existing methods or in the longer term with advances we can anticipate. I have been very vocal about the importance of accelerating regulation, both nationally and internationally, which I think could help us mitigate issues of discrimination, bias, fake news, disinformation, etc. Other anticipated negative outcomes like shocks to job markets require changes in the social safety net and education system. The use of AI in the military, especially with lethal autonomous weapons has been a big concern for many years and clearly requires international coordination. In this post however, I would like to share my thoughts regarding the more hotly debated question of long-term risks associated with AI systems which do not yet exist, where one imagines the possibility of AI systems behaving in a way that is dangerously misaligned with human rights or even loss of control of AI systems that could become threats to humanity. A key argument is that as soon as AI systems are given goals – to satisfy our needs – they may create subgoals that are not well-aligned with what we really want and could even become dangerous for humans.
Main thesis: safe AI scientists
The bottom line of the thesis presented here is that there may be a path to build immensely useful AI systems that completely avoid the issue of AI alignment, which I call AI scientists because they are modeled after ideal scientists and do not act autonomously in the real world, only focusing on theory building and question answering. The argument is that if the AI system can provide us benefits without having to autonomously act in the world, we do not need to solve the AI alignment problem. This would suggest a policy banning powerful autonomous AI systems that can act in the world (“executives” rather than “scientists”) unless proven safe. However, such a solution would still leave open the political problem of coordinating people, organizations and countries to stick to such guidelines for safe and useful AI. The good news is that current efforts to introduce AI regulation (such as the proposed bills in Canada and the EU, but see action in the US as well) are steps in the right direction.
The challenge of value alignment
Let us first recap the objective of AI alignment and the issue with goals and subgoals. Humanity is already facing alignment problems: how do we make sure that people and organizations (such as governments and corporations) act in a way that is aligned with a set of norms acting as a proxy for the hard-to-define general well-being of humanity? Greedy individuals and ordinary corporations may have self-interests (like profit maximization) that can clash with our collective interests (like preserving a clean and safe environment and good health for everyone). Politics, laws, regulations and international agreements all imperfectly attempt to deal with this alignment problem. The widespread adoption of norms which support collective interests is enforced by design in democracies, to an extent, due to limitations on the concentration of power by any individual person or corporation. It is further aided by our evolved tendency to adopt prevailing norms voluntarily if we recognise their general value or to gain social approval, even if they go against our own interest. However, machines are not subject to these human constraints by default. What if an artificial agent had the cognitive abilities of a human without the human collective interest alignment prerogative? Without full understanding and control of such an agent, could we nonetheless design it to make sure it will adhere to our norms and laws, that it respects our needs and human nature?
The challenge of AI alignment and instrumental goals
One of the oldest and most influential imagined construction in this sense is Asimov’s set of Laws of Robotics, which request that a robot should not harm a human or humanity. Modern reinforcement learning (RL) methods make it possible to teach an AI system through feedback to avoid behaving in nefarious ways, but it is difficult to forecast how such complex learned systems would behave in new situations, as we have seen with large language models (LLMs). We can also train RL agents that act according to given goals. We can use natural language (with modern LLMs) to state those goals, but there is no guarantee that they understand those goals the way we do. In order to achieve a given goal (e.g., “cure cancer”), such agents may make up subgoals (“disrupt the molecular pathway exploited by cancer cells to evade the immune system”) and the field of hierarchical RL is all about how to discover subgoal hierarchies. Since it is difficult to foresee what these subgoals will be in the future, it is even more difficult to guarantee that such AI agents won’t pick subgoals that are misaligned with human objectives (we may have not foreseen the possibility that the same pathway prevents proper human reproduction and the cancer cure endangers the human species, where the AI may interpret the notion of harm in ways different from us). This is also called the instrumental goal problem and I strongly recommend reading Stuart Russell’s book on the general topic of controlling AI systems: Human Compatible. Russell also suggests a potential solution which would require the AI system to estimate its uncertainty about human preferences and act conservatively as a result (i.e. avoid acting in a way that might harm a human). Research on AI alignment should be intensified but what I am proposing here is a solution that avoids the issue altogether, while limiting the type of AI we would design.
Training an AI Scientist with Large Neural Nets for Bayesian Inference
I would like here to outline a different approach to building safe AI systems that would completely avoid the issue of setting goals and the concern of AI systems acting in the world (which could be in an unanticipated and nefarious way). The model for this solution is the idealized scientist, focused on building an understanding of what is observed (also known as data, in machine learning) and of theories that explain those observations. Keep in mind that for almost any set of observations, there will remain some uncertainty about the theories that explain them, which is why an ideal scientist can entertain many possible theories that are compatible with the data. The mathematically rational way to handle that uncertainty is called Bayesian inference. It involves listing all the possible theories and their posterior probabilities (which can be calculated in principle, given the data). It also mandates how (in principle) to answer any question in a probabilistic way (called the Bayesian posterior predictive) by averaging the probabilistic answer to any question from all these theories, each weighted by the theory’s posterior probability. This automatically puts more weight on the simpler theories that explain the data well (known as Occam’s razor). Although this rational decision-making principle has been known for a long time, the advent of large neural networks that can be trained on a huge number of examples actually opens the door to obtaining very good approximations of these Bayesian calculations. See [1,2,3,4] for recent examples going in that direction. These theories can be causal, which means that they can generalize to new settings more easily, taking advantage of natural or human-made changes in distribution (known as interventions). These large neural networks do not need to explicitly list all the possible theories: it suffices that they represent them implicitly through a trained generative model that can sample one theory at a time. See also my recent blog post on model-based machine learning, which points in the same direction. Such neural networks can be trained to approximate both a Bayesian posterior distribution over theories as well as trained to approximate answers to questions (also known as probabilistic inference or the Bayesian posterior predictive). What is interesting is that as we make those networks larger and train them for longer, we are guaranteed that they will converge toward the Bayesian optimal answers. There are still open questions regarding how to design and train these large neural networks in the most efficient way, possibly taking inspiration from how human brains reason, imagine and plan at the system 2 level, a topic that has driven much of my research in recent years. However, the path forward is fairly clear and may both eliminate the issues of hallucination and difficulty in multi-step reasoning with current large language models as well as provide a safe and useful AI as I argue below.
AI scientists and humans working together
It would be safe if we limit our use of these AI systems to (a) model the available observations and (b) answer any question we may have about the associated random variables (with probabilities associated with these answers). One should notice that such systems can be trained with no reference to goals nor a need for these systems to actually act in the world. The algorithms for training such AI systems focus purely on truth in a probabilistic sense. They are not trying to please us or act in a way that needs to be aligned with our needs. Their output can be seen as the output of ideal scientists, i.e., explanatory theories and answers to questions that these theories help elucidate, augmenting our own understanding of the universe. The responsibility of asking relevant questions and acting accordingly would remain in the hands of humans. These questions may include asking for suggested experiments to speed-up scientific discovery, but it would remain in the hands of humans to decide on how to act (hopefully in a moral and legal way) with that information, and the AI system itself would not have knowledge seeking as an explicit goal.
These systems could not wash our dishes or build our gadgets themselves but they could still be immensely useful to humanity: they could help us figure out how diseases work and which therapies can treat them; they could help us better understand how climate changes and identify materials that could efficiently capture carbon dioxide from the atmosphere; they might even help us better understand how humans learn and how education could be improved and democratized. A key factor behind human progress in recent centuries has been the knowledge built up through the scientific process and the problem-solving engineering methodologies derived from that knowledge or stimulating its discovery. The proposed AI scientist path could provide us with major advances in science and engineering, while leaving the doing and the goals and the moral responsibilities to humans.
The political challenge
However, the mere existence of a set of guidelines to build safe and useful AI systems would not prevent ill-intentioned or unwitting humans from building unsafe ones, especially if such AI systems could bring these people and their organizations additional advantages (e.g. on the battlefield, or to gain market share). That challenge seems primarily political and legal and would require a robust regulatory framework that is instantiated nationally and internationally. We have experience of international agreements in areas like nuclear power or human cloning that can serve as examples, although we may face new challenges due to the nature of digital technologies. It would probably require a level of coordination beyond what we are used to in current international politics and I wonder if our current world order is well suited for that. What is reassuring is that the need for protecting ourselves from the shorter-term risks of AI should bring a governance framework that is a good first step towards protecting us from the long-term risks of loss of control of AI. Increasing the general awareness of AI risks, forcing more transparency and documentation, requiring organizations to do their best to assess and avoid potential risks before deploying AI systems, introducing independent watchdogs to monitor new AI developments, etc would all contribute not just to mitigating short-term risks but also helping with longer-term ones.
 Tristan Deleu, António Góis, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, Yoshua Bengio, “Bayesian Structure Learning with Generative Flow Networks“, UAI’2022, arXiv:2202.13903, February 2022.
 Nan Rosemary Ke, Silvia Chiappa, Jane Wang, Anirudh Goyal, Jorg Bornschein, Melanie Rey, Theophane Weber, Matthew Botvinic, Michael Mozer, Danilo Jimenez Rezende, “Learning to Induce Causal Structure“,ICLR 2023, arXiv:2204.04875, April 2022.
 Noah Hollmann, Samuel Müller, Katharina Eggensperger, Frank Hutter, “TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second”, ICLR 2023, arXiv:2207.01848, July 2022.
 Edward Hu, Nikolay Malkin, Moksh Jain, Katie Everett, Alexandros Graikos, Yoshua Bengio, “GFlowNet-EM for learning compositional latent variable models”, arXiv:2302.06576.
Overall this is still encouraging. It seems to take serious that
I feel like there are enough shared assumptions that collaboration or dialogue with AI notkilleveryoneists could be very useful.
That said, I wish there were more details about his Scientist AI idea:
Also it is not clear to me whether the safety is supposed to come from:
This also seems very encouraging to me! In some sense he seems to be where Holden was at 10 years ago, and he seems to be a pretty good and sane thinker on AGI risk now, and I have hope that similar arguments will be compelling to both of them so that Bengio will also realize some of the same errors that I see him making here.
I think it's an example of an AI that completely lacks the notion of in-world goals. Its goal is restricted to a purely symbolic system; that system happens to map to parts of the world, but the AI lacks the self-reflection to realise which symbols map to itself and its immediate environment, and how manipulating those symbols may make it better at accomplishing its goals. Severing that feedback loop is IMO the key to avoiding instrumental convergence. Without that, all you get is another variety of a chess playing AI: superhumanly smart at its own task, but the world within which that task optimizes goals is too abstract for its skill to be "portable" to a dangerous domain.
There's not much context to this claim made by Yoshua Bengio, but while searching in Google News I found a Spanish online newspaper that has an article* in which he claims that:
while encouraging for dialog, I don't think he's got a very good model of what implications his attempts to create scientist ai have on the ways an executive ai could attack us. even without mesaoptimization (something his approaches are absolutely not safer from at all), executive ai actually doesn't need to be very strong at first escape to kill us all. it need only know how to use a scientist ai to finish the job. and any science results we get from human-steered scientist ai will make it that much easier for executive ai by not even requiring it to run the experiments. banning executive ai will be incredibly hard and likely permanently curtail human freedom to compute, which is likely to result in such bans being highly ideologically laden one way or another unless normally-conflicting non-centrist ideologies can maintain classical liberal "value compromise" centrism for long enough to find a way to guarantee the constraints that prevent ai executives still allow everyday people to keep their general life freedom of non-ai computation, ideally even including doing research using ai scientists. Auth versions of ideologies are gonna want to control this outcome, and it's worrying, as both progressive and regressive ideologies in the US have been going auth.
on the other hand, I do think that successfully banning dangerous ai in a way compatible with all ideologies is the most promising practical approach anyone has proposed besides formal goal alignment and low-impact-bounded informal goal alignment, both of which have major research problems still open.
the issue I still see is - how do you recognize an ai executive that is trying to disguise itself?
a sketch of non-centrist ideologies might be, authoritarian progressive/center/regressive, libertarian progressive/center/regressive, as well as another dimension of community-first/multiscale/individual-first that can apply to the others; if you want me to hunt down citations for why I think this is a good map of ideologies I can do so, please comment. I consider myself libertarian progressive multiscale
It can't disguise itself without researching disguising methods first. The question is will interpretability tools be up to the task of catching it.
It will not work for catching AI executive originating outside of controlled environment (unless it queries AI scientist). But given that such attempts will originate from uncoordinated relatively computationally underpowered sources, it may be possible to preemptively enumerate disguising techniques that such AI executive could come up with. If there are undetectable varieties..., well, it's mostly game over.
I find this really encouraging. This pretty clearly says that he takes AI X-risk seriously. In combination with Hinton recently going on-record as believing AGI is an X-risk, I'd say this may be a sea change in the most respected researchers essentially taking up the AInotkilleveryoneism banner.
For reference, I replied to Bengio's post in a separate post: https://www.lesswrong.com/posts/kGrwufqxfsyuaMREy/annotated-reply-to-bengio-s-ai-scientists-safe-and-useful-ai. TLDR: pretty much the same points that other commentators to this post are making, just in a more elaborate form.
I generally agree with the notion that IMO this sort of AI (non-agentic oracular tools) should be as far as we go along that road, and they pretty much destroy all the arguments for why we should rush to AGI as they provide most of the same benefits at a fraction of the cost. However, the crucial point remains that once you have these, the step to agentic AGI seems tiny; possibly as easy as rigging up an AutoGPT-like system which uses these Scientist AIs as one of its core components.
The comparison with human cloning seems apt, as a technology that we know well enough that we surely could do it, we just have for now mostly successfully avoided to do it out of ethical concerns. That said, human cloning is both more instinctively repugnant and less economically useful than building AGI, so the incentives at play are very different. It probably would be much safer to not even have the temptation.
The main advantage of Tool-AIs is that they can be used to solve alignment for more agentic approaches. You don't need to prevent people from building agentic AI for all time, just in the intermittent period while we have Tool AI, but don't yet have alignment.
Well, that's assuming there is something akin to a solution for alignment. I think it's feasible for the technical aspect but I highly doubt it for the social/political one. I think most or all aligned AIs would just be aligned with someone, specifically.
There is also some commentary here: https://www.lesswrong.com/posts/kGrwufqxfsyuaMREy/annotated-reply-to-bengio-s-ai-scientists-safe-and-useful-ai
Someone needs to offer him/his lab funding to investigate the pros and cons of this approach.
Potentially relevant: Yoshua Bengio got funding from OpenPhil in 2017:
Interesting. I would love to know if anything useful came out of that.