Without causal assumptions or taking actions, it is simply not possible to deduce the correct causal model.
I think this is true, but you're eliding how accessible causal assumptions can be. The granddaddy of all causal assumptions is that causes generally precede effects. Another useful one for an AI is "humans are usually right about the causal language they use."
Another way to say that second one is semantic: "the language the AI uses to describe causes should usually match the language the humans use to describe similar causes." Causation doesn't have to be ontologically basic, we just have to be on the same page about it.
In that case it isn't a scientist AI but rather a knowledge indexing AI. It can pull together causal concepts that humans have written about, but without RL-style interaction with the environment, it's not a scientist.
Causes preceding effects is generally true but it's not going to help you solve much.
Thanks for the post! I wrote up something along similar lines but ended up declining to cross-post it. Basically: I largely agree with you, I have yet more quibbles, and I'm a little more pessimistic about your upsides.
I may have put extra focus on the upsides in an attempt to be more even-handed. The things that I say about the upsides are true and I do believe them but they aren't really the meat of the paper.
I'm realizing that sharing writing and thoughts is largely appreciated. The EA forum has explicit events to try getting people to publish their drafts. In that spirit I decided to post this here, and I'm glad that some people are finding it valuable.
Epistemic Status: I wrote this for an application then realized it might be of interest to others or spark a conversation. Yoshua Bengio and LawZero are important players in AI Safety, so I think we should have a conversation about their ideas.
I have two substantial concerns with Yoshua Bengio’s Scientist AI. One is that it fails to think through the consequences of success, and will fall into the same kind of alignment failures as agentic AI. A second is that Bengio’s method for making a scientist AI would fall short for both practical and theoretical reasons.
Even leaving aside some of his philosophically difficult claims like the mention that they want to make an AI that can't model itself but can still predict the world, the scientist AI as described would be unsafe. Logically speaking, if someone asked, "How can cancer be cured?" it would output some sequence of steps which could involve making an agent AI to solve cancer. I've seen Yudkowsky's blog posts from over ten years ago on why tool AI is not a solution to alignment.
Practically speaking, it tends to be the case that intelligently seeking out information in an agendic way is the best way to do science. Stopping your scientist AI from doing that is weakening it. By handicapping their scientist AI, Bengio’s team will always be behind labs that allow their AI (scientist or not) to explore. Even worse, there are strong theoretical reasons to expect understanding the world to be impossible without taking actions. This is related to the field of causal inference, pioneered by figures such as Judea Pearl. “Correlation does not imply causation” is a common mantra for a reason. Correlation is evidence for causality, reverse causality, common cause, or selection bias. Without causal assumptions or taking actions, it is simply not possible to deduce the correct causal model. Reinforcement learning, the paradigm that Bengio correctly identifies as the thing that puts us most at risk, is also the training paradigm that allows AIs to learn new causal models. This tension exists throughout the paper. Bengio often writes about wanting to train the model for causal modeling, but he also says that his plan is based on inherently associative conditional probabilities. No matter how much data he gives the model during training, this fact remains.
There are other possibly insurmountable obstacles to the plan as well. Constructing a formal language to map to reality is extremely difficult if not impossible, and researchers have been trying to do that for decades. And it would be hard to train such a model without human data, but if you train on human data, you're back at pre-training an AI which can play a potentially malicious character.
Despite my criticisms, there are good aspects of the paper. Their short-term plan is something I would like to see: If I’m reading them correctly, it is fine-tuning an LLM model to be good at hypothesizing what might go wrong with a user’s request. That is both practically useful and can make the system safer to use. Characterizing risky agentic AI systems by their “affordances, goal-directedness, and intelligence” is a good idea, even if enough intelligence can lead to the other two trivially. I will work “anytime preparedness” into my plans, so that I can contribute in both short and long timelines. I have done something similar, by doing both fieldbuilding (long term payoff) and research (short term payoff if the research is impactful), and I appreciate the handle.
I have a lot of respect for Yoshua Bengio, but his scientist AI plan is not doable for reasons that I'm sure other AI safety people have tried telling him before. Still, I'm glad that he is working on AI safety, and I expect his group to contribute something valuable, even if the plan as written can't work.