In his AI Safety “Success Stories” post, Wei Dai writes:

[This] comparison table makes Research Assistant seem a particularly attractive scenario to aim for, as a stepping stone to a more definitive success story. Is this conclusion actually justified?

I share Wei Dai's intuition that the Research Assistant path is neglected, and I want to better understand the safety problems involved in this path.

Specifically, I'm envisioning AI research assistants, built without any kind of reinforcement learning, that help AI alignment researchers identify, understand, and solve AI alignment problems. Some concrete examples:

Possible with yesterday's technology: Document clustering that automatically organizes every blog post about AI alignment. Recommendation systems that find AI alignment posts similar to the one you're reading & identify connections between the thinking of various authors.

May be possible with current or near future technology: An AI chatbot, trained on every blog post about AI alignment, which makes the case for AI alignment to skeptics or attempts to shoot down FAI proposals. Text summarization software that compresses a long discussion between two forum users in a way that both feel is accurate and fair. A NLP system that automatically organizes AI safety writings into a problem/solution table as I described in this post.

May be possible with future breakthroughs in unsupervised learning, generative modeling, natural language understanding, etc.: An AI system that generates novel FAI proposals, or writes code for an FAI directly, and tries to break its own designs. An AI system that augments the problem/solution table from this post with new rows and columns generated based on original reasoning.

What safety problems are involved in creating research assistants of this sort? I'm especially interested in safety problems which haven't yet received much attention, and safety problems with advanced assistants based on future breakthroughs.

New Answer
New Comment
13 comments, sorted by Click to highlight new comments since: Today at 5:01 PM

It seems like the main problem is making sure nobody's getting systematically misled. To help humans make the right updates, the AI has to communicate not only accurate results, but well-calibrated uncertainties. It also has to interact with humans in a way that doesn't send the wrong signals (more a problem to do with humans than to do with AI).

This is very much on the near-term side of the near/long term AI safety work dichotomy. We don't need the AI to understand deception as a category, and why it's bad, so that it can make plans that don't involve deceiving us. We just need its training / search process (which we expect to more or less understand) to suppress incentives for deception to an acceptable range, on a limited domain of everyday problems.

(I'm probably a bigger believer in the significance of this dichotomy than most. I think looking at an AI's behavior and then tinkering with the training procedure to eliminate undesired behavior in the training domain is a perfectly good approach to handing near-term misalignment like overconfident advisor-chatbots, but eventually we want to switch over to a more scalable approach that will use few of the same tools.)

I agree well-calibrated uncertainties are quite valuable, but I'm not convinced they are essential for this sort of application. For example, if my assistant tells me a story about how my proposed FAI could fail, if my assistant is overconfident in its pessimism, then the worst case is that I spend a lot of time thinking about the failure mode without seeing how it could happen (not that bad). If my assistant is underconfident, and tells me a failure mode is 5% likely when it's really 95% likely, it still feels like my assistant is being overall helpful if the failure case is one I wasn't previously aware of. To put it another way, if my assistant isn't calibrated, it feels like I should just be able to ignore its probability estimates and get good use out if it.

but eventually we want to switch over to a more scalable approach that will use few of the same tools.

I actually think the advisor approach might be scaleable, if advisor_1 has been hand-verified, and advisor_1 verifies advisor_2, who verifies advisor_3, etc.

May be possible with future breakthroughs in unsupervised learning, generative modeling, natural language understanding, etc.: An AI system that generates novel FAI proposals, or writes code for an FAI directly, and tries to break its own designs.

It seems worth pointing out that due to the inner alignment problem, we shouldn't assume that naively training, say, unsupervised learning models with human-level capabilities (e.g. for the purpose of generating novel FAI proposals) will be safe — conditioned on it being possible capabilities-wise.

Are you referring to the possibility of unintended optimization, or is there something more?

If "unintended optimization" referrers only to the inner alignment problem, then there's also the malign prior problem.

Are you referring to the possibility of unintended optimization

Yes (for a very broad interpretation of 'optimization'). I mentioned some potential failure modes in this comment.

Do you have any thoughts on how specifically those failure modes might come about?

Those specific failure modes seem to me like potential convergent instrumental goals of arbitrarily capable systems that "want to affect the world" and are in an air-gapped computer.

I'm not sure whether you're asking about my thoughts on:

  1. how can '(un)supervised learning at arbitrarily large scale' produce such systems; or

  2. conditioned on such systems existing, why might they have convergent instrumental goals that look like those failure modes.

Those specific failure modes seem to me like potential convergent instrumental goals of arbitrarily capable systems that "want to affect the world" and are in an air-gapped computer.

My understanding is convergent instrumental goals are goals which are useful to agents which want to achieve a broad variety of utility functions over different states of matter. I'm not sure how the concept applies in other cases. Like, if we aren't using RL, and there is no unintended optimization, why specifically would there be pressure to achieve convergent instrumental goals? (I'm not trying to be rhetorical or antagonistic--I really want to hear if you can think of something.)

I'm interested in #1. It seems like the most promising route is to prevent unintended optimization from arising in the first place, instead of trying to outwit a system that's potentially smarter than we are.

Sorry for the delayed response!

My understanding is convergent instrumental goals are goals which are useful to agents which want to achieve a broad variety of utility functions over different states of matter. I'm not sure how the concept applies in other cases.

I'm confused about the "I'm not sure how the concept applies in other cases" part. It seems to me that 'arbitrarily capable systems that "want to affect the world" and are in an air-gapped computer' are a special case of 'agents which want to achieve a broad variety of utility functions over different states of matter'.

Like, if we aren't using RL, and there is no unintended optimization, why specifically would there be pressure to achieve convergent instrumental goals?

I'm not sure what's the interpretation of 'unintended optimization', but I think that a sufficiently broad interpretation would cover the failure modes I'm talking about here.

I'm interested in #1. It seems like the most promising route is to prevent unintended optimization from arising in the first place, instead of trying to outwit a system that's potentially smarter than we are.

I agree. So the following is a pending question that I haven't addressed here yet: Would '(un)supervised learning at arbitrarily large scale' produce arbitrarily capable systems that "want to affect the world"?

I won't address this here, but I think this is a very important question that deserves a thorough examination (I plan to reply here with another comment if I'll end up writing something about it). For now I'll note that my best guess is that most AI safety researchers think that it's at least plausible (>10%) that the answer to that question is "yes".

I believe that researchers tend to model Oracles as agents that have a utility function that is defined over world states/histories (which would make less sense if they are confident that we can use supervised learning to train an arbitrarily powerful Oracle that does not 'want to affect the world'). Here's some supporting evidence for this:

  • Stuart Armstrong and Xavier O'Rourke wrote in their Safe Uses of AI Oracles paper:

    we model the Oracle as a reward-maximising agent facing an MDP, who has a goal of escaping (meaning the Oracle gets the maximum possible reward for escaping its containment, and a strictly lower reward in other situations).

  • Stuart Russell wrote in his book Human Compatible (2019):

    if the objective of the Oracle AI system is to provide accurate answers to questions in a reasonable amount of time, it will have an incentive to break out of its cage to acquire more computational resources and to control the questioners so that they ask only simple questions.

I'm confused about the "I'm not sure how the concept applies in other cases" part. It seems to me that 'arbitrarily capable systems that "want to affect the world" and are in an air-gapped computer' are a special case of 'agents which want to achieve a broad variety of utility functions over different states of matter'.

Well, the reason I mentioned the "utility function over different states of matter" thing is because if your utility function isn't specified over states of matter, but is instead specified over your actions (e.g. behave in a way that's as corrigible as possible), you don't necessarily get instrumental convergence.

I'm not sure what's the interpretation of 'unintended optimization', but I think that a sufficiently broad interpretation would cover the failure modes I'm talking about here.

"Unintended optimization. First, the possibility of mesa-optimization means that an advanced ML system could end up implementing a powerful optimization procedure even if its programmers never intended it to do so." - Source. "Daemon" is an older term.

I believe that researchers tend to model Oracles as agents that have a utility function that is defined over world states/histories (which would make less sense if they are confident that we can use supervised learning to train an arbitrarily powerful Oracle that does not 'want to affect the world').

My impression is that early thinking about Oracles wasn't really informed by how (un)supervised systems actually work, and the intellectual momentum from that early thinking has carried to the present, even though there's no real reason to believe these early "Oracle" models are an accurate description of current or future (un)supervised learning systems.

Well, the reason I mentioned the "utility function over different states of matter" thing is because if your utility function isn't specified over states of matter, but is instead specified over your actions (e.g. behave in a way that's as corrigible as possible), you don't necessarily get instrumental convergence.

I suspect that the concept of utility functions that are specified over your actions is fuzzy in a problematic way. Does it refer to utility functions that are defined over the physical representation of the computer (e.g. the configuration of atoms in certain RAM memory cells that their value represents the selected action)? If so, we're talking about systems that 'want to affect (some part of) the world', and thus we should expect such systems to have convergent instrumental goals with respect to our world (e.g. taking control over as much resources in our world as possible).

My impression is that early thinking about Oracles wasn't really informed by how (un)supervised systems actually work, and the intellectual momentum from that early thinking has carried to the present, even though there's no real reason to believe these early "Oracle" models are an accurate description of current or future (un)supervised learning systems.

It seems possible that something like this has happened. Though as far as I know, we don't currently know how to model contemporary supervise learning at an arbitrarily large scale in complicated domains.

How do you model the behavior of the model on examples outside the training set? If your answer contains the phrase "training distribution" then how do you define the training distribution? What makes the training distribution you have in mind special relative to all the other training distributions that could have produced the particular training set that you trained your model on?

Therefore, I'm sympathetic to the following perspective, from Armstrong and O'Rourke (2018) (the last sentence was also quoted in the grandparent):

we will deliberately assume the worst about the potential power of the Oracle, treating it as being arbitrarily super-intelligent. This assumption is appropriate because, while there is much uncertainty about what kinds of AI will be developed in future, solving safety problems in the most difficult case can give us an assurance of safety in the easy cases too. Thus, we model the Oracle as a reward-maximising agent facing an MDP, who has a goal of escaping (meaning the Oracle gets the maximum possible reward for escaping its containment, and a strictly lower reward in other situations).

I suspect that the concept of utility functions that are specified over your actions is fuzzy in a problematic way. Does it refer to utility functions that are defined over the physical representation of the computer (e.g. the configuration of atoms in certain RAM memory cells that their value represents the selected action)? If so, we're talking about systems that 'want to affect (some part of) the world', and thus we should expect such systems to have convergent instrumental goals with respect to our world (e.g. taking control over as much resources in our world as possible).

No, it's not a utility function defined over the physical representation of the computer!

The Markov decision process formalism used in reinforcement learning already has the action taken by the agent as one of the inputs which determines the agent's reward. You would have to do a lot of extra work to make it so when the agent simulates the act of modifying its internal circuitry, the Markov decision process delivers a different set of rewards after that point in the simulation. Pretty sure this point has been made multiple times, you can see my explanation here. Another way you could think about it is that goal-content integrity is a convergent instrumental goal, so that's why the agent is not keen to destroy the content of its goals by modifying its internal circuits. You wouldn't take a pill that made you in to a psychopath even if you thought it'd be really easy for you to maximize your utility function as a psychopath.

It's fine to make pessimistic assumptions but in some cases they may be wildly unrealistic. If your Oracle has the goal of escaping instead of the goal of answering questions accurately (or similar), it's not an "Oracle".

Anyway, what I'm interested in is concrete ways things could go wrong, not pessimistic bounds. Pessimistic bounds are a matter of opinion. I'm trying to gather facts. BTW, note that the paper you cite doesn't even claim their assumptions are realistic, just that solving safety problems in this worst case will also address less pessimistic cases. (Personally I'm a bit skeptical--I think you ideally want to understand the problem before proposing solutions. This recent post of mine provides an illustration.)