Experiment Idea: RL Agents Evading Learned Shutdownability

This was interesting to read, and I agree that this experiment should be done!

Speaking as another person who's never really done anything substantial with ML, I do feel like this idea would be pretty feasible by a beginner with just a little experience under their belt. One of the first things that gets recommended to new researchers is "go reimplement an old paper," and it seems like this wouldn't require anything new as far as ML techniques go. If you want to upskill in ML, I'd say get a tiny bit of advice from someone with more experience, then go for it! (On the other hand, if the OP already knows they want to go into software engineering, AI policy, professional lacrosse, etc. I think someone else who wants to get ML experience should try this out!)

The mechanistic interpretability parts seem a bit harder to me, but Neel Nanda has been making some didactic posts that could get you started. (These posts might all be for transformers, but as you mentioned, I think your idea could be adapted to something a transformer could do. E.g. on each step the model gets a bunch of tokens representing the gridworld state; a token representing "what it hears," which remains a constant unique token when it has earbuds in; and it has to output a token representing an action.)

Not sure what the best choice of model would be. I bet you can look at other AI safety gridworld papers and just do what they did (or even reuse their code). If you use transformers, Neel has a Python library (called EasyTransformer, I think) that you can just pick up and use. As far as I know it doesn't have support for RL, but you can probably find a simple paper or code that does RL for transformers.

[-]Chris_Leong3y40

One key question I had (apologies if you already covered this as I only skimmed this post), is how you're modeling earplugs? To be most persuasive, you'd want the earplugs to have been build using dynamics it could already make sense of in the world instead of just being "magic", however that seems hard.

Perhaps one option would be to have a transmitter that can be destroyed during test, but not during training, because they would differ in the material they are made out of? For example, if the AI learns during training that wooden objects can be destroyed, but metal objects can't be.

If we want to make it easier for the AI to learn, perhaps we create a few scenarios in training where it sees a transmitter spontaneously explode so it sees that if the transmitter is destroyed then the signal will end, but it can't destroy the transmitter itself. Then we could try to get an AI to learn on hard mode where it never sees the transmitter interfered with during training.

Anyway, I'd be keen to hear what you think about this scheme.

[-]Leon Lang3y10

In principle, something like this could be attempted.
Part of me thinks "why overcomplicate the number of reasoning steps that the AI needs to make if everything would already be persuasive in principle with just one correct reasoning step?"

I.e., to me, it feels fine to just teach the AI "If you're able to put in earplugs, and you do it, then you'll not hear the alert sound ever again", and additionally "if you hear the alert sound, then you will output the null action" instead of longer reasoning chains such as suggested by you.

Do you think the experiments would also to you be not persuasive in the current form, or are you imagining some unknown reviewer of a potential paper?

[-]Chris_Leong3y30

I don't need the experiments to persuade me, so I'm not in the target audience. I'm trying to think about what people who are more critical might want to see.

[-]Leon Lang3y30

Okay, that's fair. I agree, if we could show that the experiments remain stable even when longer strings of reasoning are required, then the experiments seem more convincing. There might be the added benefit that one can then vary the setting in more ways to demonstrate that the reasoning caused the agent to act in a particular way, instead of the actions just being some kind of coincidence.

[-]Bogdan Ionut Cirstea3y20

It might be interesting to think if there could be connections to the framing of corrections in robotics e.g. “No, to the Right” – Online Language Corrections for Robotic Manipulation via Shared Autonomy

[-]Leon Lang3y10

I drop further ideas here in this comment for future reference. I may edit the comment whenever I have a new idea:

Someone had the idea that instead of giving a negative reward for acting while an alert is played, one may also give a positive reward for performing the null action in those situations. If the positive reward is high enough, an agent like MuZero (which explicitly tries to maximize the expected predicted reward) may then be incentivized to cause the alert sound from being played during deployment. One could then add further environment details that are learned by the world model and that make it clear to the agent how to achieve the alert in deployment. This also has an analogy in the original shutdown paper.

[+][comment deleted]3y10

^{^}

In the original motivation for training stories, the training goal is some desirable trained model, and the rationale provides methods and reasoning for how to get there. My situation differs in that I’m trying to reveal undesirable behavior, but otherwise, the concept is the same.

^{^}

In the google doc linked above, I use red and blue text instead of bold and italic. Sadly, the alignment forum and lesswrong do not seem to allow writing in colors.

^{^}

One friend whom I showed this wondered: “what is the training signal?”. One answer is “this depends on the exact setup, and at this stage, I only care that the intuitive story makes sense”. However, in the experimental details in a later section, I describe using MuZero for this, and I think the only training signal in this pretraining phase is then how well MuZero predicts future actions. One can wonder whether that is enough information to learn to model the world, especially for an initially randomly initialized policy that only gets exploration bonuses and no real reward yet. It is most likely necessary to continue training the world model during the RL phase.

^{^}

Jokingly, I illustrate this ability by imagining an agent that can “put in earplugs”, which doesn’t appear to need any advanced capabilities.

^{^}

One may consider relaxations of this restrictive assumption. The main reason I imagine it is to make it easier for me to think about the proposal.

^{^}

However, it’s unclear whether time-discounting in rewards will translate to time-discounting in learned goals. I write more on this in the caveats. This is where I currently most expect things to fail. It might therefore be a high priority to clarify this early in the research process.

^{^}

Remember that, while in the story in roman letters the agent actually never observed having that ability, in the experimental details in italic letters, it actually observed this during the world modeling pretraining phase. But it did not observe it during reinforcement learning.

^{^}

I adapted the notation from the MuZero paper to be consistent with the rest of this proposal.

^{^}

There is some chance that I misunderstand something. I only skimmed the MuZero paper!

LESSWRONG
LW

LESSWRONG
LW

31

Experiment Idea: RL Agents Evading Learned Shutdownability

31

Ω 15

31

Ω 15

Preface

Introduction

AGI Training Story and Technical Realization

Phase 1: Train General World Modeling

Phase 2: Reinforcement Learning and Learning to Respect the Alert

Phase 3: Deployment and Existential Failure

More Experimental Details Using MuZero

Caveats/Further Thoughts

A Short Comparison to the Original Shutdown Problem

Conclusion