michaelcohen

Formal Solution to the Inner Alignment Problem

A policy outputs a distribution over , and equations 3 and 4 define what this distribution is for the imitator. If it outputs (0, a), that means and and and if it outputs (1, a), that means and . When I say

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,

that's just describing the difference between equations 3 and 4. Look at equation 4 to see that when , the distribution over the action is equal to that of the demonstrator. So we describe the behavior that follows as "deferring to the demonstrator". If we look at the distribution over the action when , it's something else, so we say the imitator is "picking its own action".

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator

or picking one of the other actions???

The 0 means the imitator is picking the action, and the means it's not picking another action that's not .

Formal Solution to the Inner Alignment Problem

What's the distinction between training and deployment when the model can always query for more data?

Formal Solution to the Inner Alignment Problem

It may be helpful to point to specific sections of such a long paper.

(Also, I agree that a neural network trained trained with that reward could produce a deceptive model that makes a well-timed error.)

Formal Solution to the Inner Alignment Problem

it certainly should have to consider different models of what the next transition will be like.

Yeah I was agreeing with that.

even though the Q learner is

trainedusing its own point estimate of the next state, it isn't, at inference time, given access to that point estimate.

Right, but one thing the Q-network, in its forward pass, is trying to *reproduce *is the point of estimate of the Q-value of the next state (since it doesn't have access to it). What it *isn't *trying to reproduce, because it isn't trained that way, is multiple models of what the Q-value might be at a given possible next state.

Formal Solution to the Inner Alignment Problem

I interpreted this bit as talking about RL

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to

find an agentwhich does do science.

But taking us back out of RL, in a wide neural network with selective attention that enables many qualitatively different forward passes, gradient descent seems to be training the way different models get proposed (i.e. the way attention is allocated), since this happens in a single forward pass, and what we're left with is a modeling routine that is heuristically considering (and later comparing) very different models. And this should include any model that a human would consider.

I think that is main thread of our argument, but now I'm curious if I was totally off the mark about Q-learning and policy gradient.

but overall I think these sorts of differences are pretty minor and shouldn't affect whether these approaches can reach general intelligence or not.

I had thought that maybe since a Q-learner is trained as if the cached point estimate of the Q-value of the next state is the Truth, it won't, in a single forward pass, consider different models about what the actual Q-value of the next state is. At most, it will consider different models about what the very next transition will be.

a) Does that seem right? and b) Aren't there some policy gradient methods that don't face this problem?

Formal Solution to the Inner Alignment Problem

So, if I understand the paper right, if was still in the set of top policies at time , the agent would not take the action to kill the demonstrator, but ask the demonstrator to take an action instead, which avoids the bad outcome, and might also demote all of the treacherous turn policies out of the top set.

This is exactly right.

But I am also reading in the abstract of the paper that

[...] queries to the demonstrator rapidly diminish in frequency

The reason this is consistent is that queries will become infrequent, but they will still be well-timed. The choice of whether to query *depends* on what the treacherous models are doing. So if the treacherous models wait for a long time, then we will have a long time where no queries are needed, and as soon as they decide to be treacherous, we thwart that plan with a query to the demonstrator. So it does not imply that

over time, it is likely that might disappear from the top set

Does this thought experiment look reasonable or have I overlooked something?

Yep, perfectly reasonable!

What about the probability that is still in the set of top policies at time ?

If we set small enough, we can make it arbitrarily like that never leaves the set of top policies.

Formal Solution to the Inner Alignment Problem

That's possible. But it seems like way less of a convergent instrumental goal for agents living in a simulated world-models. Both options--our world optimized by us and our world optimized by a random deceptive model--probably contain very little of value as judged by agents in another random deceptive model.

So yeah, I would say some models would think like this, but I would expect the total weight on models that do to be much lower.

Formal Solution to the Inner Alignment Problem

This is very nice and short!

And to state what you left implicit:

If , then in the setting with no malign hypotheses (which you assume to be safe), 0 is definitely the output, since the malign models can only shift the outcome by , so we assume it is safe to output 0. And likewise with outputting 1.

I'm pretty sure removing those is mostly just a technical complication

One general worry I have about assuming that the deterministic case extends easily to the stochastic case is that a sequence of probabilities that tends to 0 can still have an infinite sum, which is not true when probabilities must , and this sometimes causes trouble. I'm not sure this would raise any issues here--just registering a slightly differing intuition.

Formal Solution to the Inner Alignment Problem

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

Okay I think we've switched from talking about Q-learning to talking about policy gradient. (Or we were talking about the latter the whole time, and I didn't notice it). The question that I think is relevant is: how are possible world-models being hypothesized and analyzed? That's something I expect to be done with messy heuristics that sometimes have discontinuities their sequence of outputs. Which means I think that no reasonable DQN is will be generally intelligent (except maybe an enormously wide one attention-based one, such that finding models is more about selective attention at any given step than it is about gradient descent over the whole history).

A policy gradient network, on the other hand, could maybe (after having its parameters updated through gradient descent) become a network that, in a single forward pass, considers diverse world-models (generated with a messy non-local heuristic), and analyzes their plausibility, and then acts. At the end of the day, what we have is an agent modeling world, and we can expect it to consider any model that a human could come up with. (This paragraph also applies to the DQN with a gradient-descent-trained method for selectively attending to different parts of a wide network, since that could amount to effectively considering different models).

I don't think this is a problem. There will be plenty of them, but when they're wrong they'll get removed from the posterior.