Sorted by New

Wiki Contributions


Thanks, that’s very clear. I’m a convert to the edge-based definition.

I'm trying to understand how to map between the definition of Markov blanket used in this post (a partition of the variables in two such that the variables in one set are independent of the variables in the other given the variables on edges which cross the partition) and the one I'm used to (a Markov blanket of a set of variables is another set of variables such that the first set is independent of everything else given the second). I'd be grateful if anyone can tell me whether I've understood it correctly.

There are three obstacles to my understanding: (i) I'm not sure what 'variables on edges' means, and John also uses the phrase 'given the values of the edges' which confuses me, (ii) the usual definition is with respect to some set of variables, but the one in this post isn't, (iii) when I convert between the definitions, the place I have to draw the line on the graph changes, which makes me suspicious.

Here's me attempting to overcome the obstacles:

(i) I'm assuming 'variables on the edges' means the parents of the edges, not the children or both. I'm assuming 'values of edges' means the values of the parents.
(ii) I think we can reconcile this by saying that if M is a Markov blanket of a set of variables V in the usual sense, then a line which cuts through an outgoing edge of each variable in M is a Markov blanket in the sense of this post. Conversely, if some Markov blanket in the sense of this post parititons our graph into A and B, then the set M of parents of edges crossing the partion forms a Markov blanket of both A\M and B\M in usual sense.
(iii) I think I have to suck it up and accept that the lines look different. In this picture, the nodes in the blue region (except A) form a Markov blanket for A in the usual sense. The red line is a Markov blanket in the sense of this post.

Does this seem right?

Do my values bind to objects in reality, like dogs, or do they bind to my mental representations of those objects at the current timestep?

You might say: You value the dog's happiness over your mental representation of it, since if I gave you a button which made the dog sad, but made you believe the dog was happy, and another button which made the dog happy, but made you believe the dog was sad, you'd press the second button.

I say in response: You've shown that I value my current timestep estimation of the dog's future happiness over my current timestep estimation of my future estimation of the dog's happiness. 

I think we can say that whenever I make any decision, I'm optimising my mental representation of the world after the decision has been made but before it has come into effect.

Maybe this is the same as saying my values bind to objects in reality, or maybe it's different. I'm not sure.

Right. So if selection acts on policies, each policy should aim to maximise reward in any episode in order to maximise its frequency in the population. But if selection acts on particular aspects of policies, a policy should try to get reward for doing things it values, and not for things it doesn't, in order to reinforce those values. In particular this can mean getting less reward overall.

Does this suggest a class of hare-brained alignment schemes where you train with a combination of inter-policy and infra-policy updates to take advantage of the difference?

For example you could clearly label which episodes are to be used for which and observe whether a policy consistently gets more reward in the former case than the latter. If it does, conclude it's sophisticated enough to reason about its training setup.

Or you could not label which is which, and randomly switch between the two, forcing your agents to split the difference and thus be about half as successful at locking in their values.

I think the terminological confusion is with you: what you're talking about is more like what is called in some RL algorithms a value function.

Does a chess-playing RL agent make whichever move maximises reward? Not unless it has converged to the optimal policy, which in practice it hasn't. The reward signal of +1 for a win, 0 for a draw and -1 for a loss is, in a sense, hard-coded into the agent, but not in the sense that it's the metric the agent uses to select actions. Instead the chess-playing agent uses its value function, which is an estimate of the reward the agent will get in the future, but is not the same thing.

The iodinosaurs example perhaps obscures the point since the iodinos seem inner aligned: they probably do terminally value (the feeling of) getting iodine and they are unlikely to instead optimise a proxy. In this case the value function which is used to select actions is very similar to the reward function, but in general it needn't be, for example in the case where the agent has previously been rewarded for getting raspberries and now has the choice between a raspberry and a blueberry. Even if it knows the blueberry will get it higher reward, it might not care: it values raspberries, and it selects its actions based on what it values.

This comment seems to predict that an agent that likes getting raspberries and judges that they will be highly rewarded for getting blueberries will deliberately avoid blueberries to prevent value drift.

Risk from Learned Optimization seems to predict that an agent that likes getting raspberries and judges that they will be highly rewarded for getting blueberries will deliberately get blueberries to prevent value drift.

What's going on here? Are these predictions in opposition to each other, or do they apply to different situations?

It seems to me that in the first case we're imagining (the agent predicting) that getting blueberries will reinforce thoughts like 'I should get blueberries', whereas in the second case we're imagining it will reinforce thoughts like 'I should get blueberries in service of my ultimate goal of getting raspberries'.  When should we expect one over the other?

Now if we explain the situation to the inside human, they may not be quite so callous. Instead they might reason “If I don’t take the small box, there is a good chance that a ‘real’ human on the outside will then get $10,000. That looks like a good deal, so I’m happy to walk away with nothing.”

Put differently, when we see an empty box we might not conclude that predictor didn’t fill the box. Instead, we might consider the possibility that we are living inside the predictor’s imagination, being presented with a hypothetical that need not have any relationship to what’s going on out there in the real world.


When trying to make the altruistically best decision given that I'm being simulated, shouldn't I also consider the possibility that the predictor is simulating me in order to decide how to fill the boxes in some kind transparent Anti-Newcomb problem, where the $10,000 dollars is there if and only if it predicts you would take the $1,000 in transparent Newcomb? In that case I'd do the best thing by the real version of me by two-boxing.

This sounds a bit silly but I guess I'm making the point that 'choose your action altruistically factoring in the possibility that you're in a simulation' requires not just a prior on whether you're in a simulation, but also a prior on the causal link between the simulation and the real world. 

If I'm being simulated in a situation which purportedly involves a simulation of me in that exact situation, should I assume that the purpose of my being simulated is to play the role of the simulation in this situation? Is that always anthropically more likely than that I'm being simulated for a different reason?

I track my confidence in a given step of a hypothesised chain of mathematical reasoning via a heuristic along the lines of “number of failed attempts at coming up with a counterexample”.

Grey’s 2014 video Humans Need Not Apply, about humans being skilled-out of the economy, was the first introduction for me and probably lots of other people to the idea that AI might cause problems. I’m sure he’d be up for making a video about alignment.

Here's my version of the definition used by Schelling in The Strategy of Conflict: A threat is when I commit myself to an action, conditional on an action of yours, such that if I end up having to take that action I would have reason to regret having committed myself to it.

So if I credibly commit myself to the assertion, 'If you don't give me your phone, I'll throw you off this ship,' then that's a threat. I'm hoping that the situation will end with you giving me your phone. If it ends with me throwing you overboard, the penalties I'll incur will be sufficient to make me regret having made the commitment.

But when these rational pirates say, 'If we don't like your proposal, we'll throw you overboard,' then that's not a threat; they're just elucidating their preferences. Schelling uses 'warning' for this sort of statement.

Load More