DanielFilan

Wiki Contributions

Comments

The core claim of this post is that if you train a network in some environment, the agent will not generalize optimally with respect to the reward function you trained it on, but will instead be optimal with respect to some other reward function in a way that is compatible with training-reward-optimality, and that this means that it is likely to avoid shutdown in new environments. The idea that this happens because reward functions are "internally represented" isn't necessary for those results. You're right that the post uses the phrase "internal representation" once at the start, and some very weak form of "representation" is presumably necessary for the policy to be optimal for a reward function (at least in the sense that you can derive a bunch of facts about a reward function from the optimal policy for that reward function), but that doesn't mean that they're central to the post.

The spark of strategicness, if such a thing is possible, recruits the surrounding mental elements. Those surrounding mental elements, by hypothesis, make goals achievable. That means the wildfire can recruit these surrounding elements toward the wildfire's ultimate ends... Also by hypothesis, the surrounding mental elements don't themselves push strongly for goals. Seemingly, that implies that they do not resist the wildfire, since resisting would constitute a strong push.

I think this is probably a fallacy of composition (maybe in the reverse direction than how people usually use that term)? Like, the hypothesis is that the mind as a whole makes goals achievable and doesn't push towards goals, but I don't think this implies that any given subset of the mind does that.

Like, the only reason we're calling it a "Fourier basis" is that we're looking at a few different speeds of rotation, in order to scramble the second-place answers that almost get you a cos of 1 at the end, while preserving the actual answer.

I agree a rotation matrix story would fit better, but I do think it's a fair analogy: the numbers stored are just coses and sines, aka the x and y coordinates of the hour hand.

My submission: when we teach modular arithmetic to people, we do it using the metaphor of clock arithmetic. Well, if you ignore the multiple frequencies and argmax weirdness, clock arithmetic is exactly what this network is doing! Find the coordinates of rotating the hour hand (on a 113-hour clock) x hours, then y hours, use trig identities to work out what it would be if you rotated x+y hours, then count how many steps back you have to rotate to get to 0 to tell where you ended up. In fairness, the final step is a little bit different than the usual imagined rule of "look at the hour mark where the hand ends up", but not so different that clock arithmetic counts as a bad prediction IMO.

An attempt at rephrasing a shard theory critique of utility function reasoning, while restricting myself to things I basically agree with:

Yes, there are representation theorems that say coherent behaviour is optimizing some utility function. And yes, for the sake of discussion let's say this extends to reward functions in the setting of sequential decision-making (even tho I don't remember seeing a theorem for that). But: just because there's a mapping, doesn't mean that we can pull back a uniform measure on utility/reward functions to get a reasonable measure on agents - those theorems don't tell us that we should expect a uniform distribution on utility/reward functions, or even a nice distribution! They would if agents were born with utility functions in their heads represented as tables or something, where you could swap entries in different rows, but that's not what the theorems say!

Not having read other responses, my attempt to answer in my own words: what goes wrong is that there are tons of possible cognitive influences that could be reinforced by rewards for making people smile. E.g. "make things of XYZ type think things are going OK", "try to promote physical configurations like such-and-such", "trying to stimulate the reinforcer I observe in my environment". Most of these decision-influences, when extrapolated to coherent behaviour where those decision-influences drive the course of the behaviour, lead to resource-gathering and not respecting what the informed preferences of humans would be. Then this causes doom because you can better achieve most goals/preferences you could have by having more power and disempowering the humans.

I think you're making a mistake: policies can be reward-optimal even if there's not an obvious box labelled "reward" that they're optimal with respect to the outputs of. Similarly, the formalism of "reward" can be useful even if this box doesn't exist, or even if the policy isn't behaving the way you would expect if you identified that box with the reward function. To be fair, the post sort of makes this mistake by talking about "internal representations", but I think everything goes thru if you strike out that talk.

The main thing I want to talk about

I can talk about utility functions instead (which would be equivalent to value functions in this case)

I disagree that these are equivalent, and expect the policy and value function to come apart in practice. Indeed, that was observed in the original goal misgeneralization paper (3.3, actor-critic inconsistency).

I think you're the one who's imposing a type error here. For "value functions" to be useful in modelling a policy, it doesn't have to be the case that the policy is acting optimally with respect to a suggestively-labeled critic - it just has to be the case that the agent is acting consistently with some value function. Analogously, momentum is conserved in classical mechanics, even if objects have labels on them that inaccurately say "my momentum is 23 kg m/s".

Anyways, we can talk about utility functions, but then we're going to lose claim to predictiveness, no? Why should we assume that the network will internally represent a scalar function over observations, consistent with a historical training signal's scalar values (and let's not get into nonstationary reward), such that the network will maximize discounted sum return of this internally represented function? That seems highly improbable to me, and I don't think reality will be "basically that" either.

The utility function formalism doesn't require agents to "internally represent a scalar function over observations". You'll notice that this isn't one of the conclusions of the VNM theorem.

Another thing I don't get

My point is rather that these results are not predictive because the assumption won't hold. The assumptions are already known to not be good approximations of trained policies, in at least some prototypical RL situations.

What part of the post you link rules this out? As far as I can tell, the thing you're saying is that a few factors influence the decisions of the maze-solving agent, which isn't incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.

Unrelatedly: it's weird to me that the wordings of "I'll reply later" and "Not planning to respond" don't mirror each other.

Load More