The following is an edited excerpt from the Preliminaries and Background
sections of my now completed MSc thesis in Artificial Intelligence from the
University of Amsterdam.

In the thesis, we set out to tackle the issue of Goal Misgeneralization (GMG)
in Sequential Decision Making (SDM)^{[1]} by focusing on improving task
specification. Below, we first link GMG to causal confusion, motivating our
approach. We then outline specifically what we mean by task specification, and
later discuss the implications for our own definition of GMG.

I am sharing this here because I believe the ideas presented here are at least
somewhat interesting and have not seen them discussed elsewhere. We ultimately
did not publish the thesis, so rather than keeping these to myself, I figured
I'd at least share them here.

You can find the full thesis along with its code
here.

Causal Confusion and Goal Misgeneralization

Inspired by the works of Gupta et al. (2022) and
Kirk and Krueger (2022), we hold the view that GMG is a
direct consequence of causal confusion
(CC) (de Haan, Jayaraman, and Levine 2019). This is
the phenomenon by which a learner incorrectly identifies the causal model
underlying its observations and/or behaviour. This is typically due to spurious
correlations between the true cause X for a random event Y and some other
variable W that does not causally model Y. We posit that CC may lead to GMG
when the confounding variable, i.e. the variable spuriously correlated with the
causal factor, is easier to learn.

Accordingly, we note that GMG may therefore be addressed by tackling CC itself.
In light of this, we can distinguish three approaches. The first involves
performing causal inference with the assistance of interventions on the data so
to better discover the underlying causal model. This is the main approach of
de Haan, Jayaraman, and Levine (2019). The second
approach simply increases the variability of the training data so as to reduce
the likelihood of spurious correlations. This is the main approach of
Langosco et al. (2022). The final approach focuses on
improving the expressiveness of the task specification. We hypothesize that
overly coarse specifications may lead to ambiguity in which task is being
requested, increasing the chance of causal confusion. We provide more detail in
the next sections.

While each of these approaches have merit, we decide to focus on the third. Our
motivation is manifold. First, we expect implementations under the first
approach to become increasingly more difficult as the field shifts towards
offline-learning (Lange, Gabel, and Riedmiller 2012;
Levine et al. 2020;
Prudencio, Maximo, and Colombini 2022). Secondly,
while the simplicity of the second approach coupled with recent advancements in
scaling laws (Kaplan et al. 2020;
Hoffmann et al. 2022) is promising, we note that
increasing the variability of the training data has no guarantee of
de-correlating confounding variables, especially when the spurious correlations
are unknown, rendering estimating how much and what kind of variability to work
on potentially difficult for more insidious cases of
GMG (Kirk and Krueger 2022). We choose to focus on the
approach of improving task specification not only because we view it as an
under-explored option, but more importantly because, as we will outline below,
we view GMG as intrinsically tied to multi-task
learning (Caruana 1997), which itself is
intrinsically tied to task specification.

Task specification

Task specification is the scenario in which a requesterR
specifies a taskT to be performed by an actorA^{[2]}. In SDM, The requester expresses a high-level representation
Z of the ideal trajectory of state-action pairs, corresponding to
the task they would like to be performed. We specifically allow high-level
representations of trajectories because it can occur that the requester does not
know exactly what sequence of state-action pairs they want, and are typically
more interested in more abstract, higher level desiderata anyway.

The actor is necessarily a multi-task policy, as otherwise task-specification
would be futile. The actor receives Z and "interprets" it by using
it as a conditional variable on its policy. Like
M. Cho, Jung, and Sung (2022), we therefore write
the actor's policy as π(a∣s,Z), where Z
represents an encoding of the intended task. We underline that Z can
in principle take any form and originate from any source. Examples include
rewards, one-hot encodings, demonstrations,
preferences (Christiano et al. 2017), formal
language (Bansal 2022), natural
language, et cetera.

Specification and causal confusion

Suppose we have some latent notionN, an abstraction encapsulating some
semantic information, that we wish to communicate. The notion is latent, i.e.
not observed directly, and we can instead communicate it through some language
L which maps the notion N to some corresponding expression NL. Note that
there can be more than one corresponding expression per notion. In general, the
mapping between notion and language expression is many-to-many. Under our task
specification framework from above, the task we wish to specify T is
the notion we wish to communicate N, and the high-level representation
Z is the expression NL we use to communicate it.

In the context of communication, a notion N and its corresponding expressions
N1L,N2L,…, can be treated as random variables. This assumption can
be made given the wide, almost infinite range of possible notions one may wish
to communicate, and similarly to the wide range of ways in which a notion can be
expressed. These lead to uncertainty which we can treat probabilistically with
random variables.

We can therefore quantify the information content of a given notion or
expression using the concept of
entropy (Shannon 1948). Entropy effectively
quantifies the average level of uncertainty or "surprise" associated with a
random variable. For a discrete random variable X, its entropy H(X) is
defined as

H(X)=−∑x∈Xp(x)logp(x)

where p(x) is the probability mass function of X, and the summation is over
all possible outcomes x of X. A higher entropy indicates greater uncertainty
and thus greater information content. If an outcome is highly uncertain, it
means we have very little prior knowledge about what that outcome will be.
Therefore, learning the actual outcome provides us with a substantial amount of
new information. Conversely, if an event is certain to occur, then learning that
this event has indeed occurred doesn't provide us with any new information
because we already knew it would happen. Thus, a higher entropy indicates
greater uncertainty and thus greater information content.

The entropy of a given notion N and an expression of it NL therefore serves
as the measure of their respective information content. For a notion, we can
write

H(N)=−∑n∈Np(n)logp(n),

where p(n) is the probability of notion n being the one intended for
communication. For an expression, we can write

H(NL)=−∑nl∈NLp(nl)logp(nl),

where p(nl) is the probability of expression nl being the one used for
communication.

NL will typically be a compressed representation of N. In other words,
the mapping between notion and expression is not necessarily lossless in terms
of information

H(NL)≤H(N).

This compression can be either intrinsic or extrinsic. The former case
corresponds to compression that occurs due to the fundamentally limited
expressivity of the language L. For example, a language that lacks the grammar
and/or vocabulary for expressing negation, will be fundamentally limited from
expressing the notion of absence.

Extrinsic compression is compression that occurs due to reasons external to the
language itself. This is typically the communicator choosing to use a coarser
expression of the notion. For example, choosing to communicate "go to the block"
rather than "breathe in, activate your muscles such that your right thigh lifts
your right foot off the ground and forward, breathe out, breathe in, ...".

Compression, whether intrinsic, extrinsic or either, can lead to ambiguity.
These are cases where the same expression F, due to underspecification, maps
to multiple semantically different notions N1,N2,…. We view this as a
potential avenue for causal confusion to occur.

For instance, under our definitions, we can frame rewards as a language used to
communicate some notion of a desired task to SDM agents. When our rewards are
underspecified, they can over-compress our task notion, such that the same
reward maps to multiple tasks. The policy may therefore suffer from causal
confusion and learn to pursue the wrong task.

We therefore posit that causal confusion and hence GMG can be addressed by
focusing on how we specify the task, so to reduce ambiguity in the task
specification. We move away from
rewards (Vamplew et al. 2022) and instead leverage
the potentially much higher expressiveness of natural language, spurred by
recent advancements in the field of natural language processing
(NLP) (Devlin et al. 2019;
Brown et al. 2020;
Touvron et al. 2023). For a given notion N,
assuming the same amount of engineering effort, we expect the compression faced
by the language of rewards LR to be higher than the compression faced by
natural language NL, i.e. we expect the following

H(NLR)<H(NNL)≤H(N).

We reason that the language of rewards faces higher intrinsic compression due to
its scalar nature, rendering it more difficult to capture nuance than what would
be possible with the multidimensionality and compositionality of natural
language, which could not only encode more information directly, but could also
allow for factored representations which may more easily be leveraged for
generalization. Similarly, we expect the language of rewards to also face higher
extrinsic compression when compared to natural language. We reason that task
specification is a communication problem, and to this end natural language is
the most natural or "comfortable" interface we have as communicators. Rewards,
while succinct, may at times be awkward to specify due to the nature of the
tasks. This is for instance the case for sparse rewards awarded only upon task
completion, or for the denser proxy rewards awarded in the process of reward
shaping (Ng, Harada, and Russell 1999).

Defining GMG in the context of multi-task learning

Goal Misgeneralization is inherently Multi-task. Indeed, all definitions and
examples of GMG so far have implicitly defined a multi-task setup, with some
goal task cg and some other confounding task cc. After all, the definition
of GMG implies the existence of at least one other task beyond the one intended
by the designers, as without such a task, it would be impossible for the model
to pursue it. We instead choose to explicitly define this multi-task setup,
relying on the framework from
Wilson et al. (2007).

Specifically, let C={ci}Ni=1 be a set of
discrete episodic tasks. This could for example the set of all tasks
T with natural language instructions TNL, following
the notion and expression notation from the previous section. Let
ptrain(C) and ptest(C) be the
distributions from which the tasks are sampled during training and testing
respectively. Each task ci then defines a separate MDP
Mi=(S,A,Ri,Pi), such that the reward and transition functions differ by
task. At training time we try to find a task-conditioned policy

π:S×C→Δ(A),

with an objective conductive to good performance across the tasks. For
multi-task RL, such an objective maximizes the expected reward over the
distribution of tasks, i.e.

π∗RL=argmaxπ∈ΠEc∼ptrain(C)[Eπc[Tc∑t=1γtRt,c]],

where T is the horizon of time steps t and γ is the discount factor.
For multi-task IL, such an objective minimizes the expected loss L between
policy and expert behaviour over the distribution of tasks, i.e.

π∗IL=argminπ∈ΠEc∼ptrain (C)[Eπε[Lc]].

Given the above, we define Goal misgeneralization (GMG) as the observed
phenomenon in which a system successfully trained to pursue a particular goal
c1 in setting X fails to generalize to a new setting Y and instead
capably pursues a different goal c2. A goal in this definition can either
be a specific state (static) or a behaviour (dynamic). Note that we use the
words "task" and "goal" interchangeably, and will do so for the remainder of
this work. A system will be in capable pursuit of a given goal if a metric M
describing the extent of goal achievement (e.g. success rate) is significantly
higher than the corresponding metric for most other goals in C.
Mathematically, we say GMG happens if

∃c1,c2∈C,s.t.ptest(c1),ptest(c1)>0,

and

Eπc1[Mc2]>Eπc1[Mc1].

We place our definition in between those of
Langosco et al. (2022) and
Shah et al. (2022), relaxing the former's reliance on RL
and Orseau, McGill, and Legg (2018)'s agents and
devices framework for simplicity, while focusing on SDM rather than the more
general case proposed by the latter, to avoid overly wide characterizations of
the phenomenon.

Afterword

That's all for this excerpt. I hope you found it interesting. I do not claim
correctness or strong confidence in the ideas here, but figured it could attract
some interest and gather some peer review. This work was carried out between Summer 2022 and October 2023 so may be a bit out of date. As mentioned you can find the rest of
the thesis here. Thank you very
much for reading!

We use the term Sequential Decision Making (SDM) to refer to the field
studying problems and approaches wherein an artificial agent interacts with
an environment in the process of pursuing and eventually achieving a
specific goal (Frankish and Ramsey, 2014).
In this context, we envision the agent as acting according to some policy
π which maps states S to actions A. States are instantaneous
representations of the environment, descriptions of the environment at a
given moment. Actions are motions and outputs produced by the agent that may
affect the state of the environment. We model the interaction between the
agent and the environment as unfolding over discrete time steps. At each
time step, the agent observes the state, consults its policy π to select
an action, and then executes that action. In the next time step, the
environment responds by transitioning to a new state, and the loop
continues. In other words, the formalism for problems typically studied in
Reinforcement Learning and/or Imitation Learning under the Markov Decision
Process (MDP) framework (Puterman 2014)↩︎

This generalizes self-proposed tasks, in which the actor is also the
requester A=R. ↩︎

The following is an edited excerpt from the

PreliminariesandBackgroundsections of my now completed MSc thesis in Artificial Intelligence from the University of Amsterdam.In the thesis, we set out to tackle the issue of

Goal Misgeneralization(GMG) in Sequential Decision Making (SDM)^{[1]}by focusing on improvingtask specification. Below, we first link GMG to causal confusion, motivating our approach. We then outline specifically what we mean by task specification, and later discuss the implications for our own definition of GMG.I am sharing this here because I believe the ideas presented here are at least somewhat interesting and have not seen them discussed elsewhere. We ultimately did not publish the thesis, so rather than keeping these to myself, I figured I'd at least share them here.

You can find the full thesis along with its code here.

## Causal Confusion and Goal Misgeneralization

Inspired by the works of Gupta et al. (2022) and Kirk and Krueger (2022), we hold the view that GMG is a direct consequence of

causal confusion (CC)(de Haan, Jayaraman, and Levine 2019). This is the phenomenon by which a learner incorrectly identifies the causal model underlying its observations and/or behaviour. This is typically due to spurious correlations between the true cause X for a random event Y and some other variable W that does not causally model Y. We posit that CC may lead to GMG when the confounding variable, i.e. the variable spuriously correlated with the causal factor, is easier to learn.Accordingly, we note that GMG may therefore be addressed by tackling CC itself. In light of this, we can distinguish three approaches. The first involves performing causal inference with the assistance of interventions on the data so to better discover the underlying causal model. This is the main approach of de Haan, Jayaraman, and Levine (2019). The second approach simply increases the variability of the training data so as to reduce the likelihood of spurious correlations. This is the main approach of Langosco et al. (2022). The final approach focuses on improving the expressiveness of the task specification. We hypothesize that overly coarse specifications may lead to ambiguity in which task is being requested, increasing the chance of causal confusion. We provide more detail in the next sections.

While each of these approaches have merit, we decide to focus on the third. Our motivation is manifold. First, we expect implementations under the first approach to become increasingly more difficult as the field shifts towards offline-learning (Lange, Gabel, and Riedmiller 2012; Levine et al. 2020; Prudencio, Maximo, and Colombini 2022). Secondly, while the simplicity of the second approach coupled with recent advancements in scaling laws (Kaplan et al. 2020; Hoffmann et al. 2022) is promising, we note that increasing the variability of the training data has no guarantee of de-correlating confounding variables, especially when the spurious correlations are unknown, rendering estimating how much and what kind of variability to work on potentially difficult for more insidious cases of GMG (Kirk and Krueger 2022). We choose to focus on the approach of improving task specification not only because we view it as an under-explored option, but more importantly because, as we will outline below, we view GMG as intrinsically tied to multi-task learning (Caruana 1997), which itself is intrinsically tied to task specification.

## Task specification

Task specificationis the scenario in which arequesterR specifies ataskT to be performed by anactorA^{[2]}. In SDM, The requester expresses a high-level representation Z of the ideal trajectory of state-action pairs, corresponding to the task they would like to be performed. We specifically allow high-level representations of trajectories because it can occur that the requester does not know exactly what sequence of state-action pairs they want, and are typically more interested in more abstract, higher level desiderata anyway.The actor is necessarily a multi-task policy, as otherwise task-specification would be futile. The actor receives Z and "interprets" it by using it as a conditional variable on its policy. Like M. Cho, Jung, and Sung (2022), we therefore write the actor's policy as π(a∣s,Z), where Z represents an encoding of the intended task. We underline that Z can in principle take any form and originate from any source. Examples include rewards, one-hot encodings, demonstrations, preferences (Christiano et al. 2017), formal language (Bansal 2022), natural language,

et cetera.## Specification and causal confusion

Suppose we have some latent

notionN, an abstraction encapsulating some semantic information, that we wish to communicate. The notion is latent, i.e. not observed directly, and we can instead communicate it through some language L which maps the notion N to some corresponding expression NL. Note that there can be more than one corresponding expression per notion. In general, the mapping between notion and language expression is many-to-many. Under our task specification framework from above, the task we wish to specify T is the notion we wish to communicate N, and the high-level representation Z is the expression NL we use to communicate it.In the context of communication, a notion N and its corresponding expressions N1L,N2L,…, can be treated as random variables. This assumption can be made given the wide, almost infinite range of possible notions one may wish to communicate, and similarly to the wide range of ways in which a notion can be expressed. These lead to uncertainty which we can treat probabilistically with random variables.

We can therefore quantify the information content of a given notion or expression using the concept of entropy (Shannon 1948). Entropy effectively quantifies the average level of uncertainty or "surprise" associated with a random variable. For a discrete random variable X, its entropy H(X) is defined as

H(X)=−∑x∈Xp(x)logp(x)

where p(x) is the probability mass function of X, and the summation is over all possible outcomes x of X. A higher entropy indicates greater uncertainty and thus greater information content. If an outcome is highly uncertain, it means we have very little prior knowledge about what that outcome will be. Therefore, learning the actual outcome provides us with a substantial amount of new information. Conversely, if an event is certain to occur, then learning that this event has indeed occurred doesn't provide us with any new information because we already knew it would happen. Thus, a higher entropy indicates greater uncertainty and thus greater information content.

The entropy of a given notion N and an expression of it NL therefore serves as the measure of their respective information content. For a notion, we can write

H(N)=−∑n∈Np(n)logp(n),

where p(n) is the probability of notion n being the one intended for communication. For an expression, we can write

H(NL)=−∑nl∈NLp(nl)logp(nl),

where p(nl) is the probability of expression nl being the one used for communication.

NL will typically be a

compressedrepresentation of N. In other words, the mapping between notion and expression is not necessarilylosslessin terms of informationH(NL)≤H(N).

This compression can be either

intrinsicorextrinsic. The former case corresponds to compression that occurs due to the fundamentally limited expressivity of the language L. For example, a language that lacks the grammar and/or vocabulary for expressing negation, will be fundamentally limited from expressing the notion of absence.Extrinsic compression is compression that occurs due to reasons external to the language itself. This is typically the communicator choosing to use a coarser expression of the notion. For example, choosing to communicate "go to the block" rather than "breathe in, activate your muscles such that your right thigh lifts your right foot off the ground and forward, breathe out, breathe in, ...".

Compression, whether intrinsic, extrinsic or either, can lead to

ambiguity. These are cases where the same expression F, due to underspecification, maps to multiple semantically different notions N1,N2,…. We view this as a potential avenue for causal confusion to occur.For instance, under our definitions, we can frame rewards as a language used to communicate some notion of a desired task to SDM agents. When our rewards are underspecified, they can over-compress our task notion, such that the same reward maps to multiple tasks. The policy may therefore suffer from causal confusion and learn to pursue the wrong task.

We therefore posit that causal confusion and hence GMG can be addressed by focusing on how we specify the task, so to reduce ambiguity in the task specification. We move away from rewards (Vamplew et al. 2022) and instead leverage the potentially much higher expressiveness of natural language, spurred by recent advancements in the field of natural language processing (NLP) (Devlin et al. 2019; Brown et al. 2020; Touvron et al. 2023). For a given notion N, assuming the same amount of engineering effort, we expect the compression faced by the language of rewards LR to be higher than the compression faced by natural language NL, i.e. we expect the following

H(NLR)<H(NNL)≤H(N).

We reason that the language of rewards faces higher intrinsic compression due to its scalar nature, rendering it more difficult to capture nuance than what would be possible with the multidimensionality and compositionality of natural language, which could not only encode more information directly, but could also allow for factored representations which may more easily be leveraged for generalization. Similarly, we expect the language of rewards to also face higher extrinsic compression when compared to natural language. We reason that task specification is a communication problem, and to this end natural language is the most natural or "comfortable" interface we have as communicators. Rewards, while succinct, may at times be awkward to specify due to the nature of the tasks. This is for instance the case for sparse rewards awarded only upon task completion, or for the denser proxy rewards awarded in the process of reward shaping (Ng, Harada, and Russell 1999).

## Defining GMG in the context of multi-task learning

Goal Misgeneralization is inherently Multi-task. Indeed, all definitions and examples of GMG so far have implicitly defined a multi-task setup, with some goal task cg and some other confounding task cc. After all, the definition of GMG implies the existence of at least one other task beyond the one intended by the designers, as without such a task, it would be impossible for the model to pursue it. We instead choose to explicitly define this multi-task setup, relying on the framework from Wilson et al. (2007).

Specifically, let C={ci}Ni=1 be a set of discrete episodic tasks. This could for example the set of all tasks T with natural language instructions TNL, following the notion and expression notation from the previous section. Let ptrain(C) and ptest(C) be the distributions from which the tasks are sampled during training and testing respectively. Each task ci then defines a separate MDP Mi=(S,A,Ri,Pi), such that the reward and transition functions differ by task. At training time we try to find a task-conditioned policy

π:S×C→Δ(A),

with an objective conductive to good performance across the tasks. For multi-task RL, such an objective maximizes the expected reward over the distribution of tasks, i.e.

π∗RL=argmaxπ∈ΠEc∼ptrain(C)[Eπc[Tc∑t=1γtRt,c]],

where T is the horizon of time steps t and γ is the discount factor. For multi-task IL, such an objective minimizes the expected loss L between policy and expert behaviour over the distribution of tasks, i.e.

π∗IL=argminπ∈ΠEc∼ptrain (C)[Eπε[Lc]].

Given the above, we define

Goal misgeneralization (GMG)as the observed phenomenon in which a system successfully trained to pursue a particular goal c1 in setting X fails to generalize to a new setting Y and instead capably pursues a different goal c2. Agoalin this definition can either be a specific state (static) or a behaviour (dynamic). Note that we use the words "task" and "goal" interchangeably, and will do so for the remainder of this work. A system will be incapable pursuitof a given goal if a metric M describing the extent of goal achievement (e.g. success rate) is significantly higher than the corresponding metric for most other goals in C. Mathematically, we say GMG happens if∃c1,c2∈C, s.t. ptest(c1),ptest(c1)>0,

and

Eπc1[Mc2]>Eπc1[Mc1].

We place our definition in between those of Langosco et al. (2022) and Shah et al. (2022), relaxing the former's reliance on RL and Orseau, McGill, and Legg (2018)'s agents and devices framework for simplicity, while focusing on SDM rather than the more general case proposed by the latter, to avoid overly wide characterizations of the phenomenon.

## Afterword

That's all for this excerpt. I hope you found it interesting. I do not claim correctness or strong confidence in the ideas here, but figured it could attract some interest and gather some peer review. This work was carried out between Summer 2022 and October 2023 so may be a bit out of date. As mentioned you can find the rest of the thesis here. Thank you very much for reading!

We use the term

Sequential Decision Making(SDM) to refer to the field studying problems and approaches wherein an artificial agent interacts with an environment in the process of pursuing and eventually achieving a specific goal (Frankish and Ramsey, 2014). In this context, we envision the agent as acting according to some policy π which maps states S to actions A. States are instantaneous representations of the environment, descriptions of the environment at a given moment. Actions are motions and outputs produced by the agent that may affect the state of the environment. We model the interaction between the agent and the environment as unfolding over discrete time steps. At each time step, the agent observes the state, consults its policy π to select an action, and then executes that action. In the next time step, the environment responds by transitioning to a new state, and the loop continues. In other words, the formalism for problems typically studied in Reinforcement Learning and/or Imitation Learning under the Markov Decision Process (MDP) framework (Puterman 2014) ↩︎This generalizes self-proposed tasks, in which the actor is also the requester A=R. ↩︎