Aspiration-based Q-Learning

Jobst Heitzig

I have a critique which is sort of a question because I'm not sure about it. I like the idea of optimizing for 'expected likelihood of harvesting X apples' instead of 'harvest at least X apples'. But I worry that this is 'kicking the can down the road' a bit. Doesn't this mean that a very powerful optimizing agent would try unreasonably hard to maximize the expected likelihood of harvesting X apples? Would such an intense single-minded attempt by a powerful agent be expected to have negative side-effects?

[-]Clément Dumas2y10

Hi Nathan, I'm not sure if I understand your critique correctly. The algorithm we describe does not try to "maximize the expected likelihood of harvesting X apples". It tries to find a policy that, given its current knowledge of the world, will achieve an expected return of X apples. That is, it does not care about the probability of getting exactly X apples, but rather the average number of apples it will get over many trials. Does that make sense?

[-]Nathan Helm-Burger2y20

Thanks, yes, that helpfully makes it more clear. To check if my understanding has improved, is this a better summary?

The agent is tasked with designing a second agent (aka policy), such that the second agent will achieve an expected return of X across many trials.

The second agent is a non-learning agent (aka frozen). It could be potentially expressed by a frozen neural net, or decision tree, or code. Because it is static, it could be analyzed by humans or other programs before being used.

If so, then this sounds good to me. And is rather reminiscent of this other framing of such ideas: https://www.lesswrong.com/posts/sCJDstZrpCB8dQveA/using-uninterpretable-llms-to-generate-interpretable-ai-code

[-]Jobst Heitzig2y43

Hi Nathan,

I'm not sure. I guess it depends on what your definition of "agent" is. In my personal definition, following Yann LeCun's recent whitepaper, the "agent" is a system with a number of different modules, one of it being a world model (in our case, an MDP that it can use to simulate consequences of possible policies), one of it being a policy (in our case, an ANN that takes states as inputs and gives action logits as outputs), and one module being a learning algorithm (in our case, a variant of Q-learning that uses the world model to learn a policy that achieves a certain goal). The goal that the learning algorithm aims to find a suitable policy for is an aspiration-based goal: make the expected return equal some given value (or fall into some given interval). As a consequence, when this agent behaves like this very often in various environments with various goals, we can expect it to meet its goals on average (under mild conditions on the sequence of environments and goals, such as sufficient probabilistic independence of stochastic parts of the environment and bounded returns, so that the law of large number applies).

Now regarding your suggestion that the learned policy (what you call the frozen net I think) could be checked by humans before being used: that is a good idea for environments and policies that are not too complex for humans to understand. In more complex cases, one might want to involve another AI that tries to prove the proposed policy is unsafe for reasons not taken into account in selecting it in the first place, and one can think of many variations in the spirit of "debate" or "constitutional AI" etc.

[-]Nathan Helm-Burger2y20

Thanks, that makes sense!

^{^}

Read "aleph", the first letter of the Hebrew alphabet

^{^}

In $s_{1}$ it will get $20 λ$ in expectation and will choose $a_{0}$ in $s_{i}$ with a probability of $λ$ . Therefore the expected $G$ will be $20 λ^{2}$ .

^{^}

Unless we are willing to numerically determine the relationship between $λ$ and $E_{λ} G$ and find $λ_{ℵ}$ s.t $E_{λ_{ℵ}} G = ℵ$

^{^}

e.g draw actions more human-like with something similar to quantilizers

^{^}

$ℵ_{t + 1} \leftarrow (ℵ_{t} - r_{t + 1}) / γ$

Agent	Maximizer	$ℵ$ -satisficer	$ℵ$ -aspiring
Goal	Harvest as many apple as possible	Harvest at least $ℵ$ apples	On expectation, harvest $ℵ$ apples

	Q learning	Hard update	Aspiration Rescaling
Objective	Learn $Q^{*}$	Learn $Q$	Learn $Q, Q - -, ¯ ¯¯ ¯ Q$
Policy	$argmax (s_{t}, a)$	Select $a \sim π$ s.t $E Q (s_{t}, a) = ℵ_{t}$
Success condition	$argmax Q (s_{t}, a) =$ $argmax Q^{*} (s_{t}, a)$	Exact $Q$ or can recover from overshooting	Exact $Q$

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

38

Aspiration-based Q-Learning

38

Ω 15

38

Ω 15

Introduction

Satisficing and aspiration

Local Relative Aspiration

Aspiration Propagation

Aspiration Rescaling

Generalization of Aspiration Rescaling

Experiments

LRA-DQN

AR-DQN

LRAR-DQN

Conclusion