Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary. This teaser post sketches our current ideas for dealing with more complex environments. It will ultimately be replaced by one or more longer posts describing these in more detail. Reach out if you would like to collaborate on these issues.

Multi-dimensional aspirations

For real-world tasks that are specified in terms of more than a single evaluation metric, e.g., how much apples to buy and how much money to spend at most, we can generalize Algorithm 2 as follows from aspiration intervals to convex aspiration sets:

  • Assume there are  many evaluation metrics , combined into a vector-valued evaluation metric 
  • Preparation: Pick  many linear combinations  in the space spanned by these metrics so that their convex hull is full-dimensional and contains the origin, and consider the  many policies  each of which maximizes the expected value of the corresponding function . Let  and  be the expected values of  when using  in state  or after using action  in state , respectively (see Fig. 1). Let the admissibility simplices  and  be the simplices spanned by the vertices  and , respectively (red and violet triangles in Fig. 1). They replace the feasibility intervals used in Algorithm 2. 
  • Policy: Given a convex state-aspiration set  (central green polyhedron in Fig. 1), compute its midpoint (centre of mass)  and consider the  segments  from  to the corners  of  (dashed black lines in Fig. 1). For each of these segments , let  be the (nonempty!) set of actions for which  intersects . For each , compute the action-aspiration  by shifting a copy  of  along  towards  until the intersection of  and  is contained in the intersection of  and  (half-transparent green polyhedra in Fig. 1), and then intersecting  with  to give  (yellow polyhedra in Fig. 1). Then pick one candidate action from each  and randomize between these  actions in proportions so that the corresponding convex combination of the sets  is included in . Note that this is always possible because  is in the convex hull of the sets  and the shapes of the sets  "fit" into  by construction.
  • Aspiration propagation: After observing the successor state , the action-aspiration  is rescaled linearly from  to  to give the next state-aspiration , see Fig. 2. 

(We also consider other variants of this general idea) 

Fig. 1: Admissibility simplices, and construction of action-aspirations by shifting towards corners and intersecting with action admissibility simplices (see text for details).
Fig. 2: An action admissibility simplex  is the convex combination of the successor states' admissibility simplices , mixed in proportion to the respective transition probabilities . An action aspiration  can be rescaled to a successor state aspiration  by first mapping the corners of the action admissibility sets onto each other (dashed lines) and extending this map linearly.

Hierarchical decision making

A common way of planning complex tasks is to decompose them into a hierarchy of two or more levels of subtasks. Similar to existing approaches from hierarchical reinforcement learning, we imagine that an AI system can make such hierarchical decisions as depicted in the following diagram (shown for only two hierarchical levels, but obviously generalizable to more levels):

Fig. 3: Hierarchical world model in the case of two hierarchical levels of decision making.
New Comment
3 comments, sorted by Click to highlight new comments since:

Pick  many linearly independent linear combinations  
isn't there at most  linearly independent linear combinations of ?

maybe you meant pairwise linearly independent (by looking at the graph)

You are of course perfectly right. What I meant was: so that their convex hull is full-dimensional and contains the origin. I fixed it. Thanks for spotting this!