Jobst Heitzig

Senior Researcher / Lead, FutureLab on Game Theory and Networks of Interacting Agents @ Potsdam Institute for Climate Impact Research. 

I'm a mathematician working on collective decision making, game theory, formal ethics, international coalition formation, and a lot of stuff related to climate change. Here's my professional profile.


Aspiration-based, non-maximizing AI agent designs [Aspiration-based designs]

Wiki Contributions


You are of course perfectly right. What I meant was: so that their convex hull is full-dimensional and contains the origin. I fixed it. Thanks for spotting this!

Exactly! Thanks for providing this concise summary in your words. 

In the next post we generalize the target from a single point to an interval to get even more freedom that we can use for increasing safety further. 

In our current ongoing work, we generalize that further to the case of multiple evaluation metrics, in order to get closer to plausible real-world goals, see our teaser post

Alex Turner's post you referenced first convinces me that his arguments about "orbit-level power-seeking" apply to maximizers and quantilizers/satisficers. Let me reiterate that we are not suggesting quantilizers/satisficers are a good idea, but that I firmly believe explicit safety criteria rather than plain randomization should be used to select plans.

He also claims in that post that the "orbit-level power-seeking" issue affects all schemes that are based on expected utility: "There is no clever EU-based scheme which doesn't have orbit-level power-seeking incentives." I don't see a formal proof of that claim though, maybe I missed it. The rationale he gives below that claim seems to boil down to a counting argument again, which suggests to me some tacit assumption that the agent still chooses uniformly at random from some set of policies. As this is not what we suggest, I don't see how it applies to our algorithms.

Re power-seeking in general: I believe one important class of safety criteria one should use to select from the many possible plans that can fulfill an aspiration-type goal is criteria that aim to quantify the amount of power/resources/capabilities/control potential the agent has at each time step. There are some promising metrics for this already (including "empowerment", reachability, and Alex Turner's AUP). We are currently investigating some versions of such measures, including ones we believe might be novel. A key challenge in doing so is again tractability. Counting the reachable states for example might be intractable, but approximating that number by a recursively computable metric based on Wasserstein distance and Gaussian approximations to latent state distributions seems tractable and might turn out to be good enough.

Thank you for the warm encouragement. 

We tried to be careful not to claim that merely making the decision algorithm aspiration-based is already sufficient to solve the AI safety problem, but maybe we need to add an even more explicit disclaimer in that direction. We explore this approach as a potentially necessary ingredient for safety, not as a complete plan for safety.

In particular, I perfectly agree that conflicting goals are also a severe problem for safety that needs to be addressed (while I don't believe there is a unique problem for safety that deserves being called "the" problem). In my thinking, the goals of an AGI system are always the direct or indirect consequences of the task it is given by some human that is authorized to give the system a task. If that is the case, the problem of conflicting goals is ultimately an issue of conflicting goals between humans. In your paperclip example, the system should reject the task of producing a trillion paperclips because that likely interferes with the foreseeable goals of other humans. I firmly believe we need to find a design feature that makes sure that the system rejects tasks that are conflicting with other human goals in this way. For the most powerful systems, we might have to do something like what davidad suggests in his Open Agency Architecture, where plans devised by the AGI need to be approved by some form of human jury. I believe such a system would reject almost any maximization-type goals and would only accept almost exclusively aspiration-type goals, and this is the reason why I want to find out how such a goal could then be fulfilled in a rather safe way.

Re quantilization/satisficing: I think that apart from the potentially conflicting goals issue, there are at least two more issues with plain satisficing/quantilization (understood as picking a policy uniformly at random from those that promise at least X return in expectation or among the top X% percent of the feasibility interval): (1) It might be computationally intractable in complex environments that require many steps, unless one finds a way to do that sequentially (i.e., from time step to time step). (2) The unsafe ways to fulfill the goal might not be scarce enough to have sufficiently small probability when choosing policies uniformly at random. The latter is the reason why I currently believe that the freedom to solve a given aspiration-type goal in all kinds of different ways should be used to select a policy that does so in a rather safe way, as judged on the basis of some generic safety criteria. This is why we also investigate in this project how generic safety criteria (such as those discussed for impact regularization in the maximization framework) should be integrated (see post #3 in the sequence). 

"Hence the information what I will do cannot have been available to the predictor." If the latter statement is correct, then how can could have "often correctly predicted the choices of other people, many of whom are similar to you, in the particular situation"?

There's many possible explanations for this data. Let's say I start my analysis with the model that the predictor is guessing, and my model attaches some prior probability for them guessing right in a single case. I might also have a prior about the likelihood of being lied about the predictor's success rate, etc. Now I make the observation that I am being told the predictor was right every single time in a row. Based on this incoming data, I can easily update my beliefs about what happened in the previous prediction excercises: I will conclude that (with some credence) the predictor was guessed right in each individual case or that (also with some credence) I am being lied to about their prediction success. This is all very simple Bayesian updating, no problem at all. As long as my prior beliefs assign nonzero credence to the possibility that the predictor guesses right (and I see not reason why that shouldn't be a possibility), I don't need to assign any posterior credence to the (physically impossible) assumption that they could actually foretell the actions.  

Take a possible world in which the predictor is perfect (meaning: they were able to make a prediction, and there was no possible extension of that world's trajectory in which what I will actually do deviates from what they have predicted). In that world, by definition, I no longer have a choice. By definition I will do what the predictor has predicted. Whatever has caused what I will do lies in the past of the prediction, hence in the past of the current time point. There is no point in asking myself now what I should do as I have no longer causal influence on what I will do. I can simply relax and watch myself doing what I have been caused to do some time before. I can of course ask myself what might have caused my action and try to predict myself from that what I will do. If I come to believe that it was myself who decided at some earlier point in time what I will do, then I can ask myself what I should have decided at that earlier point in time. If I believe that at that earlier point in time I already knew that the predictor would act in the way it did, and if I believe that I have made the decision rationally, then I should conclude that I have decided to one-box.

The original version of Newcomb's paradox in Nozick 1969 is not about a perfect predictor however. It begins with (1) "Suppose a being in whose power to predict your choices you have enormous confidence.... You know that this being has often correctly predicted your choices in the past (and has never, so far as you know, made an incorrect prediction about your choices), and furthermore you know that this being has often correctly predicted the choices of other people, many of whom are similar to you, in the particular situation to be described below". So the information you are given is explicitly only about things from the past (how could it be otherwise). It goes on to say (2) "You have a choice between two actions". Information (2) implies that what I will do has not been decided yet and I still have causal influence on what I will do. Hence the information what I will do cannot have been available to the predictor. This implies that the predictor cannot have made a perfect prediction about my behaviour. Indeed nothing in (1) implies that they have, the information given is not about my future action at all. After I will have made my decision, it might turn out, of course, that it happens to coincides with what the predictor has predicted. But that is irrelevant for my choice as it would only imply that the predictor will have been lucky this time. What should I make of information (1)? If I am confident that I still have a choice, that question is of no significance for the decision problem at hand and I should two-box. If I am confident that I don't have a choice but have decided already, the reasoning of the previous paragraph applies and I should hope to observe that I will one-box.

What if I am unsure whether or not I still have a choice? I might have the impression that I can try to move my muscles this way or that way, without being perfectly confident that they will obey. What action should I then decide to try? I should decide to try two-boxing. Why? Because that decision is the dominant strategy: if it turns out that indeed I can decide my action now, then we're in a world where the predictor was not perfect but merely lucky and in that world two-boxing is dominant; if it instead turns out that I was not able to override my earlier decision at this point, then we're in a world where what I try now makes no difference. In either case, trying to two-box is undominated by any other strategy.

Can you please explain the "zero-probability possible world"?

Hi Nathan,

I'm not sure. I guess it depends on what your definition of "agent" is. In my personal definition, following Yann LeCun's recent whitepaper, the "agent" is a system with a number of different modules, one of it being a world model (in our case, an MDP that it can use to simulate consequences of possible policies), one of it being a policy (in our case, an ANN that takes states as inputs and gives action logits as outputs), and one module being a learning algorithm (in our case, a variant of Q-learning that uses the world model to learn a policy that achieves a certain goal). The goal that the learning algorithm aims to find a suitable policy for is an aspiration-based goal: make the expected return equal some given value (or fall into some given interval). As a consequence, when this agent behaves like this very often in various environments with various goals, we can expect it to meet its goals on average (under mild conditions on the sequence of environments and goals, such as sufficient probabilistic independence of stochastic parts of the environment and bounded returns, so that the law of large number applies).

Now regarding your suggestion that the learned policy (what you call the frozen net I think) could be checked by humans before being used: that is a good idea for environments and policies that are not too complex for humans to understand. In more complex cases, one might want to involve another AI that tries to prove the proposed policy is unsafe for reasons not taken into account in selecting it in the first place, and one can think of many variations in the spirit of "debate" or "constitutional AI" etc.

Excellent! I have three questions

  1. How would we get to a certain upper bound on ?

  2. As collisions with the boundary happen exactly when one action's probability hits zero, it seems the resulting policies are quite large-support, hence quite probabilistic, which might be a problem in itself, making the agent unpredictable. What is your thinking about this?

  3. Related to 2., it seems that while your algorithm ensures that expected true return cannot decrease, it might still lead to quite low true returns in individual runs. So do you agree that this type of algorithm is rather a safety ingredient amongst other ingredients, rather that meant to be a sufficient solution to satety?

I'm sorry but I fail to see the analogy to momentum or adam, in neither of which the vector or distance from the current point to the initial point plays any role as far as I can see. It is also different from regularizations that modify the objective function, say to penalize moving away from the initial point, which would change the location of all minima. The method I propose preserves all minima and just tries to move towards the one closest to the initial point. I have discussed it with some mathematical optimization experts and they think it's new.

Load More