Jobst Heitzig

Senior Researcher / Lead, FutureLab on Game Theory and Networks of Interacting Agents @ Potsdam Institute for Climate Impact Research. 

I'm a mathematician working on collective decision making, game theory, formal ethics, international coalition formation, and a lot of stuff related to climate change. Here's my professional profile.

Wiki Contributions

Comments

"Hence the information what I will do cannot have been available to the predictor." If the latter statement is correct, then how can could have "often correctly predicted the choices of other people, many of whom are similar to you, in the particular situation"?

There's many possible explanations for this data. Let's say I start my analysis with the model that the predictor is guessing, and my model attaches some prior probability for them guessing right in a single case. I might also have a prior about the likelihood of being lied about the predictor's success rate, etc. Now I make the observation that I am being told the predictor was right every single time in a row. Based on this incoming data, I can easily update my beliefs about what happened in the previous prediction excercises: I will conclude that (with some credence) the predictor was guessed right in each individual case or that (also with some credence) I am being lied to about their prediction success. This is all very simple Bayesian updating, no problem at all. As long as my prior beliefs assign nonzero credence to the possibility that the predictor guesses right (and I see not reason why that shouldn't be a possibility), I don't need to assign any posterior credence to the (physically impossible) assumption that they could actually foretell the actions.  

Take a possible world in which the predictor is perfect (meaning: they were able to make a prediction, and there was no possible extension of that world's trajectory in which what I will actually do deviates from what they have predicted). In that world, by definition, I no longer have a choice. By definition I will do what the predictor has predicted. Whatever has caused what I will do lies in the past of the prediction, hence in the past of the current time point. There is no point in asking myself now what I should do as I have no longer causal influence on what I will do. I can simply relax and watch myself doing what I have been caused to do some time before. I can of course ask myself what might have caused my action and try to predict myself from that what I will do. If I come to believe that it was myself who decided at some earlier point in time what I will do, then I can ask myself what I should have decided at that earlier point in time. If I believe that at that earlier point in time I already knew that the predictor would act in the way it did, and if I believe that I have made the decision rationally, then I should conclude that I have decided to one-box.

The original version of Newcomb's paradox in Nozick 1969 is not about a perfect predictor however. It begins with (1) "Suppose a being in whose power to predict your choices you have enormous confidence.... You know that this being has often correctly predicted your choices in the past (and has never, so far as you know, made an incorrect prediction about your choices), and furthermore you know that this being has often correctly predicted the choices of other people, many of whom are similar to you, in the particular situation to be described below". So the information you are given is explicitly only about things from the past (how could it be otherwise). It goes on to say (2) "You have a choice between two actions". Information (2) implies that what I will do has not been decided yet and I still have causal influence on what I will do. Hence the information what I will do cannot have been available to the predictor. This implies that the predictor cannot have made a perfect prediction about my behaviour. Indeed nothing in (1) implies that they have, the information given is not about my future action at all. After I will have made my decision, it might turn out, of course, that it happens to coincides with what the predictor has predicted. But that is irrelevant for my choice as it would only imply that the predictor will have been lucky this time. What should I make of information (1)? If I am confident that I still have a choice, that question is of no significance for the decision problem at hand and I should two-box. If I am confident that I don't have a choice but have decided already, the reasoning of the previous paragraph applies and I should hope to observe that I will one-box.

What if I am unsure whether or not I still have a choice? I might have the impression that I can try to move my muscles this way or that way, without being perfectly confident that they will obey. What action should I then decide to try? I should decide to try two-boxing. Why? Because that decision is the dominant strategy: if it turns out that indeed I can decide my action now, then we're in a world where the predictor was not perfect but merely lucky and in that world two-boxing is dominant; if it instead turns out that I was not able to override my earlier decision at this point, then we're in a world where what I try now makes no difference. In either case, trying to two-box is undominated by any other strategy.

Can you please explain the "zero-probability possible world"?

Hi Nathan,

I'm not sure. I guess it depends on what your definition of "agent" is. In my personal definition, following Yann LeCun's recent whitepaper, the "agent" is a system with a number of different modules, one of it being a world model (in our case, an MDP that it can use to simulate consequences of possible policies), one of it being a policy (in our case, an ANN that takes states as inputs and gives action logits as outputs), and one module being a learning algorithm (in our case, a variant of Q-learning that uses the world model to learn a policy that achieves a certain goal). The goal that the learning algorithm aims to find a suitable policy for is an aspiration-based goal: make the expected return equal some given value (or fall into some given interval). As a consequence, when this agent behaves like this very often in various environments with various goals, we can expect it to meet its goals on average (under mild conditions on the sequence of environments and goals, such as sufficient probabilistic independence of stochastic parts of the environment and bounded returns, so that the law of large number applies).

Now regarding your suggestion that the learned policy (what you call the frozen net I think) could be checked by humans before being used: that is a good idea for environments and policies that are not too complex for humans to understand. In more complex cases, one might want to involve another AI that tries to prove the proposed policy is unsafe for reasons not taken into account in selecting it in the first place, and one can think of many variations in the spirit of "debate" or "constitutional AI" etc.

Excellent! I have three questions

  1. How would we get to a certain upper bound on ?

  2. As collisions with the boundary happen exactly when one action's probability hits zero, it seems the resulting policies are quite large-support, hence quite probabilistic, which might be a problem in itself, making the agent unpredictable. What is your thinking about this?

  3. Related to 2., it seems that while your algorithm ensures that expected true return cannot decrease, it might still lead to quite low true returns in individual runs. So do you agree that this type of algorithm is rather a safety ingredient amongst other ingredients, rather that meant to be a sufficient solution to satety?

I'm sorry but I fail to see the analogy to momentum or adam, in neither of which the vector or distance from the current point to the initial point plays any role as far as I can see. It is also different from regularizations that modify the objective function, say to penalize moving away from the initial point, which would change the location of all minima. The method I propose preserves all minima and just tries to move towards the one closest to the initial point. I have discussed it with some mathematical optimization experts and they think it's new.

I like the clarity of this post very much! Still, we should be aware that all this hinges on what exactly we mean by "the model".

If "the model" only refers to one or more functions, like a policy function pi(s) and/or a state-value function V(s) and/or a state-action -value function Q(s,a) etc., but does not refer to the training algorithm, then all you write is fine. This is how RL theory uses the word "model".

But some people here also use the term "the model" in a broader sense, potentially including the learning algorithm that adjusts said functions, and in that case "the model" does see the reward signal. A better and more common term for the combination of model and learning algorithm is "agent", but some people seem to be a little sloppy in distinguishing "model" and "agent". One can of course also imagine architectures in which the distinction is less clear, e.g., when the whole "AI system" consists of even more components such as several "agents", each of which using different "models". Some actor-critic systems can for example be interpreted as systems consisting of two agents (an actor and a critic). And one can also imagine hierarchical systems in which a parameterized learning algorithm used in the low level component is adjusted by a (hyper-)policy function on a higher level that is learned by a 2nd-level learning algorithm, which might as well be hyperparameterized by an even higher-level learned policy, and so on, up towards one final "base" learning algorithm that was hard-coded by the designer.

So, in the context of AGI or ASI, I believe the concept of an "AI system" is the most useful term in this ontology, as we cannot be sure what the architecture of an ASI will be, how many "agents" and "policies" on how many "hierarchical levels" it will contain, what their division of labor will be, and how many "models" they will use and adjust in response to observations in the environment.

In summary, as the outermost-level learning algorithm in such an "AI system" will generally see some form of "reward signal", I believe that most statements that are imprecisely phrased in terms of a "model" getting "rewarded" can be fixed by simply replacing the term "model" by "AI system".

Jobst Heitzig10moΩ130

replacing the SGD with something that takes the shortest and not the steepest path

Maybe we can design a local search strategy similar to gradient descent which does try to stay close to the initial point x0? E.g., if at x, go a small step into a direction that has the minimal scalar product with x x0 among those that have at most an angle of alpha with the current gradient, where alpha>0 is a hyperparameter. One might call this "stochastic cone descent" if it does not yet have a name. 

roughly speaking, we gradient-descend our way to whatever point on the perfect-prediction surface is closest to our initial values.

I believe this is not correct as long as "gradient-descend" means some standard version of gradient descent because those are all local, can go highly nonlinear paths, and do not memorize the initial value to try staying close to it.

But maybe we can design a local search strategy similar to gradient descent which does try to stay close to the initial point x0? E.g., if at x, go a small step into a direction that has the minimal scalar product with x x0 among those that have at most an angle of alpha with the current gradient, where alpha>0 is a hyperparameter. One might call this "stochastic cone descent" if it does not yet have a name. 

Jobst Heitzig10moΩ010

Does the one-shot AI necessarily aim to maximize some function (like the probability of saving the world, or the expected "savedness" of the world or whatever), or can we also imagine a satisficing version of the one-shot AI which "just tries to save the world" with a decent probability, and doesn't aim to do any more, i.e., does not try to maximize that probability or the quality of that saved world etc.?

I'm asking this because

  • I suspect that we otherwise might still make a mistake in specifying the optimization target and incentivize the one-shot AI to do something that "optimally" saves the world in some way we did not foresee and don't like.
  • I try to figure out whether your plan would be hindered by switching from an optimization paradigm to a satisficing paradigm right now in order to buy time for your plan to be put into practice :-)
Load More