Jobst Heitzig

Senior Researcher / Lead, FutureLab on Game Theory and Networks of Interacting Agents @ Potsdam Institute for Climate Impact Research.

I'm a mathematician working on collective decision making, game theory, formal ethics, international coalition formation, and a lot of stuff related to climate change. Here's my professional profile.

I like the clarity of this post very much! Still, we should be aware that all this hinges on what exactly we mean by "the model".

If "the model" only refers to one or more functions, like a policy function pi(s) and/or a state-value function V(s) and/or a state-action -value function Q(s,a) etc., but does not refer to the training algorithm, then all you write is fine. This is how RL theory uses the word "model".

But some people here also use the term "the model" in a broader sense, potentially including the learning algorithm that adjusts said functions, and in that case "the model" *does* see the reward signal. A better and more common term for the combination of model and learning algorithm is "agent", but some people seem to be a little sloppy in distinguishing "model" and "agent". One can of course also imagine architectures in which the distinction is less clear, e.g., when the whole "AI system" consists of even more components such as several "agents", each of which using different "models". Some actor-critic systems can for example be interpreted as systems consisting of two agents (an actor and a critic). And one can also imagine hierarchical systems in which a parameterized learning algorithm used in the low level component is adjusted by a (hyper-)policy function on a higher level that is learned by a 2nd-level learning algorithm, which might as well be hyperparameterized by an even higher-level learned policy, and so on, up towards one final "base" learning algorithm that was hard-coded by the designer.

So, in the context of AGI or ASI, I believe the concept of an "AI system" is the most useful term in this ontology, as we cannot be sure what the architecture of an ASI will be, how many "agents" and "policies" on how many "hierarchical levels" it will contain, what their division of labor will be, and how many "models" they will use and adjust in response to observations in the environment.

In summary, as the outermost-level learning algorithm in such an "AI system" *will* generally see some form of "reward signal", I believe that most statements that are imprecisely phrased in terms of a "model" getting "rewarded" can be fixed by simply replacing the term "model" by "AI system".

replacing the SGD with something that takes the shortest and not the steepest path

Maybe we can design a local search strategy similar to gradient descent which *does* try to stay close to the initial point *x*0? E.g., if at *x,* go a small step into a direction that has the minimal scalar product with *x *– *x*0 among those that have at most an angle of *alpha* with the current gradient, where *alpha*>0 is a hyperparameter. One might call this "stochastic cone descent" if it does not yet have a name.

roughly speaking, we gradient-descend our way to whatever point on the perfect-prediction surface is closest to our initial values.

I believe this is not correct as long as "gradient-descend" means some standard version of gradient descent because those are all local, can go highly nonlinear paths, and do not memorize the initial value to try staying close to it.

But maybe we can design a local search strategy similar to gradient descent which *does* try to stay close to the initial point *x*0? E.g., if at *x,* go a small step into a direction that has the minimal scalar product with *x *– *x*0 among those that have at most an angle of *alpha* with the current gradient, where *alpha*>0 is a hyperparameter. One might call this "stochastic cone descent" if it does not yet have a name.

Does the one-shot AI *necessarily* aim to *maximize* some function (like the probability of saving the world, or the expected "savedness" of the world or whatever), or can we also imagine a *satisficing* version of the one-shot AI which "just tries to save the world" with a decent probability, and doesn't aim to do any more, i.e., does *not* try to maximize that probability or the quality of that saved world etc.?

I'm asking this because

- I suspect that we otherwise might still make a mistake in specifying the optimization target and incentivize the one-shot AI to do something that "optimally" saves the world in some way we did not foresee and don't like.
- I try to figure out whether your plan would be hindered by switching from an optimization paradigm to a satisficing paradigm right now in order to buy time for your plan to be put into practice :-)

Definition 4: Expectation w.r.t. a Set of Sa-Measures

This definition is obviously motivated by the plan to later apply some version of maximin rule, so that only the inf matters.

I suggest that we also study versions what employ other decision-under-ambiguity rules such as Hurwicz' rule or Savage's minimax regret rule.

From my reading of quantilizers, they might still choose "near-optimal" actions, just only with a small probability. Whereas a system based on decision transformers (possibly combined with a LLM) could be designed that we could then simply tell to "make me a tea of this quantity and quality within this time and with this probability" and it would attempt to do just that, without trying to make more or better tea or faster or with higher probability.

even when the agents are unable to explicitly bargain or guarantee their fulfilment of their end by external precommitments

I believe there is a misconception here. The *actual* game you describe is the game between the *programmers*, and the fact that they know in advance that the others' programs *will indeed* be run with the code that their own program has access to *does* make each program submission a *binding commitment to behave in a certain way*.

Game Theory knows since long that if binding commitments are possible, most dilemmas can be solved easily. In other words, I believe this is very nice but is quite far from being the "huge success" you claim it is.

Put differently: The whole thing depends crucially on the fact that X can be *certain* that Y will use the strategy (=code) X thinks it will use. But how on Earth would a real agent ever be able to know such a thing about another agent?

I just stumbled upon this and noticed that a real-world mechanism for international climate policy cooperation that I recently suggested in this paper can be interpreted as a special case of your (**G**,**X**,**Y**) framework.

Assume a fixed game **G** where

- each player's action space is the nonnegative reals,
(**U***x*,*y*) is weakly decreasing in*x*and weakly increasing in*y*.(**V***x*,*y*) is weakly decreasing in*y*and weakly increasing in*x*.

(Many public goods games, such as the Prisoners' Dilemma, have such a structure)

Let's call an object a *Conditional Commitment Function (CCF)* iff it is a bounded, continuous, and weakly increasing function from the nonnegative reals into the nonnegative reals. (Intended interpretation of a CCF *C*: If opponent agrees to do *y*, I agree to do any *x* that has *x <= C*(*y*))

Now consider programs of the following kind:

```
C = <some CCF>
if code(opponent) equals code(myself) except that C is replaced by some CCF D:
output the largest x >= 0 for which there is a y <= D(x) with x <= C(y)
else:
output 0
```

Let's denote this program **Z**(*C*) , where *C* is the CCF occurring in line 1 of the program. Finally, let's consider the meta-game where two programmers **A** and **B**, knowing **G**, each simultaneously choose a *C* and submit the program **Z**(*C*), the two programs are executed once to determine actions (*x*,*y*), **A** gets **U**(*x*,*y*) and **B** gets **V**(*x*,*y*).

(In the real world, the "programmers" could be the two parliaments of two countries that pass two binding laws (the "programs"), and the actions could be domestic levels of greenhouse gas emissions reductions.)

In our paper, we prove that the outcomes that will result from the *strong* Nash equilibria of this meta-game are exactly the Pareto-optimal outcomes (*x*,*y*) that both programmers prefer to the outcome (0,0).

(In an *N* (instead of 2) player context, the outcomes of strong Nash equilibria are exactly the ones from a certain version of the underlying base game's *core*, a subset of the Pareto frontier that might however be empty)*.*

I'd be interested in learning whether you think this is an interesting application context to explore the theories you discuss.

Dear Robert, I just found out about your work and absolutely love it.

Has the following idea been explored yet?

- The AI system is made of two agents, a
*strategic agent S*and a*controller agent C*. *S*'s reward function approximates the actual objective function of the system as defined by the designer.*S*can only*propose*actions to*C*, only knows about the environment and the actual actions taken what*C*tells it, and only has as many compute resources as*C*gives it.*C*'s reward function encodes*hard constraints*such as the three laws of robotics or some other formal ethical constraint system, in the form of a binary reward (1 for compliance, 0 for non-compliance).*C*has access to the actual observations and has the power to either actually take the action proposed by*S*or not.- In addition,
*C*is free to tell*S*anything regarding whether it actually took the proposed action and what the observations are, and can curtail*S*'s compute resources to avoid being outsmarted by*S*. - If indifferent in light of its reward function,
*C*will take the proposed action, will be honest about observations, and will not curtail resources (but will not get a positive reward from this because that could be exploited by*S*).

I'm sorry but I fail to see the analogy to momentum or adam, in neither of which the vector or distance from the current point to the

initialpoint plays any role as far as I can see. It is also different from regularizations that modify the objective function, say to penalize moving away from the initial point, which would change the location of all minima. The method I propose preserves all minima and just tries to move towards the one closest to the initial point. I have discussed it with some mathematical optimization experts and they think it's new.