Vanessa Kosoy

AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda. Based in Israel.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

Wiki Contributions

Comments

If you want to add a term that rewards equality, it makes more sense to use e.g. variance rather than entropy. Because, (i) entropy is insensitive to the magnitude of the differences between utilities, and (ii) it's only meaningful under some approximations, and it's not clear which approximation to take. Because, the precise distribution is discrete, and usually no two people will have exactly the same utility, so entropy will always be the logarithm of population size.

A relevant classical result in complexity theory is: if  then the polynomial hierarchy collapses. Which is to say, this is pretty unlikely. This means that you can't solve  in time which is polynomial in the size  of the circuit, even if you had unlimited time to precompute an advice string for the given  that you need.

When I try to move my mouse over the smiley, both the selection and the smiley disappear before I can click it.

I have the same bug in Firefox 113.0.2 on Windows 11. But, it seems to depend on what I select: for some selections it works, for some selections it doesn't.

My disagreement with this post is that I am a human-centric carbon[1] chauvinist. You write:

I'm saying something more like: we humans have selfish desires (like for vanilla ice cream), and we also have broad inclusive desires (like for everyone to have ice cream that they enjoy, and for alien minds to feel alien satisfaction at the fulfilment of their alien desires too). And it's important to get the AI on board with those values.

Why would my "selfish" desires be any less[2] important than my "broad inclusive" desires? Assuming even that it makes sense to separate the two, which is unclear. I don't see any decision-theoretic justification for this. (This is not to say that AI should like ice cream, but that it should provide me with ice cream, or some replacement that I would consider better.)

I think if the AI kills everyone and replaces us with vaguely-human-like minds that we would consider "sentient", that will go on to colonize the universe and have lots of something-recognizable-as "fun and love and beauty and wonder", it would certainly be better than bleak-desolation-squiggles, but it would also certainly be a lot worse than preserving everyone alive today and giving us our own utopic lives.

  1. ^

    I probably don't care much about actual carbon. If I was replaced with a perfect computer simulation of me, it would probably be fine. But I'm not sure about even that much.

  2. ^

    "Less" relatively to the weights they already have in my utility function.

it really motivates bargaining, as there are usually pareto improvements that are obvious, and near-pareto improvements beyond even that.

I couldn't really parse this. What does it mean to "motivate bargaining" and why is it good?

If you're worried about the experience of the most unlucky/powerless member, this ensures you won't degrade it with your negotiation.

In practice, it's pretty hard for a person to survive on their own, so usually not existing is at least as good as the minimax (or at least it's not that much worse). It can actually be way, way better than the minimax, since the minimax implies every other person doing their collective best to make things as bad as possible for this person.

I'm trying to compare your proposal to https://en.wikipedia.org/wiki/Shapley_value. On the surface, it seems similar

There is a huge difference: Shapley value assumes utility is transferable, and I don't.

I do worry a bit that in both Shapely and your system, it is acceptible to disappear people - the calculation where they don't exist seems problematic when applied to actual people. It has the nice property of ignoring "outliers" (really, negative-value lives), but that's only a nice property in theory, it would be horrific if actually applied.

By "outliers" I don't mean negative-value lives, I mean people who want everyone else to die and/or to suffer. 

It is not especially acceptable in my system to disappear people: it is an outcome that is considered, but it only happens if enough people have a sufficiently strong preference for it. I do agree it might be better to come up with a system that somehow discounts "nosy" preferences, i.e. doesn't put much weight on what Alice thinks Bob's life should look like when it contradicts what Bob wants.

It is possible to get rid of the need to consider worlds in which some players don't exist, by treating  as optimization for a subset of players. This can be meaningful in the context of a single entity (e.g. the AI) optimizing for the preferences of , or in the context of game-theory, where we interpret it as having all players coordinate in a manner that optimizes for the utilities of  (in the latter context, it makes sense to first discard any outcome that assigns a below-minimax payoff to any player[1]). The disadvantage is, this admits BATNAs in which some people get worse-than-death payoffs (because of adversarial preferences of other people). On the other hand, it is still "threat resistant" in the sense that, the mechanism itself doesn't generate any incentive to harm people.

It would be interesting to compare this with Diffractor's ROSE point.

  1. ^

    Regarded as a candidate definition for a fully-general abstract game-theoretic superrational optimum, this still seems lacking, because regarding the minimax in a game of more than two players seems too weak. Maybe there is a version based on some notion of "coalition minimax".

Here's an idea about how to formally specify society-wide optimization, given that we know the utility function of each individual. In particular, it might be useful for multi-user AI alignment.

A standard tool for this kind of problem is Nash bargaining. The main problem with it is that it's unclear how to choose the BATNA (disagreement point). Here's why some simple proposals don't work:

  • One natural BATNA for any game is assigning each player their maximin payoff. However, for a group of humans it means something horrible: Alice's maximin is a situation in which everyone except Alice are doing their best to create the worst possible world for Alice. This seems like an unhealthy and unnatural starting point.
  • Another natural BATNA is the world in which no humans exist at all. The problem with this is: suppose there is one psychopath who for some reason prefers everyone not to exist. Then, there are no Pareto improvements over the BATNA, and therefore this empty world is already the "optimum". The same problem applies to most choices of BATNA.

Here is my proposal. We define the socially optimal outcome by recursion over the number of people . For , we obviously just optimize the utility function of the lone person. For a set of people  of cardinality , let's consider any given . The BATNA payoff of  is defined to be the minimum over all  of the payoff of  in the socially optimal outcome of  (we consider worlds in which  doesn't exist). If there are multiple optimal outcomes, we minimize over them. Typically, the minimum is achieved for  but we can't just set  in the definition, we need the minimization in order to make sure that the BATNA is always admissible[1]. We then do Nash bargaining with respect to this BATNA.

Good properties of this proposal:

  • The outcome is Pareto efficient. It is also "fair" in the sense that the specification is rather natural and symmetric.
  • The only especially strong assumption needed to make sense of the definition, is the ability to consider worlds in which some people don't exist[2]. For example, we don't need anything like transferable utility or money. [EDIT: See child comment for a discussion of removing this assumption.]
  • AFAICT threats don't affect the outcome, since there's no reference to minimax or Nash equilibria.
  • Most importantly, it is resistant to outliers:
    • For example, consider a world with a set  of selfish people and  psychopath who we denote . The outcome space is : each person either exists or not. A selfish person gets payoff  for existing and payoff  for non-existing. The psychopath's payoff is minus the number of people who exists. Let  be the cardinality of . Then, we can check that the socially optimal outcome gives each selfish person a payoff of  (i.e. they exist with this probability).
    • In the above example, if we replace the selfish people with altruists (whose utility function is the number of altruists that exist) the outcome is even better. The expected number of existing altruists is .
  1. ^

    "Admissible" in the sense that, there exists a payoff vector which is a Pareto improvement over the BATNA and is actually physically realizable.

  2. ^

    We also need to assume that we can actually assign utility functions to people, but I don't consider it a "strong assumption" in this context.

Jobst Heitzig asked me whether infra-Bayesianism has something to say about the absent-minded driver (AMD) problem. Good question! Here is what I wrote in response:

Philosophically, I believe that it is only meaningful to talk about a decision problem when there is also some mechanism for learning the rules of the decision problem. In ordinary Newcombian problems, you can achieve this by e.g. making the problem iterated. In AMD, iteration doesn't really help because the driver doesn't remember anything that happened before. We can consider a version of iterated AMD where the driver has a probability  to remember every intersection, but they always remember whether they arrived at the right destination. Then, it is equivalent to the following Newcombian problem: 

  • With probability , counterfactual A happens, in which Omega decides about both intersections via simulating the driver in counterfactuals B and C.
  • With probability , counterfactual B happens, in which the driver decides about the first intersection, and Omega decides about the second intersection via simulating the driver in counterfactual C.
  • With probability , counterfactual C happens, in which the driver decides about the second intersection, and Omega decides about the first intersection via simulating the driver in counterfactual B.

For this, an IB agent indeed learns the updateless optimal policy (although the learning rate carries an  penalty).

Your alleged counterexample is wrong, because the  you constructed is not computable. First, "computable" means that there is a program  which receives  and  as inputs s.t. for all  and , it halts and

Second, even your weaker definition fails here. Let . Then, there is no program that computes  within accuracy , because for every  while . Therefore, determining the value of  within  requires looking at infinitely many elements of the sequence. Any program would that outputs  on  has to halt after reading some finite  symbols, in which case it would output  on  as well.

Load More