Chantiel

Wiki Contributions

Comments

Chantiel's Shortform

I've made a few posts that seemed to contain potentially valuable ideas related to AI safety. However, I got almost no feedback on them, so I was hoping some people could look at them and tell me what they think. They still seem valid to me, and if they are, they could potentially be very valuable contributions. And if they aren't valid, then I think knowing the reason for this could potentially help me a lot in my future efforts towards contributing to AI safety.

The posts are:

  1. My critique of a published impact measure.
  2. Manual alignment
  3. Alignment via reverse engineering
A system of infinite ethics

FWIW, this conclusion is not clear to me. To return to one of my original points: I don't think you can dodge this objection by arguing from potentially idiosyncratic preferences, even perfectly reasonable ones; rather, you need it to be the case that no rational agent could have different preferences. Either that, or you need to be willing to override otherwise rational individual preferences when making interpersonal tradeoffs.

Yes, that's correct. It's possible that there are some agents with consistent preferences that really would wish to get extraordinarily uncomfortable to avoid the torture. My point was just that this doesn't seem like it would would be a common thing for agents to want.

Still, it is conceivable that there are at least a few agents out their that would consistently want to opt for the 0.5 chance of being extremely uncomfortable option, and I do suppose it would be best to respect their wishes. This is a problem that I hadn't previously fully appreciated, so I would like to thank you for brining it up.

Luckily, I think I've finally figured out a way to adapt my ethical system to deal with this. That is, the adaptation will allow for agents to choose the extreme-discomfort-from-dust-specks option if that is what they wish for my my ethical system to respect their preferences. To do this, allow for the measure to satisfaction to include infinitesimals. Then, to respect the preferences of such agents, you just need need to pick the right satisfaction measure.

Consider the agent that for which each 50 years of torture causes a linear decrease in their utility function. For simplicity, imagine torture and discomfort are the only things the agent cares about; they have no other preferences; also assume that the agent dislike torture more than it dislikes discomfort, but only be a finite amount. Since the agent's utility function/satisfaction measure is linear, I suppose being tortured for an eternity would be infinitely worse for the agent than being tortured for a finite amount of time. So, assign satisfaction 0 to the scenario in which the agent is tortured for eternity. And if the agent is instead tortured for years, let the agent's satisfaction be , where is whatever infinitesimal number you want. If my understanding of infinitesimals is correct, I think this will do what we want it to do in terms having agents using my ethical system respect the agent's preferences.

Specifically, since being tortured forever would be infinitely worse than being tortured for a finite amount of time, any finite amount of torture would be accepted to decrease the chance of infinite torture. And this is what maximizing this satisfaction measure does: for any lottery, changing the chance of infinite torture has finite affect on expected satisfaction, whereas changing the chance of finite torture only has infinitesimal effect, so so avoiding infinite torture would be prioritized.

Further, among lotteries involving finite amounts of torture, it seems the ethical system using this satisfaction measure continues to do what what it's supposed to do. For example, consider the choice between the previous two options:

  1. A 0.5 chance of being tortured for 3^^^^3 years and a 0.5 chance of being fine.
  2. A 0.5 chance of 3^^^^3 - 9999999 years of torture and 0.5 chance of being extraordinarily uncomfortable.

If I'm using my infinitesimal math right, the expected satisfaction of taking option 1 would be , and the expected satisfaction of taking option 2 would be , for some . Thus, to maximize this agent's satisfaction measure, my moral system would indeed let the agent give infinite priority to avoiding infinite torture, the ethical system would itself consider the agent to get infinite torture infinitely-worse than getting finite torture, and would treat finite amounts of torture as decreasing satisfaction in a linear manner. And, since the utility measure is still technically bounded, it would still avoid the problem with utility monsters.

(In case it was unclear, is Knuth's up-arrow notion, just like "^".)

Chantiel's Shortform

If the impact measure was poorly implemented, then I think such an impact-reducing AI could indeed result in the world turning out that way. However, note that the technique in the paper is intended to, for a very wide range of variables, make the world if the AI wasn't turned on as similar as possible to what it would be like if it was turned on. So, you can potentially avoid the AI-controlled-drone scenario by including the variable "number of AI-controlled drones in the world" or something correlated with it, as these variables could be have quite different values between a possible world in which the AI was turned on and a possible world in which the AI wasn't.

Coming up with a set of variables wide enough to include that might seem a little difficult, but I'm not sure it would be. One option is to, for every definable function of the world, include the value of the function as one of the variables the AI considers and tries to avoid interfering with.

Chantiel's Shortform

I have some concerns about an impact measure proposed here. I'm interested on working on impact measures, and these seem like very serious concerns to me, so it would be helpful seeing what others think about them. I asked Stuart, one of the authors, about these concerns, but he said it was too busy to work on dealing with them.

First, I'll give a basic description of the impact measure. Have your AI be turned on from some sort of stochastic process that may or may not result in the AI being turned on. For example, consider sending a photo through a semi-silvered mirror. Let X be the statement, "The AI was turned on". For a wide range of variables V, try to make the probability distribution as similar as possible to the probability distribution .

Here are my concerns:

It's not clear to me that we can find any verifiably-stochastic process. Some interpretations of quantum physics say that there are some latent variables that determine the outcome of seemingly stochastic processes. In this case, just knowing the AI was or was not turned on may have thus implications on latent variables that possibly have large effects on the world.

Further, I can't find a way to define X and not-X that works, even if you find a truly stochastic process. You could try to specify X as, "the photon makes it through the semi-silvered mirror" and specify not-X as the negation of this. But conditioning on this and nothing else is insufficient. To see why, not that X and not-X don't specify anything about the mechanics about how the photon and mirror interact. For example, X, above doesn't rule out universes in which there is no quantum physics as we know it, and the mirror and photon interact in some completely different way that makes the photon more likely to pass through than it would in our universe. Thus, conditioning on X would make it more likely the AI would be in that universe, which could have large effects on the values of variables.

You could try to get around this by also including in X and not-X mechanics describing how the photon and semi-silvered mirror interact. The issue is, we don't actually know exactly how the photon and semi-silvered mirror interact. We have models of quantum physics that have extremely high predictive accuracy in practice, but we still don't have a theory of everything. So we could include in X and not-X our current understanding of quantum physics, but that understanding might be somewhat wrong.

This could potentially be dangerous. Suppose the AI has invented some technology that, according to the actual, correct theory of everything, would destroy the world. But according to our slightly wrong model of quantum physics, it would do nothing in particular. Then the AI, having X and not-X condition on our current model of quantum physics, would see no problem with creating the technology.

Further still, I still have the concern about what to do if the AI in fact doesn't turn on. You could decide to just turn on some AI identical to it, and keep trying turning on AIs until you get an one to turn on. But then the world would pretty much be identical whether X or not-X turns out to be true, because the other AI would do the same thing it would have.

You could instead turn on some AI that uses the same reduced-impact technique, but is different from the current AI in some way that significantly effects the world. However, if you do this, then I think we might arrive in a very bad equilibrium in the AIs' behavior. Specifically, suppose all AIs decide to destroy the world if they're turned on. Then that means that if one of the AIs isn't turned on, some other AI will be turned on, which would also lead to the world being destroyed. If all the destroyed worlds look pretty much the same, then the AI deciding to destroy the world would have low impact according to your definition.

So suppose instead that if the AI doesn't turn on, we rely on some other AI safety technique. Suppose this AI safety technique successfully makes safe AI. In this case, we had no need to create the reduced-impact AI you described, because the other AI design would have worked fine. Suppose the alternative AI safety technique in fact doesn't work, and the result of using it is that our world gets destroyed. In that case, the conditioning on not-X would imply the world would be destroyed. So if an AI tries to make the world conditioning on X similar to it conditioning on not-X, then it would again act to destroy the world.

Luke responded to one of my points, saying, "Yes. [what to do if the AI doesn't get turned on] needs to be handled with care. A time-delay might be sufficient for this, though. My response was:

My concern with time-delays is that the AI might make some sort of time-bomb-esque device. For example, suppose the AI knows that if it wasn't turned on, another AI like it would have been turned on three months later. Then both AIs could decide to make a "time bomb" set to go off four months after the first AI was scheduled to potentially turn on. And when it goes off, the "time bomb" might destroy the world, convert it into a mass of AI-controlled nanotechnology, or do some other bad thing. This way, neither AI would actually change the world relative to if it was never turned on.

A system of infinite ethics

I think this framing muddies the intuition pump by introducing sadistic preferences, rather than focusing just on unboundedness below. I don't think it's necessary to do this: unboundedness below means there's a sense in which everyone is a potential "negative utility monster" if you torture them long enough. I think the core issue here is whether there's some point at which we just stop caring, or whether that's morally repugnant.

Fair enough. So I'll provide a non-sadistic scenario. Consider again the scenario I previously described in which you have a 0.5 chance of being tortured for 3^^^^3 years, but also have the repeated opportunity to cause yourself minor discomfort in the case of not being tortured and as a result get your possible torture sentence reduced by 50 years.

If you have an unbounded below utility function in which each 50 years causes a linear decrease in satisfaction or utility, then to maximize expected utility or life satisfaction, it seems you would need to opt for living in extreme discomfort in the non-torture scenario to decrease your possible torture time be an astronomically small proportion, provided the expectations are defined.

To me, at least, it seems clear that you should not take the opportunities to reduce your torture sentence. After all, if you repeatedly decide to take them, you will end up with a 0.5 chance of being highly uncomfortable and a 0.5 chance of being tortured for 3^^^^3 years. This seems like a really bad lottery, and worse than the one that lets me have a 0.5 chance of having an okay life.

Sorry, sloppy wording on my part. The question should have been "does this actually prevent us having a consistent preference ordering over gambles over universes" (even if we are not able to represent those preferences as maximising the expectation of a real-valued social welfare function)? We know (from lexicographic preferences) that "no-real-valued-utility-function-we-are-maximising-expectations-of" does not immediately imply "no-consistent-preference-ordering" (if we're willing to accept orderings that violate continuity). So pointing to undefined expectations doesn't seem to immediately rule out consistent choice.

Oh, I see. And yes, you can have consistent preference orderings that aren't represented as a utility function. And such techniques have been proposed before in infinite ethics. For example, one of Bostrom's proposals to deal with infinite ethics is the extended decision rule. Essentially, it says to first look at the set of actions you could take that would maximize P(infinite good) - P(infinite bad). If there is only one such action, take it. Otherwise, take whatever action among these that has highest expected moral value given a finite universe.

As far as I know, you can't represent the above as a utility function, despite it being consistent.

However, the big problem with the above decision rule is that it suffers from the fanaticism problem: people would be willing to bear any finite cost, even 3^^^3 years of torture, to have even an unfathomably small chance of increasing the probability of infinite good or decreasing the probability of infinite bad. And this can get to pretty ridiculous levels. For example, suppose you are sure you can easily design a world that makes every creature happy and greatly increases the moral value of the world in a finite universe if implemented. However, you know that coming up with such a design would take one second of computation on your supercomputer, which means one less second to keep thinking about astronomically-improbable situations in which you could cause infinite good. Thus would have some minuscule chance of avoiding infinite good or causing infinite bad. Thus, you decide to not help anyone, because you won't spare the one second of computer time.

More generally, I think the basic property of non-real-valued consistent preference orderings is that they value some things "infinitely more" than others. The issue is, if you really value some property infinitely more than some other property of lesser importance, it won't be worth your time to even consider pursuing the property of lesser importance, because it's always possible you could have used the extra computation to slightly increase your chances of getting the property of greater importance.

A system of infinite ethics

I think this framing muddies the intuition pump by introducing sadistic preferences, rather than focusing just on unboundedness below. I don't think it's necessary to do this: unboundedness below means there's a sense in which everyone is a potential "negative utility monster" if you torture them long enough. I think the core issue here is whether there's some point at which we just stop caring, or whether that's morally repugnant.

Fair enough. So I'll provide a non-sadistic scenario. Consider again the scenario I previously described in which you have a 0.5 chance of being tortured for 3^^^^3 years, but also have the repeated opportunity to cause yourself minor discomfort in the case of not being tortured and as a result get your possible torture sentence reduced by 50 years.

If you have an unbounded below utility function in which each 50 years causes a linear decrease in satisfaction or utility, then to maximize expected utility or life satisfaction, it seems you would need to opt for living in extreme discomfort in the non-torture scenario to decrease your possible torture time be an astronomically small proportion, provided the expectations are defined.

To me, at least, it seems clear that you should not take the opportunities to reduce your torture sentence. After all, if you repeatedly decide to take them, you will end up with a 0.5 chance of being highly uncomfortable and a 0.5 chance of being tortured for 3^^^^3 years. This seems like a really bad lottery, and worse than the one that lets me have a 0.5 chance of having an okay life.

Sorry, sloppy wording on my part. The question should have been "does this actually prevent us having a consistent preference ordering over gambles over universes" (even if we are not able to represent those preferences as maximising the expectation of a real-valued social welfare function)? We know (from lexicographic preferences) that "no-real-valued-utility-function-we-are-maximising-expectations-of" does not immediately imply "no-consistent-preference-ordering" (if we're willing to accept orderings that violate continuity). So pointing to undefined expectations doesn't seem to immediately rule out consistent choice.

Oh, I see. And yes, you can have consistent preference orderings that aren't represented as a utility function. And such techniques have been proposed before in infinite ethics. For example, one of Bostrom's proposals to deal with infinite ethics is the extended decision rule. Essentially, it says to first look at the set of actions you could take that would maximize P(infinite good) - P(infinite bad). If there is only one such action, take it. Otherwise, take whatever action among these that has highest expected moral value given a finite universe.

As far as I know, you can't represent the above as a utility function, despite it being consistent.

However, the big problem with the above decision rule is that it suffers from the fanaticism problem: people would be willing to bear any finite cost, even 3^^^3 years of torture, to have even an unfathomably small chance of increasing the probability of infinite good or decreasing the probability of infinite bad. And this can get to pretty ridiculous levels. For example, suppose you are sure you can easily design a world that makes every creature happy and greatly increases the moral value of the world in a finite universe if implemented. However, you know that coming up with such a design would take one second of computation on your supercomputer, which means one less second to keep thinking about astronomically-improbable situations in which you could cause infinite good. Thus would have some minuscule chance of avoiding infinite good or causing infinite bad. Thus, you decide to not help anyone, because you won't spare the 1 second of computer time.

More generally, I think the basic property of non-real-valued consistent preference orderings is that they value some things "infinitely more" than others. The issue is, if you really value some property infinitely more than some other property of lesser importance, it won't be worth your time to even consider pursuing the property of lesser importance, because it's always possible you could have used the extra computation to slightly increase your chances of getting the property of greater importance.

A system of infinite ethics

Also, in addition to my previous response, I want to note that the issues with unbounded satisfaction measures are not unique to my infinite ethical system. Instead, they are common potential problems with a wide variety of aggregate consequentialist theories.

For example, imagine suppose your a classical utilitarianism with an unbounded utility measure per person. And suppose you know that the universe is finite will consist of a single inhabitant with a utility whose probability distributions follows a Cauchy distribution. Then your expected utilities are undefined, despite the universe being knowably finite.

Similarly, imagine if you again used classical utilitarianism but instead you have a finite universe with one utility monster and 3^^^3 regular people. Then, if your expected utilities are defined, you would need to give the utility monster what it wants, to the expense of of everyone else.

So, I don't think your concern about keeping utility functions bounded is unwarranted; I'm just noting that they are part of a broader issue with aggregate consequentialism, not just with my ethical system.

A system of infinite ethics

Thanks. I've toyed with similar ideas perviously myself. The advantage, if this sort of thing works, is that it conveniently avoids a major issue with preference-based measures: that they're not unique and therefore incomparable across individuals. However, this method seems fragile in relying on a finite number of scenarios: doesn't it break if it's possible to imagine something worse than whatever the currently worst scenario is? (E.g. just keep adding 50 more years of torture.) While this might be a reasonable approximation in some circumstances, it doesn't seem like a fully coherent solution to me.

As I said, you can allow for infinitely-many scenarios if you want; you just need to make it so the supremum of them their value is 1 and the infimum is 0. That is, imagine there's an infinite sequence of scenarios you can come up with, each of which is worse than the last. Then just require that the infimum of the satisfaction of those sequences is 0. That way, as you consider worse and worse scenarios, the satisfaction continues to decrease, but never gets below 0.

IMO, the problem highlighted by the utility monster objection is fundamentally a prioritiarian one. A transformation that guarantees boundedness above seems capable of resolving this, without requiring boundedness below (and thus avoiding the problematic consequences that boundedness below introduces).

One issue with only having boundedness above is that is that the expected of life satisfaction for an arbitrary agent would probably often be undefined or in expectation. For example, consider if an agent had a probability distribution like a Cauchy distribution, except that it assigns probability 0 to anything about the maximize level of satisfaction, and is then renormalized to have probabilities sum to 1. If I'm doing my calculus right, the resulting probability distribution's expected value doesn't converge. You could either interpret this as the expected utility being undefined or being , since the Rienmann sum approaches as the width of the column approaches zero.

That said, even if the expectations are defined, it doesn't seem to me that keeping the satisfaction measure bounded above but not bellow would solve the problem of utility monsters. To see why, imagine a new utility monster as follows. The utility monster feels an incredibly strong need to have everyone on Earth be tortured. For the next hundred years, its satisfaction will will decrease by 3^^^3 for every second there's someone on Earth not being tortured. Thus, assuming the expectations converge, the moral thing to do, according to maximizing average, total, or expected-value-conditioning-on-being-in-this-universe life satisfaction is to torture everyone. This is a problem both in finite and infinite cases.

A final random thought/question: I get that we can't expected utility maximise unless we can take finite expectations, but does this actually prevent us having a consistent preference ordering over universes, or is it potentially just a representation issue?

If I understand what you're asking correctly, you can indeed have consistent preferences over universes, even if you don't have a bounded utility function. The issue is, in order to act, you need more than just a consistent preference order over possible universe. In reality, you only get to choose between probability distributions over possible worlds, not specific possible worlds. And this, with an unbounded utility function, will tend to result in undefined expected utilities over possible actions and thus is not informative of what action you should take. Which is the whole point of utility theory and ethics.

Now, according to some probability distributions, can have well-defined expected values even with an unbounded utility function. But, as I said, this is not robust, and I think that in practice expected values of an unbounded utility function would be undefined.

A system of infinite ethics

For the record, according to my intuitions, average consequentialism seems perfectly fine to me in a finite universe.

That said, if you don't like using average consequentialism in a finite case, I don't personally see what's wrong with just having a somewhat different ethical system for finite cases. I know it seems ad-hoc, but I think there really is an important distinction between finite and infinite scenarios. Specifically, people have the moral intuition that larger numbers of satisfied lives are more valuable than smaller numbers of them, which average utilitarianism conflicts with. But in an infinite universe, you can't change the total amount of satisfaction or dissatisfaction.

But, if you want, you could combine both the finite ethical system and infinite ethical system so that a single principle is used for moral deliberation. This might make it feel less ad-hocy. For example, you could have a moral value function that of the form, f(total amount of satisfaction and dissatisfaction in the universe) * expected value of life satisfaction for an arbitrary agent in this universe. And let f be some bounded function that's maximized by and approaches this value very slowly.

For those who don't want this, they are free to use my total-utilitarian-infinite-ethical system. I think that it just ends up as regular total utilitarian in a finite world, or close to it.

A system of infinite ethics

In P(old probability of being in first group) * 1 = (P(old probability of being in first group) + $\epsilon) * u the epsilon is smaller than any real number and there is no real small enough that it could characterise the difference between 1 and u.

Could you explain why you think so? I had already explained why would be real, so I'm wondering if you had an issue with my reasoning. To quote my past self:

Remember that if you decide to take a certain action, that implies that other agents who are sufficiently similar to you and in sufficiently similar circumstances also take that action. Thus, you can acausally have non-infinitesimal impact on the satisfaction of agents in situations of the form, "An agent in a world with someone just like Slider who is also in very similar circumstances to Slider's." The above scenario is of finite complexity and isn't ruled out by evidence. Thus, the probability of an agent ending up in such a situation, conditioning only only on being some agent in this universe, is nonzero [and non-infinitesimal].

If you have some odds or expectations that deal with groups and you have other considerations that deal with a finite amount of individuals you either have the finite people not impact the probabilities at all or the probabilities will stay infinidesimally close (for which is see a~b been used as I am reading up on infinities) which will conflict with the desarata...

Just to remind you, my ethical system basically never needs to worry about finite impacts. My ethical system doesn't worry about causal impacts, except to the extent that the inform you about the total acausal impact of your actions on the moral value of the universe. All things you do have infinite acausal impact, and these are all my system needs to consider. To use my ethical system, you don't even need a notion of causal impact at all.

Load More