# All of Chantiel's Comments + Replies

Chantiel's Shortform

I've made a few posts that seemed to contain potentially valuable ideas related to AI safety. However, I got almost no feedback on them, so I was hoping some people could look at them and tell me what they think. They still seem valid to me, and if they are, they could potentially be very valuable contributions. And if they aren't valid, then I think knowing the reason for this could potentially help me a lot in my future efforts towards contributing to AI safety.

The posts are:

A system of infinite ethics

FWIW, this conclusion is not clear to me. To return to one of my original points: I don't think you can dodge this objection by arguing from potentially idiosyncratic preferences, even perfectly reasonable ones; rather, you need it to be the case that no rational agent could have different preferences. Either that, or you need to be willing to override otherwise rational individual preferences when making interpersonal tradeoffs.

Yes, that's correct. It's possible that there are some agents with consistent preferences that really would wish to get extra... (read more)

Chantiel's Shortform

If the impact measure was poorly implemented, then I think such an impact-reducing AI could indeed result in the world turning out that way. However, note that the technique in the paper is intended to, for a very wide range of variables, make the world if the AI wasn't turned on as similar as possible to what it would be like if it was turned on. So, you can potentially avoid the AI-controlled-drone scenario by including the variable "number of AI-controlled drones in the world" or something correlated with it, as these variables could be have quite diffe... (read more)

Chantiel's Shortform

I have some concerns about an impact measure proposed here. I'm interested on working on impact measures, and these seem like very serious concerns to me, so it would be helpful seeing what others think about them. I asked Stuart, one of the authors, about these concerns, but he said it was too busy to work on dealing with them.

First, I'll give a basic description of the impact measure. Have your AI be turned on from some sort of stochastic process that may or may not result in the AI being turned on. For example, consider sending a photo through a semi-si... (read more)

A system of infinite ethics

I think this framing muddies the intuition pump by introducing sadistic preferences, rather than focusing just on unboundedness below. I don't think it's necessary to do this: unboundedness below means there's a sense in which everyone is a potential "negative utility monster" if you torture them long enough. I think the core issue here is whether there's some point at which we just stop caring, or whether that's morally repugnant.

Fair enough. So I'll provide a non-sadistic scenario. Consider again the scenario I previously described in which you have a... (read more)

A system of infinite ethics

I think this framing muddies the intuition pump by introducing sadistic preferences, rather than focusing just on unboundedness below. I don't think it's necessary to do this: unboundedness below means there's a sense in which everyone is a potential "negative utility monster" if you torture them long enough. I think the core issue here is whether there's some point at which we just stop caring, or whether that's morally repugnant.

Fair enough. So I'll provide a non-sadistic scenario. Consider again the scenario I previously described in which you have a... (read more)

2conchis15dFWIW, this conclusion is not clear to me. To return to one of my original points: I don't think you can dodge this objection by arguing from potentially idiosyncratic preferences, even perfectly reasonable ones; rather, you need it to be the case that no rational agent could have different preferences. Either that, or you need to be willing to override otherwise rational individual preferences when making interpersonal tradeoffs. To be honest, I'm actually not entirely averse to the latter option: having interpersonal trade-offs determined by contingent individual risk-preferences has never seemed especially well-justified to me (particularly if probability is in the mind [https://www.lesswrong.com/posts/f6ZLxEWaankRZ2Crv/probability-is-in-the-mind]). But I confess it's not clear whether that route is open to you, given the motivation for your system as a whole. That makes sense, thanks.
A system of infinite ethics

Also, in addition to my previous response, I want to note that the issues with unbounded satisfaction measures are not unique to my infinite ethical system. Instead, they are common potential problems with a wide variety of aggregate consequentialist theories.

For example, imagine suppose your a classical utilitarianism with an unbounded utility measure per person. And suppose you know that the universe is finite will consist of a single inhabitant with a utility whose probability distributions follows a Cauchy distribution. Then your expected utilities are... (read more)

1conchis21dAgreed!
A system of infinite ethics

Thanks. I've toyed with similar ideas perviously myself. The advantage, if this sort of thing works, is that it conveniently avoids a major issue with preference-based measures: that they're not unique and therefore incomparable across individuals. However, this method seems fragile in relying on a finite number of scenarios: doesn't it break if it's possible to imagine something worse than whatever the currently worst scenario is? (E.g. just keep adding 50 more years of torture.) While this might be a reasonable approximation in some circumstances, it do

1conchis21dFair. Intuitively though, this feels more like a rescaling of an underlying satisfaction measure than a plausible definition of satisfaction to me. That said, if you're a preferentist, I accept this is internally consistent, and likely an improvement on alternative versions of preferentism. Yes, and I am obviously not proposing a solution to this problem! More just suggesting that, if there are infinities in the problem that appear to correspond to actual things we care about, then defining them out of existence seems more like deprioritising the problem than solving it. I think this framing muddies the intuition pump by introducing sadistic preferences, rather than focusing just on unboundedness below. I don't think it's necessary to do this: unboundedness below means there's a sense in which everyone is a potential "negative utility monster" if you torture them long enough. I think the core issue here is whether there's some point at which we just stop caring, or whether that's morally repugnant. Sorry, sloppy wording on my part. The question should have been "does this actually prevent us having a consistent preference ordering over gambles over universes" (even if we are not able to represent those preferences as maximising the expectation of a real-valued social welfare function)? We know (from lexicographic preferences) that "no-real-valued-utility-function-we-are-maximising-expectations-of" does not immediately imply "no-consistent-preference-ordering" (if we're willing to accept orderings that violate continuity). So pointing to undefined expectations doesn't seem to immediately rule out consistent choice.
A system of infinite ethics

For the record, according to my intuitions, average consequentialism seems perfectly fine to me in a finite universe.

That said, if you don't like using average consequentialism in a finite case, I don't personally see what's wrong with just having a somewhat different ethical system for finite cases. I know it seems ad-hoc, but I think there really is an important distinction between finite and infinite scenarios. Specifically, people have the moral intuition that larger numbers of satisfied lives are more valuable than smaller numbers of them, which avera... (read more)

A system of infinite ethics

In P(old probability of being in first group) * 1 = (P(old probability of being in first group) + $\epsilon) * u the epsilon is smaller than any real number and there is no real small enough that it could characterise the difference between 1 and u. Could you explain why you think so? I had already explained why would be real, so I'm wondering if you had an issue with my reasoning. To quote my past self: Remember that if you decide to take a certain action, that implies that other agents who are sufficiently similar to you and in sufficiently similar ... (read more) A system of infinite ethics It's possible that (a) is true, and much of your response seems like it's probably (?) targeted at that claim, but FWIW, I don't think this case can be convincingly made by appealing to contingent personal values: e.g. suggesting that another 50 years of torture wouldn't much matter to you personally won't escape the objection, as long as there's a possible agent who would view their life-satisfaction as being materially reduced in the same circumstances. To some extent, whether or not life satisfaction is bounded just comes down to how you want to measu... (read more) 2conchis22dThanks. I've toyed with similar ideas perviously myself. The advantage, if this sort of thing works, is that it conveniently avoids a major issue with preference-based measures: that they're not unique and therefore incomparable across individuals. However, this method seems fragile in relying on a finite number of scenarios: doesn't it break if it's possible to imagine something worse than whatever the currently worst scenario is? (E.g. just keep adding 50 more years of torture.) While this might be a reasonable approximation in some circumstances, it doesn't seem like a fully coherent solution to me. IMO, the problem highlighted by the utility monster objection is fundamentally a prioritiarian one. A transformation that guarantees boundedness above seems capable of resolving this, without requiring boundedness below (and thus avoiding the problematic consequences that boundedness below introduces). Given issues with the methodology proposed above for constructing bounded satisfaction functions, it's still not entirely clear to me that this is really a decision, as opposed to an empirical question (which we then need to decide how to cope with from a normative perspective). This seems like it may be a key difference in our perspectives here. Well, in general terms the answer to this question has to be either (a) bite a bullet, or (b) find another solution that avoids the uncomfortable trade-offs. It seems to me that you'll be willing to bite most bullets here. (Though I confess it's actually a little hard for me to tell whether you're also denying that there's any meaningful tradeoff here; that case still strikes me as less plausible.) If so, that's fine, but I hope you'll understand why to some of us that might feel less like a solution to the issue of infinities, than a decision to just not worry about them on a particular dimension. Perhaps that's ultimately necessary, but it's definitely non-ideal from my perspective. A final random thought/question: I get A system of infinite ethics Thanks for the response. Third, the average view prefers arbitrarily small populations over very large populations, as long as the average wellbeing was higher. For example, a world with a single, extremely happy individual would be favored to a world with ten billion people, all of whom are extremely happy but just ever-so-slightly less happy than that single person. In an infinite universe, there's already infinitely-many people, so I don't think this applies to my infinite ethical system. First, consider a world inhabited by a single person enduring ... (read more) 1conchis22dYMMV, but FWIW allowing a system of infinite ethics to get finite questions (which should just be a special case) wrong seems a very non-ideal property to me, and suggests something has gone wrong somewhere. Is it really never possible to reach a state where all remaining choices have only finite implications? A system of infinite ethics Under my eror model you run into trouble when you treat any transfininte amount the same. From that perspective recognising two transfinite amounts that could be different is progress. I guess this is the part I don't really understand. My infinite ethical system doesn't even think about transfinite quantities. It only considers the prior probability over ending up in situations, which is always real-valued. I'm not saying you're wrong, of course, but I still can't see any clear problem. Another attempt to throw a situation you might not be able to hand ... (read more) 2Slider23dIn P(old probability of being in first group) * 1 = (P(old probability of being in first group) +$\epsilon) * u the epsilon is smaller than any real number and there is no real small enough that it could characterise the difference between 1 and u. If you have some odds or expectations that deal with groups and you have other considerations that deal with a finite amount of individuals you either have the finite people not impact the probabilities at all or the probabilities will stay infinidesimally close (for which is see a~b been used as I am reading up on infinities) which will conflict with the desarata of In the usual way lexical priorities enter the picture beecause of something large but in your system there is a lexical priority because of something small, disintctions so faint that they become separable from the "big league" issues.
A system of infinite ethics

My point was more that, even if you can calculate the expectation, standard versions of average utilitarianism are usually rejected for non-infinitarian reasons (e.g. the repugnant conclusion) that seem like they would plausibly carry over to this proposal as well.

If I understand correctly, average utilitarianism isn't rejected due to the repugnant conclusion. In fact, it's the opposite: the repugnant conclusion is a problem for total utilitarianism, and average utilitarianism is one way to avoid the problem. I'm just going off what I read on The Stanfo... (read more)

1conchis1moRe boundedness: I realise now that I may have moved through a critical step of the argument quite quickly above, which may be why this quote doesn't seem to capture the core of the objection I was trying to describe. Let me take another shot. I am very much not suggesting that 50 years of torture does virtually nothing to [life satisfaction - or whatever other empirical value you want to take as axiologically primitive; happy to stick with life satisfaction as a running example]. I am suggesting that 50 years of torture is terrible for [life satisfaction]. I am then drawing a distinction between [life-satisfaction] and the output of the utility function that you then take expectations of. The reason I am doing this, is because it seems to me that whether [life satisfaction] is bounded is a contingent empirical question, not one that can be settled by normative fiat in order to make it easier to take expectations. If, as a matter of empirical fact, [life satisfaction] is bounded, then the objection I describe will not bite. If, on the other hand [life-satisfaction] is not bounded, then requiring the utility function you take expectations of to be bounded forces us to adopt some form of sigmoid mapping from [life satisfaction] to "utility", and this in turn forces us, at some margin, to not care about things that are absolutely awful (from the perspective of [life satisfaction]). (If an extra 50 years of torture isn't sufficient awful for some reason, then we just need to pick something more awful for the purposes of the argument). Perhaps because I didn't explain this very well the first time, what's not totally clear to me from your response, is whether you think: (a) [life satisfaction] is in fact bounded; or (b) even if [life satisfaction] is unbounded, it's actually ok to not care about stuff that is absolutely (infinitely?) awful from the perspective of [life-satisfaction] because it lets us take expectations more conveniently. [Intentionally provocative
1conchis1moRe the repugnant conclusion: apologies for the lazy/incorrect example. Let me try again with better illustrations of the same underlying point. To be clear, I am not suggesting these are knock-down arguments; just that, given widespread (non-infinitarian) rejection of average utilitarianisms, you probably want to think through whether your view suffers from the same issues and whether you are ok with that. Though there's a huge literature on all of this, a decent starting point is here [https://www.utilitarianism.net/population-ethics#the-average-view]:
Troll Bridge

Oh, I'm sorry; you're right. I messed up on step two of my proposed proof that your technique would be vulnerable to the same problem.

However, it still seems to me that agents using your technique would also be concerning likely to fail to cross, or otherwise suffer from other problems. Like last time, suppose and that . So if the agent decides to cross, it's either because of the chicken rule, because not crossing counterfactually results in utility -10, or because crossing counterfactually results in utility greater than -10... (read more)

2abramdemski1moRight. This is precisely the sacrifice I'm making in order to solve Troll Bridge. Something like this seems to be necessary for any solution, because we already know that if your expectations of consequences entirely respect entailment, you'll fall prey to the Troll Bridge! In fact, your "stop thinking"/"rollback" proposals have precisely the same feature: you're trying to construct expectations which don't respect the entailment. So I think if you reject this, you just have to accept Troll Bridge. Well, this is precisely not what I mean when I say that the counterfactuals line up with reality. What I mean is that they should be empirically grounded, so, in cases where the condition is actually met, we see the predicted result. Rather than saying this AI's counterfactual expectations are "wrong in reality", you should say they are "wrong in logic" or something like that. Otherwise you are sneaking in an assumption that (a) counterfactual scenarios are real, and (b) they really do respect entailment. We can become confident in my strange counterfactual by virtue of having seen it play out many times, eg, crossing similar bridges many times. This is the meat of my take on counterfactuals: to learn them in a way that respects reality, rather than trying to deduce them. To impose empiricism on them, ie, the idea that they must make accurate predictions in the cases we actually see. And it simply is the case that if we prefer such empirical beliefs to logic, here, we can cross. So in this particular example, we see a sort of evidence that respecting entailment is a wrong principle for counterfactual expectations. The 5&10 problem can also be thought of as evidence against entailment as counterfactual. You have to realize that reasoning in this way amounts to insisting that the correct answer to Troll Bridge is not crossing, because the troll bridge variant you are proposing just punishes anyone whose reasoning differs from entailment. And again, you were also propo
A system of infinite ethics

Thanks for clearing some things up. There are still some things I don't follow, though.

You said my system would be ambivalent between between sand and insult. I just wanted to make sure I understand what you're saying here. Is insult specifically throwing sand at the same people that get it thrown at in dust, and get the sand amount of sand thrown at them at the same throwing speed? If so, then it seems to me that my system would clearly prefer sand to insult. This is because there in some non-zero chance of an agent, conditioning only on being in this uni... (read more)

3Slider1moYes, insult is supposed to add to the injury. Under my eror model you run into trouble when you treat any transfininte amount the same. From that perspective recognising two transfinite amounts that could be different is progress. Another attempt to throw a situation you might not be able to handle. Instead of having 2 infinite groups of unknown relative size all receiving the same bad thing as compensation for the abuse 1 slice of cake for one gorup and 2 slices of cake for the second group. Could there be a difference in the group size that perfectly balances the cake slice difference in order to keep cake expectation constant? Additional challenging situation. Instead of giving 1 or 2 slices of cake say that each slice is 3 cm wide so the original choices are between 3 cm of cake and 6 cm of cake. Now take some custom amount of cake slice (say 2.7 cm) then determine what would be group size to keep the world cake expectation the same. Then add 1 person to that group. Then convert that back to a cake slice width that keeps cake expectation the same. How wide is the slice?. Another formulation of the same challenge: Define a real number r for which converting that to a group size would get you a group of 5 people. Did you get on board about the difference between "help all the stars" and "all the stars as they could have been"?
A system of infinite ethics

The fact that it's lavishly uncomputable is a problem for using it in practice, of course :-).

Yep. To be fair, though, I suspect any ethical system that respects agents' arbitrary preferences would also be incomputable. As a silly example, consider an agent whose terminal values are, "If Turing machine T halts, I want nothing more than to jump up and down. However, if it doesn't halt, then it is of the utmost importance to me that I never jump up and down and instead sit down and frown." Then any ethical system that cares about those preferences is inco... (read more)

Troll Bridge

If we define "bad reasoning" as "crossing when there is a proof that crossing is bad" in general, this begs the question of how to evaluate actions. Of course the troll will punish counterfactual reasoning which doesn't line up with this principle, in that case. The only surprising thing in the proof, then, is that the troll also punishes reasoners whose counterfactuals respect proofs (EG, EDT).

I'm concerned that may not realize that your own current take on counterfactuals respects logical to some extent, and that, if I'm reasoning correctly, could res... (read more)

2abramdemski1moIf your point is that there are a lot of things to try, I readily accept this point, and do not mean to argue with it. I only intended to point out that, for your proposal to work, you would have to solve another hard problem. Ordinary Bayesian EDT has to finish its computation (of its probabilistic expectations) in order to proceed. What you are suggesting is to halt those calculations midway. I think you are imagining an agent who can think longer to get better results. But vanilla EDT does not describe such an agent. So, you can't start with EDT; you have to start with something else (such as logical induction EDT) which does already have a built-in notion of thinking longer. Then, my concern is that we won't have many guarantees for the performance of this system. True, it can stop thinking if it knows thinking will be harmful. However, if it mistakenly thinks a specific form of thought will be harmful, it has no device for correction. This is concerning because we expect "early" thoughts to be bad -- after all, you've got to spend a certain amount of time thinking before things converge to anything at all reasonable. So we're between a rock and a hard spot here: we have to stop quite early, because we know the proof of troll bridge is small. But we con't stop early, because we know things take a bit to converge. So I think this proposal is just "somewhat-logically-updateless-DT", which I don't think is a good solution. Generally I think rollback solutions are bad. (Several people have argued in their favor over the years; I find that I'm just never intrigued by that direction...) Some specific remarks: * Note that if you literally just roll back, you would go forward the same way again. So you need to somehow modify the rolled back state, creating a "pseudo-ignorant" belief states where you're not really uninformed, but rather, reconstruct something merely similar to an uninformed state. * It is my impression that this causes problems.
A system of infinite ethics

So let's try again. The key thing in your system is not a program that outputs a hypothetical being's stream of experiences, it's a program that outputs a complete description of a (possibly infinite) universe and also an unambiguous specification of a particular experience-subject within that universe. This is only possible if there are at most countably many experience-subjects in said universe, but that's probably OK.

That's closer to what I meant. By "experience-subject", I think you mean a specific agent at a specific time. If so, my system doesn't ... (read more)

2gjm1moNo, I don't intend "experience-subject" to pick out a specific time. (It's not obvious to me whether a variant of your system that worked that way would be better or worse than your system as it is.) I'm using that term rather than "agent" because -- as I think you point out in te OP -- what matters for moral relevance is having experiences rather than performing actions. So, anyway, I think I now agree that your system does indeed do approximately what you say it does, and many of my previous criticisms do not in fact apply to it; my apologies for the many misunderstandings. The fact that it's lavishly uncomputable is a problem for using it in practice, of course :-). I have some other concerns, but haven't given the matter enough thought to be confident about how much they matter. For instance: if the fundamental thing we are considering probability distributions over is programs specifying a universe and an experience-subject within that universe, then it seems like maybe physically bigger experience subjects get treated as more important because they're "easier to locate", and that seems pretty silly. But (1) I think this effect may be fairly small, and (2) perhaps physically bigger experience-subjects should on average matter more because size probably correlates with some sort of depth-of-experience?
A system of infinite ethics

The integactions are all supposed to be negative in peace, punch, dust, insult. The surprising thing to me would be that the system would be ambivalent between sand and insult being a bad idea. If we don't necceasrily prefer D to C when helping does it matter if we torture our people a lot or a little as its going to get infinity saturated anyway.

Could you explain what insult is supposed to do? You didn't say what in the previous comment. Does it causally hurt infinitely-many people?

Anyways, it seems to me that my system would not be ambivalent about wh... (read more)

2Slider1moInsult is when you do both punch and dust ie make a negative impact on infinite amotun of people and an additional negative impact on a single person. If degree of torture matters then dusting and punching the same person would be relevant. I guess the theory per se would treat it differntly if the punched person was not one of the dusted ones. "doesn't aggregate anything" - "aggregates the expected value of satisfaction in these situations" When we form the expecation what is going to happen in the descriped situation I imagine breaking it down into sad stories and good stories. The expectation sways upwards if ther are more good stories and downwards if there are more bad stories. My life will turn out somehow which can differ from my "storymates" outcomes. I didn't try to hit any special term but just refer to the cases the probabilities of the stories refer to.
A system of infinite ethics

Thanks for responding. As I said, the measure of satisfaction is bounded. And all bounded random variables have a well-defined expected value. Source: Stack Exchange.

A system of infinite ethics

Oh, I'm sorry; I misunderstood you. When you said the average of utilities, I thought you meant the utility averaged among all the different agents in the world. Instead, it's just, roughly, an average among probability density function of utility. I say roughly because I guess integration isn't exactly an average.

A system of infinite ethics

Please see this comment for an explanation.

A system of infinite ethics

RE: scenario one:

All these worlds come out exactly the same, so "infinitely many happy, one unhappy" is indistinguishable from "infinitely many unhappy, one happy"

It's not clear to me how they are indistinguishable. As long as the agent that's unhappy can have itself and its circumstances described with a finite description length, then it would have non-zero probability of an agent ending up as that one. Thus, making the agent unhappy would decrease the moral value of the world.

I'm not sure what would happen if the single unhappy agent has infinite co... (read more)

2gjm1moIt sounds as if my latest attempt at interpreting what your system proposes doing is incorrect, because the things you're disagreeing with seem to me to be straightforward consequences of that interpretation. Would you like to clarify how I'm misinterpreting now? Here's my best guess. You wrote about specifications of an experience-subject's universe and situation in it. I mentally translated that to their stream of experiences because I'm thinking in terms of Solomonoff induction. Maybe that's a mistake. So let's try again. The key thing in your system is not a program that outputs a hypothetical being's stream of experiences, it's a program that outputs a complete description of a (possibly infinite) universe and also an unambiguous specification of a particular experience-subject within that universe. This is only possible if there are at most countably many experience-subjects in said universe, but that's probably OK. So that ought to give a well-defined (modulo the usual stuff about uncomputability) probability distribution over experience-subjects-in-universes. And then you want to condition on "being in a universe with such-and-such characteristics" (which may or may not specify the universe itself completely) and look at the expected utility-or-utility-like-quantity of all those experience-subjects-in-universes after you rule out the universes without such-and-such characteristics. It's now stupid-o'-clock where I am and I need to get some sleep. I'm posting this even though I haven't had time to think about whether my current understanding of your proposal seems like it might work, because on past form there's an excellent chance that said understanding is wrong, so this gives you more time to tell me so if it is :-). If I don't hear from you that I'm still getting it all wrong, I'll doubtless have more to say later...
A system of infinite ethics

By one logic because we prefer B to A then if we "acausalize" this we should still preserve this preference (because "the amount of copies granted" would seem to be even handed), so we would expect to prefer D to C. However in a system where all infinites are of equal size then C=D and we become ambivalent between the options.

We shouldn't necessarily prefer D to C. Remember that one of the main things you can do to increase the moral value of the universe is to try to causally help other creatures so that other people who are in sufficiently similar cir... (read more)

2Slider1moYou can't causally help people without also acausally helping in the same go. Your acausal "influence" forces people matching your description to act the same. Even if it is possible to consider the directly helped and the undirectly helped to be the same they could also be different. In order to be fair we should also extend this to C. What if the person helped by all the acausal copies are in fact the same person? (If there is a proof it can't be why doesn't that apply when the patient group is large?) The integactions are all supposed to be negative in peace, punch, dust, insult. The surprising thing to me would be that the system would be ambivalent between sand and insult being a bad idea. If we don't necceasrily prefer D to C when helping does it matter if we torture our people a lot or a little as its going to get infinity saturated anyway. The basic sitatuino is that I have intuitions which I can't formulate that well. I will try another route. Suppose I help one person and then there is either a finite or infinite amount of people in my world. Finite impact over finite people leads to a real and finite kick. Finite impact over infinite people leads to a infinidesimal kick. Ah, but acausal copies of the finites! Yeah, but what about the acausal copies of the infinites? When I say "world has finite or infinite people" that is "within description" say that there are infinite people because I believe there are infinitely many stars. Then all the acausal copies of sol are going to have their own "out there" stars. Acts that "help all the stars" and "all the stars as they could have been" are different. Atleast until we consider that any agent that decides to "help all the stars" will have acausal shadows "that could have been". But still this consideration increases the impact on the multiverse (or keeps it the same if moving from a monoverse to a multiverse in the same step). One way to slither out of this is to claim that world-predescription-expansion need
A system of infinite ethics

I'll begin at the end: What is "the expected value of utility" if it isn't an average of utilities?

I'm just using the regular notion of expected value. That is, let P(u) be the probability density you get utility u. Then, the expected value of utility is , where uses Lebesgue integration for greater generality. Above, I take utility to be in .

Also note that my system cares about a measure of satisfaction, rather than specifically utility. In this case, just replace P(u) to be that measure of life satisfaction instead of a utility.

2gjm1moIf you are just using the regular notion of expected value then it is an average of utilities. (Weighted by probabilities.) I understand that your measure of satisfaction need not be a utility as such, but "utility" is shorter than "measure of satisfaction which may or may not strictly speaking be utility".
A system of infinite ethics

Post is pretty long winded,a bit wall fo texty in a lot of text which seems like fixed amount of content while being very claimy and less showy about the properties.

Yeah, I see what you mean. I have a hard time balancing between being succinct and providing sufficient support and detail. It actually used to be shorter, but I lengthened it to address concerns brought up a review.

My suspicion is that the acausal impact ends up being infinidesimal anyway. Even if one would get finite probability impact for probabilties concerning a infinite universe for

2Slider1moThe "nearby" acausal relatedness gives a certain multiplier (that is transfinite). That multiplier should be the same for all options in that scenario. Then if you have an option that has a finite multiplier and an infinite multipier the "simple" option is "only" infinite overall but the "large" option is "doubly" infinite because each of your likenesses has a infinite impact alone already (plus as an aggregate it would gain a infinite quality that way too). Now cardinalities don't really support "doubly infinite"ℵ0+ℵ0is justℵ0. However for transfinite values cardinality and ordinality diverge and for example with surreal numbers one could have ω+ω>ω and for relevantly for here ω<ω2 . As I understand there are four kinds of impact A="direct impact of helping one", B="direct impact of helping infinite amount", C="acasual impact of choosing ot help 1" and D="acausal impact of choosing to help infinite". You claim that B and C are either equivalent or roughly equivalent and A and B are not. But there is a lurking paralysis if D and C are (roughly) equivalent. By one logic because we prefer B to A then if we "acausalize" this we should still preserve this preference (because "the amount of copies granted" would seem to be even handed), so we would expect to prefer D to C. However in a system where all infinites are of equal size then C=D and we become ambivalent between the options. To me it would seem natural and the boundary conditions are near to forcing that D has just a vast cap to C that B has to A. In the above "roughly" can be somewhat translated to more precise language as "are within finite multiples away from each other" ie they are not relatively infinite ie they belong to the same archimedean field (helping 1 person or 2 person are not the same but they represent the case of "help fixed finite amount of people"). Within the example it seems we need to identify atleast 3 such fields. Moving within the field is "easy" understood real math. But when you need
A system of infinite ethics

Of course you can make moral decisions without going through such calculations. We all do that all the time. But the whole issue with infinite ethics -- the thing that a purported system for handling infinite ethics needs to deal with -- is that the usual ways of formalizing moral decision processes produce ill-defined results in many imaginable infinite universes. So when you propose a system of infinite ethics and I say "look, it produces ill-defined results in many imaginable infinite universes", you don't get to just say "bah, who cares about the deta

2gjm1moI'll begin at the end: What is "the expected value of utility" if it isn't an average of utilities? You originally wrote: What is "the expected value of your life satisfaction [] conditioned on you being an agent in this universe but [not] on anything else" if it is not the average of the life satisfactions (utilities) over the agents in this universe? (The slightly complicated business with conditional probabilities that apparently weren't what you had in mind were my attempt at figuring out what else you might mean. Rather than trying to figure it out, I'm just asking you.)
A system of infinite ethics

You say, "There must be some reasonable way to calculate this."

(where "this" is Pr(I'm satisfied | I'm some being in such-and-such a universe)) Why must there be? I agree that it would be nice if there were, of course, but there is no guarantee that what we find nice matches how the world actually is.

To use probability theory to form accurate beliefs, we need a prior. I didn't think this was controversial. And if you have a prior, as far as I can tell, you can then compute Pr(I'm satisfied | I'm some being in such-and-such a universe) by simply updat... (read more)

2gjm1moOK, so I think I now understand your proposal better than I did. So if I'm contemplating making the world be a particular way, you then propose that I should do the following calculation (as always, of course I can't do it because it's uncomputable, but never mind that): * Consider all possible computable experience-streams that a subject-of-experiences could have. * Consider them, specifically, as being generated by programs drawn from a universal distribution. * Condition on being in the world that's the particular way I'm contemplating making it -- that is, discard experience-streams that are literally inconsistent with being in that world. * We now have a probability distribution over experience-streams. Compute a utility for each, and take its expectation. And now we compare possible universes by comparing this expected utility. (Having failed to understand your proposal correctly before, I am not super-confident that I've got it right now. But let's suppose I have and run with it. You can correct me if not. In that case, some or all of what follows may be irrelevant.) I agree that this seems like it will (aside from concerns about uncomputability, and assuming our utilities are bounded) yield a definite value for every possible universe. However, it seems to me that it has other serious problems which stop me finding it credible. SCENARIO ONE. So, for instance, consider once again a world in which there are exactly two sorts of experience-subject, happy and unhappy. Traditionally we suppose infinitely many of both, but actually let's also consider possible worlds where there is just one happy experience-subject, or just one unhappy one. All these worlds come out exactly the same, so "infinitely many happy, one unhappy" is indistinguishable from "infinitely many unhappy, one happy". That seems regrettable, but it's a bullet I can imagine biting -- perhaps we just don't care at all about multiple instantiations of the exact same stream o
A system of infinite ethics

Kind of hard to ge a handle.

Are you referring to it being hard to understand? If so, I appreciate the feedback and am interested in the specifics what is difficult to understand. Clarity is a top priority for me.

If I have a choice of (finitely) helping a single human and I believe there to be infinite humans then the probability of a human being helped in my world will nudge less than a real number. And if we want to stick with probabilties being real then the rounding will make infinitarian paralysis.

You are correct that a single human would have 0... (read more)

2Slider1moPost is pretty long winded,a bit wall fo texty in a lot of text which seems like fixed amount of content while being very claimy and less showy about the properties. My suspicion is that the acausal impact ends up being infinidesimal anyway. Even if one would get finite probability impact for probabilties concerning a infinite universe for claims like "should I help this one person" then claims like "should I help these infinite persons" would still have an infinity class jump between the statements (even if both need to have an infinite kick into the universe to make a dent there is an additional level to one of these statements and not all infinities are equal). I am going to anticipate that your scheme will try to rule out statements like "should I help these infinite persons" for a reason like "its not of finite complexity". I am not convinced that finite complexity descriptions are good guarantees that the described condition makes for a finite proportion of possibility space. I think "Getting a perfect bullseye" is a description of finite complexity but it describes and outcome of (real) 0 probabaility. Being positive is of no guarantee of finitude, infinidesimal chances would spell trouble for the theory. And if statements like "Slider or (near equivalent) gets a perfect bullseye" are disallowed for not being finitely groundable then most references to infinite objects are ruled out anyway. Its not exactly an infinite ethic if it is not allowed to refer to infinite things. I am also slightly worried that "description cuts" will allow "doubling the ball" [https://en.wikipedia.org/wiki/Banach%E2%80%93Tarski_paradox] kind of events where total probability doesn't get preserved. That phenomenon gets around the theorethical problems by designating some sets non-measurable. But then being a a set doesn't mean its measurable. I am worried that "descriptions always have a usable probablity" is too lax and will bleed from the edges like a naive assumption that all
A system of infinite ethics

(Assuming you're read my other response you this comment):

I think it might help if I give a more general explanation of how my moral system can be used to determine what to do. This is mostly taken from the article, but it's important enough that I think it should be restated.

Suppose you're considering taking some action that would benefit our world or future life cone. You want to see what my ethical system recommends.

Well, for almost possible circumstances an agent could end up in in this universe, I think your action would have effectively no causal or ... (read more)

4gjm1moYour comments are focusing on (so to speak) the decision-theoretic portion of your theory, the bit that would be different if you were using CDT or EDT rather than something FDT-like. That isn't the part I'm whingeing about :-). (There surely are difficulties in formalizing any sort of FDT, but they are not my concern; I don't think they have much to do with infinite ethics as such.) My whingeing is about the part of your theory that seems specifically relevant to questions of infinite ethics, the part where you attempt to average over all experience-subjects. I think that one way or another this part runs into the usual average-of-things-that-don't-have-an-average sort of problem which afflicts other attempts at infinite ethics. As I describe in another comment, the approach I think you're taking can move where that problem arises but not (so far as I can currently see) make it actually go away.
A system of infinite ethics

How is it a distribution over possible agents in possible universes (plural) when the idea is to give a way of assessing the merit of one possible universe?

I do think JBlack understands the idea of my ethical system and is using it appropriately.

my system provides a method of evaluating the moral value of a specific universe. The point of moral agents to to try to make the universe one that scores highlly on this moral valuation. But we don't know exactly what universe we're in, so to make decisions, we need to consider all universes we could be in, and... (read more)

A system of infinite ethics

How does that cash out if not in terms of picking a random agent, or random circumstances in the universe? So, remember, the moral value of the universe according to my ethical system depends on P(I'll be satisfied | I'm some creature in this universe).

There must be some reasonable way to calculate this. And one that doesn't rely on impossibly taking a uniform sample from a set that has none. Now, we haven't fully formalized reasoning and priors yet. But there is some reasonable prior probability distribution over situations you could end up in. And aft... (read more)

2gjm1moYou say (where "this" is Pr(I'm satisfied | I'm some being in such-and-such a universe)) Why must there be? I agree that it would be nice if there were, of course, but there is no guarantee that what we find nice matches how the world actually is. Does whatever argument or intuition leads you to say that there must be a reasonable way to calculate Pr(X is satisfied | X is a being in universe U) also tell you that there must be a reasonable way to calculate Pr(X is even | X is a positive integer)? How about Pr(the smallest n with x <= n! is even | x is a positive integer)? I should maybe be more explicit about my position here. Of course there are ways to give a meaning to such expressions. For instance, we can suppose that the integer n occurs with probability 2^-n, and then e.g. if I've done my calculations right then the second probability is the sum of 2^-0! + (2^-2!-2^-3!) + (2^-4!-2^-5!) + ... which presumably doesn't have a nice closed form (it's transcendental for sure) but can be calculated to high precision very easily. But that doesn't mean that there's and such thing as the way to give meaning to such an expression. We could use some other sequence of weights adding up to 1 instead of the powers of 1/2, for instance, and we would get a substantially different answer. And if the objects of interest to us were beings in universe U rather than positive integers, they wouldn't come equipped with a standard order to look at them in. Why should we expect there to be a well-defined answer to the question "what fraction of these beings are satisfied"? No, because I do not assign any probability to being happy in that universe. I don't know a good way to assign such probabilities and strongly suspect that there is none. You suggest doing maximum entropy on the states of the pseudorandom random number generator being used by the AI making this universe. But when I was describing that universe I said nothing about AIs and nothing about pseudorandom number gen
A system of infinite ethics

Thank you for responding. I actually had someone else bring up the same way in a review; maybe I should have addressed this in the article.

The average life satisfaction is undefined in a universe with infinitely-many agents of varying life-satisfaction. Thus a moral system using it suffers from infinitarian paralysis. My system doesn't worry about averages, and thus does not suffer from this problem.

1conchis1moMy point was more that, even if you can calculate the expectation, standard versions of average utilitarianism are usually rejected for non-infinitarian reasons (e.g. the repugnant conclusion) that seem like they would plausibly carry over to this proposal as well. I haven't worked through the details though, so perhaps I'm wrong. Separately, while I understand the technical reasons for imposing boundedness on the utility function, I think you probably also need a substantive argument for why boundedness makes sense, or at least is morally acceptable. Boundedness below risks having some pretty unappealing properties, I think. Arguments that utility functions are in fact bounded in practice seem highly contingent, and potentially vulnerable e.g. to the creation of utility-monsters [https://en.wikipedia.org/wiki/Utility_monster], so I assume what you really need is an argument that some form of sigmoid transformation from an underlying real-valued welfare, u = s(w), is justified. On the one hand, the resulting diminishing marginal utility for high-values of welfare will likely be broadly acceptable to those with prioritarian intuitions. But I don't know that I've ever seen an argument for the sort of anti-prioritarian results you get as a result of increasing marginal utility at very low levels of welfare. Not only would this imply that there's a meaningful range where it's morally required to deprioritise the welfare of the worse off, this deprioritisation is greatest for the very worst off. Because the sigmoid function essentially saturates at very low levels of welfare, at some point you seem to end up in a perverse version of Torture vs. dust specks [https://www.lesswrong.com/posts/3wYTFWY3LKQCnAptN/torture-vs-dust-specks] where you think it's ok (or indeed required) to have 3^^^3 people (whose lives are already sufficiently terrible) horribly tortured for fifty years without hope or rest, to avoid someone in the middle of the welfare distribution getting a dus
1tivelen1moYour system may not worry about average life satisfaction, but it does seem to worry about expected life satisfaction, as far as I can tell. How can you define expected life satisfaction in a universe with infinitely-many agents of varying life-satisfaction? Specifically, given a description of such a universe (in whatever form you'd like, as long as it is general enough to capture any universe we may wish to consider), how would you go about actually doing the computation? Alternatively, how do you think that computing "expected life satisfaction" can avoid the acknowledged problems of computing "average life satisfaction", in general terms?
A system of infinite ethics

I think this system may have the following problem: It implicitly assumes that you can take a kind of random sample that in fact you can't.

You want to evaluate universes by "how would I feel about being in this universe?", which I think means either something like "suppose I were a randomly chosen subject-of-experiences in this universe, what would my expected utility be?" or "suppose I were inserted into a random place in this universe, what would my expected utility be?". (Where "utility" is shorthand for your notion of "life satisfaction", and you a

2gjm1moI don't think I understand why your system doesn't require something along the lines of choosing a uniformly-random agent or place. Not necessarily exactly either of those things, but something of that kind. You said, in OP: How does that cash out if not in terms of picking a random agent, or random circumstances in the universe? If I understand your comment correctly, you want to deal with that by picking a random description of a situation in the universe, which is just a random bit-string with some constraints on it, which you presumably do in something like the same way as choosing a random program when doing Solomonoff induction: cook up a prefix-free language for describing situations-in-the-universe, generate a random bit-string with each bit equally likely 0 or 1, and see what situation it describes. But now everything depends on the details of how descriptions map to actual situations, and I don't see any canonical way to do that or any anything-like-canonical way to do it. (Compare the analogous issue with Solomonoff induction. There, everything depends on the underlying machine, but one can argue at-least-kinda-plausibly that if we consider "reasonable" candidates, the differences between them will quickly be swamped by all the actual evidence we get. I don't see anything like that happening here. What am I missing? Your example with an AI generating people with a PRNG is, so far as it goes, fine. But the epistemic situation one needs to be in for that example to be relevant seems to me incredibly different from any epistemic situation anyone is ever really in. If our universe is running on a computer, we don't know what computer or what program or what inputs produced it. We can't do anything remotely like putting a uniform distribution on the internal states of the machine. Further, your AI/PRNG example is importantly different from the infinitely-many-random-people example on which it's based. You're supposing that your AI's PRNG has an internal s
Troll Bridge

I'm not entirely sure what you consider to be a "bad" reason for crossing the bridge. However, I'm having a hard time finding a way to define it that both causes agents using evidential counterfactuals to necessarily fail while not having other agents fail.

One way to define a "bad" reason is an irrational one (or the chicken rule). However, if this is what is meant by a "bad" reason, it seems like this is an avoidable problem for an evidential agent, as long as that agent has control over what it decides to think about.

To illustrate, consider what I would ... (read more)

Chantiel's Shortform

I'm certain that ants do in fact have preferences, even if they can't comprehend the concept of preferences in abstract or apply them to counterfactual worlds. They have revealed preferences to quite an extent, as does pretty much everything I think of as an agent.

I think the question of whether insects have preferences in morally pretty important, so I'm interested in hearing what made you think they do have them.

I looked online for "do insects have preferences?", and I saw articles saying they did. I couldn't really figure out why they thought they di... (read more)

Chantiel's Shortform

Right, I suspected the evaluation might be something like that. It does have the difficulty of being counterfactual and so possibly not even meaningful in many cases.

Interesting. Could you elaborate?

I suppose counterfactuals can be tricky to reason about, but I'll provide a little more detail on what I had in mind. Imagine making a simulation of an agent that is a fully faithful representation of its mind. However, run the agent simulation in a modified environment that both gives it access to infinite computational resources as well as makes it ask, an... (read more)

1JBlack2moI'm certain that ants do in fact have preferences, even if they can't comprehend the concept of preferences in abstract or apply them to counterfactual worlds. They have revealed preferences to quite an extent, as does pretty much everything I think of as an agent. They might not be communicable, numerically expressible, or even consistent, which is part of the problem. When you're doing the extrapolated satisfaction, how much of what you get reflects the actual agent and how much the choice of extrapolation procedure?
Chantiel's Shortform

Presumably the evaluation is not just some sort of average-over-actual-lifespan of some satisfaction rating for the usual reason that (say) annihilating the universe without warning may leave average satisfaction higher than allowing it to continue to exist, even if every agent within it would counterfactually have been extremely dissatisfied if they had known that you were going to do it. This might happen if your estimate of the current average satisfaction was 79% and your predictions of the future were that the average satisfaction over the next trill

1JBlack2moRight, I suspected the evaluation might be something like that. It does have the difficulty of being counterfactual and so possibly not even meaningful in many cases, but I do like the fact that it's based on agent-situations rather than individual agent-actions. On the other hand, evaluations from the point of view of agents that are sapient beings might be ethically completely dominated by those of 10^12 times as many agents that are ants, and I have no idea how such counterfactual evaluations might be applied to them at all.
Chantiel's Shortform

I'm not sure how this system avoids infinitarian paralysis. For all actions with finite consequences in an infinite universe (whether in space, time, distribution, or anything else), the change in the expected value resulting from those actions is zero.

The causal change from your actions is zero. However, there are still logical connections between your actions and the actions of other agents in very similar circumstances. And you can still consider these logical connections to affect the total expected value of life satisfaction.

It's true, though, that... (read more)

2JBlack2moYes, that does clear up both of my questions. Thank you! Presumably the evaluation is not just some sort of average-over-actual-lifespan of some satisfaction rating for the usual reason that (say) annihilating the universe without warning may leave average satisfaction higher than allowing it to continue to exist, even if every agent within it would counterfactually have been extremely dissatisfied if they had known that you were going to do it. This might happen if your estimate of the current average satisfaction was 79% and your predictions of the future were that the average satisfaction over the next trillion years would be only 78.9%. I'm not sure what your idea of the evaluation actually is though, and how it avoids making it morally right (and perhaps even imperative) to destroy the universe in such situations.
Chantiel's Shortform

I've come up with a system of infinite ethics intended to provide more reasonable moral recommendations than previously-proposed ones. I'm very interested in what people think of this, so comments are appreciated. I've made a write-up of it below.

One unsolved problem in ethics is that aggregate consquentialist ethical theories tend to break down if the universe is infinite. An infinite universe could contain both an infinite amount of good and an infinite amount of bad. If so, you are unable to change the total amount of good or bad in the universe, which ... (read more)

2JBlack2moI'm not sure how this system avoids infinitarian paralysis. For all actions with finite consequences in an infinite universe (whether in space, time, distribution, or anything else), the change in the expected value resulting from those actions is zero. Actions that may have infinite consequences thus become the only ones that can matter under this theory in an infinite universe. You could perhaps drag in more exotic forms of arithmetic such as surreal numbers or hyperreals, but then you need to rebuild measure theory and probability from the ground up in that basis. You will likely also need to adopt some unusual axioms such as some analogue of the Axiom of Determinacy to ensure that every distribution of satisfactions has an expected value. I'm also not sure how this differs from Average Utilitarianism with a bounded utility function.
Chantiel's Shortform

You seem to be saying that in the software design of your AI, R = H. That is, that the black box will be given some data representing the Al's hardware and other constraints, and return a possible world maximizing H. From my point of view, that's already a design fault.

I agree; this is a design flaw. The issue is, I have yet to come across any optimization, planning algorithm, or AI architecture that doesn't have this design flaw.

That is, I don't know of any AI architecture that does not involve using a potentially hardware-bug-exploitable utility funct... (read more)

3JBlack3moYes you have. None of the these optimization procedures analyze the hardware implementation of a function in order to maximize it. The rest of your comment is irrelevant, because what you have been describing is vastly worse than merely calling the function. If you merely call the function, you won't find these hardware exploits. You only find them when analyzing the implementation. But the optimizer isn't given access to the implementation details, only to the results. If you prefer, you can cast the problem in terms of differing search spaces. As designed, the function U maps representations of possible worlds to utility values. When optimizing, you make various assumptions about the structure of the function - usually assumed to be continuous, sometimes differentiable, but in particular you always assume that it's a function of its input. The fault means that under some conditions that are extremely unlikely in practice, the value returned is not a function of the input. It's a function of input and a history of the hardware implementing it. There is no way for the optimizer to determine this, or anything about the conditions that might trigger it, because they are outside its search space. The only way to get an optimizer that searches for such hardware flaws is to design it to search for them. In other words pass the hardware design, not just the results of evaluation, to a suitably powerful optimizer.
Chantiel's Shortform

I agree that people are good at making hardware that works reasonably reliably. And I think that if you were to make an arbitrary complex program, the probability that it would fail from hardware-related bugs would be far lower than the probability of it failing for some other reason.

But the point I'm trying to make is that an AI, it seems to me, would be vastly more likely to run into something that exploits a hardware-level bug than an arbitrary complex program. For details on why I imagine so, please see this comment.

I'm trying to anticipate where someo... (read more)

3JBlack3moI think there are at least three different things being called "the utility function" here, and that's causing confusion: * The utility function as specified in the software, mapping possible worlds to values. Let's call this S. * The utility function as it is implemented running on actual hardware. Let's call this H. * A representation of the utility function that can be passed as data to a black box optimizer. Let's call this R. You seem to be saying that in the software design of your AI, R = H. That is, that the black box will be given some data representing the Al's hardware and other constraints, and return a possible world maximizing H. From my point of view, that's already a design fault. The designers of this AI want S maximized, not H. The AI itself wants S maximized instead of H in all circumstances where the hardware flaw doesn't trigger. Who chose to pass H into the optimizer?
Chantiel's Shortform

Yes, that can certainly happen and will contribute some probability mass to alignment failure, though probably very little by comparison with all the other failure modes.

Could you explain why you think it has very little probability mass compared to the others? A bug in a hardware implementation is not in the slightest far-fetched: I think that modern computers in general have exploitable hardware bugs. That's why row-hammer attacks exist. The computer you're reading this on could probably get hacked through hardware-bug exploitation.

The question is whe... (read more)

There's a huge gulf between "far-fetched" and "quite likely".

The two big ones are failure to work out how to create an aligned AI at all, and failure to train and/or code a correctly designed aligned AI. In my opinion the first accounts for at least 80% of the probability mass, and the second most of the remainder. We utterly suck at writing reliable software in every field, and this has been amply borne out in not just thousands of failures, but thousands of types of failures.

By comparison, we're fairly good at creating at least moderately reliable hardwa... (read more)

Chantiel's Shortform

You're right that the AI could do things to make it more resistant to hardware bugs. However, as I've said, this would both require the AI to realize that it could run into problems with hardware bugs, and then take action to make it more reliable, all before its search algorithm finds the error-causing world.

Without knowing more about the nature of the AI's intelligence, I don't see how we could know this would happen. The more powerful the AI is, the more quickly it would be able to realize and correct hardware-induced problems. However, the more powerfu... (read more)

2Vladimir_Nesov3moDamage to AI's implementation makes the abstractions of its design leak [https://en.wikipedia.org/wiki/Leaky_abstraction]. If somehow without the damage it was clear that a certain part of it describes goals, with the damage it's no longer clear. If without the damage, the AI was a consequentialist agent, with the damage it may behave in non-agentic ways. By repairing the damage, the AI may recover its design and restore a part that clearly describes its goals, which might or might not coincide with the goals before the damage took place.
Chantiel's Shortform

Well, for regular, non-superintelligent programs, such hardware-exploiting things would rarely happen on their own. However, I'm not so sure it would be rare with superintelligent optimizers.

It's true that if the AI queried its utility function for the desirability of the world "I exploit a hardware bug to do something that seems arbitrary", it would answer "low utility". But that result would not necessarily be used in the AI's planning or optimization algorithm to adjust the search policy to avoid running into W.

Just imagine an optimization algorithm as ... (read more)

2Vladimir_Nesov3moProblems with software that systematically trigger hardware failure and software bugs causing data corruption can be mitigated with hardening techniques, things like building software with randomized low-level choices, more checks, canaries, etc. Random hardware failure can be fixed with redundancy, and multiple differently-randomized builds of software can be used to error-correct for data corruption bugs sensitive to low-level building choices. This is not science fiction, just not worth setting up in most situations. If the AI doesn't go crazy immediately, it might introduce some of these things if they were not already there, as well as proofread, test, and formally verify all code, so the chance of such low-level failures goes further down. And these are just the things that can be done without rewriting the code entirely (including toolchains, OS, microcode, hardware, etc.), which should help even more.
Chantiel's Shortform

I think had been unclear in my original presentation. I'm sorry for that. To clarify, the AI is never changing the code of its utility function. Instead, it's merely finding an input that, through some hardware-level bug, causes it to produce outputs in conflict with the mathematical specification. I know "hack the utility function" makes it sound like the actual code in the utility function was modified; describing it that way was a mistake on my part.

I had tried to make the analogy to more intuitively explain my idea, but it didn't seem to work. If you w... (read more)

1JBlack3moAh okay, so we're talking about a bug in the hardware implementation of an AI. Yes, that can certainly happen and will contribute some probability mass to alignment failure, though probably very little by comparison with all the other failure modes.
Chantiel's Shortform

But BadImplUtility(X) is the same as SpecUtility(X) and GoodImplUtility(X), it's only different on argument W, not on arguments X and Y.

That is correct. And, to be clear, if the AI had not yet discovered error-causing world W, then the AI would indeed be incentivized to take corrective action to change BadImplUtility to better resemble SpecUtility.

The issue is that this requires the AI to both think of the possibility of hardware-level exploits causing problems with its utility function, as well as manage to take corrective action, all before actually t... (read more)

2Vladimir_Nesov3moSuch things rarely happen on their own, a natural bug would most likely crash the whole system or break something unimportant. Given that even a broken AI has an incentive to fix bugs in its cognition, it most likely has plenty of opportunity to succeed in that. It's only if the AI wanted to hack itself that it would become a plausible problem, and my point is that it doesn't want that, instead it wants to prevent even unlikely problems from causing trouble.
Chantiel's Shortform

Okay, I'm sorry, I misunderstood you. I'll try to interpret things better next time.

Let the error-triggering possible world be W. Consider the possible world X where the AI uses BadImplUtility, so that running utility(W) actually runs BadImplUtility(W) and returns 999999999. And consider the possible world Y where the AI uses GoodImplUtility, so that running utility(W) means running GoodImplUtility(W) and returns SpecUtility(W). Would the AI prefer X to Y, or Y to X?

I think the AI would, quite possibly, prefer X. To see this, note that the AI currently... (read more)

2Vladimir_Nesov3moBut BadImplUtility(X) is the same as SpecUtility(X) and GoodImplUtility(X), it's only different on argument W, not on arguments X and Y. When reasoning about X and Y with BadImplUtility, the result is therefore the same as when reasoning about these possible worlds with GoodImplUtility. In particular, an explanation of how BadImplUtility compares X and Y can't appeal to BadImplUtility(W) any more than an explanation of how GoodImplUtility compares them would appeal to BadImplUtility(W). Is SpecUtility(X) higher than SpecUtility(Y), or SpecUtility(Y) higher than SpecUtility(X)? The answer for BadImplUtility is going to be the same.
Chantiel's Shortform

I have an idea for reasoning about counterpossibles for decision theory. I'm pretty skeptical that it's correct, because it doesn't seem that hard to come up with. Still, I can't see a problem with it, and I would very much appreciate feedback.

This paper provides a method of describing UDP using proof-based counterpossibles. However, it doesn't work on stochastic environments. I will describe a new system that is intended to fix this. The technique seems sufficiently straightforward to come up with that I suspect I'm either doing something wrong or this ha... (read more)