## LESSWRONGLW

kh

kaarelh AT gmail DOT com

Sorted by New
25mo
2

# Wiki Contributions

I took the main point of the post to be that there are fairly general conditions (on the utility function and on the bets you are offered) in which you should place each bet like your utility is linear, and fairly general conditions in which you should place each bet like your utility is logarithmic. In particular, the conditions are much weaker than your utility actually being linear, or than your utility actually being logarithmic, respectively, and I think this is a cool point. I don't see the post as saying anything beyond what's implied by this about Kelly betting vs max-linear-EV betting in general.

(By the way, I'm pretty sure the position I outline is compatible with changing usual forecasting procedures in the presence of observer selection effects, in cases where secondary evidence which does not kill us is available. E.g. one can probably still justify [looking at the base rate of near misses to understand the probability of nuclear war instead of relying solely on the observed rate of nuclear war itself].)

I'm inside-view fairly confident that Bob should be putting a probability of 0.01% on surviving conditional on many worlds being true, but it seems possible I'm missing some crucial considerations having to do with observer selection stuff in general, so I'll phrase the rest of this as more of a question.

What's wrong with saying that Bob should put a probability of 0.01% of surviving conditional on many-worlds being true – doesn't this just follow from the usual way that a many-worlder would put probabilities on things, or at least the simplest way for doing so (i.e. not post-normalizing only across the worlds in which you survive)? I'm pretty sure that the usual picture of Bayesianism as having a big (weighted) set of possible worlds in your head and, upon encountering evidence, discarding the ones which you found out you were not in, also motivates putting a probability of 0.01% on surviving conditional on many-worlds. (I'm assuming that for a many-worlder, weights on worlds are given by squared amplitudes or whatever.)

This contradicts a version of the conservation of expected evidence in which you only average over outcomes in which you survive (even in cases where you don't survive in all outcomes), but that version seems wrong anyway, with Leslie's firing squad seeming like an obvious counterexample to me, https://plato.stanford.edu/entries/fine-tuning/#AnthObje .

A big chunk of my uncertainty about whether at least 95% of the future’s potential value is realized comes from uncertainty about "the order of magnitude at which utility is bounded". That is, if unbounded total utilitarianism is roughly true, I think there is a <1% chance in any of these scenarios that >95% of the future's potential value would be realized. If decreasing marginal returns in the [amount of hedonium -> utility] conversion kick in fast enough for 10^20 slightly conscious humans on heroin for a million years to yield 95% of max utility, then I'd probably give >10% of strong utopia even conditional on building the default superintelligent AI. Both options seem significantly probable to me, causing my odds to vary much less between the scenarios.

This is assuming that "the future’s potential value" is referring to something like the (expected) utility that would be attained by the action sequence recommended by an oracle giving humanity optimal advice according to our CEV. If that's a misinterpretation or a bad framing more generally, I'd enjoy thinking again about the better question. I would guess that my disagreement with the probabilities is greatly reduced on the level of the underlying empirical outcome distribution.

Great post, thanks for writing this! In the version of "Alignment might be easier than we expect" in my head, I also have the following:

• Value might not be that fragile. We might "get sufficiently many bits in the value specification right" sort of by default to have an imperfect but still really valuable future.
• For instance, maybe IRL would just learn something close enough to pCEV-utility from human behavior, and then training an agent with that as the reward would make it close enough to a human-value-maximizer. We'd get some misalignment on both steps (e.g. because there are systematic ways in which the human is wrong in the training data, and because of inner misalignment), but maybe this is little enough to be fine, despite fragility of value and despite Goodhart.
• Even if deceptive alignment were the default, it might be that the AI gets sufficiently close to correct values before "becoming intelligent enough" to start deceiving us in training, such that even if it is thereafter only deceptively aligned, it will still execute a future that's fine when in deployment.
• It doesn't seem completely wild that we could get an agent to robustly understand the concept of a paperclip by default. Is it completely wild that we could get an agent to robustly understand the concept of goodness by default?
• Is it so wild that we could by default end up with an AGI that at least does something like putting 10^30 rats on heroin? I have some significant probability on this being a fine outcome.
• There's some distance  from the correct value specification such that stuff is fine if we get AGI with values closer than . Do we have good reasons to think that  is far out of the range that default approaches would give us?

(But here's some reasons not to expect this.)

I still disagree / am confused. If it's indeed the case that , then why would we expect ? (Also, in the second-to-last sentence of your comment, it looks like you say the former is an equality.) Furthermore, if the latter equality is true, wouldn't it imply that the utility we get from [chocolate ice cream and vanilla ice cream] is the sum of the utility from chocolate ice cream and the utility from vanilla ice cream? Isn't  supposed to be equal to the utility of ?

My current best attempt to understand/steelman this is to accept , to reject , and to try to think of the embedding as something slightly strange. I don't see a reason to think utility would be linear in current semantic embeddings of natural language or of a programming language, nor do I see an appealing other approach to construct such an embedding. Maybe we could figure out a correct embedding if we had access to lots of data about the agent's preferences (possibly in addition to some semantic/physical data), but it feels like that might defeat the idea of this embedding in the context of this post as constituting a step that does not yet depend on preference data. Or alternatively, if we are fine with using preference data on this step, maybe we could find a cool embedding, but in that case, it seems very likely that it would also just give us a one-step solution to the entire problem of computing a set of rational preferences for the agent.

A separate attempt to steelman this would be to assume that we have access to a semantic embedding pretrained on preference data from a bunch of other agents, and then to tune the utilities of the basis to best fit the preferences of the agent we are currently dealing with. That seems like it a cool idea, although I'm not sure if it has strayed too far from the spirit of the original problem.

The link in this sentence is broken for me: "Second, it was proven recently that utilitarianism is the “correct” moral philosophy." Unless this is intentional, I'm curious to know where it directed to.

I don't know of a category-theoretic treatment of Heidegger, but here's one of Hegel: https://ncatlab.org/nlab/show/Science+of+Logic. I think it's mostly due to Urs Schreiber, but I'm not sure – in any case, we can be certain it was written by an Absolute madlad :)

Why should I care about similarities to pCEV when valuing people?

It seems to me that this matters in case your metaethical view is that one should do pCEV, or more generally if you think matching pCEV is evidence of moral correctness. If you don't hold such metaethical views, then I might agree that (at least in the instrumentally rational sense, at least conditional on not holding any metametalevel views that contradict these) you shouldn't care.

> Why is the first example explaining why someone could support taking money from people you value less to give to other people, while not supporting doing so with your own money? It's obviously true under utilitarianism

I'm not sure if it answers the question, but I think it's a cool consideration. I think most people are close to acting weighted-utilitarianly, but few realize how strong the difference between public and private charity is according to weighted-utilitarianism.

> It's weird to bring up having kids vs. abortion and then not take a position on the latter. (Of course, people will be pissed at you for taking a position too.)

My position is "subsidize having children, that's all the regulation around abortion that's needed". So in particular, abortion should be legal at any time. (I intended what I wrote in the post to communicate this, but maybe I didn't do a good job.)

> democracy plans for right now
I'm not sure I understand in what sense you mean this? Voters are voting according to preferences that partially involve caring about future selves. If what you have in mind is something like people being less attentive about costs policies cause 10 years into the future and this leads to discounting these more than the discount from caring alone, then I guess I could see that being possible. But that could also happen for people's individual decisions, I think? I guess one might argue that people are more aware about long-term costs of personal decisions than of policies, but this is not clear to me, especially with more analysis going into policy decisions.

> As to your framing, the difference between you-now and you-future is mathematically bigger than the difference between others-now and others-future if you use a ratio for the number of links to get to them.
Suppose people change half as much in a year as your sibling is different from you, and you care about similarity for what value you place on someone. Thus, two years equals one link.
After 4 years, you are now two links away from yourself-now and your sibling is 3 from you now. They are 50% more different than future you (assuming no convergence). After eight years, you are 4 links away, while they are only 5, which makes them 25% more different to you than you are.
Alternately, they have changed by 67% more, and you have changed by 100% of how much how distant they were from you at 4 years.
It thus seems like they have changed far less than you have, and are more similar to who they were, thus why should you treat them as having the same rate.

That's a cool observation! I guess this won't work if we discount geometrically in the number of links. I'm not sure which is more justified.

There is lots of interesting stuff in your last comment which I still haven't responded to. I might come back to this in the future if I have something interesting to say. Thanks again for your thoughts!

I proposed a method for detecting cheating in chess; cross-posting it here in the hopes of maybe getting better feedback than on reddit: https://www.reddit.com/r/chess/comments/xrs31z/a_proposal_for_an_experiment_well_data_analysis/

In 'The inequivalence of society-level and individual charity' they list the scenarios as 1, 1, and 2 instead of A, B, C, as they later use. Later, refers incorrectly to preferring C to A with different necessary weights when the second reference is is to prefer C to B.

I agree and I published an edit fixing this just now

The claim that money becomes utility as a log of the amount of money isn't true, but is probably close enough for this kind of use. You should add a note to the effect. (The effects of money are discrete at the very least).

I mostly agree, but I think footnote 17 covers this?

The claim that the derivative of the log of y = 1/y is also incorrect. In general, log means either log base 10, or something specific to the area of study. If written generally, you must specify the base. (For instance, in Computer Science it is base-2, but I would have to explain that if I was doing external math with that.) The derivative of the natural log is 1/n, but that isn't true of any other log. You should fix that statement by specifying you are using ln instead of log (or just prepending the word natural).

I think the standard in academic mathematics is that , https://en.wikipedia.org/wiki/Natural_logarithm#Notational_conventions, and I guess I would sort of like to spread that standard :). I think it's exceedingly rare for someone to mean base 10 in this context, but I could be wrong. I agree that base 2 is also reasonable though. In any case, the base only changes utility by scaling by a constant, so everything in that subsection after the derivative should be true independently of the base. Nevertheless, I'm adding a footnote specifying this.

Just plain wrong in my opinion, for instance, claiming that a weight can't be negative assumes away the existence of hate, but people do hate either themselves or others on occasion in non-instrumental ways, wanting them to suffer, which renders this claim invalid (unless they hate literally everyone).

I'm having a really hard time imagining thinking this about someone else (I can imagine hate in the sense of like... not wanting to spend time together with someone and/or assigning a close-to-zero weight), but I'm not sure – I mean, I agree there definitely are people who think they non-instrumentally want the people who killed their family or whatever to suffer, but I think that's a mistake? That said, I think I agree that for the purposes of modeling people, we might want to let weights be negative sometimes.

I also don't see how being perfectly altruistic necessitates valuing everyone else exactly the same as you. I could still value others different amounts without being any less altruistic, especially if the difference is between a lower value for me and the others higher. Relatedly, it is possible to not care about yourself at all, but this  math can't handle that.

I think it's partly that I just wanted to have some shorthand for "assign equal weight to everyone", but I also think it matches the commonsense notion of being perfectly altruistic. One argument for this is that 1) one should always assign a higher weight for oneself than for anyone else (also see footnote 12 here) and 2) if one assigns a lower weight to someone else, then one is not perfectly altruistic in interactions with that person – given this, the unique option is to assign equal weight to everyone.