All of cubefox's Comments + Replies

I guess larger rockets are safer because more money is invested in testing them, since an explosion gets more expensive the larger the rocket is. But there seems to be no analogous argument which explains why smarter human brains are safer. It doesn't seem they are tested better. If the strong orthogonality thesis is true for artificial intelligence, there should be a positive explanation for why it is apparently not true for human intelligence.

Yes, most people seem to reject the stronger version, they think a superintelligent AI is unlikely to kill all humans. Given the context of the original question here, this seems to be understandable: In humans, higher IQ is correlated with lower antisocial and criminal behavior and lower violence – things which we typically judge to be immoral. I agree there are good philosophical reasons supporting the strong orthogonality thesis for artificial intelligence, but I think we have so far not sufficiently engaged with the literature from criminology and IQ r... (read more)

2Daniel Kokotajlo1mo
It doesn't seem worth engaging with to me. Yes, there's a correlation between IQ and antisocial and criminal behavior. If anyone seriously thinks we should just extrapolate that correlation all the way up to machine superintelligence (and from antisocial-and-criminal-behavior to human-values-more-generally) & then call it a day, they should really put that idea down in writing and defend it, and in the course of doing so they'll probably notice the various holes in it. Analogy: There's a correlation between how big rockets are and how safe rockets are. The bigger ones like Saturn 5 tend to blow up less than the smaller rockets made by scrappy startups, and really small rockets used in warfare blow up all the time. So should we then just slap together a suuuuper big rocket, a hundred times bigger than Saturn 5, and trust that it'll be safe? Hell no, that's a bad idea not worth engaging with. IMO the suggestion criminology-IQ research should make us optimistic about machine superintelligence is similarly bad for similar reasons.

That's the main problem with the orthogonality thesis, it so vague. The thesis that there isn't a perfect correlation is extremely weak and uninteresting.

6Daniel Kokotajlo1mo
Nevertheless, some people still needed to hear it! I have personally talked with probably a dozen people who were like "But if it's so smart, won't it understand what we meant / what it's purpose is?" or "But if it's so smart, won't it realize that killing humans is wrong, and that instead it should cooperate and share the wealth?"

Okay, I understand. The problem with fundamental microstates is that they only really make sense if they are possible worlds, and possible worlds bring their own problems.

One is: we can gesture at them, but we can't grasp them. They are too big, they each describe a whole world. We can grasp the proposition that snow is white, but not the equivalent disjunction of all the possible worlds where snow is white. So we can't use then for anything psychological like subjective Bayesianism. But maybe that's not your goal anyway.

A more general problem is that ther... (read more)

Could you clarify this part?

On the other hand, if we do extend entropy to arbitrary propositions A, it probably does make sense to define it as the conditional expectation S(A) = E[-log p | A], as you did.

Then "average entropy"/"entropy" of a macrostate p_A is S(True) under the distribution p_A, and "entropy"/"surprisal" of a microstate B (in the macrostate p_A) is S(B) under the distribution p_A.

By a slight coincidence, S(True) = S(A) under p_A, but S(True) is the thing that generalizes to give entropy of an arbitrary distribution.

I think I don't understand your notation here.

2Adam Scherlis1mo
I think I was a little confused about your comment and leapt to one possible definition of S() which doesn't satisfy all the desiderata you had. Also, I don't like my definition anymore, anyway. Disclaimer: This is probably not a good enough definition to be worth spending much time worrying about. First things first: I think this is indeed how we should think of "microstates". (I don't want to use the word "macrostate" at all, at this point.) I was thinking of something like: given a probability distribution p and a proposition A, define "S(A) under p" =∑x∈Ap(x)(−logp(x))∑x∈Ap(x) where the sums are over all microstates x in A. Note that the denominator is equal to p(A). I also wrote this as S(A) = expectation of (-log p(x)) conditional on A, orS(A)= E[(−logp)|A], but I think "log p" was not clearly "log p(x) for a microstate x" in my previous comment. I also defined a notation p_A to represent the probability distribution that assigns probability 1/|A| to each x in A and 0 to each x not in A. I used T to mean a tautology (in this context: the full set of microstates). Then I pointed out a couple consequences: * Typically, when people talk about the "entropy of a macrostate A", they mean something equal tolog|A|. Conceptually, this is based on the calculation∑x∈A1 |A|(−log1|A|), which is the same as either "S(A) under p_A" (in my goofy notation) or "S(T) under p_A", but I was claiming that you should think of it as the latter. * The (Shannon/Gibbs) entropy of p, for a distribution p, is equal to "S(T) under p" in this notation. * Finally, for a microstate x in any distribution p, we get that "S({x}) under p" is equal to -log p(x). All of this satisfied my goals of including the most prominent concepts in Alex's post: * log |A| for a macrostate A * Shannon/Gibbs entropy of a distribution p * -log p(x) for a microstate x And a couple other goals: * Generalizing the Shannon/Gibbs entropy, which isS(p)=Ex∼p[−logp(x)], in a

This is a very thought provoking post. As far as I understand, it is an attempt of finding a unified theory of entropy.

I admit I am still somewhat confused about this topic. This is partly because of my insufficient grasp of the material in this post, but, I think, also partly because you didn't yet went quite far enough with your unification.

One point is the thinking in terms of "states". A macrostate is said to be a set of microstates. As far as I understand, the important thing here is that all microstates are presumed to be mutually exclusive, such tha... (read more)

1Adam Scherlis1mo
I think macrostates are really a restricted kind of probability distribution, not a kind of proposition. But they're the kind of distribution p_A that's uniform over a particular disjunction A of microstates (and zero elsewhere), and I think people often equivocate between the disjunction A and the distribution p_A. [EDIT: "macrostate" is a confusing term, my goal here is really to distinguish between A and p_A, whatever you want to call them] In general, though, I think macrostates aren't fundamental and you should just think about distributions if you're comfortable with them. I think microstates should indeed be considered completely-specified possible worlds, from this perspective. "Average entropy" / "entropy of a macrostate" in OP ("entropy" in standard usage) is a function from probability distributions to reals. Shannon came up with an elegant set of axioms for this function, which I don't remember offhand, but which uniquely pins down the expectation of -log(p(microstate)) as the entropy function (up to a constant factor). "Entropy of a microstate" in OP ("surprisal" in information theory, no standard name otherwise AFAIK) is a function from probability distributions to random variables, which is equal to -log(p(microstate)). So I guess I'm not sure propositions play that much of a role in the usual definition of entropy. On the other hand, if we do extend entropy to arbitrary propositions A, it probably does make sense to define it as the conditional expectation S(A) = E[-log p | A], as you did. Then "average entropy"/"entropy" of a macrostate p_A is S(True) under the distribution p_A, and "entropy"/"surprisal" of a microstate B (in the macrostate p_A) is S(B) under the distribution p_A. By a slight coincidence, S(True) = S(A) under p_A, but S(True) is the thing that generalizes to give entropy of an arbitrary distribution. I've never seen an exploration of what happens if you apply this S() to anything other than individual microstates or True,

We may be already doing that in case of cartoon faces with their exaggerated features. Cartoon faces don't look eldritch to us, but why would they?

1Rudi C1mo
They are still smooth and have low-frequency patterns, which seems to be the main difference from adversarial examples currently produced from DL models.

I'm confused, which GAN faces look like "horrible monstrosities"!?

3Brangus Brangus1mo
I assumed he meant the thing that most activates the face detector, but from skimming some of what people said above, seems like maybe we don't know what that is.

I'm not sure this is what you mean, but yes, in case of acts, it is indeed so that only the utility of an action matters for our choice, not the expected utility, since we don't care about probabilities of, or assign probabilities to, possible actions when we choose among them, we just pick the action with the highest utility.

But only some propositions describe acts. I can't chose (make true/certain) that the sun shines tomorrow, so the probability of the sun shining tomorrow matters, not just its utility. Now if the utility of the sun shining tomorrow is ... (read more)

1Viktor Rehnberg2mo
So we have that but at the same time And I can see how starting from this you would get that U(⊤)=0. However, I think one of the remaining confusions is how you would go in the other direction. How can you go from the premise that we shift utilities to be 0 for tautologies to say that we value something to a large part from how unlikely it is. And then we also have the desirability axiom U(A∨B)=P(A)U(A)+P(B)U(B)P(A)+P(B) for all A and B such that P(A∧B)=0 together with Bayesian probability theory. What I was talking about in my previous comment goes against the desirability axiom in the sense that I meant that for X="Sun with probabilitypand rain with probability(1−p)" in the more general case there could be subjects that prefer certain outcomes proportionally more (or less) than usual such that U(X)≠pU(Sun) +(1−p)U(Rain) for some probabilities p. As the equality derives directly from the desirability axiom, it was wrong of me to generalise that far. But, to get back to the confusion at hand we need to unpack the tautology axiom a bit. If we say that a proposition ⊤ is a tautology if and only if P(⊤)=1 [1] [#fn-z38W3msEzinftD4jF-1], then we can see that any proposition that is no news to us has zero utils as well. And I think it might be well to keep in mind that learning that e.g. sun tomorrow is more probable than we once thought does not necessarily make us prefer sun tomorrow less, but the amount of utils for sun tomorrow has decreased (in an absolute sense). This comes in nicely with the money analogy because you wouldn't buy something that you expect with certainty anyway[2] [#fn-z38W3msEzinftD4jF-2], but this doesn't mean that you prefer it any less compared to some other worse outcome that you expected some time earlier. It is just that we've updated from our observations such that the utility function now reflects our current beliefs. If you prefer A to B then this is a fact regardless of the probabilities of those outcomes. When the probabilities c

Sorry - forgot about your comment.

  1. Tasks that animals usually face? (Find food, a safe place to sleep, survive, reproduce ...)

  2. This is an intriguing question. My first intuition: Probably not, because ...

    1. It seems evolution would have figured it out by now. After all, evolution optimizes heavily for generality. Any easily fixable blind spot would be a low hanging fruit for natural selection (e.g. by being exploited in inter-species competition).
    2. The level of generality of most animals seems very similar, and seems to have stayed similar for a very lon
... (read more)

I think I will write a somewhat longer post as a full introduction to Jeffrey-style utility theory. But I'm still not quite sure on some things. For example, Bradley suggests that we can also interpret the utility of some proposition as the maximum amount of money we would pay (to God, say) to make it true. But I'm not sure whether that money would rather track expected utility (probability times utility) -- or not. Generally the interpretation of expected utility versus the interpretation of utility is not yet quite clear to me, yet. Have to think a bit more about it...

1Viktor Rehnberg2mo
Isn't that just a question whether you assume expected utility or not. In the general case it is only utility not expected utility that matters.

I see only one clean solution to this problem: Let anyone post at the AI Alignment Forum, do no longer automatically crosspost to Less Wrong, and (somehow) nudge people who post AI content to Less Wrong to instead post it to the AI Alignment Forum. There should be three separate platforms:

  • Less Wrong, for rationality
  • AI Alignment Forum, for AI
  • EA Forum, for effective altruism

Currently, only effective altruism has its own platform, while Less Wrong and the AI Alignment Forum are insufficiently separated.

This way people interested in rationality don't hav... (read more)

Thanks for the Bradley reference. He does indeed work in Jeffrey's framework. On conditional utility ("conditional desirability", in Jeffrey terminology) Bradley references another paper from 1999 where he goes into a bit more detail on the motivation:

To arrive at our candidate expression for conditional desirabilities in terms of unconditional ones, we reason as follows. Getting the news that XY is true is just the same as getting both the news that X is true and the news that Y is true. But DesXY is not necessarily equal to DesX + DesY because of the w

... (read more)
2Viktor Rehnberg2mo
Sure, I've found it to be an interesting framework to think in so I suppose someone else might too. You're the one who's done the heavy lifting so far so I'll let you have an executive role. If you want me to write up a first draft I can probably do it end of next week. I'm a bit busy for at least the next few days.

I'm confused, the EA Forum post you linked seems to roughly agree with the nuclear winter doomers like Xia et al, not with the more optimistic datasecretlox thread you linked earlier. Quote from the EA Forum post:

By my estimation, a nuclear exchange between the US and Russia would lead to a famine that would kill 5.5 billion people in expectation (90% confidence interval: 2.7 billion to 7.5 billion people).

Mostly I was talking about the soot production. I think it is less doomy than normal for nuclear winter doomers, which is why I said directionally. But yeah, it is a lot more doomy than Bean is. I should have noted that in my comment.

The linked article is more than a week old, though, it appears not to talk exactly about the same thing. Moreover, presumably every military in the world has to use microchips, so a China specific ban seems to require further justification.

Given that this is an act of economic warfare, how does the US government justify it? For liberal western democracies, "it is in our selfish interest to do so" is typically not seen as a sufficient reason. Usually there is some justification of why it is the morally right thing to do. I'm concerned this part is missing here.

The justification [] is that there's a risk that the Chinese military will use it. It seems to be part of general export controls for military hardware.

Yeah. The German term has less of a negative connotation I guess. "Surprisal" goes in a similar direction.

1the gears to ascenscion2mo
hmm, perhaps compare programmer's "building projects to scratch an itch", or fiction writers' "plot bunnies".

These ideas here remind me of Niklas Luhmann's theory of communication. Usually communication theories assume there to be a sender and a receiver, and the sender tries to transmit information to the receiver. Then that's called communication. But in Luhmann's theory there is no assumption of anything being transmitted. Instead, as far as I understand, he only assumes that people involved in communication react to "irritations" created by other people, and those reactions consist in creating further "irritations". A feedback loop.

So a blog post may be seen ... (read more)

1Henrik Karlsson2mo
I'm not deeply familiar with Luhmann's work, though that was interesting. It does remind me somewhat of Bakhtin (and Buber) on dialogue.
1the gears to ascenscion2mo
"irritation" in the sense of "state-of-the-world which I can improve by contributing and which my feelings desire to be improved", yeah? not necessarily that the desire to contribute has to have negative valence. if so, I wish Luhmann had chosen a different word, but I can see how my comment fits into the framework.

Yeah, you are right. I used the fact that . This makes use of the fact that and are both mutually exclusive and exhaustive, i.e. and . For , where and are mutually exclusive but not exhaustive, is not equivalent to . Since can be true without either of or being true.

It should however work if , since then . So for to hold, would have to be a "partition" of , exhaustively enumerating all the incompatible ways it can be true.

... (read more)
1Viktor Rehnberg2mo
A⟺((A∧s1)∨(A∧s2)) I agree with as a sufficient criteria to only sum over {s1,s2} , the other steps I'll have to think about before I get them. -------------------------------------------------------------------------------- I found this newer paper [] and having skimmed it seemed like it had similar premises but they defined U(A|B)=U(A∧B)−U(B)+U(⊤)=U(A ∧B)−U(B) (instead of deriving it).

Not only do humans not directly care about increasing IGF, the vast majority does hardly even care about the proxy of maximizing the number of their direct offspring. That's something natural selection could have optimized for, but mostly didn't. Most couples in first world countries could have more than five children, yet they have less than 1.5 on average, far below replacement. The fact that this happens in pretty much all developed countries, despite politicians' effort to counteract this trend, shows how weak the preference for offspring really is.

It ... (read more)

But we have my result above, i.e.

This proof could also be extended to longer disjunctions between mutually exclusive propositions apart from and . Hence, for a set of mutually exclusive propositions ,

which does not rely on the assumption of being equal to . After all, I only used the desirability axiom for the derivation, not the assumption . So we get a "nice" expression anyway as long as our disjunction is mutually exclusive. Right? (Maybe I misunderstood your point.)

Regarding , I am now no lo... (read more)

1Viktor Rehnberg2mo
Didn't you use that B∨¬B=⊤. I can see how to extend the derivation for more steps s1∨s2∨…∨sn but only if {si}ni=1=Ω. The sums ∑s∈SP(s|a)U(a∧s) and ∑ω∈ΩP(ω|a)U(a∧ω)for arbitrary U are equal if and only if P(Ω∖S|a)=0. The other alternative I see is if (and I'm unsure about this) we assume that U(z ∧a)=U(z) and P(z|a)=P(z) for z∈Ω∖S. -------------------------------------------------------------------------------- What I would think that U(A|B) would mean is U(A) after we've updated probabilities and utilities from the fact that B is certain. I think that would be the first one but I'm not sure. I can't tell which one that would be.

Oh yes, of course! (I probably thought this was supposed to be valid for our as well, which is assumed to be mutually exclusive, but, unlike , not exhaustive.)

1Viktor Rehnberg2mo
General S (even if mutually exclusive) is tricky I'm not sure the expression is as nice then.

I don't understand what you mean in the beginning here, how is the same as ?

1Viktor Rehnberg2mo
U(⊤)=∑ω∈ΩP(ω)U(ω)=0that was one of the premises, no? You expect utility 0 from your prior.

Regarding the time stamp: Yeah, this is the right way to think about it, at least in the case of subjective utility theory, where utilities represent desires, and probabilities represent beliefs, and it also the right way to think about for Bayesianism (subjective probability theory). and only represent the subjective state of an agent at a particular point in time. They don't say anything how they should be changed over time. They only say that at any point in time, these functions (the agents) should satisfy the axioms.

Rules for change over time woul... (read more)

1Viktor Rehnberg2mo
Some first reflections on the results before I go into examining all the steps. Hmm, yes my expression seems wrong when I look at it a second time. I think I still confused the timesteps and should have writtenU(a)=∑ω∈Ω(P(ω|a)U(ω∧a)−P(ω)U (ω)) The extra negation comes from a reflex from when not using Jeffrey's decision theory. With Jeffrey's decision theory it reduces to your expression as the negated terms sum to U(⊤)=0. But, still I probably should learn not to guess at theorems and properly do all steps in the future. I suppose that is a point in favor for Jeffrey's decision theory that the expressions usually are cleaner. As for your derivation you used that P(A|B)+P(A|¬B)=P(A) in the derivation but that is not the case for general S. This is a note to self to check whether this still holds for S⊊Ω. -------------------------------------------------------------------------------- Edit: My writing is confused here disregard it. My conclusion is still

Interesting! I have a few remarks, but my reply will have to wait a few days as I have to finish something.

The way I think about it: The utility maximizer looks for the available action with the highest utility and only then decides to do that action. A decision is the event of setting the probability of the action to 1, and, because of that, its utility to 0. It's not that an agent decides for an action (sets it to probability 1) because it has utility 0. That would be backwards.

There seems to be some temporal dimension involved, some "updating" of utilities. Similar to how assuming the principle of conditionalization formalizes classical Bayes... (read more)

2Viktor Rehnberg2mo
Ah, those timestep subscripts are just what I was missing. I hadn't realised how much I needed that grounding until I noticed how good it felt when I saw them. So to summarise (below all sets have mutually exclusive members). In Jeffrey-ish notation we say have the axiom U(S)=1P(S)∑s∈SP(s)U(s) and normally you would want to indicate what distribution you have over S in the left-hand side. However, we always renormalize U such that the distribution is our current prior. We can indicate this by labeling the utilities from what timestep (and agent should probably included as well, but lets skip this for now). Ut(S)=1P(S)∑s∈SP(s)Ut(s) That way we don't have to worry about U being shifted during the sum in the right hand side or something. (I mean notationally that would just be absurd, but if I would sit down and estimate the consequences of possible actions I wouldn't be able to not let this shift my expectation for what action I should take before I was done.). We can also bring up the utility of an action a to be Ut(a)=∑ω∈Ω(P(ω|a)−P(ω))Ut(ω∧a) Furthermore, for most actions it is quite clear that we can drop the subscript t as we know that we are considering the same timestep consistently for the same calculation U(A∨B)=P(A)A+P(B)U(B)P(A)+P(B),ifP(A∧B)=0 Now I'm fine with this because I will have those subscript ts in the back of my mind. -------------------------------------------------------------------------------- I still haven't commented on U(A∨B) in general or U(A|B). My intuition is that they should be able to be described from U(A), U(B) and U(A∧B), but it isn't immediately obvious to me how to do that while keeping U(⊤)=0. I tried considering a toy case where A=s1∨s2 and B=s2∨s3 (S={s1,s2,s3}) and then U(A∨B)=U(s1∨s2∨s3)=1P(S)∑s∈SP(s)U(s) but I couldn't see how it would be possible without assuming some things about how U(A), U(B) and U(A∧B) relate to each other which I can't in general.

I'm not perfectly sure what the connection with Bayesian updates is here. In general it is provable from the desirability axiom that This is because any (e.g. ) is logically equivalent to for any (e.g. ), which also leads to the "law of total probability". Then we have a disjunction which we can use with the desirability axiom. The denominator cancels out and gives us in the nominator instead of , which is very convenient because we presumably don't know the prior probab... (read more)

1Viktor Rehnberg2mo
Well, deciding to do action a would also make it utility 0 (edit: or close enough considering remaining uncertainties) even before it is done. At least if you're committed to the action and then you could just as well consider the decision to be the same as the action. It would mean that a "perfect" utility maximizer always does the action with utility 0 (edit: but the decision can have positive utility(?)). Which isn't a problem in any way except that it is alien to how I usually think about utility. Put in another way. While I'm thinking about which possible action I should take the utilities fluctuate until I've decided for an action and then that action has utility 0. I can see the appeal of just considering changes to the status quo, but the part where everything jumps around makes it an extra thing for me to keep track of.

Well, the "expected value" of something is just the value multiplied by its probability. It follows that, if the thing in question has probability 1, its value is equal to the expected value. Since is a tautology, it is clear that .

Yes, this fact is independent of , but this shouldn't be surprising I think. After all, we are talking about the utility of a tautology here, not about the utility of itself! In general, is usually not 1 ( and are only presumed to be mutually exclusive, not necessarily exhaus... (read more)

1Viktor Rehnberg2mo
Ok, so this is a lot to take in, but I'll give you my first takes as a start. My only disagreement prior to your previous comment seems to be in the legibility of the desirability axiom for U(A∨B) which I think should contain some reference to the actual probabilities of A and B. Now, I gather that this disagreement probably originates from the fact that I defined U({})=0 while in your framework U(⊤)=0. Something that appears problematic to me is if we consider the tautology (in Jeffrey notation) U(Doom∨¬Doom)=P(Doom)U(Doom)+P(¬Doom)U(¬Doom)=0. This would mean that reducing the risk of Doom has 0 net utility. In particular, certain Doom and certain ¬Doom are equally preferable (=0). Which I don't thing either of us agree with. Perhaps I've missed something.

Ah, thanks. I still find this strange, since in your case and are events, which can be assigned specific probabilities and utilities, while is apparently a random variable. A random variable is, as far as I understand, basically a set of mutually exclusive and exhaustive events. E.g. = The weather tomorrow = {good, neutral, bad}. Each of those events can be assigned a probability (and they must sum to 1, since they are mutually exclusive and exhaustive) and a utility. So it seems it doesn't make sense to assign itself a utility (or a probability)... (read more)

1Viktor Rehnberg2mo
What I found confusing with P(A∨¬A)U(A∨¬A) was that to me this reads as U(A∨¬A) which should always(?) depend on P(A) but with this notation it is hidden to me. (Here I picked ¬A as the mutually exclusive event B, but I don't think it should remove much from the point). That is also why I want some way of expressing that in the notation. I could imagine writing as UX(Ω) that is the cleanest way I can come up with to satisfy both of us. Then with expected utility UX(Ω)=EX[U(X)]. When we accept the expected utility hypothesis then we can always write it as a expectation/sum of its parts P(A)U(A)+P(¬A)U(¬A) and then there is no confusion either.

I'm probably missing something here, but how is a defined expression? I thought takes as inputs events or outcomes or something like that, not a real number like something which could be multiplied with ? It seems you treat not as an event but as some kind of number? (I get of course, since returns a real number.)

The thing I would have associated with "expected utility hypothesis": If and are mutually exclusive, then

1Viktor Rehnberg2mo
Hmm, I usually don't think too deeply about the theory so I had to refresh somethings to answer this. First off, the expected utility hypothesis is apparently implied by the VNM axioms. So that is not something needed to add on. To be honest I usually only think of a coherent preference ordering and expected utilities as two seperate things and hadn't realized that VNM combines them. About notation, with U(A) I mean the utility of getting A with certainty and with pA I mean the utility of getting A with probability p. If you don't have the expected utility hypothesis I don't think you can separate an event from its probability. I tried to look around to the usual notation but didn't find anything great. Wikipedia [] used something like U(X)=E[U(X)]=∑ω∈ΩPX(ω)U(ω) where X is a random variable over the set of states Ω. Then I'd say that the expected utility hypothesis is the step U(X)=E[U(X)].

Could you explain the "expected utility hypothesis"? Where does this formula come from? Very intriguing!

2Viktor Rehnberg2mo
Expected utility hypothesis is that U(pA)=pU(A). To make it more concrete suppose that for outcome A is worth 10utils for you. Then getting A with probaillity 1/2 is worth 5utils. This is not necessarily true, there could be an entity that prefers outcomes comparatively more if they are probable/improbable. The name comes from the fact that if you assume it to be true you can simply take expectations of utils and be fine. I find it very agreeable for me.

In Jeffrey's desirability formula you write . But isn't this value always 1 for any i? Which would mean the term can be eliminated since multiplying with 1 makes no difference? Assume p = "the die comes up even". So the partition of p is (the die comes up...) {2,4,6}. And for all i. E.g. P(even|2)=1.

I guess you (Jeffrey) rather meant ?

Similar recommendation to blog post writers: Try to include only relatively important links, since littering your post with links will increase effective reading time for many readers. Which will cause fewer people to read the (whole) post.

This is similar to post length: There is an urge to talk about everything somewhat relevant to the topic, respond to all possible objections and the like. But longer posts will, on average, be read by fewer people. There is a trade-off between being concise and being thorough.

To add to this: Expressing belief in the Christian god will be still relatively harmless. It would cost you some professional status because people would think you are not very smart. But expressing other beliefs outside the Overton window may make people think you are actively evil or at least very immoral. As a historical example, expressing disbelief in God was once such a case. For such (supposedly) immoral beliefs you may lose a lot more status, and not just status. You might get cancelled or excluded from your social circles, lose job opportunities e... (read more)

2Shoshannah Tekofsky3mo
Yes, agreed. The technique is only aimed at the "soft" edge of this, where people might in reality even disagree if something is still in or outside the Overton Window. I do think a gradient-type model of controversiality is a more realistic model of how people are socially penalized than a binary model. The exercise is not aimed at sharing views that would lead to heavy social penalties indeed, and I don't think anyone would benefit from running it that way. It's a very relevant distinction you are raising.

I agree that value in the sense of goodness is not relevant for alignment. Relevant is what the AI is motivated to do, not what it believes to be good. I'm just saying that your usage of "value" would be called "desire" by philosophers.

Often it seems that using the term "values" suffers a bit from this ambiguity. If someone says an agent A "values" an outcome O, do they mean A believes that O is good, i.e. that A believes O has a high value, a high degree of goodness? Or do they mean that A wants O to obtain, i.e. that A desires O? That seems often ambiguo... (read more)

Ethical truths are probably different from empirical truths. An advanced AI may learn empirical truths on its own from enough data, but it seems unlikely that it will automatically converge on the ethical truth. Instead, it seems that any degree of intelligence can be combined with any kind of goal. (Orthogonality Thesis)

I think the main point of the orthogonality thesis is less about an advanced AI not being able to figure out the true ethics, but the AI not being motivated to be ethical in this way even if it figures out the correct theory. If there i... (read more)

Good point, I see what you mean. I think we could have 2 distinct concepts of "ethics" and 2 corresponding orthogonality theses: 1. Concept "ethics1" requires ethics to be motivational. Some set of rules can only be the true ethics if, necessarily, everyone who knows them is motivated to follow them. (I think moral internalist probably use this concept?) 2. Concept "ethics2" doesn't require some set of rules to be motivational to be the correct ethics. The orthogonality thesis for 1 is what I mentioned: Since there are (probably) no rules that necessarily motivate everyone who knows them, the AI would not find the true ethical theory. The orthogonality thesis for 2 is what you mention: Even if the AI finds it, it would not necessarily be motivated by it.

A bit tangential: Regarding the terminology, what you here call "values" would be called "desires" by philosophers. Perhaps also by psychologists. Desires measure how strongly an agent wants an outcome to obtain. Philosophers would mostly regard "value" as a measure of how good something is, either intrinsically or for something else. There appears to be no overly strong connection between values in this sense and desires, since you may believe that something is good without being motivated to make it happen, or the other way round.

5Quintin Pope3mo
If you say (or even believe) that X is good, but never act to promote X, I’d say you don’t actually value X in a way that’s relevant for alignment. An AI which says / believes human flourishing is good, without actually being motivated to make said flourishing happen, would be an alignment failure. I also think that an agents answers to the more abstract values questions like “how good something is” are strongly influenced by how the simpler / more concrete values form earlier in the learning process. Our intent here was to address the simple case, with future posts discussing how abstract values might derive from simpler ones.

This is a great post, thank you! A few comments:

  • Cartesian dualism is a kind of substance dualism, which has many problems. But there is also property dualism, which is most famously defended by David Chalmers. Given that he is a (perhaps even "the") top philosopher of mind, property dualism is probably not so easy to dismiss. Other philosophers have similar views, like Galen Strawson. He says consciousness is likely a fundamental property, just like, perhaps, mass. This means any physical theory of everything must contain irreducible terms for all fundam
... (read more)

This is my favorite so far, since you really propose a task which is not impressive, while I would be quite impressed with most of the other suggestions. A sign for this is that my probability for GPT-4 not doing this is relatively close to 50%.

In fact there is a tradeoff here between certainty and surprise: The more confident you are GPT-4 won't solve a task, the more surprised you will be if it does. This follows from conservation of expected evidence.

Your view here sounds a bit like preference presentism:

For instance, if I choose to bring one more person into the world, by having a child (which, incidentally, we just did!), that decision is primarily about what kind of life I want to have, and what commitments I am willing to make, rather than about whether I think the world, in the abstract, is better or not with one more person in it.


Apart from comparativists, we have presentists who draw a distinction between presently existing people and non-existing people (Narveson 1973; Heyd 1988);

... (read more)

Thanks, that clarifies it. I'm not sure whether it would be the right way to compare the similarity of two utility functions, since it only considers ordinal information without taking into account how strongly the agents value an outcome / world state. But this is at least one way to do it.

A comment from hacker news on this piece:

The reason that language models require large amounts of data is because they lack grounding. When humans write a sentence about.. let's say "fire", we can relate that word to visual, auditory and kinesthetic experiences built from a coherent world model. Without this world model the LM needs a lot of examples, essentially it has to remember all the different contexts in which the word "fire" appears and figure out when it's appropriate to use this word in a sentence [...]

In other words, language models need so ... (read more)

Rank correlation coefficients are an interesting point. The way I have so far interpreted "orthogonality" in the orthogonality thesis, is just as modal ("possibility") independence: For a system with any given quantity of intelligence, any goal is possible, and for a system with any given goal, any quantity of intelligence is possible.

The alternative approach is to measure orthogonality in terms of "rank correlation" when we assume we have some ordering on goals, such as by how aligned they are with the goals of humanity.

As far as I understand, a rank corr... (read more)

1Stephen Bennett (Previously GWS)4mo
Your initial point was that "goals" aren't a quantifiable thing, and so it doesn't make sense to talk about "orthogonality", which I agree with. I was just saying that while goals aren't quantifiable, there are ways of quantifying alignment. The stuff about world states and kendall's tau was a way to describe how you could assign a number to "alignment". When I say world states, I mean some possible way the world is. For instance, it's pretty easy to imagine two similar world states: the one that we currently live in, and one that's the same except that I'm sitting cross legged on my chair right now instead of having my knee propped against my desk. That's obviously a trivial difference and so gets nearly exactly the same rank as the world we actually live in. Another world state might be one in which everything is the same except that a cosmic ray has created a prion in my brain (which gets ranked much lower than the actual world). Ranking all possible future world states is one way of expressing an agent's goals, and computing the similarity of these rankings between agents is one way of measuring alignment. For instance, if someone wants me to die, they might rank the Stephen-has-a-prion world quite highly, whereas I rank it quite low, and this will contribute to us having a low correlation between rank orderings over possible world states, and so by this metric we are unaligned from one another.

Specifically for "orthogonal": I think here simply the word "independent" should be used instead. Most people don't even know what orthogonality means. "Independent" is of course a vague term (logically independent? modally independent? causally independent? counterfactually independent? probabilistically independent? etc), but so is "orthogonal" in its metaphorical sense.

Moreover, orthogonality between two concepts really only can make geometric sense if you compare two qualitative concepts expressed by mass nouns, like in "intelligence" and "age". Otherw... (read more)

4Stephen Bennett (Previously GWS)4mo
Overall I think you're right, and walking[1] [#fnrsrabl6r2b]through this example for myself was a good example of ways in which geometric metaphors can be imprecise[1] [#fnrsrabl6r2b](although I'm not sure they're exactly misleading[1] [#fnrsrabl6r2b]). I'll end up[1] [#fnrsrabl6r2b]having to stretch[1] [#fnrsrabl6r2b]the metaphor to the point that I was significantly modifying what other people were saying to have it actually make sense. Regarding the "orthogonality" in "orthogonality thesis". The description given on the LW tag [] and indeed bostrom's paper [] is orthogonality between intelligence and goals as you said. However, in practice I frequently see "goals" replaced with something like "alignment", which (to the extent that you can rank order aligned agents), is something quantifiable. This seems appropriate since you can take something like Kendall's tau [] of the rank orderings of world states of two agents, and that correlation is the degree to which one agent is aligned with another. 1. ^ [#fnrefrsrabl6r2b]This is a spatial metaphor. I went back through after writing the post to see how often they showed up. Wowza.

Back then there was already significant demand for "human computers": I think it is plausible than one of Babbage's steam powered machines could have been faster than quite a few people.

Well, as far as I can tell, even the electromechanical computers of the 1930s were not significantly faster than humans using mechanical calculators. That's why I don't think it would have worked in Babbage's day. More details in the middle of this essay []
Load More