Alex_Altair

Sequences

Entropy from first principles

Comments

I do remember a bunch of content around that, yeah. And I would agree that terminal goals are arbitrary in the sense that they could be anything. But, for any given agent/organism/"thing that wants stuff", there will be a fact-of-the-matter of what terminals goals got instantiated inside that thing.

There are also a few separate but related and possibly confusing facts;

  • The process of evolution will tend to produce organisms that have certain kinds of terminal goals instantiated inside them.
  • Empirically, humans happen to have a huge overlap in their terminal goals (including the terminal goal that other beings have their terminal goals satisfied).
  • If there are a bunch of roughly equally-capable agents around, then it maximizes your own utility (= terminal goals) to do a lot of game-theoretic cooperation with them.

Hm, I'm not sure about Mere Goodness, I read the sequences soon after they were finished, so I don't much remember which concepts were where. There is a sequence post titled Terminal Values and Instrumental Values, though it mostly seems to be emphasizing that both things exist and are different, saving the rest of the content for other posts.

Morality. To me it seems like rationality can tell you how to achieve your goals but not what (terminal) goals to pick. Arguments that try to tell you what terminal goals to pick have just never made sense to me. Maybe there's something I'm missing though.

Okay, I'll bite on this one.

The very thing that distinguishes terminal goals is that you don't "pick" them, you start out with them. They are the thing that gives the concept of "should" a meaning.

A key thing the orthogonality thesis is saying that it is perfectly possible to have any terminal goals, and that there's no such thing as a "rational" set of terminal goals to have.

If you have terminal goals, then you may still need to spend a lot of time introspecting to figure out what they are. If you don't have terminal goals, then the concept of "should", and morality in general, cannot be made meaningful for you. People often consider themselves to be "somewhere in between", where they're not a perfect encoding of some unchangeable terminal values, but there is still a strong sense in which they want stuff for its own sake. I would consider nailing down exactly how these in-between states work to be part of agent foundations.

I'd strongly encourage you to split this post up into a sequence! I think it improves readability (and strongly increases engagement).

I just remembered that we can tag users now; I'll try tagging @evhub to check with his opinion.

I found the beginning of this post very confusing because you don't seem to be at all acknowledging that the Speed Prior is this specific idea created in 2000 long before AI alignment was a field. (It doesn't seem like you even reference this paper in the post?) Early in the post, right under the heading "What is the speed prior and why do we care about it?" you say,

The speed prior is a potential technique for combating formation of deceptive alignment.

This is a true statement about the Speed Prior, but it's not what it is, and it's emphatically not why it was conceived; instead this is a statement of why we (the alignment community) care about it.

My guess about what happened here would be something like;

  1. Paul and others talked a bunch about the Solomonoff prior and its implications for alignment, occasionally mentioning the Speed Prior as a close cousin to the Solomonoff prior.
  2. Over time, most of why people were talking about the Speed Prior was just from the fact that it's penalizing computation time (which is an idea that is generally useful for alignment) and not from its formal specification
  3. Evan picked up on this generalized usage
  4. Evan mentored you and transferred the phrase "speed prior" as referring to that general concept.

I think this is a great idea for the alignment community to be developing, but we should do so under a term that doesn't already refer to something specific outside our field. (I think most of my objection would be ameliorated if you consistently use "a speed prior" and "speed priors".) I'm not too much of a stickler for freezing the usage of terms, but I was genuinely confused by this usage, and I suspect that other alignment researchers would be too.

Okay so this post is great, but just want to note my confusion, why is it currently the 10th highest karma post of all time?? (And that's inflation-adjusted!)

Oh, yeah, that's totally fair. I agree that a lot of those writings are really valuable, and I've been especially pleased with how much Nate has been writing recently. I think there are a few factors that contributed to our disagreement here;

  • I meant to refer to my beliefs about MIRI at the time that Death With Dignity was published, which means most of what you linked wasn't published yet. So by "last few years" I meant something like 2017-2021, which does look sparse.
  • I was actually thinking about something more like "direct" alignment work. 2013-2016 was a period where MIRI was outputting much more research, hosting workshops, et cetera.
  • MIRI is small enough that I often tend to think in terms of what the individual people are doing, rather than attributing it to the org, so I think of the 2021 MIRI conversations as "Eliezer yells at people" rather than "MIRI releases detailed communications about AI risk"

Anyway my overall reason for saying that was to argue that it's reasonable for people to have been updating in the "MIRI giving up" direction long before Death With Dignity.

Load More