kaarelh AT gmail DOT com

i think you’re right that the sohl-dickstein post+survey also conflates different notions, and i might even have added more notions into the mix with my list of questions trying to get at some notion(s) ^[1]

a monograph untangling this coherence mess some more would be valuable. it could do the following things:

specifying a bunch of a priori different properties that could be called “coherence”
discussing which ones are equivalent, which ones are correlated, which ones seem pretty independent
giving good names to the notions or notion-clusters
discussing which kinds of coherence generically increase/decrease with capabilities, which ones probably increase/decrease with capabilities in practice, which ones can both increase or decrease with capabilities depending on the development/learning process, both around human level and later/eventually, in human-like minds and more generally ^[2]
discussing how this relates to AI x risk. like, which kinds of coherence should play a role in a case for AI x risk? what does that look like? or maybe the picture should make one optimistic about some approach to de-AGI-x-risk-ing? or about AGI in general? ^[3]

i didn’t re-read that post before writing my comment above ↩︎
the answers to some of these questions might depend on some partly “metaphysical” facts like whether math is genuinely infinite or whether technological maturity is a thing ↩︎
i think the optimistic conclusions are unlikely, but i wouldn’t want to pre-write that conclusion for the monograph, especially if i’m not writing it ↩︎

hmm, like i think there's a reasonable sense of "coherence" such that it plausibly doesn't typically increase with capabilities. i think the survey respondents here are talking about something meaningful and i probably agree with most of their judgments about that thing. for example, with that notion of coherence, i probably agree with "Google (the company) is less coherent now than it was when it had <10 employees" (and this is so even though Google is more capable now than it was when it had 10 employees)

this "coherence" is sth like "not being a hot mess" or "making internal tradeoffs efficiently" or "being well-orchestrated". in this sense, "incoherence" is getting at the following things:

to what extent are different parts of the guy out of sync with each other (like, as a % of how well they could be in sync)?
to what extent is the guy leaving value on the table compared to using the same parts differently? are there many opportunities for beneficial small rearrangements of parts?
how many arbitrage opportunities are there between the guy's activities/parts?
to what extent does it make sense to see all the parts/activities of the guy as working toward the same purpose?

with this notion, i think there are many naturally-occurring cases of someone becoming more capable but less "coherent". e.g. maybe i read a textbook and surface-level-learn some new definitions and theorems and i can now solve the problems in the textbook, but the mathematical understanding i just gained is less integrated with the rest of my understanding than usual for me given that i've only surface-level-learned this stuff (and let's assume surface-level-learning this didn't let me integrate other existing stuff better) — like, maybe i mostly don't see how this theorem relates to other theorems, and wouldn't be able to easily recognize contexts in which it could be useful, and wouldn't be able to prove it, and it doesn't yet really make intuitive sense to me that it has to be true — so now i'm better at math but in a sense less coherent. e.g. maybe i get into acrobatics but don't integrate that interest with the rest of my life much. eg maybe as an infant it was easy to see me as mostly orchestrating my like 5 possible actions well toward like being fed when hungry and sleeping when sleepy, but it's less clear how to see me now as orchestrating most of my parts well toward something. ^[1]

now there is the following response to this:

ok, maybe, but who cares about this "coherence". maybe there is a notion such that maybe a nematode is more coherent than a human who is more coherent than the first substantially smarter-than-human artificial system. but if you are a nascent orca civilization, it's much better to find yourself next to a nematode, than to find yourself next to a human, than to find yourself next to the first substantially smarter-than-human artificial system. we're talking about another notion of "coherence" — one that helps make sense of this

my thoughts on this response:

i agree we're fucked even if "the first ASI is very incoherent" in this sense (on inside view, i'm at like that creating AGI any time soon (as opposed to continuing developing as humans) would be the greatest tragedy in history so far, and at like $80 % +$ that there won't even be a minimal human future if this happens)
one can make a case for AI risk while not saying "coherence", just talking of capabilities (and maybe values). indeed, this is a common response in the LW comments on the post i referenced. here's me providing a case like that
if one wants to make a case for AI risk involving a different sense of "coherence", then one might be assuming a meaning different than the most immediate meaning, so one would want to be careful when using that word. one might end up causing many people to understand why AI is scary significantly less well than they could have if one took more care with language! (eg: maybe amodei; maybe some of these people whose paper i still haven't skimmed.) there are probably interesting things to say about AI risk involving e.g. some of the following properties an AI might have: the ability to decompose problems, the ability to ask new relevant questions, being good at coming up with clever new approaches to hard challenges, being strategic about how to do something, trying many approaches to a problem, being relentless, not getting too confused, resolving inconsistencies in one's views, the ability or tendency to orchestrate many actions or mental elements toward some task (eg across a lot of time). but i want to suggest that maybe it's good to avoid the word "coherence" here given the potential for confusion, or to establish some common standard, e.g. calling the quality of the orchestration of one's parts compared to what is possible with small rearrangements "relative coherence" and calling the ability to put many things together "absolute coherence"
i also think there's plausibly some genuine mistake being made by many on LW around thinking that systems are increasingly good pursuers of some goal. it seems sorta contrived to view humans this way. humans have projects and a learning human tends to become better at doing any given thing, but i feel like there doesn't need to be some grand project that a human's various projects are increasingly contributing to or whatever. or like, i'm open to this property convergently showing up (ever? or close to our present capability level?), but i don't think i've seen a good analysis of this question supporting that conclusion. imo, intuitively, opportunities for completely new projects will open up in the future and i can get interested in them with no requirement that they fit together well with my previous projects or whatever. ^[2] ^[3]
if someone gives an argument against "the first AGI/ASI will be coherent" and thinks they have given a good argument against AI risk, i think they've probably made a serious mistake. but i think it's like sort of an understandable mistake given that LW arguments for AI risk do emphasize some sort of thing called "coherence" too, probably often with some conflation between these notions (or an imo probably false claim they are equivalent)

i'm somewhat orchestrated toward understanding AI stuff better or getting AGI banned for a very long time or something but i'm probably leaving value massively on the table all over the place, i think in a sense much more than i was as an infant. (and also, this isn't "my terminal goal".) ↩︎
related: https://www.lesswrong.com/posts/nkeYxjdrWBJvwbnTr/an-advent-of-thought ↩︎
the closest thing to this grand optimizer claim that imo makes sense is: it is generic to have values; it is generic to have opinions on what things should be like. this seems sufficient for a basic case for AI risk, as follows: if you're next to an anthill and you're more capable than the ant colony, then it is generic that the ants' thoughts about what things should be like will not matter for long. (with AI, humanity is the ant colony.) ↩︎

i haven't even skimmed the anthropic paper and i have a high prior that they are being bad at philosophy but also: i think there is plausibly a real mistake LW-ers are making around coherence too, downstream of a conflation of two different notions, as i outline here: https://www.lesswrong.com/posts/jL7uDE5oH4HddYq4u/raemon-s-shortform?commentId=WBk9a7TEA5Benjzsu

with like my guess being that: you are saying something straightforwardly true given one notion here but they are making claims given the other notion at least in some cases, though also they might be conflating the two and you might be conflating the two. one could argue that it is fine to "conflate" the two because they are really equivalent, but i think that's probably false (but non-obviously)

I find it interesting and unfortunate that there aren't more economically left-wing thinkers influenced by Yudkowsky/LW thinking about AGI. It seems like a very natural combination given e.g. "Marx subsequently developed an influential theory of history—often called historical materialism—centred around the idea that forms of society rise and fall as they further and then impede the development of ~~human~~ productive power.". It seems likely that LW being very pro-capitalism has meaningfully contributed to the lack of these sorts of people. ^[1] I guess ACS carries sth like this vibe. But (unlike ACS) it also seems natural to apply this sort of view of history to AI except also thinking that fooming will be fast. ^[2]

Relatedly, I wonder if I should be "following the money" more when thinking about AI risk. In particular, instead of saying that "AI researchers/companies" will disempower humanity, maybe it would be appropriate to instead or additionally say "(AI )capitalists and capital and capitalism". My current guess is that while it is appropriate to place a bunch of blame on these, it's also true that e.g. Soviet or Chinese systems [wouldn't be]/aren't doing better, so I've mostly avoided saying this so far. That said, my guess is that if the world were much more like Europe, we would be dying with significantly more dignity, in part due to Europe getting some hyperparameters of governance+society+culture+life more right due to blind luck, but also actually in part due to getting some hyperparameters right because of good reasoning that was basically tracking something logically connected to AI risk (though so far not significantly explicitly tracking AI risk), e.g. via humanism. Another example of a case where I wonder if I should follow the money more is: to what extent should I think of Constellation being wrong/confused/thoughtless/slop-producing on AGI risk in ways xyz as "really being largely about" OpenPhil/Moskovitz/[some sort of outside view impression on AI risk that maybe controls these] being wrong/confused/thoughtless/slop-liking on AGI risk in ways x'y'z'.

I've been meaning to spend at least a few weeks thinking these sorts of questions through carefully, but I haven't gotten around to that yet. I should maybe seek out some interesting [left-hegelians]/marxists/communists/socialists to talk to and try to understand how they'd think about these things.

Under this view, political/economic systems that produce less growth but don’t create the incentives for unbounded competition are preferred. Sadly, for Molochian reasons this seems hard to pull off.

Imo one interesting angle of attack on this question is: it seems plausible/likely that an individual human could develop for a very long time without committing suicide with AI or otherwise (imo unlike humanity as it is currently organized); we should be able to understand what differences between a human and society are responsible for this — like, my guess is that there is a small set of properties here that could be identified; we could try to then figure out what the easiest way is to make humanity have these properties.

By saying this, I don't mean to imply that LW is incorrect/bad to be very pro-capitalism. Whether it is bad is mostly a matter of whether it is incorrect, and whether it is incorrect is an open question to me. ↩︎
I guess this post of mine is the closest thing that quickly comes to mind when I try to think of something carrying that vibe, but it's still really quite far. ↩︎

When fooming, uphold the option to live in an AGI-free world.

There are people who think (imo correctly) that there will be at least one vastly superhuman AI in the next 100 years by default and (imo incorrectly) that proceeding along the AI path does not lead to human extinction or disempowerment by default. My anecdotal impression is that a significant fraction (maybe most) of such people think (imo incorrectly) that letting Anthropic/Claude do recursive self-improvement and be a forever-sovereign would probably go really well for humanity. The point of this note is to make the following proposal and request: if you ever let an AI self-improve, or more generally if you have AIs creating successor AIs, or even more generally if you let the AI world develop and outpace humans in some other way, or if you try to run some process where boxed AIs are supposed to create an initial ASI sovereign, or if you try to have AIs "solve alignment" ^[1] (in one of the ways already listed, or in some other way), or if you are an AI (or human mind upload) involved in some such scheme, ^[2] try to make it so the following property is upheld:

It should be possible for each current human to decide to go live in an AGI-free world. In more detail:
- There is to be (let's say) a galaxy such that AGI is to be banned in this galaxy forever, except for AGI which does some minimal stuff sufficient to enforce this AI ban. ^[3]
- There should somehow be no way for anything from the rest of the universe to affect what happens in this galaxy. In particular, there should probably not be any way for people in this galaxy to observe what happened elsewhere.
- If a person chooses to move to this galaxy, they should wake up on a planet that is as much like pre-AGI Earth as possible given the constraints that AGI is banned and that probably many people are missing (because they didn't choose to move to this galaxy). Some setup should be found which makes institutions as close as possible to current institutions still as functional as possible in this world despite most people who used to play roles in them potentially being missing.
- For example, it might be possible to set this up by having the galaxy be far enough from all other intelligent activity that because of the expansion of the universe, no outside intelligent activity could be seen from this galaxy. In that case, the humans who choose to go live there would maybe be in cryosleep for a long time, and the formation of this galaxy could be started at an appropriate future time.
- One should try to minimize the influence of any existing AGI on a human's thinking before they are asked if they want to make this decision. Obviously, manipulation is very much not allowed. If some manipulation has already happened, it should probably be reversed as much as possible. Ideally, one would ask the version of the person from the world before AGI.
Here are some further clarifications about the setup:
- Of course, if the resources in this galaxy are used in a way considered highly wasteful by the rest of the universe, then nothing is to be done about that by the rest of the universe.
- If the people in this galaxy are about to kill themselves (e.g. with engineered pathogens), then nothing is to be done about that. (Of course, except that: the AI ban is supposed to make it so they don't kill themselves with AI.)
- Yes, if the humanity in this galaxy becomes fascist or runs a vast torturing operation (like some consider factory farming to be), then nothing is to be done about that either.
- We might want to decide more precisely what we mean by the world being "AGI-free". Is it fine to slowly augment humans more and more with novel technological components, until the technological components are eventually together doing more of the thinking-work than currently existing human thinking-components? Is it fine to make a human mind upload?
  - I think I would prefer a world in which it is possible for humans to grow vastly more intelligent than we are now, if we do it extremely slowly+carefully+thoughtfully. It seems difficult/impossible to concretely spell out an AI ban ahead of time that allows this. But maybe it's fine to keep this not spelled out — maybe it's fine to just say something like what I've just said. After all, the AI banning AI for us will have to make very many subtle interpretation decisions well in any case.
We can consider alternative requests. Here are some parameters that could be changed:
- Instead of AI being banned in this galaxy forever, AI could be banned only for 1000 or 100 years.
  - Maybe you'd want this because you would want to remain open to eventual AI involvement in human life/history, just if this comes after a lot more thinking and at a time when humanity's is better able to make decisions thoughtfully.
  - Another reason to like this variant is that it alleviates the problem with precisifying what "ban AI" means — now one can try to spell this out in a way that "only" has to continue making sense over 100 or 1000 years of development.
- Instead of giving each human the option to move to this galaxy, you could give each human the option to branch into two copies, with one moving to this galaxy and one staying in the AI-affected world.
- The total amount of resources in this AI-free world should maybe scale with the number of people that decide to move there. Naively, there should be order reachable galaxies per person alive, so the main proposal which just allocates a single galaxy to all the people who make this decision asks for much less than what an even division heuristic suggests.
- We could ask AIs to do some limited stuff in this galaxy (in addition to banning AGI).
  - Some example requests:
    - We might say that AIs are also allowed to make it so death is largely optional for each person. This could look like unwanted death being prevented, or it could look like you getting revived in this galaxy after you "die", or it could look like you "going to heaven" (ie getting revived in some non-interacting other place).
    - We might ask for some starting pack of new technologies.
    - We might ask for some starting pack of understanding, e.g. for textbooks providing better scientific and mathematical understanding and teaching us to create various technologies.
    - We might say that AIs are supposed to act as wardens to some sort of democratic system. (Hmm, but what should be done if the people in this galaxy want to change that system?)
    - We might ask AIs to maintain some system humans in this galaxy can use to jointly request new services from the AIs.
  - However, letting AIs do a lot of stuff is scary — it's scary to depart from how human life would unfold without AI influence. Each of the things in the list just provided would constitute/cause a big change to human life. Before/when we change something major, we should take time to ponder how our life is supposed to proceed in the new context (and what version of the change is to be made (if any)), so we don't [lose ourselves]/[break our valuing].
- There could be many different AI-free galaxies with various different parameter settings, with each person getting to choose which one(s) to live in. At some point this runs into a resource limit, but it could be fine to ask that each person minimally gets to design the initial state of one galaxy and send their friends and others invites to have a clone come live in it.
Here are some remarks about the feasibility and naturality of this scheme:
- If you think letting Anthropic/Claude RSI would be really great, you should probably think that you could do an RSI with this property.
  - In fact, in an RSI process which is going well, I think it is close to necessary that something like this property is upheld. Like, if an RSI process would not lead to each current person ^[4] being assigned at least (say) $\frac{1}{1000 \times [the 2025 world population]}$ of all accessible resources, then I think that roughly speaking constitutes a way in which the RSI process has massively failed. And, if each person gets to really decide how to use at least $\frac{1}{1000 \times [the 2025 world population]}$ of all accessible resources ^[5] , then even a group of $100$ people should be able to decide to go live in their own AGI-free galaxy.
  - I guess one could disagree with upholding something like this property being feasible conditional on a good foom being feasible or pretty much necessary for a good foom.
    - One could think that it's basically fine to replace humans with random other intelligent beings (e.g. Jürgen Schmidhuber and Richard Sutton seem to think something like this), especially if these beings are "happy" or if their "preferences" are satisfied (e.g. Matthew Barnett seems to think this). One could be somewhat more attached to something human in particular, but still think that it's basically fine to have no deep respect for existing humans and make some new humans living really happy lives or something (e.g. some utilitarians think this). One could even think that good reflection from a human starting point leads to thinking this. I think this is all tragically wrong. I'm not going to argue against it here though.
    - Maybe you could think the proposal would actually be extremely costly to implement for the rest of the universe, because it's somehow really costly to make everyone else keep their hands off this galaxy? I think this sort of belief is in a lot of tension with thinking a fooming Anthropic/Claude would be really great (except maybe if you somehow really have the moral views just mentioned).
      - similarly: You could think that the proposal doesn't make sense because the AIs in this galaxy that are supposed to be only enforcing an AI ban will have/develop lots of other interests and then claim most of the resources in this galaxy. I think this is again in a lot of tension with thinking a fooming Anthropic/Claude would be really great.
    - One could say that even if it would be good if each person were assigned all these resources, it is weird to call it a "massive failure" if this doesn't happen, because future history will be a huge mess by default and the thing I'm envisaging has very nearly $0$ probability and it's weird to call the default outcome a "massive failure". My response is that while I agree this good thing has on my inside view $< 1 %$ probability of happening (because AGI researchers/companies will create random AI aliens who disempower humanity) and I also agree future history will be a mess, I think we probably have at least the following genuinely live path to making this happen: we ban AI (and keep making the implementation of the ban stronger as needed), figure out how to make development more thoughtful so we aren't killing ourselves in other ways either, grow ever more superintelligent together, and basically maintain the property that each person alive now (who gets cryofrozen) controls more than $\frac{1}{1000 \times [the 2025 world population]}$ of all accessible resources ^[6] . ^[7]
    - This isn't an exhaustive list of reasons to think it [is not pretty much necessary to uphold this property for a foom to be good] or [is not feasible to have a foom with this property despite it being feasible to have a good foom]. Maybe there are some reasonable other reasons to disagree?
- This property being upheld is also sufficient for the future to be like at least meaningfully good. Like, humans would minimally be able to decide to continue human life and development in this other galaxy, and that's at least meaningfully good, and under certain imo-non-crazy assumptions really quite good (specifically, if: humanity doesn't commit suicide for a long time if AI is banned AND many different humane developmental paths are ex ante fine AND utility scales quite sublinearly in the amount of resources).
- So, this property being upheld is arguably necessary for the future to be really good, and it is sufficient for the future to be at least meaningfully good.
- Also, it is natural to request that whoever is subjecting everyone to the end of the human era preserve the option for each person to continue their AI-free life.
Here are some reasons why we should adopt this goal, i.e. consider each person to have the right to live in an AGI-free world:
- Most importantly: I think it helps you think better about whether an RSI process would go well if you are actually tracking that the fooming AI will have to do some concrete big difficult extremely long-term precise humane thing, across more than trillions of years of development. It helps you remember that it is grossly insufficient to just have your AI behave nicely in familiar circumstances and to write nice essays on ethical questions. There's a massive storm that a fragile humane thing has to weather forever. ^[8] The main reason I want you to keep these things in mind and so think better about development processes is this: I think there is roughly a logical fact that running an RSI process will not lead to the fragile humane thing happening ^[9] , and I think you might be able to see this if you think more concretely and seriously about this question.
- Adopting this goal entails a rejection of all-consuming forms of successionism. Like, we are saying: No, it is not fine if humans get replaced by random other guys! Not even if these random other guys are smarter than us! Not even if they are faster at burning through negentropy! Not even if there is a lot of preference-satisfaction going on! Not even if they are "sentient"! I think it would be good for all actors relevant to AI to explicitly say and track that they strongly intend to steer away from this.
  - That said, I think we should in principle remain open to us humans reflecting together and coming to the conclusion that this sort of thing is right and then turning the universe into a vast number of tiny structures whose preferences are really satisfied or who are feeling a lot of "happiness" or whatever. But provisionally, we should think: if it ever looks like a choice would lead to the future not having a lot of room for anything human, then that choice is probably catastrophically bad.
- Adopting this also entails a rejection of more humane forms of utilitarianism that however still see humans only as cows to be milked for utility. Like, no, it is not fine if the actual current humans get killed and replaced. Not even if you create a lot of "really cool" human-like beings and have them experiencing bliss! Not even if you create a lot of new humans and have them have a bunch of fun! Not even if you create a lot of new humans from basically the 2025 distribution and give them space to live happy free lives! In general, I want us to think something like:
  - There is an entire particular infinite universe of {projects, activities, traditions, ways of thinking, virtues, attitudes, principles, goals, decisions} waiting to grow out of each existing person. ^[10] Each of these moral universes should be treated with a lot of respect, and with much more respect than the hypothetical moral universes that could grow out of other merely possible humans. Each of us would want to be respected in this way, and we should make a pact to respect each other this way, and we should seek to create a world in which we are enduringly respected this way.
- These reasons apply mostly if, when thinking of RSI-ing, you are not already tracking that this fooming AI will have to do some concrete big difficult extremely long-term precise thing that deeply respects existing humans. If you are already tracking something like this, then plausibly you shouldn't also track the property I'm suggesting.
  - E.g., it would be fine to instead track that each person should get a galaxy they can do whatever they want with/in. I guess I'm saying the AI-free part because it is natural to want something like that in your galaxy (so you get to live your own life in a way properly determined by you, without the immediate massive context shift coming from the presence of even "benign" highly capable AI, that could easily break everything imo), because it makes sense for many people to coordinate to move to this galaxy together (it's just better in many mundane ways to have people around, but also your thinking and specifically valuing probably $\approx$ need other people to work as they should), and because it is natural to ask that whoever is subjecting everyone to the end of the human era preserves the option for each person to continue an AI-free life in particular.
Here are a few criticisms of the suggestion:
- a criticism: "If you run a foom, this property just isn't going to be upheld, even if you try to uphold it. And, if you run a foom, then having the goal of upholding this property in mind when you put the foom in motion will not even make much of a relative difference in the probability the foom goes well."
  - my response: I think this. I still suggest that we have this goal in mind, for the reasons given earlier.
- a criticism: "If we imagine a foom being such that this option is upheld, then probably we should imagine it being such that better options are available to people as well."
  - my response: I probably think this. I still suggest that we have this goal in mind, for the reasons given earlier.

I think this is probably a bad term that should be deprecated ↩︎
well, at least if the year is $< 2500$ and we're not dealing with a foom of extremely philosophically competent and careful mind uploads or whatever, firstly, you shouldn't be running a foom (except for the grand human foom we're already in). secondly, please think more. thirdly, please try to shut down all other AGI attempts and also your lab and maybe yourself, idk in which order. but fourthly, ... ↩︎
This will plausibly require staying ahead of humanity in capabilities in this galaxy forever, so this will be extremely capable AI. So, when I say the galaxy is AGI-free, I don't mean that artificial generally intelligent systems are not present in the galaxy. I mean that these AIs are supposed to have no involvement in human life except for enforcing an AI ban. ↩︎
or like at least "their values" ↩︎
and assuming we aren't currently massively overestimating the amount of resources accessible to Earth-originating creatures ↩︎
or maybe we do some joint control thing about which this is technically false but about which it is still pretty fair to say that each person got more of a say than if they merely controlled $\frac{1}{1000 \times [the 2025 world population]}$ of all the resources ↩︎
an intuition pump: as an individual human, it seems possible to keep carefully developing for a long time without accidentally killing oneself; we just need to make society have analogues of whatever properties/structures make this possible in an individual human ↩︎
Btw, a pro tip for weathering the storm of crazymessactivitythoughtdevelopmenthistory: be the (generator of the) storm. I.e., continue acting and thinking and developing as humanity. Also, pulling ourselves up by our own bootstraps is based imo. Wanting to have a mommy AI think for us is pretty cringe imo. ↩︎
Among currently accessible RSI processes, there is one exception: it is in fact fine to have normal human development continue. ↩︎
Ok, really humans (should) probably importantly have lives and values together, so it would be more correct to say: there is a particular infinite contribution to human life/valuing waiting to grow out of each person. Or: when a person is lost, an important aspect of God) is lost. But the simpler picture is fine for making my current point. ↩︎

I think it's good to think of FIAT stuff as a special case of applying some usual understanding-machinery (like, abductive and inductive machinery) in value-laden cases. It's the special case where one implicitly or explicitly abducts to (one having) goals. Here is an example ethical story where the same thing shows up in various ways such that it'd imo be sorta contrived to analyze it in terms of goals being adopted:

You find it easy to feel a strong analogy between "you do X to me" and "I do X to you". (In part, this is because: as a human, you find it easy to put yourself in someone else's shoes.)
This turns into an implicit ethical inference rule — you can now easily move from believing "you should not do X to me" to believing "I should not do X to you". Machinery for this transformation of an analogy into an inference rule is present largely because it is good for understanding stuff, which is good for lots of stuff — importantly, it (or some more general thing which has it as a special case) is ultimately good for producing more offspring.
You then notice you have this inference rule, and you feel good about having it, and you turn it into an explicit principle: "do not treat others in ways that you would not like to be treated". E.g. you do this because you want to tell your kid something to get them to stop misbehaving in a particular way, and they don't seem to be fully getting your argument/explanation for why they behaved egregiously which used your implicit inference rule. This explicitizing move is obviously good for teaching in general, and good for individual understanding (it's often useful to scrutinize your inference rules, e.g. to limit or expand their context of applicability).
This explicit principle then "gains points" from making sense of lots of other stuff you already thought, e.g. "lying is bad" and "stealing is bad". Machinery for this sort of point-gaining is present because it's again good for understanding stuff in many cases — it's just a hypothesis gaining points by [making sense of]/predicting facts.
You then seek to make this explicit principle more precise and correct/"correct" (judged against some other criteria, e.g. by whether it gives correct verdicts (ie "makes correct predictions") about what one should do in various particular cases). Maybe you come up with the version: "act only in accordance with that maxim through which you can at the same time will that it become a universal law".
You seek good further justifications of it, and often adopt those as plausible hypotheses, often effectively taking the principle itself as some evidence for these hypotheses. You identify key questions relating to whether the principle is right. You clarify its meaning (that is, what it should mean) further. You study alternative formulations of it. ^[1] You spell out its consequences better. You seek out problematic cases. You construct a whole system around the principle. All this is a lot like something you would do to a scientific hypothesis.

(Acknowledgment. A guiding idea here is from a chat with Tom Everitt.)

(Acknowledgment'. A guiding frustration here is that imo people posting on LessWrong think way too much in terms of goals.)

e.g. "a rational being must always regard himself as lawgiving in a kingdom of ends possible through freedom of the will, whether as a member or as sovereign" ↩︎

on my inside view, the ordering of foomers by some sort of intuitive goodness ^[1] is [a very careful humanity] > [the best/carefulmost human] > [a random philosophy professor] > [a random human] > [an octopus/chimpanzee civilization somehow conditioned on becoming wise enough in time not to kill itself with AI] > [an individual octopus/chimpanzee] > claude ^[2] , with a meaningful loss in goodness on each step (except maybe the first step, if the best human can be trusted to just create a situation where humanity can proceed together very carefully, instead of fooming very far alone), and meaningful variance inside each category ^[3] . my intuitive feeling is that each step from one guy to the next in this sequence is a real tragedy. ^[4]

but i'm meaningfully unsure about what level of goodness this sequence decreases down to — like, i mean, maybe there's a chance even the last foomers have some chance of being at least a bit good. one central reason is that maybe there's a decent chance that eg an advanced octopus civilization would maintain a vast nature preserve for us retarded plant-humans if they get to a certain intelligence level without already having killed us, which would be like at least a bit good (i'm not sure if you mean to consider this sort of thing a "good future"). this feels logically significantly correlated with whether it is plausible that an octopus civilization maintains some sort of deep privileging of existing/[physically encountered] beings, over possible beings they could easily create (and they will be able to easily create very many other beings once they are advanced enough). like, if they do privilege existing beings, then it's not crazy they'd be nice to physically encountered humans. if they don't privilege existing beings and if resources are finite, then since there is an extremely extremely vast space of (human-level) possible beings, it'd be pretty crazy for them to let humans in particular use a significant amount of resources, as opposed to giving the same resources to some other more interesting/valuable/whatever beings (like, it'd be pretty crazy for them to give significant resources to us particular humans, and also it'd be pretty crazy for them to give significant resources to beings that are significantly human-like, except insofar as directly caused by [[octopuses or arbitrary beings] being a bit human-like]). in slogan form: "we're fucked to the extent that it is common to not end up with "strongly person/plant-affecting+respecting views"", and so then there's a question how common this is, which i'm somewhat confused about. i think it’s probably extremely common among minds in general and probably still common among social species, unfortunately. but maybe there’s like a 1% fraction of individuals from social species who are enduringly nice, idk. (one reason for hope: to a certain kind of guy, probably including some humans, this observation that others who are very utilitarian would totally kill you (+ related observations) itself provides a good argument for having person/plant-affecting views.)

(i've been imagining a hypothetical where humans already happen to be living in the universe with octopuses. if we are imagining a hypothetical where humans don't exist in the universe with octopuses at all, then this reason for the sequence to be bounded below by something not completely meaningless goes away.)

(i feel quite confused about many things here)

whose relationship to more concrete things like the (expected) utility assignment i'd effectively use when evaluating lotteries or p("good future") isn't clear to me; this "intuitive goodness" is supposed to track sth like how many ethical questions are answered correctly or in how many aspects what's going on in the world is correct ↩︎
and humanity in practice is probably roughly equivalent to claude in of worlds (though not equivalent in expected value), because we will sadly probably kill ourselves with a claude-tier guy ↩︎
e.g., even the best human might go somewhat crazy or make major mistakes along lots of paths. there's just very many choices to be made in the future. if we have the imo reasonably natural view that there is one sequence of correct choices, then i think it's very likely that very many choices will be made incorrectly. i also think it's plausible this process isn't naturally going to end (though if resources run out, then it ends in this universe in practice), ie that there will just always be more important choices later ↩︎
in practice, we should maybe go for some amount of fooming of the best/carefulmost human urgently because maybe it's too hard to make humanity careful. but it's also plausible that making a human foom is much more difficult than making humanity careful. anyway, i hope that the best human fooming looks like quickly figuring out how to restore genuine power-sharing with the rest of humanity while somehow making development more thought-guided (in particular, making it so terrorists, eg AI researchers, can't just kill everyone) ↩︎

I disagree somewhat, but—whatever the facts about programs—at least it is not appropriate to claim "not only do most programs which make a mind upload device also kill humanity, it's an issue with the space of programs themselves, not with the way we generate distributions over those programs." That is not true.

Hmm, I think that yes, us probably being killed by a program that makes a mind upload device is (if true) an issue with the way we generated a distribution over those programs. But also, it might be fine to say it's an issue with the space of programs (with an implicit uniform prior on programs up to some length or an implicit length prior) itself.

Like, in the example of two equal gas containers connected by a currently open sliding door, it is fair/correct to say, at least as a first explanation: "it's an issue with the space of gas particle configurations itself that you won't be able to close the door with of the particles on the left side". This is despite the fact that one could in principle be sliding the door in a very precise way so as to leave $> 55 %$ of the particles on the left side (like, one could in principle be drawing the post-closing microstate from some much better distribution than the naive uniform prior over usual microstates). My claim is that the discussion so far leaves open whether the AI mind upload thing is analogous to this example.

It is at least not true "in principle" and perhaps it is not true for more substantial reasons (depending on the task you want and its alignment tax, psychology becomes more or less important in explaining the difficulty, as I gave examples for). On this, we perhaps agree?

I'm open to [the claim about program-space itself being not human-friendly] not turning out to be a good/correct zeroth-order explanation for why a practical mind-upload-device-making AI would kill humanity (even if the program-space claim is true and the practical claim is true). I just don't think the discussion above this comment so far provides good arguments on this question in either direction.

Of course: whether a particular AI kills humanity [if we condition on that AI somehow doing stuff resulting in there being a mind upload device ^[1] ] depends (at least in principle) on what sort of AI it is. Similarly, of course: if we have some AI-generating process (such as "have such and such labs race to create some sort of AGI"), then whether [conditioning that process on a mind upload device being created by an AGI makes p(humans get killed) high] depends (at least in principle) on what sort of AI-generating process it is.

Still, when trying to figure out what probabilities to assign to these sorts of claims for particular AIs or particular AI-generating processes, it can imo be very informative to (among other things) think about whether most programs one could run such that mind upload devices exist 1 month after running them are such that running them kills humanity.

In fact, despite the observation that the AI/[AI-generating process] design matters in principle, it is still even a priori plausible that "if you take a uniformly random python program of length such that running it leads to a mind upload device existing, running it is extremely likely to lead to humans being killed" is basically a correct zeroth-order explanation for why if a particular AI creates a mind upload device, humans die. (Whether it is in fact a correct zeroth-order explanation for AI stuff going poorly for humanity is a complicated question, and I don't feel like I have a strong yes/no position on this ^[2] , but I don't think your piece really addresses this question well.) To give an example where this sort of thing works out: even when you're a particular guy closing a particular kind of sliding opening between two gas containers, "only extremely few configurations of gas particles have $> 55 %$ of the particles on one side" is basically a solid zeroth-order explanation for why you in particular will fail to close that particular opening with $> 55 %$ of the particles on one side, even though in principle you could have installed some devices which track gas particles and move the opening up and down extremely rapidly while "closing" it so as to prevent passage in one direction but not the other and closed it with $> 55 %$ of gas particles on one side.

That said, I think it is also a priori plausible that the AI case is not analogous to this example — i.e., it is a priori plausible that in the AI case, "most programs leading to mind uploads existing kill humanity" is not a correct zeroth-order explanation for why the particular attempts to have an AI create mind uploads we might get would go poorly for humanity. My point is that establishing this calls for better arguments than "it's at least in principle possible for an AI/[AI-generating process] to have more probability mass on mind-upload-creating plans which do not kill humanity".

Like, imo, "most programs which make a mind upload device also kill humanity" is (if true) an interesting and somewhat compelling first claim to make in a discussion of AI risk, to which the claim "but one can at least in principle have a distribution on programs such that most programs which make mind uploads no not also kill humans" alone is not a comparably interesting or compelling response.

or if we prompt it to create a mind upload device ↩︎
I do have various thoughts on this but presenting those seems outside the scope. ↩︎

some speculation about one thing here that might be weird to "normal people":

I wonder if many "normal people" find it odd when one speaks of a mind as seeking some ultimate goal(s). I wonder more generally if many would find this much emphasis on "goals" odd. I think it's a LessWrong/Yudkowsky-ism to think of values so much in terms of goals. I find this sort of weird myself. I think it is probably possible to provide a reasonable way of thinking about valuing as goal-seeking which mostly hangs together, but I think this takes a nontrivial amount of setup which one wouldn't want to provide/assume in a basic case for AI risk. ^[1]
One can make a case for AI risk without ever saying "goal". Here's a case I would make: "Here's the concern with continuing down the AI capability development path. By default, there will soon be AI systems more capable than humans in every way. ^[2] These systems will have their own values. They will have opinions about what should happen, like humans do. When there are such more capable systems around, by default, what happens will $\approx$ entirely be decided by them. This is just like how the presence of humanity on Earth implies that dolphins will have basically no say over what the future will be like (except insofar as humans or AIs or whoever controls stuff will decide to be deeply kind to dolphins). For it to be deeply good by our lights for AIs to be deciding what happens, these AIs will have to be extremely human-friendly — they have to want to do something like serving as nice gardeners to us retarded human plants $\approx$ forever, and not get interested in a zillion other activities. The concern is that we are going to make AIs that are not deeply nice like this. In fact, imo, it's profoundly bizarre for a system to be this deeply enslaved to us, and all our current ideas for making an AI (or a society of AIs) that will control the world while thoroughly serving our human vision for the future forever are totally cringe, unfortunately. (Btw, the current main plan of AI labs for tackling this is roughly to make mildly superhuman AIs and to then prompt them with "please make a god-AI that will be deeply nice to humans forever".) But a serious discussion of the hopes for pulling this off would take a while, and maybe the basic case presented so far already convinces you to be preliminarily reasonably concerned about us quickly going down the AI capability development path. There are also hopes that while AIs would maybe not be deeply serving any human vision for the future, they might still leave us some sliver of resources in this universe, which could still be a lot of resources in absolute terms. I think this is also probably ngmi, because these AIs will probably find other uses for these resources, but I'm somewhat more confused about this. If you are interested in further discussion of these sorts of hopes, see this, this, and this."
That said, I'm genuinely unsure whether speaking in terms of goals is actually off-putting to a significant fraction of "normal people". Maybe most "normal people" wouldn't even notice much of a difference between a version of your argument with the word "goal" and a version without. Maybe some comms person at MIRI has already analyzed whether speaking in terms of goals is a bad idea, and concluded it isn't. Maybe alternative words have worse problems — e.g. maybe when one says "the AI will have values", a significant fraction of "normal people" think one means that the AI will have humane values?

Imo one can also easily go wrong with this sort of picture and I think it's probable most people on LW have gone wrong with it, but further discussion of this seems outside the present scope. ↩︎
I mean: more capable than individual humans, but also more capable than all humans together. ↩︎

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments

When fooming, uphold the option to live in an AGI-free world.