on my inside view, the ordering of foomers by some sort of intuitive goodness [1] is [a very careful humanity] > [the best/carefulmost human] > [a random philosophy professor] > [a random human] > [an octopus/chimpanzee civilization somehow conditioned on becoming wise enough in time not to kill itself with AI] > [an individual octopus/chimpanzee] > claude [2] , with a meaningful loss in goodness on each step (except maybe the first step, if the best human can be trusted to just create a situation where humanity can proceed together very carefully, instead of fooming very far alone), and meaningful variance inside each category [3] . my intuitive feeling is that each step from one guy to the next in this sequence is a real tragedy. [4]
but i'm meaningfully unsure about what level of goodness this sequence decreases down to — like, i mean, maybe there's a chance even the last foomers have some chance of being at least a bit good. one central reason is that maybe there's a decent chance that eg an advanced octopus civilization would maintain a vast nature preserve for us retarded plant-humans if they get to a certain intelligence level without already having killed us, which would be like at least a bit good (i'm not sure if you mean to consider this sort of thing a "good future"). this feels logically significantly correlated with whether it is plausible that an octopus civilization maintains some sort of deep privileging of existing/[physically encountered] beings, over possible beings they could easily create (and they will be able to easily create very many other beings once they are advanced enough). like, if they do privilege existing beings, then it's not crazy they'd be nice to physically encountered humans. if they don't privilege existing beings and if resources are finite, then since there is an extremely extremely vast space of (human-level) possible beings, it'd be pretty crazy for them to let humans in particular use a significant amount of resources, as opposed to giving the same resources to some other more interesting/valuable/whatever beings (like, it'd be pretty crazy for them to give significant resources to us particular humans, and also it'd be pretty crazy for them to give significant resources to beings that are significantly human-like, except insofar as directly caused by [[octopuses or arbitrary beings] being a bit human-like]). in slogan form: "we're fucked to the extent that it is common to not end up with "strongly person/plant-affecting+respecting views"", and so then there's a question how common this is, which i'm somewhat confused about. i think it’s probably extremely common among minds in general and probably still common among social species, unfortunately. but maybe there’s like a 1% fraction of individuals from social species who are enduringly nice, idk. (one reason for hope: to a certain kind of guy, probably including some humans, this observation that others who are very utilitarian would totally kill you (+ related observations) itself provides a good argument for having person/plant-affecting views.)
(i've been imagining a hypothetical where humans already happen to be living in the universe with octopuses. if we are imagining a hypothetical where humans don't exist in the universe with octopuses at all, then this reason for the sequence to be bounded below by something not completely meaningless goes away.)
(i feel quite confused about many things here)
whose relationship to more concrete things like the (expected) utility assignment i'd effectively use when evaluating lotteries or p("good future") isn't clear to me; this "intuitive goodness" is supposed to track sth like how many ethical questions are answered correctly or in how many aspects what's going on in the world is correct ↩︎
and humanity in practice is probably roughly equivalent to claude in of worlds (though not equivalent in expected value), because we will sadly probably kill ourselves with a claude-tier guy ↩︎
e.g., even the best human might go somewhat crazy or make major mistakes along lots of paths. there's just very many choices to be made in the future. if we have the imo reasonably natural view that there is one sequence of correct choices, then i think it's very likely that very many choices will be made incorrectly. i also think it's plausible this process isn't naturally going to end (though if resources run out, then it ends in this universe in practice), ie that there will just always be more important choices later ↩︎
in practice, we should maybe go for some amount of fooming of the best/carefulmost human urgently because maybe it's too hard to make humanity careful. but it's also plausible that making a human foom is much more difficult than making humanity careful. anyway, i hope that the best human fooming looks like quickly figuring out how to restore genuine power-sharing with the rest of humanity while somehow making development more thought-guided (in particular, making it so terrorists, eg AI researchers, can't just kill everyone) ↩︎
I disagree somewhat, but—whatever the facts about programs—at least it is not appropriate to claim "not only do most programs which make a mind upload device also kill humanity, it's an issue with the space of programs themselves, not with the way we generate distributions over those programs." That is not true.
Hmm, I think that yes, us probably being killed by a program that makes a mind upload device is (if true) an issue with the way we generated a distribution over those programs. But also, it might be fine to say it's an issue with the space of programs (with an implicit uniform prior on programs up to some length or an implicit length prior) itself.
Like, in the example of two equal gas containers connected by a currently open sliding door, it is fair/correct to say, at least as a first explanation: "it's an issue with the space of gas particle configurations itself that you won't be able to close the door with of the particles on the left side". This is despite the fact that one could in principle be sliding the door in a very precise way so as to leave of the particles on the left side (like, one could in principle be drawing the post-closing microstate from some much better distribution than the naive uniform prior over usual microstates). My claim is that the discussion so far leaves open whether the AI mind upload thing is analogous to this example.
It is at least not true "in principle" and perhaps it is not true for more substantial reasons (depending on the task you want and its alignment tax, psychology becomes more or less important in explaining the difficulty, as I gave examples for). On this, we perhaps agree?
I'm open to [the claim about program-space itself being not human-friendly] not turning out to be a good/correct zeroth-order explanation for why a practical mind-upload-device-making AI would kill humanity (even if the program-space claim is true and the practical claim is true). I just don't think the discussion above this comment so far provides good arguments on this question in either direction.
Of course: whether a particular AI kills humanity [if we condition on that AI somehow doing stuff resulting in there being a mind upload device [1] ] depends (at least in principle) on what sort of AI it is. Similarly, of course: if we have some AI-generating process (such as "have such and such labs race to create some sort of AGI"), then whether [conditioning that process on a mind upload device being created by an AGI makes p(humans get killed) high] depends (at least in principle) on what sort of AI-generating process it is.
Still, when trying to figure out what probabilities to assign to these sorts of claims for particular AIs or particular AI-generating processes, it can imo be very informative to (among other things) think about whether most programs one could run such that mind upload devices exist 1 month after running them are such that running them kills humanity.
In fact, despite the observation that the AI/[AI-generating process] design matters in principle, it is still even a priori plausible that "if you take a uniformly random python program of length such that running it leads to a mind upload device existing, running it is extremely likely to lead to humans being killed" is basically a correct zeroth-order explanation for why if a particular AI creates a mind upload device, humans die. (Whether it is in fact a correct zeroth-order explanation for AI stuff going poorly for humanity is a complicated question, and I don't feel like I have a strong yes/no position on this [2] , but I don't think your piece really addresses this question well.) To give an example where this sort of thing works out: even when you're a particular guy closing a particular kind of sliding opening between two gas containers, "only extremely few configurations of gas particles have of the particles on one side" is basically a solid zeroth-order explanation for why you in particular will fail to close that particular opening with of the particles on one side, even though in principle you could have installed some devices which track gas particles and move the opening up and down extremely rapidly while "closing" it so as to prevent passage in one direction but not the other and closed it with of gas particles on one side.
That said, I think it is also a priori plausible that the AI case is not analogous to this example — i.e., it is a priori plausible that in the AI case, "most programs leading to mind uploads existing kill humanity" is not a correct zeroth-order explanation for why the particular attempts to have an AI create mind uploads we might get would go poorly for humanity. My point is that establishing this calls for better arguments than "it's at least in principle possible for an AI/[AI-generating process] to have more probability mass on mind-upload-creating plans which do not kill humanity".
Like, imo, "most programs which make a mind upload device also kill humanity" is (if true) an interesting and somewhat compelling first claim to make in a discussion of AI risk, to which the claim "but one can at least in principle have a distribution on programs such that most programs which make mind uploads no not also kill humans" alone is not a comparably interesting or compelling response.
some speculation about one thing here that might be weird to "normal people":
Centrally, a lie is a statement that contradicts reality, and
my initial reaction to this was: "what? a lie doesn't have to contradict reality, right? eg if i thought that 2+2=5, then if i told you that 2+2=4, i'd be lying to you, right?"
but then i looked at the google definition of a lie and was surprised to see it agreed with this sentence of your post. but i sort of still don't believe this is really the canonical meaning. chatgpt seems to agree with me lol: https://chatgpt.com/share/696eed66-ab40-800f-9157-0e7d04f5362a
(of course we can choose to use the word either way. i'm mostly saying this because i think it's plausible your reaction will just be "oops". if you stand by this meaning, then probably one should discuss which notion better fits the ways in which we already want to use the term, but i'm not actually that interested in having that discussion)
the AI safety community sees such a strong break with the rest of the ML community
i don't want to make any broader point in the present discussion with this but: the AI safety community is not inside the ML community (and imo shouldn't be)
to clarify a bit: my claim was that there are 10k individuals in history who have contributed at least at the same order of magnitude to our understanding of concepts — like, in terms of pushing human understanding further compared to the state of understanding before their work. we can be interested in understanding what this number is for this reason: it can help us understand whether it's plausible that this line of inquiry is just about to find some sort of definitive theory of concepts. (i expect you will still have a meaningfully lower number. i could be convinced it's more like 1000 but i think it's very unlikely to be like 100.) i think wentworth is obviously much higher eg if you rank people on publicly displayed alignment understanding, very likely in the top 10
If I try to imagine a world in which AIs somehow look like this around AGI (like, around when the "tasks" these AIs could do start including solving millennium prize problems), I strongly feel like I should then imagine something like humans prompting an AI (or a society of AIs) with like "ok now please continue on your path to becoming a god and make things super-duper-good (in the human sense) forever" (this could be phrased more like "please run our companies/states/etc. while being really good" or "please make an initial friendly ASI sovereign" or "please solve alignment" or whatever), with everything significant being done by AIs forever after. And I think it's very unlikely this leads to a future remotely as good as it could be — it'll lead to something profoundly inhuman instead.
Basically, it seems to me like you're making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
the intellect mustn't be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality... Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once - lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like "tree" (as opposed to a particular tree)), although I couldn't explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that's still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
Not that this kind of study of language is necessarily the best way to proceed with alignment - I didn't continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
+1 to this. to me this looks like understanding some extremely toy cases a bit better and thinking you're just about to find some sort of definitive theory of concepts. there's just SO MUCH different stuff going on with concepts! wentworth+lorell's work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell's work (i'd probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining! there's SO MANY questions! there's a lot of different structure in eg a human mind that is important for our concepts working! minds are really big, and not just in content but also in structure (including the structure that makes concepting tick in humans)! and minds are growing/developing, and not just in content but also in structure (including the structure that makes concepting tick in humans)! "what's the formula for good concepts?" should sound to us like "what's the formula for useful technologies?" or "what's the formula for a strong economy?". there are very many ideas that go into having a strong economy, and there are probably very many ideas that go into having a powerful conceptive system. this has mostly just been a statement of my vibe/position on this matter, with few arguments, but i discuss this more here.
on another note: "retarget the search to human values" sounds nonsensical to me. by default (at least without fundamental philosophical progress on the nature of valuing, but imo probably even given this, at least before serious self-re-programming), values are implemented in a messy(-looking) way across a mind, and changing a mind's values to some precise new thing is probably in the same difficulty tier as re-writing a new mind with the right values from scratch, and not doable with any small edit
concretely, what would it look like to retarget the search in a human so that (if you give them tools to become more capable and reasonable advice on how to become more capable "safely"/"value-preservingly") they end up proving the riemann hypothesis, then printing their proof on all the planets in this galaxy, and then destroying all intelligent life in the galaxy (and committing suicide)? this is definitely a simpler thing than object-level human values, and it's plausibly more natural than human values even in a world in which there is already humanity that you can try to use as a pointer to human values. it seems extremely cursed to make this edit in a human. some thoughts on a few approaches that come to mind:
maybe the position is "humans aren't retargetable searchers (in their total structure, in the way needed for this plan), but the first AGI will probably be one". it seems very likely to me that values will in fact be diffusely and messily implemented in that AGI as well. for example, there won't even remotely be a nice cleavage between values and understanding
a response: the issue is that i've chosen an extremely unnatural task. a counterresponse: it's also extremely unnatural to have one's valuing route through an alien species, which is what the proposal wants to do to the AI ↩︎
that said, i think it's also reasonably natural to be the sort of guy who would actively try to undo any supposed value changes after the fact, and it's reasonably natural to be the sort of guy whose long-term future is more governed by stuff these edits don't touch. in these cases, these edits would not affect the far future, at least not in the straightforward way ↩︎
these are all given their correct meaning/function only in the context of their very particular mind, in basically all its parts. so i could also say: their mind just kicks in again in general. ↩︎
I think it's good to think of FIAT stuff as a special case of applying some usual understanding-machinery (like, abductive and inductive machinery) in value-laden cases. It's the special case where one implicitly or explicitly abducts to (one having) goals. Here is an example ethical story where the same thing shows up in various ways such that it'd imo be sorta contrived to analyze it in terms of goals being adopted:
(Acknowledgment. A guiding idea here is from a chat with Tom Everitt.)
(Acknowledgment'. A guiding frustration here is that imo people posting on LessWrong think way too much in terms of goals.)
e.g. "a rational being must always regard himself as lawgiving in a kingdom of ends possible through freedom of the will, whether as a member or as sovereign" ↩︎