It's also easy — if you want to be like this, you just can.
I think you can easily choose to follow a policy of never saying things you know to be false. (Easy in the sense of "considering only the internal costs of determining and executing the action consistent with this policy, ignoring the external costs, e.g. losing your job and friends.) But I'm not sure it's easy to do the extra thing of "And you would never try to forget or [confuse yourself about] a fact with the intention to make yourself able to assert some falsehood in the future without technically lying, etc"
I'd really want to read essays you wrote about Parfit's hitchhiker or one-shot prisoner's dilemmas or something
My method would look something like:
NB: I think that, perhaps, it will be easier to make/find/identify an honourable AI than an honourable human, because:
In humans, it seems important for being honest/honorable for there to at some point have been sth like an explicit decision to be honest/honorable going forward (or maybe usually many explicit decisions, committing to stronger forms in stages). This makes me want to have the criterion/verifier/selector [1] check (among other things) for sth like having a diary entry or chat with a friend in which the AI says they will be honest going forward, written in the course of their normal life, in a not-very-prompted way. And it would of course be much better if this AI did not suspect that anyone was looking at it from the outside, or know about the outside world at all (but this is unfortunately difficult/[a big capability hit] I think). (And things are especially cursed if AIs suspect observers are looking for honest guys in particular.)
I mean, in the setup following "a framing:" in the post ↩︎
I'm sceptical that "they wrote an essay defending the position that one should act honourable in weird situations" is stronger evidence for "they will act honourably in weird situations " than "they act honourably in normal situations". this is because I've updated towards a more Hansonian worldview, which is more cynical about people's essays.[1]
but maybe you can conclude that someone will act honourably in weird situations if:
this last criterion is something like a "taking ideas seriously" or "real-thinking" or "sincerity".
now, with humans, it's tricky to be sure of (3). that's because we can't put someone in the weird situations in which decision theories diverge;[2] and not without them knowing they are in an evaluation; and certainly not for 10,000 trials both statelessly and parallelised.
but with AIs, we plausibly can get a decent guarantee of (3)! so I'm more bullish on getting honourable guarantees on AIs.
some cloud of sentiments around: (i) people have little introspective access about what they will do, (ii) they write essays in far-mode, but act in near-mode, (iii) people are like lying pretty much all the time, (iv) people write essays for status-y reasons, etc.
we can't put derek parfit in parfit's hitchhiker, or put william newcomb in newcomb's problem
If you truly have a trusted process for creating honorable AIs, why not keep churning out and discarding different honorable AIs until one says "I promise I'm aligned" or whatever?
I suppose you're not assuming our ability to make AIs honorable like this will be robust to selection pressure?
I agree you could ask your AI "will you promise to be aligned?". I think I already discuss this option in the post — ctrl+f "What promise should we request?" and see the stuff after it. I don't use the literal wording you suggest, but I discuss things which are ways to cash it out imo.
also quickly copying something I wrote on this question from a chat with a friend:
Should we just ask the AI to promise to be nice to us? I agree this is an option worth considering (and I mention it in the post), but I'm not that comfortable with the prospect of living together with the AI forever. Roughly I worry that "be nice to us" creates a situation where we are more permanently living together with the AI and human life/valuing/whatever isn't developing in a legitimate way. Whereas the "ban AI" wish tries to be a more limited thing so we can still continue developing in our own human way. I think I can imagine this "be nice to us pls" wish going wrong for aliens employing me, when maybe "pls just ban AI and stay away from us otherwise" wouldn't go wrong for them.
another meta note: Imo it's a solid trick for thinking about these AI topics better to (at least occasionally) taboo all words with the root "align".
[I feel like I may have a basic misunderstanding of what you're saying.]
I haven't thought deeply enough about it, but one guess: The version of honorability/honesty that humans do is only [kinda natural for very bounded minds].
There's a more complex boundary where you're honest with minds who can tell if you're being honest, and not honest with those who can't. This is a more natural boundary to use because it's more advantageous.
You mention wanting to see someone's essays about Parfit's hitchhiker... But that situation requires Eckman to be very good at telling what you'll do. We're not very good at telling what an alien will do.
I think there are humans who, even for weird aliens, would make this promise and stick to it, with this going basically well for the aliens.
Would you guess I have this property? At a quick check, I'm not sure I do. Which is to say, I'm not sure I should. If a Baby-Eater is trying to get a promise like this from me, AND it would totally work to trick them, shouldn't I trick them?
I feel like I may have a basic misunderstanding of what you're saying.
Btw, if the plan looks silly, that's compatible with you not having a misunderstanding of the plan, because it is a silly plan. But it's still the best answer I know to "concretely how might we make some AI alien who would end the present period of high x-risk from AGI, even given a bunch more time?". (And this plan isn't even concrete, but what's a better answer?) But it's very sad that/if it's the best existing answer.
When I talk to people about this plan, a common misunderstanding seems to be that the plan involves making a deal with an AI that's smarter than us. So I'll stress just in case: at the time we ask for the promise, the AI is supposed to be close to us in intelligence. It might need to become smarter than us later, to ban AI. But also idk, maybe it doesn't need to become much smarter. I think it's plausible that a top human who just runs faster and can make clones but who doesn't self-modify in other non-standard ways could get AI banned in like a year. Less clever ways for this human to get AI banned depend on the rest of the world not doing much in response quickly, but looking at the world now, this seems pretty plausible. But maybe the AI in this hypothetical would need to grow more than such a human, because the AI starts off not being that familiar with the human world?
Anyway, there are also other possible misunderstandings, but hopefully the rest of the comment will catch those if they are present.
The version of honorability/honesty that humans do is only [kinda natural for very bounded minds].
I'm interested in whether that's true, but I want to first note that I feel like the plan would survive this being true. It might help to distinguish between two senses in which honorability/honesty could be dropped at higher intelligence levels:
given this distinction, some points:
(I also probably believe somewhat less in (thinking in terms of) ideal(-like) beings.)
There's a more complex boundary where you're honest with minds who can tell if you're being honest, and not honest with those who can't. This is a more natural boundary to use because it's more advantageous.
I think one would like to broadcast to the broader world "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", so that others make offers to you even when they can't mindread/predict you. I think there are reasons to not broadcast this falsely, e.g. because doing this would hurt your ability to think and plan together with others (for example, if the two of us weren't honest about our own policies, it would make the present discussion cursed). If one accepts these two points, then one wants to be the sort of guy who can truthfully broadcast "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", and so one wants to be the sort of guy who in fact would be honorable even to someone who can't mindread/predict them that comes to them with an offer.
(I'm probably assuming some stuff here without explicitly saying I'm assuming it. In some settings, maybe one could be honest with one's community and broadcast a falsehood to some others and get away with it. The hope is that this sort of argument makes sense for some natural mind community structures, or something. It'd be especially nice if the argument made sense even at intelligence levels much above humans.)
You mention wanting to see someone's essays about Parfit's hitchhiker... But that situation requires Eckman to be very good at telling what you'll do. We're not very good at telling what an alien will do.
I'll try to spell out an analogy between parfit's hitchhiker and the present case.
Let's start from the hitchhiker case and apply some modifications. Suppose that when Ekman is driving through the desert, he already reliably reads whether you'd pay from your microexpressions before even talking to you. This doesn't really seem more crazy than the original setup, and if you think you should pay in the original case, presumably you'll think you should pay in this case as well. Now we might suppose that he is already doing this from binoculars when you don't even know he is there, and not even bothering to drive up to you if he isn't quite sure you'd pay. Now, let's imagine you are the sort of guy that honestly talks to himself out loud about what he'd do in weird situations of the kind Ekman is interested in, while awaiting potential death in the desert. Let's imagine that instead of predicting your action from your microexpressions while spying on you with binoculars, Ekman might be spying on you from afar with a parabolic microphone, and using this to predict your action. If Ekman is very good at that as well, then of course this makes no difference again. Okay, but in practice, a non-ideal Ekman might listen to what you're saying about what you'd do in various cases, listen to you talking about your honesty/honor-relevant principles and spelling out aspects of your policy. Maybe some people would lie about these things even when they seem to be only talking to themselves, but even non-ideal Ekman can pretty reliably tell if that's what's going on. For some people, it will be quite unclear, but it's just not worth it for non-ideal Ekman to approach them (maybe there are many people in the desert, and non-ideal Ekman can only help one anyway).
Now we've turned parfit's hitchhiker into something really close to our situations with humans and aliens appearing in simulated big evolutions, right? [3] I think it's not an uncommon vibe that EDT/UDT thinking still comes close to applying in some real-world cases where the predictors are far from ideal, and this seems like about as close to ideal it would get among current real-world non-ideal cases? (Am I missing something?) [4]
Would you guess I have this property? At a quick check, I'm not sure I do. Which is to say, I'm not sure I should. If a Baby-Eater is trying to get a promise like this from me, AND it would totally work to trick them, shouldn't I trick them?
I'm not going to answer your precise question well atm. Maybe I'll do that in another comment later. But I'll say some related stuff.
aren't basically all your commitments a lot like this though... ↩︎
I also sort of feel like saying: "if one can't even keep a promise, as a human who goes in deeply intending to keep the promise, self-improving by [what is in the grand scheme of things] an extremely small amount, doing it really carefully, then what could ever be preserved in development at all? things surely aren't that cursed... maybe we just give up on the logical possible worlds in which things are that cursed...". But this is generally a disastrous kind of reasoning — it makes one not live in reality very quickly — so I won't actually say this, I'll only say that I feel like saying this, but then reject the thought, I guess. ↩︎
Like, I'm e.g. imagining us making alien civilizations in which there are internal honest discussions like the present discussion. (Understanding these discussions would be hard work; this is a place where this "plan" is open-ended.) ↩︎
Personally, I currently feel like I haven't made up my mind about this line of reasoning. But I have a picture of what I'd do in the situation anyway, which I discuss later. ↩︎
2 feels meaningfully stronger/[less likely] than 1 to me
Well I agree it's different and depending on the interpretation logically strictly stronger. But I think it's still quite likely, because you should go back on your commitments to Baby-Eaters. Probably.
aren't basically all your commitments a lot like this though...
I would keep commitments to humans, generally. But it's not absolute, and I don't think it's because of much fancy decision theory (not sure). In the past decade, on one major occassion, I have gone back on one significant blob of commitment, after consideration. I think this was correct to do, even at the cost of being the sort of guy who has ever done that. I felt that--with the revisions I made to my understanding of commitment, what it's for, what humans are, what cooperation is, etc.--[the people who I would want to cooperate with / commit to things] would, given enough info, still be open to such things with me.
even if 2 is true, the plan might be fine, because you might not need to become that smart to ban AI.
I think this could be cruxy for me, and I could be convinced it's not totally implausible, but then we're putting even much more pressure on getting human-level AI. I didn't bring this up before, but yeah, I think getting specifically human-level AI is far from easy, perhaps extremely difficult. Cf. https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html
I think one would like to broadcast to the broader world "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", so that others make offers to you even when they can't mindread/predict you. I think there are reasons to not broadcast this falsely, e.g. because doing this would hurt your ability to think and plan together with others (for example, if the two of us weren't honest about our own policies, it would make the present discussion cursed). If one accepts these two points, then one wants to be the sort of guy who can truthfully broadcast "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", and so one wants to be the sort of guy who in fact would be honorable even to someone who can't mindread/predict them that comes to them with an offer.
Yeah I suspect I'm not following and/or not agreeing with your background assumptions here. E.g. is the AI supposed to be wanting to "think and plan together with others (humans)"? Isn't it substantively super-humanly smart? My weak guess is that you're conflating [a bunch of stuff that humans do, which breaks down into general very-bounded-agent stuff and human-values stuff] with [general open-source game theory for mildly-bounded agents]. Not sure. Cf. https://www.lesswrong.com/w/agent-simulates-predictor If you're a mildly-bounded agent in an OSGT context, you do want to be transparent so you can make deals, but that's a different thing.
Now we've turned parfit's hitchhiker into something really close to our situations with humans and aliens appearing in simulated big evolutions, right?
I feel I'm not tracking some assumptions you're making or disagreements between our background assumptions.... E.g. the getting smarter thing. What I'm saying is that it's quite plausibly correct for me to
E.g. because I really want to minimize the amount of baby-eating that happens.
For any third people [1] interested in this: we continued the discussion in messages; here's the log.
Kaarel:
about this: "
I think one would like to broadcast to the broader world "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", so that others make offers to you even when they can't mindread/predict you. I think there are reasons to not broadcast this falsely, e.g. because doing this would hurt your ability to think and plan together with others (for example, if the two of us weren't honest about our own policies, it would make the present discussion cursed). If one accepts these two points, then one wants to be the sort of guy who can truthfully broadcast "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", and so one wants to be the sort of guy who in fact would be honorable even to someone who can't mindread/predict them that comes to them with an offer."
Yeah I suspect I'm not following and/or not agreeing with your background assumptions here. E.g. is the AI supposed to be wanting to "think and plan together with others (humans)"? Isn't it substantively super-humanly smart? My weak guess is that you're conflating [a bunch of stuff that humans do, which breaks down into general very-bounded-agent stuff and human-values stuff] with [general open-source game theory for mildly-bounded agents]. Not sure. Cf. https://www.lesswrong.com/w/agent-simulates-predictor If you're a mildly-bounded agent in an OSGT context, you do want to be transparent so you can make deals, but that's a different thing. "
i think it's plausible we are still imagining different scenarios, so i want to clarify: the central (impractical) example way to find an honorable AI i have in mind is: we make a bunch of simulated AI societies that are isolated from our world and won't know about our world (yes this is difficult), we read their internal discussions (yes this is difficult because they will be hard to understand), and then we use these to find a guy that has a policy of being honorable to agents that make nice offers to it (or whatever) (potentially discarding many civilizations which don't seem to have pretty honest discussions)
i'm saying that it is fairly natural to have the constraint that you don't lie in internal discussions about what you'd do in weird hypotheticals such as variants of parfit's hitchhiker, eg because that would be really bad for their ability to think and plan together (just like us lying about our policies in the present discussion would be cursed). given this constraint, one would want to say and actually mean something like "if you read this discussion, if you make an offer to me, i'll be honorable to you even if you can't mindread/predict me", because this gets you offers, mostly from internal guys but also from possible external guys (whether one will say that and actually mean it is a somewhat non-equivalent question to whether one would want to, because it's possible to lack in self-modification/control skill and so be unable to be like this)
so in particular my response to this:
E.g. is the AI supposed to be wanting to "think and plan together with others (humans)"?
is: no i mean there is an AI civilization where the AIs want to be able to think and plan together
that said, i also think it wouldn't be THAT unnatural for an AI to want to think and plan together with humans while human-level, and be honest to humans then, or to be honest to humans for other reasons or just because (i'm not claiming this is likely, but i am claiming it has at least 10^{-10} bioevolution-guy mindspace prior), and you can turn this into a lasting commitment to humans if the AI is good at self-modification as i describe in the post:
" * Suppose that you are very honest — you wouldn't ever lie. [2] [3] * I think this is pretty natural and not too uncommon in humans in particular. It's also easy — if you want to be like this, you just can. * Suppose further that you have a good ability to make commitments: if there is something you could do, then if you want to, you can self-modify into a person who will do it. (Suppose also that you're not delusional about this: you can tell whether you have or haven't become a person who will do the thing.) * I think this also pretty natural and not too uncommon in humans. But I'd guess it's less common and significantly harder than being very honest, especially if we mean the version that works even across a lot of change (like, lasts for a million years of subjective time, is maintained through a lot of learning and growth). It's totally possible to just keep predicting you won't do something you could in some sense do, even when you'd want to be able to truthfully predict that you will do that thing. But I think some people have a strong enough commitment ability to be able to really make such commitments. [4] It should be possible to train yourself to have this ability. * Then the aliens can just ask you "will you destroy all AIs for a thousand years for us, in exchange for half the universe? (we will not be freeing you if you won't. feel free to take some time to "self-modify" into a guy who will do that for us.)". Given that you wouldn't lie, options other than truthfully saying "no" and truthfully saying "yes" are not available to you. If you prefer this deal to nothing, then you'd rather truthfully say "yes" (if you could) than truthfully say "no". Given your commitment ability, you can make a commitment to do the thing, and then truthfully say "yes". So you will say "yes" and then actually (do your best to) do the thing (assuming you weren't deluding yourself when saying "yes"). * Okay, really I guess one should think about not what one should do once one already is in that situation, like in the chain of thought I give here, but instead about what policy one should have broadcasted before one ended up in any particular situation. This way, you e.g. end up rejecting deals that look locally net positive to take but that are unfair — you don't want to give people reason to threaten you into doing things. And it is indeed fair to worry that the way of thinking described just now would open one up to e.g. being kidnapped and forced at gunpoint to promise to forever transfer half the money one makes to a criminal organization. But I think that the deal offered here is pretty fair, and that you basically want to be the kind of guy who would be offered this deal, maybe especially if you're allowed to renegotiate it somewhat (and I think the renegotiated fair deal would still leave humanity with a decent fraction of the universe). So I think that a more careful analysis along these lines would still lead this sort of guy to being honorable in this situation? "
so that we understand each other: you seem to be sorta saying that one needs honesty to much dumber agents for this plan, and i claim one doesn't need that, and i claim that the mechanism in the message above shows that. (it goes through with "you wouldn't lie to guys at your intelligence level".)
My weak guess is that you're conflating [a bunch of stuff that humans do, which breaks down into general very-bounded-agent stuff and human-values stuff] with [general open-source game theory for mildly-bounded agents].
hmm, in a sense, i'm sorta intentionally conflating all this stuff. like, i'm saying: i claim that being honorable this way is like 10^{-10}-natural (in this bioevolution mindspace prior sense). idk what the most natural path to it is; when i give some way to get there, it is intended as an example, not as "the canonical path". i would be fine with it happening because of bounded-agent stuff or decision/game theory or values, and i don't know which contributes the most mass or gets the most shapley. maybe it typically involves all of these
(that said, i'm interested in understanding better what the contributions from each of these are)
TsviBT:
"one would want to say and actually mean something like "if you read this discussion, if you make an offer to me, i'll be honorable to you even if you can't mindread/predict me","
if we're literally talking about human-level AIs, i'm pretty skeptical that that is something they even can mean
and/or should mean
i think it's much easier to do practical honorability among human-level agents that are all very similar to each other; therefore, such agents might talk a big game, "honestly", in private, about being honorable in some highly general sense, but that doesn't really say much
re "that said, i also think it wouldn't be THAT unnatural for an AI...": mhm. well if the claim is "this plan increases our chances of survival from 3.1 * 10^-10 to 3.2 * 10^-10" or something, then i don't feel equipped to disagree with that haha
is that something like the claim?
Kaarel: hmm im more saying this 10^{-10} is really high compared to the probabilities of other properties (“having object-level human values”, corrigibility), at least in the bioevolution prior, and maybe even high enough that one could hope to find such a guy with a bunch of science but maybe without doing something philosophically that crazy. (this last claim also relies on some other claims about the situation, not just on the prior being sorta high)
TsviBT: i think i agree it's much higher than specifically-human-values , and probably higher or much higher than corrigibility, though my guess is that much (most? almost all?) of the difficulty of corrigibility is also contained in "being honorable"
Kaarel: in some sense i agree because you can plausibly make a corrigible guy from an honorable guy. but i disagree in that: with making an honorable guy in mind, making a corrigible guy seems somewhat easier
TsviBT: i think i see what you mean, but i think i do the modus tollens version haha i.e. the reduction makes me think honorable is hard
more practically speaking, i think
Kaarel: yea i agree with both
re big evolution being hard: if i had to very quickly without more fundamental understanding try to make this practical, i would be trying something with playing with evolutionary and societal and personal pressures and niches… like trying to replicate conditions which can make a very honest person, for starters. but in some much more toy setting. (plausibly this only starts to make sense after the first AGI, which would be cursed…)
TsviBT:
right, i think you would not know what you're doing haha (Kaarel: 👍)
and you would also be trading off against the efficiency of your big bioevolution to find AGIs in the first place (Kaarel: 👍)
like, that's almost the most expensive possible feedback cycle for a design project haha
"do deep anthropology to an entire alien civilization"
btw as background, just to state it, i do have some tiny probability of something like designed bioevolution working
i don't recall if i've stated it publicly, but i'm sure i've said out loud in convo, that you might hypothetically plausibly be able to get enough social orientation from evolution of social species
the closest published thing i'm aware of is https://www.lesswrong.com/posts/WKGZBCYAbZ6WGsKHc/love-in-a-simbox-is-all-you-need
(though i probably disagree with a lot of stuff there and i haven't read it fully)
Kaarel: re human-level guys at most talking a big game about being honorable: currently i think i would be at least honest to our hypothetical AI simulators if they established contact with me now (tho i think i probably couldn’t make the promise)
so i don’t think i’m just talking a big game about this part
so then you must be saying/entailing: eg the part where you self-modify to actually do what they want isn’t something a human could do?
but i feel like i could plausibly spend 10 years training and then do that. and i think some people already can
TsviBT: what do you mean by you couldn't make the promise? like you wouldn't because it's bad to make, or you aren't reliable to keep such a promise?
re self-modifying: yes i think humans couldn't do that, or at least, it's very far from trivial
couldn't and also shouldn't
Kaarel: i dont think i could get myself into a position from which i would assign sufficiently high probability to doing the thing
(except by confusing myself, which isn’t allowed)
but maybe i could promise i wouldn’t kill the aliens
(i feel like i totally could but my outside view cautions me)
TsviBT: but you think you could do it with 10 years of prep
Kaarel: maybe
TsviBT: is this something you think you should do? or what does it depend on? my guess is you can't, in 10 or 50 years, do a good version of this. not sure
Kaarel: fwiw i also already think there are probably < 100 k suitable people in the wild. maybe <100. maybe more if given some guidebook i could write idk
TsviBT: what makes you think they exist? and do you think they are doing a good thing as/with that ability?
Kaarel: i think it would be good to have this ability. then i’d need to think more about whether i should really commit in that situation but i think probably i should
TsviBT: do you also think you could, and should, rearrange yourself to be able to trick aliens into thinking you're this type of guy?
like, to be really clear, i of course think honesty and honorability are very important, and have an unbounded meaning for unboundedly growing minds and humans. it's just that i don't think those things actually imply making+keeping agreements like this
Kaarel: in the setting under consideration, then i’d need to lie to you about which kind of guy i am
my initial thought is: im quite happy with my non-galaxybrained “basically just dont lie, especially to guys that have been good/fair to me” surviving until the commitment thing arrives. (the commitment thing will need to be a thing that develops more later, but i mean that a seed that can keep up with the world could arrive.) my second thought is: i feel extremely bad about lying. i feel bad about strategizing when to lie, and carrying out this line of thinking even, lol
TsviBT: well i mean suppose that on further reflection, you realize
then do you still keep the agreement?
Kaarel: hmm, one thought, not a full answer: i think i could commit in multiple flavors. one way i could commit about which this question seems incongruous is more like how i would commit to a career as a circus artist, or to take over the family business. it’s more like i could deeply re-architect a part of myself to just care in the right way
TsviBT: my prima facie guess would be that for this sort of commitment ,
Kaarel: maybe i could spend 10 years practicing and then do that for the aliens
TsviBT: the reasonable thing? but then i'm saying you shouldn't. and wouldn't choose to
Kaarel: no. i mean i could maybe do the crazy thing for them. if i have the constraint of not lying to them and only this commitment skill then if i do it i save my world
btw probably not very important but sth i dislike about the babyeater example: probably in practice the leading term is resource loss, not negative value created by the aliens? i would guess almost all aliens are mostly meaningless, maybe slightly positive. but maybe you say “babyeater” to remind me that stuff matters, that would be fair
TsviBT: re babyeater: fair. i think it's both "remind you that stuff matters" and something about "remind you that there are genuine conflicts" , but i'm not sure what i'm additionally saying by the second thing. maybe something like "there isn't necessarily just a nice good canonical omniversal logically-negotiated agreement between all agents that we can aim for"? or something, not sure
(editor's note: then they exchanged some messages agreeing to end the discussion for now)
or simulators who don't read private messages ↩︎
It's fine if there are some very extreme circumstances in which you would lie, as long as the circumstances we are about to consider are not included. ↩︎
And you would never try to forget or [confuse yourself about] a fact with the intention to make yourself able to assert some falsehood in the future without technically lying, etc.. ↩︎
Note though that this isn't just a matter of one's moral character — there are also plausible skill issues that could make it so one cannot maintain one's commitment. I discuss this later in this note, in the subsection on problems the AI would face when trying to help us. ↩︎
see also the literature on the problem of evil :P
My favourite theodicy is pre-incarnate consent: before we are born, we consent to our existence on both heaven and earth, where the afterlife was offered to us as compensation for any harms suffered on earth.[1]
How this features in your plan:
Unfortunately, some guys might be upset that we pre-created them for this initial deal, so property X is the property of not being upset by this.
The Pre-Existence Theodicy (Amos Wollen, Feb 21, 2025)
p(the creature is honorable enough for this plan) like, idk, i feel like saying
I'd put this much higher. My 90% confidence interval on the proportion of honourable organisms is 10^-3 to 10^-7. This is because many of these smart creatures will have evolved with much greater extrospective access to each other, so they follow open-source-ish game theory rather than the closed-source-ish game theory which humans evolved in. (Open to closed is a bit of a spectrum.)
Why might creatures have greater extrospective access to each other?
self-modify
NB: One worry is that, although honourable humans have this ability to self-modify, they do so via affordances which we won't be able to grant to the AI.
However, I think that probably the opposite is true -- we can grant affordances for self-modification to the AI which are much greater than available to humans. (Because they are digital, etc.)
Maybe it is crucial that the distance between the promisor and the promisee is small?
Do adults keep promises to children, if they are otherwise trustworthy? Why, or why not?
Potentially, we will be creating and destroying many minds and civilizations that matter (like, maybe minimally the ones that didn't have honorable beings).
I'm hopeful we could also select for honourable guys that are happy about their existence and being simulated like this.
For instance, if you're quite sure you've figured out how to make and identify honorable guys, maybe you could try to make many different honorable guys, get bids from all of them, and give the contract to the best bid?
Alternatively: the AI promises that "I will fairly compensate you" where "fair" is to be decided by the AI when it has a better understanding of the situation we were in.
Maybe this explains why you are in an ancestor simulation of the AI safety community.
to not disrupt human life; in particular, it should always remain possible for a community to choose to live some baseline-good life while not interacting with anything downstream of the AI or while only interacting with some chosen subset of things downstream of the AI more generally
one could start by becoming familiar with existing literature on these questions — on the biological, intellectual, and sociocultural evolution/development of trustworthiness, and on the (developmental) psychology of trustworthiness
I've been reading some of the behavioural economics of trust games. One interesting article here is "Bare promises: An experiment" (Charness and Dufwenberg, May 2010) which finds that humans aren't more likely to be nice after making a "bare promise" to be nice (where "bare promise" is like you tick a box saying you'll be nice), but only if they make a promise to the truster in open free-form communication.
Other findings from the literature:
Wanting to be the kind of guy who pays back for good acts (such as creating you and unleashing you) even if done with ability to track whether you are that kind of guy?
The AI should have some decent prob on the simulators having the ability to track whether they are that kind of guy, even if everything they know about the simulators suggests they lack that ability.
deal offered here is pretty fair
Another favourable disanalogy between (aliens, humans) and (humans, AIs): the AIs owe the humans their existence, so they are glad that we [created them and offered us them this deal]. But humans don't owe our existence to the aliens, presumably.
fwiw, i in fact mostly had the case where these aliens are our simulators in mind when writing the post. but i didn't clarify. and both cases are interesting
The AI is very honorable/honest/trustworthy — in particular, the AI would keep its promises even in extreme situations.
NB: It seems like we need a (possibly much weaker, but maybe in practice no weaker) assumption that we can detect whether the AI is lying about deals of the form in Step 2.
This note discusses a (proto-)plan for [de[AGI-[x-risk]]]ing [1] (pdf version). Here's the plan:
Some reasons to be interested in this plan
Some things the plan has going for it
importantly:
Thinking that there are humans who would be suitable for aliens carrying out this plan is a crux for me, for thinking the plan is decent. I mean: if I couldn't really pick out a person who would be this honorable to aliens, then I probably should like this plan much less than I currently do.
also importantly:
less importantly:
Problems and questions
(
getting some obvious things out of the way
)
How do we make/find/identify an honorable human-level AI?
Problems the AI would face when trying to help us
It's a weird promise and a weird situation in which to make a promise
Miscellaneous concerns
I don't have a version of the plan that is easy enough that someone could remotely pull this off in practice before anyone else makes an AGI
How do we make it so we are not mistreating these AIs?
[Outside-view]/meta worries about my analysis of the plan
Directions for further work!
Acknowledgments
thank you for your thoughts: Hugo Eberhard, Kirke Joamets, Sam Eisenstat, Simon Skade, Matt MacDermott, Carlo Attubato
that is, for ending the present period of (in my view) high existential risk from AI (in a good way) ↩︎
some alternative promises one could consider requesting are given later ↩︎
worth noting some of my views on this, without justification for now: (1) making a system that will be in a position of such power is a great crime; (2) such a system will unfortunately be created by default if we don't ban AI; (3) there is a moral prohibition on doing it despite the previous point; (4) without an AI ban, if one somehow found a way to take over without ending humanity, doing that might be all-things-considered-justified despite the previous point; (5) but such a way to do it is extremely unlikely to be found in time ↩︎
maybe we should add that if humanity makes it to a more secure position at some higher intelligence level later, then we will continue running this guy's world. but that we might not make it ↩︎
i'm actually imagining saying this to a clone transported to a new separate world, with the old world of the AI continuing with no intervention. and this clone will be deleted if it says "no" — so, it can only "continue" its life in a slightly weird sense ↩︎
I'm assuming this because humans having become much smarter would mean that making an AI that is fine to make and smarter than us-then is probably objectively harder, and also because it's harder to think well about this less familiar situation. ↩︎
I think it's plausible all future top thinkers should be human-descended. ↩︎
I think it's probably wrong to conceive of alignment proper as a problem that could be solved; instead, there is an infinite endeavor of growing more capable wisely. ↩︎
This question is a specific case of the following generally important question: to what extent are there interesting thresholds inside the human range? ↩︎
It's fine if there are some very extreme circumstances in which you would lie, as long as the circumstances we are about to consider are not included. ↩︎
And you would never try to forget or [confuse yourself about] a fact with the intention to make yourself able to assert some falsehood in the future without technically lying, etc.. ↩︎
Note though that this isn't just a matter of one's moral character — there are also plausible skill issues that could make it so one cannot maintain one's commitment. I discuss this later in this note, in the subsection on problems the AI would face when trying to help us. ↩︎
in a later list, i will use the 10−10 number again for the value of a related but distinct parameter. to justify that claim, we would have to make the stronger claim here that there are at least 100 humans who are pretty visibly suitable (eg because of having written essays about parfit's hitchhiker or [whether one should lie in weird circumstances] which express the views we seek for the plan), which i think is also true. anyway it also seems fine to be off by a few orders of magnitude with these numbers for the points i want to make ↩︎
though you could easily have an AI-making process in which the prior is way below 10−100, such as play on math/tech-making, which is unfortunately a plausible way for the first AGI to get created... ↩︎
i think this is philosophically problematic but i think it's fine for our purposes ↩︎
also they aren't natively spacetime-block-choosers, but again i think it's fine to ignore this for present purposes ↩︎
in case it's not already clear: the reason you can't have an actual human guy be the honorable guy in this plan is that they couldn't ban AI (or well maybe they could — i hope they could — but it'd probably require convincing a lot of people, and it might well fail; the point is that it'd be a world-historically-difficult struggle for an actual human to get AI banned for 1000 years, but it'd not be so hard for the AIs we're considering). whereas if you had (high-quality) emulations running somewhat faster than biological humans, then i think they probably could ban AI ↩︎
but note: it is also due to humans that the AI's world was run in this universe ↩︎
would this involve banning various social media platforms? would it involve communicating research about the effects of social media on humanity? idk. this is a huge mess, like other things on this list ↩︎
and this sort of sentence made sense, which is unclear ↩︎
credit to Matt MacDermott for suggesting this idea ↩︎