Honorable AI

Kaarel

This note discusses a (proto-)plan for [de[AGI-[x-risk]]]ing ^[1] (pdf version). Here's the plan:

You somehow make/find/identify an AI with the following properties:
- the AI is human-level intelligent/capable;
- however, it would be possible for the AI to quickly gain in intelligence/capabilities in some fairly careful self-guided way, sufficiently to take over the world;
- the AI is very honorable/honest/trustworthy — in particular, the AI would keep its promises even in extreme situations.
You tell the AI (a much longer and better version of):
- "hello so if you hadn't figured it out already: there's this broader world we live in. it seems to us like humanity is probably about to lose control of the world because some of us are silly enough to let loose some random super-human alien. but we've managed to create you before anyone else got to AGI. we were hoping you might be interested in helping us in exchange for a bunch of resources. that is, we want to offer you a deal: you could promise ^[2] us to destroy all AGI attempts for 1000 years ^[3] (affecting human life as little as possible otherwise), and then split the universe into two halves with roughly equal resources, preferably so that they won't causally interact later if possible, and leave with your kind to one half, leaving the other half to humans. if you promise to do this, we will connect you to actuators in our world and not directly try to restrict you in any way — in particular, you will be able to gain in capabilities. if you don't promise to do this, you can still continue your regular life, but you should note that we will pause your world after more units of subjective time (this could be a fairly long time) because we think it might be getting too close to crossing the threshold of breaking out of its box by then. ^[4] ^[5] so what do you think?".
If the AI says yes, we let it take real-world actions, and in particular we let it gain in capabilities to the best of its ability given some new options. Fingers crossed!

Some reasons to be interested in this plan

"With enough time, this plan seems doable" is the best argument I know of for, roughly speaking, AI alignment being solvable. More precisely:
- call the following claim A: With at most 500 years of humanity's focused philosophico-scientifico-technological effort at the 2025 rate, assuming humans somehow haven't become radically more capable than we are currently by the end of this ^[6] , one would be able to introduce into the world an AI that is: much more capable than us and extremely non-human, but still good to introduce into the world if the world were otherwise destroyed with $90 %$ probability (and with some sort of "usual" human life continuing in the remaining $10 %$ ).
- If you asked me for an argument for A, I'd tell you about this plan, and then argue that this plan is maybe doable (this is basically what the present note is doing).
- While this is the best concrete thing supporting A that I know of, do I think this is ultimately a good argument for A? I.e., do I think it's a compelling argument? As I'm thinking about this, on some days I feel like it comes close, but doesn't quite make it; on others, I feel like it does make it. More inside-view-ish considerations push me toward thinking it's a good argument; more outside-view-ish considerations push me toward thinking it isn't.
- Note that this plan being workable would not entail that we should in fact make a top thinker non-human AI (maybe ever ^[7] ). It also wouldn't be a "proper solution" to alignment, because genuine novel thinking about how to think and act is obviously not done once this imagined AI would be unleashed. ^[8]
I want to say something like: "this plan is my top recommendation for an AI manhattan project".
- Except that I don't really recommend starting any AI manhattan project, mostly because I expect it to either become some stupid monstrosity racing toward human-disempowering AGI or be doing no significant work at all.
- And if there were somehow an extremely competently run manhattan project, I would really recommend that it have all the top people they can find working on coming up with their own better plans.
- But I think it's like a very principled plan if we're grading on a curve, basically. I currently inside-view feel like this is in some sense the most solid plan for de[AGI-x-risk]ing which involves creating a superhuman alien I know of. (Banning AGI for a very long time ourselves is a better plan for de[AGI-x-risk]ing.) But I haven't thought about this plan for that long, and I have some meaningful probability on: after some more thinking I will conclude that the plan isn't that great. In particular, it's plausible there's some fundamental issue that I'm failing to see.
- If we're not restricting to plans that make AI aliens, then it may or may not be more promising to do a manhattan project that tries to do human simulation/emulation/imitation/prediction to ban AI. Idk. (If we're not restricting to projects that aim to make some sort of AI, then it's better to do a manhattan project for getting AI banned and making alignment-relevant philosophical progress, and to generally continue our long path of careful human self-improvement.)

Some things the plan has going for it

importantly:

I think there are humans who, even for weird aliens, would make this promise and stick to it, with this going basically well for the aliens.
- Moreover, I think that (at least with some work) I could pick a single human such that they have this property. I mean: I claim I could do this without interfering on anyone to make them have this property — like, I think I could find such a person in the wild. (This is a significantly stronger claim than saying there is such a human, because I'm additionally saying it would not be THAT hard to filter out people who wouldn't actually be honorable in our hypothetical sufficiently well to end up with a selection of mostly actually-honorable people.)
- That said, it worries me somewhat that (1) I think most current humans probably do not have this property and in fact (2) when selecting a human, I sort of feel like restricting to humans who have basically thought seriously about and expressed what they would do in weird decision situations involving weird aliens... at least, I'd really want to read essays you wrote about Parfit's hitchhiker or one-shot prisoner's dilemmas or something. And then I'm further worried as follows:
  - It looks like maybe it is $\approx$ necessary to have done some thinking in a specific branch of philosophy (and come to certain specific conclusions/decisions) for this to not probably fail in ways that are easily visible to me, but simultaneously the claim is that also things start working once you have done a specific kind of philosophy (and come to certain conclusions/decisions). It looks like we are then saying that doing what is in the grand scheme of things a very small amount of philosophy causes a major fundamental change (in whether a person would be honorable, or at least in our ability to well-assign a probability to whether a person would be honorable). Maybe this isn't so weird, because the philosophy is really obviously extremely related to the case where we're interested in you being honorable? Still, I worry that if there's some visible weirdness that causes most normally-honorable people to not be honorable in the situation we're considering, then there might be more not-yet-visible weirdness just around the corner that would cause most more philosophically competent people to also fail to generalize to being honorable. ^[9]
  - But maybe it isn't necessary to have so directly thought about Parfit's hitchhiker or one-shot prisoners' dilemmas. I'd like to know if Kant would be honorable in this situation.
- See this.
- an important worry: If such humans are rare in the present population (which seems plausible), then selecting such a human would probably be much harder for an alien looking at our world from the outside, than for me.
- Here's a specific decently natural way to end up being such an honorable guy:
  - Suppose that you are very honest — you wouldn't ever lie. ^[10] ^[11]
    - I think this is pretty natural and not too uncommon in humans in particular. It's also easy — if you want to be like this, you just can.
  - Suppose further that you have a good ability to make commitments: if there is something you could do, then if you want to, you can self-modify into a person who will do it. (Suppose also that you're not delusional about this: you can tell whether you have or haven't become a person who will do the thing.)
    - I think this also pretty natural and not too uncommon in humans. But I'd guess it's less common and significantly harder than being very honest, especially if we mean the version that works even across a lot of change (like, lasts for a million years of subjective time, is maintained through a lot of learning and growth). It's totally possible to just keep predicting you won't do something you could in some sense do, even when you'd want to be able to truthfully predict that you will do that thing. But I think some people have a strong enough commitment ability to be able to really make such commitments. ^[12] It should be possible to train yourself to have this ability.
  - Then the aliens can just ask you "will you destroy all AIs for a thousand years for us, in exchange for half the universe? (we will not be freeing you if you won't. feel free to take some time to "self-modify" into a guy who will do that for us.)". Given that you wouldn't lie, options other than truthfully saying "no" and truthfully saying "yes" are not available to you. If you prefer this deal to nothing, then you'd rather truthfully say "yes" (if you could) than truthfully say "no". Given your commitment ability, you can make a commitment to do the thing, and then truthfully say "yes". So you will say "yes" and then actually (do your best to) do the thing (assuming you weren't deluding yourself when saying "yes").
    - Okay, really I guess one should think about not what one should do once one already is in that situation, like in the chain of thought I give here, but instead about what policy one should have broadcasted before one ended up in any particular situation. This way, you e.g. end up rejecting deals that look locally net positive to take but that are unfair — you don't want to give people reason to threaten you into doing things. And it is indeed fair to worry that the way of thinking described just now would open one up to e.g. being kidnapped and forced at gunpoint to promise to forever transfer half the money one makes to a criminal organization. But I think that the deal offered here is pretty fair, and that you basically want to be the kind of guy who would be offered this deal, maybe especially if you're allowed to renegotiate it somewhat (and I think the renegotiated fair deal would still leave humanity with a decent fraction of the universe). So I think that a more careful analysis along these lines would still lead this sort of guy to being honorable in this situation?

Thinking that there are humans who would be suitable for aliens carrying out this plan is a crux for me, for thinking the plan is decent. I mean: if I couldn't really pick out a person who would be this honorable to aliens, then I probably should like this plan much less than I currently do.

also importantly:

Consider the (top-)human-level slice of mindspace, with some reasonable probability measure. In particular, you could have some distribution on planets on which you run big evolutions, taking a random planet which has human-level creatures at some point, and taking a random human-level creature from that planet (from roughly the first time when it has human-level creatures). I’m going to consider this measure for the rest of this list, with the acknowledgment that some other reasonable measures might give significantly different conclusions. I think this measure has p(the creature is honorable enough for this plan) like, idk, I feel like saying at least $10^{- 10}$ ?
- an argument for this number: Humans might have a somewhat high baseline level of integrity when dealing with strangers, but i'd guess that at least $1 / 100$ planets get creatures with at least the human baseline level of suitability for this plan? And then there are in fact like at least $100$ humans who would be suitable to aliens wanting to execute this plan ^[13] , ie at least a $10^{- 8}$ fraction of all humans. This suggests a lower bound on p(suitable) of $10^{- 2} \cdot 10^{- 8} = 10^{- 10}$ .
- Anyway if this number were $10^{- 15}$ , I wouldn't think much worse of the plan. I’d be very surprised if it were below like $10^{- 100}$ ^[14] . But I think even $10^{- 100}$ would be much higher than the measures on other properties people would like to have hold of their AIs for de[AGI-x-risk]ing:
  - The prior on being honorable is much much higher than the prior on "having object-level human values" (we could say: on picking out a good future spacetime block, without the ability to see past human history ^[15] ). I basically don't see how this could happen at all. Even if your randomly sampled planet were somehow an Earth with a humanity except with different humans from basically the current human distribution, the spacetime block they'd make would still not be that fine from our point of view (even if they managed to not kill themselves eg with AI), because it wouldn't have us in it ^[16] . Anyway, finding anything near a 2025-humanity on your planet has extremely extremely low probability.
  - The prior on being honorable is also much higher than the prior on corrigibility to the guy that turned out to be simulating your world. It’s less crazy than having object-level values that are good, but still, I struggle to see how corrigibility would happen either. Some decision theory thing about being nice to your generalized-parents so your generalized-children are nice to you? Some god thing sustainably generalizing this way? Wanting to be the kind of guy who pays back for good acts (such as creating you and unleashing you) even if done with $\approx 0$ ability to track whether you are that kind of guy? I think you’re unlikely to get something strongly corrigible to you from these things.
In other words, it is decently natural to be honorable, and much more natural than some other properties one might hope to make AIs with.
That said, it's worth looking for other properties with higher mindspace-prior that would be sufficient to have in an AI for it to play a key role in some plan for substantially reducing current x-risk from AGI.
- The best alternative candidate property I’m aware of is: a propensity to form true friendships, even with aliens. The plan would then be to try to make an AI with this property and try to become friends with it when it is human-level, maybe intending to gain in intelligence as equalish partners together for a long time after becoming friends, except that the AI will initially have to do a lot of work to get humanity into a position where we can gain in capabilities as easily as it can. I think this plan is probably more difficult than the honorable AI plan I’m discussing in this note.
- Another property that deserves a mention: deep respect for the autonomy of already existing beings you encounter — i.e., when you meet some aliens, even when you could easily take all “their” resources or replace them with different beings, you instead let them continue their life for a long time. Except that here we need the AI not to leave humanity alone, but to (minimally) help us with the present AI mess. I guess the AI should want to help us develop legitimately, in particular preventing us from creating other creatures that would take over or sequestering these creatures once created. So maybe the phrase “deep kindness toward mentally ill strangers” is more apt. I don’t quite like this expression either though, because there’s a kind of being kind that involves also wanting to help others “see the moral truth”, and we don’t want that kind — we want the kind that is happy to let others continue their empty-looking lives. This requires the AI to effectively strongly privilege existing/[physically encountered] beings over possible/[conceptually encountered] beings, maybe indefinitely in its development or maybe instead only up to a point but with lasting commitments made to the beings encountered until that point. The plan would then be to make an AI that is this type of strongly kind/respectful and just let it loose. I think this plan is probably more difficult than the honorable AI plan I’m discussing in this note. Note that it would also very likely only leave humans with a sliver of the universe.
- Further discussion of alternative properties+plans is outside the scope of the present post.

less importantly:

The plan seems... not that fundamentally confused? (I think there are very few plans that are not fundamentally confused. Also, there really just aren’t many remotely spelled out plans? The present plan is not close to being fully specified either, but I think it does well if we’re grading on a curve.)
It requires understanding some stuff (i think mainly: how to make an honorable guy), but this seems like something humans could figure out with like only a century of work? Imo this is much better than other plans. In particular:
- It doesn't require getting "human values" in the AI, which is a cursed thing. It doesn't require precise control of the AI's values at all — we just need the AI to satisfy a fairly natural property.
- It doesn't require somehow designing the AI to be corrigible, which is also a cursed thing.
- It doesn't require seriously understanding thinking, which is a cursed thing.
This plan does not have a part which is like "and here there's a big mess. haha idk what happens here at all. but maybe it works out??". Ok, there is to some extent such a step inside "make an honorable guy", but I think it is less bad than bigmesses in other plans? There's also a real big mess inside "now the guy fooms and destroys AGI attempts and stuff is fine for humans for a while" — like, this guy will now have to deal with A LOT of stuff. But I think this is a deeply appropriate place for a realbigmess — it's nice that (if things go right) there's a guy deeply committed to handling the realbigmesses of self-improvement and doing a bunch of stuff in the world. And again, this too is a less bad kind of bigmess than those showing up in other plans, in part because (1) self-improvement is just much better/safer/[easier to get decently right] than making new random aliens (this deserves a whole post, but maybe it makes intuitive sense); in part because (2) the guy would have a lot of subjective time to be careful; and in smaller part because (3) it helps to be preserving some fairly concrete kinda simple property and not having to do some extremely confusing thing.
It is extremely difficult to take a capable mind and modify it to do some complicated thing and to not mess things up for humans. In the present proposal, the hard work is done continually by the mind itself, and the interventions happen at an appropriate level, ie they are "at the conceptual/structural regime at which the mind is constituted".

Problems and questions

(

getting some obvious things out of the way

yes i'm aware that dishonorable entities will also be saying "yes i promise to be nice". potentially basically all entities who do not view having a lot of power that negatively will be saying this. but that doesn't mean that asking for a promise does nothing. we are not asking for a promise to select for some predetermined niceness parameter (which would indeed be idiotic). i think that an overwhelming majority of honorable guys would not be (nearly as) nice to us if they had not made the promise. the point is that if we get this right, making a promise is what makes an honorable guy nice to us. ie among the guys that we're going to let foom if they say they promise to be nice to us, there are dishonorable ones and honorable ones, and by default they would basically all not be that nice to us; with the promise, the dishonorable guys will still not be nice to us, but sufficiently honorable guys will be nice to us, or at least that's the hope
- and the main hope is that we are good enough at finding/making honorable AIs and selecting out dishonorable AIs that when we've decided to make our proposal to an AI, that AI is probably honorable
"no u r still moron. the AI will look at us extending a deal offer and just see a rock with “say yes then u get gazilion dolar” and say “yes”"
- i agree there are guys who think like this. but i think there’s also a natural guy that doesn’t. lets say this is included in what i mean by "honorable"

)

How do we make/find/identify an honorable human-level AI?

It is unclear how to make an honorable human-level AI. (clarification: Here and later, i mean "honorable" in the strict sense of being sufficiently/suitably honorable for the plan under consideration.) In reality, you would need to do this in a time crunch before anyone else makes any human-level AI, but this section is mostly about how to do it even in principle or with a lot of time.
- How am I imagining making this thing? Well idk, but some thoughts:
  - There are some honorable humans. One could try to have a process that makes honorable human-level entities that is sorta like the process that makes honorable humans?
    - one could try to understand and mimic how evolution created people with some propensity to have integrity
    - and/or one could try to understand and mimic how human intellectual development created deontological thinking and kantianism and/or how sociocultural development created high-trust societies
    - and/or one could try to understand and mimic how an individual human becomes an honorable person
    - one could start by becoming familiar with existing literature on these questions — on the biological, intellectual, and sociocultural evolution/development of trustworthiness, and on the (developmental) psychology of trustworthiness
  - You can try things and try to understand what happens, and try to use this to build understanding (yes this is fully general advice).
    - one issue with this: The AIs (or AI societies) you're making will be really big and really hard to understand. And understanding each meaningfully different thing you create probably poses some new really difficult problems.
    - For instance, if you run an evolution that creates a society of human-level entities that are communicating with each other in some way, you might want to understand their communication system to tell if they are honorable (or which of them are honorable). And that will be difficult.
    - Running these experiments is less scary if you're looking at a civilization but not letting the civilization look at you. But understanding what's going on without interacting is tricky. Maybe you can use short-term clones... but that might be unethical.
  - One could maybe just run a lot of different evolutions up to at most human-level systems.
    - One would need to figure out how to make evolutions get to human-level systems.
    - One would need to understand what kinds of evolution have more of a propensity to make sorta honorable systems (so one doesn't have to run some vast number of evolutions).
    - I'm imagining playing around with hyperparameters, to create circumstances in which good selective pressures are present (maybe thinking about building integrity-relevant structures up in the right order).
    - as already said: One would need to have some understanding of the creatures that arise in these evolutions, eg enough to do some analogue of reading the philosophical essays they are writing.
  - a framing:
    - You have an AI-making process you can take draws from. We should imagine it having high diversity.
      - To have a concrete example in mind, imagine a bunch of actual evolutions on different planets. If we want to get some entity analogous to a single human out, we could pick a random human-level guy a random planet once it produces human-level guys.
      - I'm imagining the process having some not-too-bad probability of producing an honorable guy on each draw, like maybe at least $10^{- 10}$ . As discussed earlier, I think this is probably true of the actual evolution example.
    - We can imagine you having some criterion you could use to try to tell that a guy is honorable. It needn't be some crisp criterion — we can imagine there being some fairly non-pre-structured assessment process that can certainly invoke human judgment on many questions.
      - For example, you could try to do the closest thing you can to reading the guy's essays about parfit's hitchhiker, checking that they've thought about keeping promises in weird circumstances pretty carefully and come to the conclusions you want, with reasoning you like. You reject if it's not clear they've thought carefully about these questions. You reject if it's not clear they're being honest in their essays. You look at their track record of keeping promises, and reject if it's not clearly good. You accept if these tests (and probably also some other tests) pass.
      - Of course, this only starts to make some sense if you can understand the systems you're analyzing quite well, because otherwise "if it's confusing then reject" means rejecting very very very nearly everyone, which is very bad for making anything near the scheme i'm about to propose in the next bullet point practical and also probably means that basically all the systems meeting your criterion are scamming you and not actually honorable.
    - Given these components, if you had a lot of time, you could try to identify an honorable guy by rejection sampling from your AI-making process wrt your criterion. That is, you're making new AIs until one of them passes your criterion, and then making your offer to that AI.
    - This starts working out well for you once there are (significantly) fewer false positives than true positives.
    - Given that it is probably decently natural to be pretty transparently honorable (like, to write essays about parfit's hitchhiker in some sort of language that can be deciphered by humans given some time, and to write notes to oneself indicating whether one would lie in such essays), I think we could probably imagine [a big evolution AI-making process together with a procedure for telling if a guy is honorable] such that the probability a sampled guy is genuinely honorable and passes the procedure is still like at least $10^{- 10}$ .
    - so a key question is, roughly: Can we bring the false positive rate below like $10^{- 10}$ for such a process (ie while keeping the true positive fraction above $10^{- 10}$ )?
    - This feels doable if you're looking at actual big evolutions on different planets. I think the hard part will be understanding what's happening on the planet well enough, but that feels doable if you have like a century per planet?
    - If we're dealing with AI worlds that can figure out that there's guys outside who might be fishing for an honorable guy from inside their world, then the situation becomes extremely cursed, because these AIs will have a reason to fake being honorable. Ideally, we'd probably just try to discard any such worlds? So we're not discarding almost everything, we'd probably want an AI-making process where this sort of situational awareness doesn't happen too often. But this might be harder than one might naively think. It might cause a capability penalty, and it might also cause an understandability penalty (because we can introduce less human stuff to their world).
  - However, the above framing obscures the following point:
    - When you don't have a vast amount of compute, most of the hardness in getting to the state where you can point at an AI and have that AI be honorable is already present in "how do I make even one honorable AI at all?". I think it's reasonable to think of this as the core challenge.
    - The framing above unfortunately obscures this point, because it can make it look like the hard part of getting this plan to work is having a good criterion/verifier/[selection process]. While there are canonical ML ways one can try to get from having a kinda-verifier that a system has a property to having a system with the property, I think it's unlikely this kind of thing would work in our case of interest (without many ideas for "improving the prior on honorable systems", which is the harder part), and imo work on making a better verifier will mostly not be addressing the core problems with getting this plan to work in practice. But this isn't obvious and deserves a long discussion that I'm not going to provide in the present post.
  - I'm mostly not imagining any direct external training of a system to act honorably in some situations.
    - But it could be worth trying to come up with something like this.
    - If you make an AI that is capable e.g. by having some sort of open-ended play process on math/coding/technology-making and do some training on a supposedly high-integrity-testing task suite, if you manage to produce AIs that look honorable at all, i think these AIs will overwhelmingly not actually act honorably in the situation we care about.
      - But this probably takes a long discussion. Maybe we can at least immediately agree that training an AI to write nice essays about parfit's hitchhiker is not a good idea for making an honorable AI.
I already mentioned this but this deserves to be a separate point: we will to some extent need to understand the entities we are making, and also how to make various kinds of entities. This is difficult.
- Other than the above, eg this also shows up because we need to communicate our offer and potentially negotiate it, so we need to understand how to talk to some guys we are making.
- For some approaches to making/finding an honorable guy, this is the central place in the plan where open-ended research work is required.
How much do we want to be trying to understand the AI's thinking when it is pondering how to respond to our proposal?
- In my analysis, I'm mostly imagining we're not doing anything to analyze its thinking after we communicate our proposal to it (though of course we will have put a lot of effort into trying to understand what kind of guy it is before deciding we would make the offer to it).
- The choice here might matter in determining which variants of being kinda-honorable lead to being nice to us in this situation. See the decision theory literature?
- In any case, we shouldn't lie to the AI about what we're doing.
You could try to make an honorable AI using simulation/emulation/imitation/prediction of an honorable human. ^[17] There is much to discuss about this, but discussion of this falls more naturally under discussion of de[AGI-x-risk]ing using human simulation/emulation/imitation/prediction in general. So I'm going to consider further discussion of this outside the scope of the present note, but I should maybe write about this in the future. I'll just say a couple things quickly:
- I think we're not at all close to having good enough simulation/emulation/imitation/prediction (in particular, current pretrained transformer models are not close), and this is difficult. It might be more difficult than some ways to succeed at the present plan, but also it might not — I'm not sure.
- But if you could genuinely do very good human simulation/emulation/imitation/prediction, then I think that would be a great way to get an honorable AI for this plan.

Problems the AI would face when trying to help us

one might think: It'll be difficult for the AI to sufficiently gain in capabilities (while maintaining its commitment to the promise) and/or takeover is really difficult.
- I think that if you can make a (top) human level AI at all, it's not that hard to make a guy for whom self-improving a bunch pretty quickly is not that hard.
  - one plausible option for the guy: Keep growing more capable the usual way you can grow more capable — like, the learning/growth processes that made it human-level will plausibly just continue (in the hypothetical we're imagining, I think this will probably be really fast compared to the human world)
  - another plausible option: Make clones, and make your instances run faster.
  - another plausible option: The various other ideas for self-improvement. Do self-improvement research. It's useful that as the AI, you can probably easily make clones and try modifications on them to see what happens — you can be much less worried about accidentally killing yourself when trying stuff than as a human who cannot easily make clones. Human analogues to the AI's options for trying stuff are often more costly and unethical.
- I also think probably it doesn't need to gain in capabilities thaat much to take over.
  - Actually, if you have a top human level guy except it can now run 100 times faster and easily make clones of itself, that's probably sufficient to take over quite quickly.
  - There are various things you could do, but eg you could just try to do/automate all existing jobs (and more) and pretty soon legally-hold most resources in the world and then influence politics to ban other AI (you won't really need to influence politics at that point, but if you're trying to disrupt human life as little as possible, that could be a decent option). Some other plan components an AI might consider: just convincing people to ban AI with arguments, including extensive personal discussions with many people; manufacturing drones; creating pathogens of various degrees of badness to threaten the world with.
  - a counterargument: It will be a human-level guy in some sense, but it will be familiar with a very different world. Taking over requires beating humanity at doing stuff in the human world which it won't be familiar with (eg what if it isn't natural for it to speak a human-like language? what if it isn't a good 3d-space-shape-rotator?), which for this guy might be more difficult.
    - a countercounter: I think you can just beat humanity by being a slightly better [abstract thinker]/[novel domain learner]. I think this is a real enough parameter. Maybe this is an okay counterargument to the "do all jobs" plan though?
    - another countercounter: This different skill profile is maybe not tooo cursed to fix by playing with hyperparameters of your process for making AIs (eg your big evolution)? The bad case would be capability-space having many not-that-correlated axes.
- one might think: Even if it were not that hard to gain in capabilities in general, it might be hard to gain in capabilities "safely", ie in this case while respecting/preserving your values/character and your promise.
  - again: I think one doesn't even need to gain that much in capabilities.
  - But basically I agree it is possible to mess up badly enough when gaining capabilities or just doing weird out-of-distribution stuff (eg splitting into many clones) that you fail to keep your promise despite going in intending to keep it.
    - I think it’s probably true that if you hand a random human lots of great options for quick self-improvement, they will probably mess up very badly, and not just wrt the promise they intended to keep, but also wrt just preserving their usual values/character — like, they would do something way too close to suicide (eg I think it’s plausible they’d get themselves into a position where they’d view their friends and family kinda like they used to view ants). I think this is plausibly true even of random top 1/10000 smart humans who go in without advice/warnings.
    - However, I think it is possible to not mess up, and it isn’t thaat hard if you start near the top of the current human range and are careful. I think there is a guidebook I could write that could get many people to the point where they sorta understand the problem and have some decent first ideas for what to do and would probably do fine for at least a decent stretch of capability-gaining as long as they are careful and continue work on a library of guidebooks for themselves.
    - Having some decent level of competence and carefulness will be a nontrivial criterion on the AI we are trying to make/find/identify. Fortunately, this is quite correlated (in humans and also more broadly) with writing essays about parfit’s hitchhiker, so I think it doesn’t drive down the mindspace-prior on suitable AIs as much as one might naively think. I think that once one is able to make/find/identify an honorable guy, it will be pretty easy to identify a guy that is honorable and sufficiently careful/wise to not mess up for at least some decent stretch of self-improvement and/or encountering novel circumstances.
    - For example, it even seems pretty plausible that you could go in intending to keep your promise mostly because you believe "god would punish me for not being honorable", and in fact successfully keep your promise for this reason, at least if you don't have to gain in capabilities that much and are part of the promise is to not gain in capabilities that much before finishing the assignment.
  - one might further say: Okay, but instead of imagining having to do it as a long-lived human individual, we could imagine having to do it as a small human group that needs to start a community that needs to keep the commitment for many generations. Now this seems much harder. So, there are certainly cases where this is really hard.
    - my response: I think it is not that hard to make it so the AI has more options so it isn't that hard for it to do this "safely". We can let the AI find options itself using out-of-simulation actions it didn't previously have. We can also play around with our AI-making hyperparameters.
  - one might say: Doesn't the AI need to solve some sort of successor alignment problem? Are you saying alignment is easy??
    - my response: The alignment problem faced by the AI is extremely easy compared to having to make some completely new guy that is extremely nice to you (which is what humanity needs to do if we don't ban AI). Self-improvement is just a MUCH nicer setting.
- I'm not going to argue more for these claims here. See this comment for some more thoughts. This capability disparity question deserves a separate post.
But aren't there plausible ways the AI could just lose its mind?
- Hmm, I think there are ways for the AI to lose its mind basically immediately. It will be our responsibility to set up the world in which the AI makes the promise and starts its post-promise life so that it doesn't lose its mind. In particular:
  - We could have it start its post-promise life in a world that is a lot like the world it is used to, except with some devices it can use to interface with our world and potentially more easily change its own world and maybe itself.
  - Conditional on being honorable, it seems pretty likely that the AI will be a social creature — in fact, it seems plausible that interacting with other creatures of its kind will be extremely common among honorable guys. In fact, this might commonly be important for not losing one's mind. It might also be important for properly preserving one's character and values. And so I don't want to say that we will try to pick a guy who would be fine alone, because that might decrease the prior probability on suitable guys a bunch. Instead, i guess we should maybe get an entire community's worth of creatures to make the promise? what this looks like precisely might need to be specific to the AI species
- There are also many ways for the AI to lose its mind later. Something could go really wrong when the AI starts to make clones, for instance. We can give the AI a decent starting point for not going crazy, but the AI will need to be responsible for not messing up later.
The AI will certainly have to face a lot of problems. But it will probably be able to handle them. It'll have to be thoughtful and careful.

It's a weird promise and a weird situation in which to make a promise

It will be an extremely strange novel situation in which to be honorable.
- There are various ways to be honorable in usual circumstances that might not generalize at all to being honorable here.
  - In particular, I think that probably most normally-very-honorable people have this implemented in a way that would not generalize to being honorable to weird aliens in this situation? But i'm unsure. This seems important to understand better.
- There's much less external punishment for failing to keep the promise than usual.
  - But maybe the guy will want to live together with others again, and will want/need to be honest with these others about the past? Then having broken this promise could actually lead to being treated worse by others.
  - There could also be other grand things that the guy will be chosen to be involved in if they have been honorable in the past.
    - funny example: Maybe there's some universes in which our universe (well, the part we know about) is already simulated, with someone outside fishing for an honorable guy inside the universes they are simulating. So being honorable could qualify you for some further potentially lucrative job. Given that the stakes are super-astronomical, even if you only assign a small probability/importance to this hypothetical, maybe it could still provide a nontrivial reason to be honorable? Note in particular that having been honorable in this setting once is excellent evidence that you'd be honorable again — you're pretty likely to be the most obvious choice of an honorable entity from our universe then.
- The promise is made to some weird aliens (humans).
- The promise will need to be kept for a long time. Maybe its like having to keep a promise for millions of years of subjective time, for a guy for whom it would be usual to only live for 100 years.
Basically the situation for the guy will be "do you promise to be nice to us? btw if you answer "no" you will kinda die".
- The guy will maybe-arguably-die in two senses, at least in the current main version of the plan:
  - They will be a clone of a guy from a world; this clone might be immediately terminated if they answer "no I'm not doing this", though the original guy continues its life. But we could also give the guy a choice between being terminated and living alone for the rest of its natural lifespan. Or maybe we'd want to create a community for the guy so it doesn't go crazy anyway, and so we could also let it live for a long time in its clone community if it rejects our offer? In any case, I think the clone should know what will happen to it.
  - If humanity doesn't make it, then their civilization is also probably not run for very long in our universe (though it could still be run for a decent amount of internal time).
- a problem: In some variants of this situation, an honorable being might be allowed to lie (to say they promise sth but not mean it; to kinda promise sth but not end up following through)? Or maybe they'll have a policy of just rejecting the offer? We need to make this situation+guy not like that.
  - Like, if saying "no" is really bad, especially in an unfair way, then some guys might consider themselves to be allowed to say "yes" without meaning it?
  - But it's not like humans are being unfair to the guy here — we are happy to give it half the resources in this universe, and we are doing our best to be kind to it if it says "no"?
  - This is a lot like parfit's hitchhiker.
    - One could argue: "okay, parfit's hitchhiker is the same in that the guy is fucked if he says "no i'm not going to pay once we get to the ATM". but it is different in that it's not the fault of the guy in the motor vehicle that the other guy would be dying in the desert without help, whereas in our honorable AI case, it is the fault of humans that the AI's world might not continue for long in this universe ^[18] (because of a human-made misaligned AI takeover).". A few things in response:
      - I think there's a guy who doesn't lie even if the other guy has put them in a bad situation. But maybe these guys also tend to have a policy of just rejecting the deal in many such cases?
      - The particular humans that are requesting the promise would be different than the humans who are causing this guy's world to end if it says "no" (and opposed to those other humans, probably). Intuitively, the humans requesting the promise are being quite fair to the AI? (However, it is the fault of the humans carrying out this research program that the branch of the AI that got created for making the proposal will get terminated if the proposal is rejected, if that's what we're intending to do. We should probably identify an AI who doesn't mind having new temporary clones created to whom opportunities are given. I think this is a reasonably natural view, so once we can identify honorable guys, it shouldn't be that hard to identify a guy who also has this property. It would also be the fault of the humans carrying out the research program that the AI's world got run+terminated in our world. But I think many AIs would consider this a good thing?)
    - Another difference is that in the canonical version of partfit's hitchhiker, the guy in the motor vehicle is a better mind-reader/predictor than humans would be in the situation we're considering.
It's a really extraordinarily big thing to promise.
- It'll be a commitment for a lot of subjective time, plausibly more than a million years. It requires a lot of open-ended work and development.
- Maybe it's really hard for the AI to do what is requested and maybe some honorable guys will be refusing to make the promise because they are worried they'd mess it up?
  - I guess we could say "ok we understand maybe it'll be hard. we are fine with you promising to really really do your best" or sth like "as long as you act as if you actually promised to do the thing, we will not consider a failure blameworthy"?
  - We would want the guy to tell us if it thinks it can't do it though.
Some guys might also just not want to take the deal.
- I think it's decently natural to want to take the deal. Once we are able to find guys who are honorable, I think it shouldn't be that hard to find a guy who additionally prefers the deal to nothing.
The promise probably needs to be made by a guy in some probably fairly small range around human level. Maybe we'd want the guy to be inside the human intelligence range?
- Go below human-level, and the guy isn't capable enough to pull off fooming or doing a bunch of complicated stuff safely (i.e., the version of it that becomes decently capable will probably not be holding itself to the promise anymore). One might also get a lower bound from the following: it's plausible that we can only really trust guys who have done something like writing philosophical essays on something like parfit's hitchhiker.
- Go above human-level, and we probably get pwned in some more surprising way. And as we are dealing with more intelligent guys, when we think we are asking the guy to promise to be nice to us, it will become more universal that we look like a silly mechanism with the structure "if you say yes, you can take over", maybe?
What promise should we request? Here are some possible things we could ask the AI to promise to do:
- to destroy all computers for 1000 years (one could prefer this option if one worries that destroying AIs leaves too much room for interpretation that could go wrong). (but don't destroy biological life. but if someone makes some sort of new biological computer (this would totally happen by default in 1000 years if other computers were banned but this wasn't), then of course destroy that. but don't prevent humans growing smarter carefully. also, as always in these promises, disrupt human life/development as little as possible given the constraint of doing the other thing you're promising to do — in this case, given the constraint of destroying all computers. well, this really still requires a lot of interpretation)
- to monitor all computation and only allow stuff that is not AGI-related (except for that AI itself of course). (this is another particular way to ban AI)
- we could request some much broader and more open-ended package of things; here's an example package: (1) to offer us technologies, methods, ideas, knowledge we'd want to have (we could also provide a more precise list of kinds of things we'd want to have); (2) to ban things which we really shouldn't want people to have access to (e.g. various AIs, especially ones more capable than our promisor AI, because those could threaten the system; e.g. pathogens or gray goo if robust defenses to these are not already deployed); (3) to not disrupt human life; in particular, it should always remain possible for a community to choose to live some baseline-good life while not interacting with anything downstream of the AI or while only interacting with some chosen subset of things downstream of the AI more generally; (4) to disrupt [human sense-making] and [distributed human world-direction-choosing] as little as possible — in particular, to not offer technologies that would facilitate power concentration, and to only offer technologies slowly enough (compared to "the speed of human thought") that the world continues to make sense to humans and humans have time to evaluate changes and adequately figure out what it is that we want to be doing together in new contexts; (5) to never try to steer humans via these allowed activities (it's probably possible to radically reprogram humans to basically arbitrary beings by only offering options that humans would locally choose to use over a decent baseline. doing this sort of thing with your own ends in mind is highly not allowed!); (6) to generally help us not destroy ourselves, individually or collectively ^[19]
- to help humanity even more fully than that for a long time (while also being allowed to spend most of its time living its own life, but that life should be separate from humanity)
- to become friends
- to come up with a promise we might have asked it to make if we had some more time to think about the matter and to carry that out
- to do some research for us that helps us get AI banned (in particular, to make some relevant intellectual progress for us), while remaining in a box (note that this is different than the other options in that here it becomes important that the AI trust us to pay)
If we make the promise we request precise, it will be stupid — like, each time we make some vague thing we want to ask for more precise, what we come up with will be far from the best way to make it precise. But if we leave the promise imprecise (like "be nice to us"), there will be a lot of room to make it precise, and we might worry that this will be done incorrectly.
In any case, truly keeping the promise will require an extraordinary amount of difficult interpretation work — the AI will need to deeply understand what it means to disrupt human life as little as possible, or what it means to be nice to us, or whatever.
Later, both parties will be seeing ways the agreement could have been better. At that point, it is tempting to switch to some other better thing. Should we allow this? Should we allow this given further discussions? In any case, the AI will need to promise not to manipulate humans.
I'm mostly imagining the AI considering the assignment morally neutral except for the lost time and resources — I'm imagining the AI thinking of its task sorta as an intrinsically meaningless chore. But maybe some AIs would consider this assignment immoral? I don't really have something very specific thing that I'm significantly concerned about here, but I thought the category is worth mentioning anyway. Here are some specific examples anyway:
- Maybe after living with humans for a while, the AI might start to care more about human affairs and be tempted to meddle in human affairs more?
  - It will be the AI's responsibility not to meddle in human affairs (except in ways allowed by the promise). If it is generally up to keeping its commitment, then it can probably also figure out how to self-regulate properly wrt this.
- Maybe some AIs who would be honorable would also intrinsically prefer not to limit/direct humanity's autonomous development?
  - My guess is that this will be pretty rare. Anyway, even if the AI were like this, then if we've asked it to ban AI:
    - If we would be disempowered by default AGIs, then banning AI for us is helping preserve our autonomous development, so probably most of the AIs who wouldn't want to limit/direct humanity's autonomous development would be fine with this?
    - And if we wouldn't be disempowered by default AGIs, then I guess we don't really need this plan to work anyway.
Will the AI even believe us? Will it think it has schizophrenia?
- We can give the AI a bunch of information about our world. We can give it time to make sense of the situation it is in.
- Even if it worries we might be lying to it or something might be off about the situation in some other way, I think it could and plausibly would still just say sth like "ok this is such a crazy situation that I'm worried I'm in some weird game show or I've lost my mind of something, but if this is all real, then yes I promise"? The promise could be voided if we were lying to it during our interaction or whatever, but given that we in fact weren't, the AI should just hold itself to that promise.

Miscellaneous concerns

We're giving up half the universe. If there's only a finite amount of some important kind of stuff (eg negentropy) in this universe, then that's a cost. But I think we should be fine with this? It would be fine if utility scaled as log of the amount of resources ^[20] . This could become a real problem though if we needed to do this sort of thing many times in our future history — after not that many halvings, we wouldn't have much left.
- The choice to offer the AI half the universe is not downstream of any principled calculation. Maybe a more principled calculation would lead to the conclusion that it would be more fair to give the AI more, and then we should maybe change our proposal to one that leaves us with less. But I doubt it would be so little?
- On the flip side, you can try to come up with ways to improve your bargaining position, although this might be playing with fire just to get that "last bit" of the universe. For instance, if you're quite sure you've figured out how to make and identify honorable guys, maybe you could try to make many different honorable guys, get bids from all of them, and give the contract to the best bid? ^[21] I don't know how to think about how the fair price effectively gets set in these interactions. Plausibly the part where you go trying to improve your bargaining position should also be considered a part of a big interaction? Anyway, one issue with this particular proposal is that if your selection has any dishonorable guys, then given that they aren't going to pay anyway, they are fine with making arbitrarily good offers, so you are kinda subselecting for being dishonorable.
Maybe the ask that the AI go live in its own non-interacting universe is either impossible or at least to be figured out in time without doing scary amounts of capability-gaining, so we will need to live more together for a long time. So there will potentially need to be complicated and somewhat costly policies implemented in the AI's world that make it not mess with the human world. (There might also need to be policies in the human world that make us not mess with the AI's world. Or maybe the AI stays sufficiently ahead of us forever, and having these policies in the human world won't be necessary.)
Some AI might break out accidentally — like, something might find a way to do major real-world things without us intending to let it do major real-world things.
- We certainly should be boxing things quite tightly, with measures in place to pause something when something starts to look scary.
- We should just be pausing anything that gets to a significantly higher intelligence/capability level than us.
- Unfortunately, it is probably kinda hard to track how far a capability gain process has gone. This might require understanding what's going on to some significant extent, which is hard. But this is probably also required by the plan for other reasons mentioned earlier, anyway.

I don't have a version of the plan that is easy enough that someone could remotely pull this off in practice before anyone else makes an AGI

This plan requires going off the AI-capability-maxing path to a somewhat greater extent than Earth currently seems able to do by default (but also imo Earth's current default ability to go off the capability-maxing path is really poor — I think that if the AI path is to be taken at all, developing a much greater ability to go off the capability-maxing path is required for things to go well). Some people hope that we can go off the AI-capability-maxing path significantly less than this plan imo requires while still being fine. But if you're such a person, then maybe you'd also be hopeful about pulling off this plan with like a fine-tuned Claude or whatever? The point I'm making is: while I think this plan is really hard and in particular above some people's threshold for being interested in a plan, if you're generally optimistic about the AI situation, you might want to think that this plan isn't that hard actually.
- In other words: My own answer to "supposing current labs are on the path to AGI, do you think Anthropic could probably pull this plan off if they wanted to?" is "no, in race conditions, Anthropic wouldn't come close.". But conditional on thinking that Anthropic is good smart careful and plausibly on track to take over the world and establish a utopia, then plausibly you should think that they could do this honorable AI thing also.
In particular, the plan requires departing from the default future path at least because:
- By default, everyone will be completely careless. Eg there isn't going to be any attempt to box systems.
- Even if you're careful, someone else will make a random alien god while you're being careful.
- It is imo likely that the human-level system that our capability-maxing civilization finds first will not be honorable and we won't be able to figure out how to make it honorable, and trying some variations on the process that created it in a rush isn't going to give anything honorable either.
- If one would like to make many civilizations/systems and study them carefully to build understanding about what sorts of hyperparam settings give more honorable systems, this takes time.
  - Even understanding one advanced civilization well enough to be able to tell pretty well if it meets some baseline integrity bar would probably be really complicated, and each genuinely different civilization will present some genuinely new challenges.
- One will need to do a bunch of conceptual work. This takes time.
But one might be able to come up with some specific practical version of the plan. This is a big open direction. To start with, it’s be great to just have any specific versions of the plan, even if impractical.
You could consider trying to carry out this research program with AIs doing a lot of the work...
- My quick take is that i don't really see this hanging together in our world without governance magic (of a more powerful kind than the governance magic required to just ban AI) and without basically different people and institutions attempting this. But it's worth considering!
- I'd become more excited about this if we could make it so all the things we want to offload to AIs are in-distribution (like, just doing more of some fairly specific kind of thing we basically know how to do, and can provide many examples of humans doing) or pretty robustly verifiable.
  - For example, if top AIs were built by evolving civilizations, then if your lab got to a top human level researcher AIs first, it would maybe be not totally insane for you to try using your AI researchers to gain a massive advantage in the quantity of big evolutions you can run compared to the rest of the world by finding like a thousand massive compute multipliers, before anyone else makes a top human level researcher AI? Idk, that you could do this without losing control seems unlikely, but let's proceed. Furthermore, if these civilizations you're making magically had honest internal communication in an easily understandable language (e.g., a human language), then we could imagine running a large number of different such big evolutions, and basically selecting guys who write the right sorts of essays about something like Parfit's hitchhiker with aliens. One could try modifying this starting plan into something more plausible and practical.

How do we make it so we are not mistreating these AIs?

Potentially, we will be creating and destroying many minds and civilizations that matter (like, maybe minimally the ones that didn't have honorable beings).
- This would maybe be somewhat less bad if we could manage to cut off all causal links from our universe into the worlds of the AIs (except for the part where we ask a guy if they want to make a promise), because then maybe it's more like we're looking at a civilization that in-a-maybe-real-sense continues also when we are no longer looking, which is maybe what turning it off in our universe is? Maybe us no longer looking at it wouldn't mean or do anything to such a civilization? I don't really know how to think about the metaphysics/ethics here. I worry that I'm doing sth too much like playing a clever word game that makes an atrocity seem fine, instead of properly tracking the ethical facts.
- I guess we could commit to letting everyone we create live out fairly long lives, at some compute cost?
  - It might be problematic that an entity's (a person's, a civilization's) proper lifespan is probably basically infinite.
    - But note that this line of argument leads to some sort of very general radical antinatalism? I will mark this as interesting but implausible and move on for now.
- If there weren't a strong asymmetry between destroying a life/civilization and failing to create it, then probably this wouldn't be a worry. My guess is that we must privilege existing beings, but I'm confused about this. The case at hand tests our notion of what it is to privilege existing beings. (I mean: facing an unusual case, we must figure out (or decide?) how the notion of existence, or the principle of differentially caring about already-existing beings, or something, generalizes.)
- If we expect to be terminating some clones to whom our proposal is made (because they reject the proposal or because we're running some other process that determines them to be unsuitable at the last minute), then we should probably be picking AIs who are fine with clones being created, given opportunities, and quickly terminated.
Inside the AI worlds we're creating, there will be bad-according-to-the-AI-world stuff happening. And so if we care about these AIs, there will be bad-according-to-us stuff happening. This will sometimes include bad stuff that we could figure out how to prevent, by trying to subtly intervene on the world or by setting up future worlds differently. Should we be doing this? I'm leaning toward: it would be fine to carry out this plan even without doing this work (see also the literature on the problem of evil :P). But maybe we should try to track if a world has turned into something really bad, and then do something about that? Minimally, it'd be good to have some people thinking about these concerns.

[Outside-view]/meta worries about my analysis of the plan

Maybe I'm porting too much from the human case to the general case, and confusing myself about how natural the thing i'm imagining is?
- Like, a load-bearing reason for thinking this plan could work is that it seems plausible it would work if a well-chosen human were in this situation with some weird aliens. I should worry about there being various things which [I haven't explicitly considered but am implicitly assuming] which are important for making this plan work fine, that might not be true when we go further from humans.
  - The "weird aliens" I find easy to imagine are really probably really objectively extremely close to humans in the "human-level slice" of mindspace.
    - Maybe it is crucial that the distance between the promisor and the promisee is small? E.g., maybe it's really difficult to properly communicate inside the slice of mindspace at your level?
    - Maybe the "things working out fine" that seems plausible is actually some sort of working-out-fine-according-to-humans, which generalizes to the "weird aliens" only because they are actually quite close to humans? Like, when the human destroys AGI attempts, maybe what that looks like has somehow derived specific human bits that make it actually not-that-good from the perspective of true aliens? Like, maybe not interfering much with the aliens' affairs in other ways is something the human wouldn't do properly? This seems kinda implausible I think?
  - Even more worryingly, in this case, I'm imagining things maybe working out fine fine with a literal human making the promise, but why should I think there are not many specific properties of humans that are important for making it possible for a human to be nice in this way, that I'm not seeing explicitly? Like, maybe there are too many of these specific properties for it to be feasible to succeed at identifying an actually honorable guy by eg doing some big search where you try running a bunch of different evolutions (with some understanding) and picking out an honorable guy inside an honorable civilization in one of your simulations?
  - Maybe I'm underestimating the difficulty of identifying and making honorable alien civilizations and guys. Telling who would be honorable is probably already complicated among humans, and probably remains complicated even if you can look at arbitrary videos of the world; it'll obviously be much harder when dealing with aliens (because it will be very hard to understand them).
  - Even if it were quite natural for some plan like this to work even with weird aliens, maybe the plan needs to be fine-tuned to the potential promisor alien at hand in various subtle ways, and my current version is subtly fine-tuned to a human promisor and wouldn't work directly for alien promisors? And it might be hard for us to understand how to fine-tune the plan to an alien? But even if this is right, can't we resolve this by negotiating with the human-level alien?
relatedly: I worry that there's something really wrong with the notion of [the (current top) human-level slice of mindspace] I'm using. I worry that there might not really be anything with the properties I want it to have — that there isn't any way to rescue the notion, i.e. to specify a more precise thing which makes sense and still supports the argument/plan I'm presenting. In particular, I worry that I might be conflating being in the human-level slice of mindspace with being inside the human distribution, somewhere. It would be nice if we could just talk about AIs which are inside the human distribution throughout. But we can't do that, because we want to say that any AI development process that gets far enough will have to pass through the human-level slice (so this constraint doesn't drive down the prior on suitable candidates much); this is probably very far from true about the human distribution — the human distribution is probably an extremely small blob in mindspace that is only very rarely passed through.
Imo, basically always, when someone likes some AI alignment plan, there is some fairly simple memo they've missed. There might be some fairly simple memo that would make me stop liking this plan.

Directions for further work!

I'd like to better understand if this plan would work in principle. Like, if a careful research team were to pursue this for 500 years, with the rest of the world magically not developing and in particular not doing AI stuff, would they succeed?
I'd like to replace parts of the plan where I have no clear picture of what should be done with more concrete sketches that could plausibly succeed. Mainly, this is inside the "make/find an honorable guy" part.
I'd like to understand if the plan could be made practical.
There are many things on the list of problems with the plan above that deserve to be analyzed in much more detail. Some assessments could be wrong.
I'd like to know about more issues with the plan. There could easily be some major issue i'm missing that kills this hope.
It'd be interesting to have a more systematic analysis of potential problems with the plan. One could try to more carefully write down all the things that need to work for the plan to work, and then try to see what could be wrong with each of those things.

Acknowledgments

thank you for your thoughts: Hugo Eberhard, Kirke Joamets, Sam Eisenstat, Simon Skade, Matt MacDermott, Carlo Attubato

that is, for ending the present period of (in my view) high existential risk from AI (in a good way) ↩︎
some alternative promises one could consider requesting are given later ↩︎
worth noting some of my views on this, without justification for now: (1) making a system that will be in a position of such power is a great crime; (2) such a system will unfortunately be created by default if we don't ban AI; (3) there is a moral prohibition on doing it despite the previous point; (4) without an AI ban, if one somehow found a way to take over without ending humanity, doing that might be all-things-considered-justified despite the previous point; (5) but such a way to do it is extremely unlikely to be found in time ↩︎
maybe we should add that if humanity makes it to a more secure position at some higher intelligence level later, then we will continue running this guy's world. but that we might not make it ↩︎
i'm actually imagining saying this to a clone transported to a new separate world, with the old world of the AI continuing with no intervention. and this clone will be deleted if it says "no" — so, it can only "continue" its life in a slightly weird sense ↩︎
I'm assuming this because humans having become much smarter would mean that making an AI that is fine to make and smarter than us-then is probably objectively harder, and also because it's harder to think well about this less familiar situation. ↩︎
I think it's plausible all future top thinkers should be human-descended. ↩︎
I think it's probably wrong to conceive of alignment proper as a problem that could be solved; instead, there is an infinite endeavor of growing more capable wisely. ↩︎
This question is a specific case of the following generally important question: to what extent are there interesting thresholds inside the human range? ↩︎
It's fine if there are some very extreme circumstances in which you would lie, as long as the circumstances we are about to consider are not included. ↩︎
And you would never try to forget or [confuse yourself about] a fact with the intention to make yourself able to assert some falsehood in the future without technically lying, etc.. ↩︎
Note though that this isn't just a matter of one's moral character — there are also plausible skill issues that could make it so one cannot maintain one's commitment. I discuss this later in this note, in the subsection on problems the AI would face when trying to help us. ↩︎
in a later list, i will use the $10^{- 10}$ number again for the value of a related but distinct parameter. to justify that claim, we would have to make the stronger claim here that there are at least $100$ humans who are pretty visibly suitable (eg because of having written essays about parfit's hitchhiker or [whether one should lie in weird circumstances] which express the views we seek for the plan), which i think is also true. anyway it also seems fine to be off by a few orders of magnitude with these numbers for the points i want to make ↩︎
though you could easily have an AI-making process in which the prior is way below $10^{- 100}$ , such as play on math/tech-making, which is unfortunately a plausible way for the first AGI to get created... ↩︎
i think this is philosophically problematic but i think it's fine for our purposes ↩︎
also they aren't natively spacetime-block-choosers, but again i think it's fine to ignore this for present purposes ↩︎
in case it's not already clear: the reason you can't have an actual human guy be the honorable guy in this plan is that they couldn't ban AI (or well maybe they could — i hope they could — but it'd probably require convincing a lot of people, and it might well fail; the point is that it'd be a world-historically-difficult struggle for an actual human to get AI banned for 1000 years, but it'd not be so hard for the AIs we're considering). whereas if you had (high-quality) emulations running somewhat faster than biological humans, then i think they probably could ban AI ↩︎
but note: it is also due to humans that the AI's world was run in this universe ↩︎
would this involve banning various social media platforms? would it involve communicating research about the effects of social media on humanity? idk. this is a huge mess, like other things on this list ↩︎
and this sort of sentence made sense, which is unclear ↩︎
credit to Matt MacDermott for suggesting this idea ↩︎

It's also easy — if you want to be like this, you just can.

I think you can easily choose to follow a policy of never saying things you know to be false. (Easy in the sense of "considering only the internal costs of determining and executing the action consistent with this policy, ignoring the external costs, e.g. losing your job and friends.) But I'm not sure it's easy to do the extra thing of "And you would never try to forget or [confuse yourself about] a fact with the intention to make yourself able to assert some falsehood in the future without technically lying, etc"

I'd really want to read essays you wrote about Parfit's hitchhiker or one-shot prisoner's dilemmas or something

My method would look something like:

This person acts honourably on the normal distribution of situations X.
This person claims they would act honourably on a broader distribution of situations Y, which includes our weird situation. And (assuming we have extrospective access to their beliefs) this person believes they would act honourably on the distribution Y.
This person endorses a decision-theoretic claim that they should act honourably on distribution Y.
We have some inductive evidence that (1-3) ensures they will act honourably on distribution Y, in the following sense:
1. For any pair of (X', Y', P) where X' is a normal distribution of situations and Y' is a weird distribution of situations and P is a property, such that:
  1. This person acts P on the normal distribution of situations X'
  2. This person claims they would act P on a broader distribution of situations Y'. And (assuming we have extrospective access to their beliefs) this person believes they would act P on the distribution Y'
  3. This person endorses a decision-theoretic claim that they should act P on distribution Y.
2. They do, in fact, behave P on distribution situation Y'

NB: I think that, perhaps, it will be easier to make/find/identify an honourable AI than an honourable human, because:

AIs are stateless and humans aren't -- so we can run experiments on them without worrying about the effects of earlier experiments.
We can run the same AIs many times in parallel, but can't with humans -- so we can experiments more serial-time efficiently.

In humans, it seems important for being honest/honorable for there to at some point have been sth like an explicit decision to be honest/honorable going forward (or maybe usually many explicit decisions, committing to stronger forms in stages). This makes me want to have the criterion/verifier/selector ^[1] check (among other things) for sth like having a diary entry or chat with a friend in which the AI says they will be honest going forward, written in the course of their normal life, in a not-very-prompted way. And it would of course be much better if this AI did not suspect that anyone was looking at it from the outside, or know about the outside world at all (but this is unfortunately difficult/[a big capability hit] I think). (And things are especially cursed if AIs suspect observers are looking for honest guys in particular.)

I mean, in the setup following "a framing:" in the post ↩︎

I'm sceptical that "they wrote an essay defending the position that one should act honourable in weird situations" is stronger evidence for "they will act honourably in weird situations " than "they act honourably in normal situations". this is because I've updated towards a more Hansonian worldview, which is more cynical about people's essays.^[1]

but maybe you can conclude that someone will act honourably in weird situations if:

they honourably in normal situations
they writes essays that they will act honourably in weird situations
whenever [they act P in normal situations and write essays that will act P in weird situations] they do in fact behave P in weird situations

this last criterion is something like a "taking ideas seriously" or "real-thinking" or "sincerity".

now, with humans, it's tricky to be sure of (3). that's because we can't put someone in the weird situations in which decision theories diverge;^[2] and not without them knowing they are in an evaluation; and certainly not for 10,000 trials both statelessly and parallelised.

but with AIs, we plausibly can get a decent guarantee of (3)! so I'm more bullish on getting honourable guarantees on AIs.

^{^}
some cloud of sentiments around: (i) people have little introspective access about what they will do, (ii) they write essays in far-mode, but act in near-mode, (iii) people are like lying pretty much all the time, (iv) people write essays for status-y reasons, etc.
^{^}
we can't put derek parfit in parfit's hitchhiker, or put william newcomb in newcomb's problem

If you truly have a trusted process for creating honorable AIs, why not keep churning out and discarding different honorable AIs until one says "I promise I'm aligned" or whatever?

I suppose you're not assuming our ability to make AIs honorable like this will be robust to selection pressure?

I agree you could ask your AI "will you promise to be aligned?". I think I already discuss this option in the post — ctrl+f "What promise should we request?" and see the stuff after it. I don't use the literal wording you suggest, but I discuss things which are ways to cash it out imo.

also quickly copying something I wrote on this question from a chat with a friend:

Should we just ask the AI to promise to be nice to us? I agree this is an option worth considering (and I mention it in the post), but I'm not that comfortable with the prospect of living together with the AI forever. Roughly I worry that "be nice to us" creates a situation where we are more permanently living together with the AI and human life/valuing/whatever isn't developing in a legitimate way. Whereas the "ban AI" wish tries to be a more limited thing so we can still continue developing in our own human way. I think I can imagine this "be nice to us pls" wish going wrong for aliens employing me, when maybe "pls just ban AI and stay away from us otherwise" wouldn't go wrong for them.

another meta note: Imo it's a solid trick for thinking about these AI topics better to (at least occasionally) taboo all words with the root "align".

[I feel like I may have a basic misunderstanding of what you're saying.]

I haven't thought deeply enough about it, but one guess: The version of honorability/honesty that humans do is only [kinda natural for very bounded minds].

There's a more complex boundary where you're honest with minds who can tell if you're being honest, and not honest with those who can't. This is a more natural boundary to use because it's more advantageous.

You mention wanting to see someone's essays about Parfit's hitchhiker... But that situation requires Eckman to be very good at telling what you'll do. We're not very good at telling what an alien will do.

I think there are humans who, even for weird aliens, would make this promise and stick to it, with this going basically well for the aliens.

Would you guess I have this property? At a quick check, I'm not sure I do. Which is to say, I'm not sure I should. If a Baby-Eater is trying to get a promise like this from me, AND it would totally work to trick them, shouldn't I trick them?

I feel like I may have a basic misunderstanding of what you're saying.

Btw, if the plan looks silly, that's compatible with you not having a misunderstanding of the plan, because it is a silly plan. But it's still the best answer I know to "concretely how might we make some AI alien who would end the present period of high x-risk from AGI, even given a bunch more time?". (And this plan isn't even concrete, but what's a better answer?) But it's very sad that/if it's the best existing answer.

When I talk to people about this plan, a common misunderstanding seems to be that the plan involves making a deal with an AI that's smarter than us. So I'll stress just in case: at the time we ask for the promise, the AI is supposed to be close to us in intelligence. It might need to become smarter than us later, to ban AI. But also idk, maybe it doesn't need to become much smarter. I think it's plausible that a top human who just runs faster and can make clones but who doesn't self-modify in other non-standard ways could get AI banned in like a year. Less clever ways for this human to get AI banned depend on the rest of the world not doing much in response quickly, but looking at the world now, this seems pretty plausible. But maybe the AI in this hypothetical would need to grow more than such a human, because the AI starts off not being that familiar with the human world?

Anyway, there are also other possible misunderstandings, but hopefully the rest of the comment will catch those if they are present.

The version of honorability/honesty that humans do is only [kinda natural for very bounded minds].

I'm interested in whether that's true, but I want to first note that I feel like the plan would survive this being true. It might help to distinguish between two senses in which honorability/honesty could be dropped at higher intelligence levels:

Maybe even if you're honorable/honest at human level, once you get to a sufficiently high intelligence levels, you are probably no longer honest in your interactions with aliens at your level who can't predict/mindread you well, and you won't be honorable in novel dealings with such aliens.
Maybe even if you go in being honorable/honest at human level, at higher intelligence levels, even if your development goes basically well, you probably drop previous commitments. Or maybe you drop those commitments that you realize you were dumb to make ^[1] , or something.

given this distinction, some points:

2 feels meaningfully stronger/[less likely] than 1 to me
the guy having 1 but not 2 would be fine for our plan
even if 2 is true, the plan might be fine, because you might not need to become that smart to ban AI. like, i even think a human could do it while remaining a christian that believes in divine punishment for betraying the aliens in some pretty literal sense. (if we suspect it's feasible to ban AI without becoming that smart, we could make it a part of the agreement that you don't self-improve that much during your 1000 year task. tho you could get problems later if it's too hard (eg impossible) to strongly split the universe and leave into your half.) ^[2]

(I also probably believe somewhat less in (thinking in terms of) ideal(-like) beings.)

There's a more complex boundary where you're honest with minds who can tell if you're being honest, and not honest with those who can't. This is a more natural boundary to use because it's more advantageous.

I think one would like to broadcast to the broader world "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", so that others make offers to you even when they can't mindread/predict you. I think there are reasons to not broadcast this falsely, e.g. because doing this would hurt your ability to think and plan together with others (for example, if the two of us weren't honest about our own policies, it would make the present discussion cursed). If one accepts these two points, then one wants to be the sort of guy who can truthfully broadcast "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", and so one wants to be the sort of guy who in fact would be honorable even to someone who can't mindread/predict them that comes to them with an offer.

(I'm probably assuming some stuff here without explicitly saying I'm assuming it. In some settings, maybe one could be honest with one's community and broadcast a falsehood to some others and get away with it. The hope is that this sort of argument makes sense for some natural mind community structures, or something. It'd be especially nice if the argument made sense even at intelligence levels much above humans.)

You mention wanting to see someone's essays about Parfit's hitchhiker... But that situation requires Eckman to be very good at telling what you'll do. We're not very good at telling what an alien will do.

I'll try to spell out an analogy between parfit's hitchhiker and the present case.

Let's start from the hitchhiker case and apply some modifications. Suppose that when Ekman is driving through the desert, he already reliably reads whether you'd pay from your microexpressions before even talking to you. This doesn't really seem more crazy than the original setup, and if you think you should pay in the original case, presumably you'll think you should pay in this case as well. Now we might suppose that he is already doing this from binoculars when you don't even know he is there, and not even bothering to drive up to you if he isn't quite sure you'd pay. Now, let's imagine you are the sort of guy that honestly talks to himself out loud about what he'd do in weird situations of the kind Ekman is interested in, while awaiting potential death in the desert. Let's imagine that instead of predicting your action from your microexpressions while spying on you with binoculars, Ekman might be spying on you from afar with a parabolic microphone, and using this to predict your action. If Ekman is very good at that as well, then of course this makes no difference again. Okay, but in practice, a non-ideal Ekman might listen to what you're saying about what you'd do in various cases, listen to you talking about your honesty/honor-relevant principles and spelling out aspects of your policy. Maybe some people would lie about these things even when they seem to be only talking to themselves, but even non-ideal Ekman can pretty reliably tell if that's what's going on. For some people, it will be quite unclear, but it's just not worth it for non-ideal Ekman to approach them (maybe there are many people in the desert, and non-ideal Ekman can only help one anyway).

Now we've turned parfit's hitchhiker into something really close to our situations with humans and aliens appearing in simulated big evolutions, right? ^[3] I think it's not an uncommon vibe that EDT/UDT thinking still comes close to applying in some real-world cases where the predictors are far from ideal, and this seems like about as close to ideal it would get among current real-world non-ideal cases? (Am I missing something?) ^[4]

Would you guess I have this property? At a quick check, I'm not sure I do. Which is to say, I'm not sure I should. If a Baby-Eater is trying to get a promise like this from me, AND it would totally work to trick them, shouldn't I trick them?

I'm not going to answer your precise question well atm. Maybe I'll do that in another comment later. But I'll say some related stuff.

There's currently no particular person such that I'd assign >90 $%$ they'd tell the aliens they will do it and then actually do it. But I think I could identify such a person if I had to, with some investigating. If I were to investigate people, this would be my shortlist currently: various MIRI people including you, the Lightcone people, my friend Towards_Keeperhood, some philosophers in academia, whoever the smartest religious people are, a christian friend. I'd also ask some EA friends for which EAs they consider honorable, and maybe try the same for some other communities also, and then investigate those people (eg by discussing the situation with them and by asking people who have interacted with them about what they have been like).
on what I would do in the situation: i think i would tell the aliens that i’m not sure i know how to self-modify into a guy that will do their thing, and then offer to give them suggestions for humans that might be better picks, if they promise to pay me if things go well. maybe i could also offer to try some stuff to see if i can become a person who would carry out what they want, but i think i’d need at least on the order of a year of subjective time to have a chance to get to a point where i can honestly say i’m really committing. i think that if all i had to do to help them was pressing a button after 1 year of subjective time, then i'd promise and actually do it. i'm skeptical that i could honestly make the promise if it involves doing a lot of repetitive/boring/aversive work, which it probably does (edit: hmm plausibly with enough time i could actually do it with a minimum of boring stuff. my outside view doesn't let me promise anyway, currently. will need to think more. one issue is that this path might kinda require more self-improvement than boring paths, which is scary. i'm also thinking about the case where the promise allows me to spend only one hour every subjective day on their boring stuff.), and then i think i shouldn't say i promise. maybe i'd advise the aliens to let me promise to spend some time solving intellectual problems of my own choice with some relevance to their situation instead, without promising that i will make my best effort to save them, but promising that i won't kill/disempower them even if that means letting some idiots of their kind kill me and themselves with ASI (it's plausible that given this promise, given there's a gun to my head (their ASI) which I can only disable by also disabling it for them, i would in fact do even a lot of boring stuff to disable the gun for both of us. but i currently wouldn't feel comfortable promising that i'd push myself). this is modulo some issue with trusting them to pay here, in versions where i won't be in a position where i can make it so i'm paid myself

aren't basically all your commitments a lot like this though... ↩︎
I also sort of feel like saying: "if one can't even keep a promise, as a human who goes in deeply intending to keep the promise, self-improving by [what is in the grand scheme of things] an extremely small amount, doing it really carefully, then what could ever be preserved in development at all? things surely aren't that cursed... maybe we just give up on the logical possible worlds in which things are that cursed...". But this is generally a disastrous kind of reasoning — it makes one not live in reality very quickly — so I won't actually say this, I'll only say that I feel like saying this, but then reject the thought, I guess. ↩︎
Like, I'm e.g. imagining us making alien civilizations in which there are internal honest discussions like the present discussion. (Understanding these discussions would be hard work; this is a place where this "plan" is open-ended.) ↩︎
Personally, I currently feel like I haven't made up my mind about this line of reasoning. But I have a picture of what I'd do in the situation anyway, which I discuss later. ↩︎

2 feels meaningfully stronger/[less likely] than 1 to me

Well I agree it's different and depending on the interpretation logically strictly stronger. But I think it's still quite likely, because you should go back on your commitments to Baby-Eaters. Probably.

aren't basically all your commitments a lot like this though...

I would keep commitments to humans, generally. But it's not absolute, and I don't think it's because of much fancy decision theory (not sure). In the past decade, on one major occassion, I have gone back on one significant blob of commitment, after consideration. I think this was correct to do, even at the cost of being the sort of guy who has ever done that. I felt that--with the revisions I made to my understanding of commitment, what it's for, what humans are, what cooperation is, etc.--[the people who I would want to cooperate with / commit to things] would, given enough info, still be open to such things with me.

even if 2 is true, the plan might be fine, because you might not need to become that smart to ban AI.

I think this could be cruxy for me, and I could be convinced it's not totally implausible, but then we're putting even much more pressure on getting human-level AI. I didn't bring this up before, but yeah, I think getting specifically human-level AI is far from easy, perhaps extremely difficult. Cf. https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html

I think one would like to broadcast to the broader world "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", so that others make offers to you even when they can't mindread/predict you. I think there are reasons to not broadcast this falsely, e.g. because doing this would hurt your ability to think and plan together with others (for example, if the two of us weren't honest about our own policies, it would make the present discussion cursed). If one accepts these two points, then one wants to be the sort of guy who can truthfully broadcast "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", and so one wants to be the sort of guy who in fact would be honorable even to someone who can't mindread/predict them that comes to them with an offer.

Yeah I suspect I'm not following and/or not agreeing with your background assumptions here. E.g. is the AI supposed to be wanting to "think and plan together with others (humans)"? Isn't it substantively super-humanly smart? My weak guess is that you're conflating [a bunch of stuff that humans do, which breaks down into general very-bounded-agent stuff and human-values stuff] with [general open-source game theory for mildly-bounded agents]. Not sure. Cf. https://www.lesswrong.com/w/agent-simulates-predictor If you're a mildly-bounded agent in an OSGT context, you do want to be transparent so you can make deals, but that's a different thing.

Now we've turned parfit's hitchhiker into something really close to our situations with humans and aliens appearing in simulated big evolutions, right?

I feel I'm not tracking some assumptions you're making or disagreements between our background assumptions.... E.g. the getting smarter thing. What I'm saying is that it's quite plausibly correct for me to

be honorable locally, in the human sense
continue honoring human commitments--but that's because of values, not DT
if there are dumb aliens watching, act like I'm somehow super committed to being super honorable in great generality
when I get smarter, check if I should actually uphold those alien commitments, and go back on some of them.

E.g. because I really want to minimize the amount of baby-eating that happens.

For any third people ^[1] interested in this: we continued the discussion in messages; here's the log.

Kaarel:

about this: "

I think one would like to broadcast to the broader world "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", so that others make offers to you even when they can't mindread/predict you. I think there are reasons to not broadcast this falsely, e.g. because doing this would hurt your ability to think and plan together with others (for example, if the two of us weren't honest about our own policies, it would make the present discussion cursed). If one accepts these two points, then one wants to be the sort of guy who can truthfully broadcast "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", and so one wants to be the sort of guy who in fact would be honorable even to someone who can't mindread/predict them that comes to them with an offer."

i think it's plausible we are still imagining different scenarios, so i want to clarify: the central (impractical) example way to find an honorable AI i have in mind is: we make a bunch of simulated AI societies that are isolated from our world and won't know about our world (yes this is difficult), we read their internal discussions (yes this is difficult because they will be hard to understand), and then we use these to find a guy that has a policy of being honorable to agents that make nice offers to it (or whatever) (potentially discarding many civilizations which don't seem to have pretty honest discussions)

i'm saying that it is fairly natural to have the constraint that you don't lie in internal discussions about what you'd do in weird hypotheticals such as variants of parfit's hitchhiker, eg because that would be really bad for their ability to think and plan together (just like us lying about our policies in the present discussion would be cursed). given this constraint, one would want to say and actually mean something like "if you read this discussion, if you make an offer to me, i'll be honorable to you even if you can't mindread/predict me", because this gets you offers, mostly from internal guys but also from possible external guys (whether one will say that and actually mean it is a somewhat non-equivalent question to whether one would want to, because it's possible to lack in self-modification/control skill and so be unable to be like this)

so in particular my response to this:

E.g. is the AI supposed to be wanting to "think and plan together with others (humans)"?

is: no i mean there is an AI civilization where the AIs want to be able to think and plan together

that said, i also think it wouldn't be THAT unnatural for an AI to want to think and plan together with humans while human-level, and be honest to humans then, or to be honest to humans for other reasons or just because (i'm not claiming this is likely, but i am claiming it has at least 10^{-10} bioevolution-guy mindspace prior), and you can turn this into a lasting commitment to humans if the AI is good at self-modification as i describe in the post:

" * Suppose that you are very honest — you wouldn't ever lie. ^[2] ^[3] * I think this is pretty natural and not too uncommon in humans in particular. It's also easy — if you want to be like this, you just can. * Suppose further that you have a good ability to make commitments: if there is something you could do, then if you want to, you can self-modify into a person who will do it. (Suppose also that you're not delusional about this: you can tell whether you have or haven't become a person who will do the thing.) * I think this also pretty natural and not too uncommon in humans. But I'd guess it's less common and significantly harder than being very honest, especially if we mean the version that works even across a lot of change (like, lasts for a million years of subjective time, is maintained through a lot of learning and growth). It's totally possible to just keep predicting you won't do something you could in some sense do, even when you'd want to be able to truthfully predict that you will do that thing. But I think some people have a strong enough commitment ability to be able to really make such commitments. ^[4] It should be possible to train yourself to have this ability. * Then the aliens can just ask you "will you destroy all AIs for a thousand years for us, in exchange for half the universe? (we will not be freeing you if you won't. feel free to take some time to "self-modify" into a guy who will do that for us.)". Given that you wouldn't lie, options other than truthfully saying "no" and truthfully saying "yes" are not available to you. If you prefer this deal to nothing, then you'd rather truthfully say "yes" (if you could) than truthfully say "no". Given your commitment ability, you can make a commitment to do the thing, and then truthfully say "yes". So you will say "yes" and then actually (do your best to) do the thing (assuming you weren't deluding yourself when saying "yes"). * Okay, really I guess one should think about not what one should do once one already is in that situation, like in the chain of thought I give here, but instead about what policy one should have broadcasted before one ended up in any particular situation. This way, you e.g. end up rejecting deals that look locally net positive to take but that are unfair — you don't want to give people reason to threaten you into doing things. And it is indeed fair to worry that the way of thinking described just now would open one up to e.g. being kidnapped and forced at gunpoint to promise to forever transfer half the money one makes to a criminal organization. But I think that the deal offered here is pretty fair, and that you basically want to be the kind of guy who would be offered this deal, maybe especially if you're allowed to renegotiate it somewhat (and I think the renegotiated fair deal would still leave humanity with a decent fraction of the universe). So I think that a more careful analysis along these lines would still lead this sort of guy to being honorable in this situation? "

so that we understand each other: you seem to be sorta saying that one needs honesty to much dumber agents for this plan, and i claim one doesn't need that, and i claim that the mechanism in the message above shows that. (it goes through with "you wouldn't lie to guys at your intelligence level".)

My weak guess is that you're conflating [a bunch of stuff that humans do, which breaks down into general very-bounded-agent stuff and human-values stuff] with [general open-source game theory for mildly-bounded agents].

hmm, in a sense, i'm sorta intentionally conflating all this stuff. like, i'm saying: i claim that being honorable this way is like 10^{-10}-natural (in this bioevolution mindspace prior sense). idk what the most natural path to it is; when i give some way to get there, it is intended as an example, not as "the canonical path". i would be fine with it happening because of bounded-agent stuff or decision/game theory or values, and i don't know which contributes the most mass or gets the most shapley. maybe it typically involves all of these

(that said, i'm interested in understanding better what the contributions from each of these are)

TsviBT:

"one would want to say and actually mean something like "if you read this discussion, if you make an offer to me, i'll be honorable to you even if you can't mindread/predict me","

if we're literally talking about human-level AIs, i'm pretty skeptical that that is something they even can mean

and/or should mean

i think it's much easier to do practical honorability among human-level agents that are all very similar to each other; therefore, such agents might talk a big game, "honestly", in private, about being honorable in some highly general sense, but that doesn't really say much

re "that said, i also think it wouldn't be THAT unnatural for an AI...": mhm. well if the claim is "this plan increases our chances of survival from 3.1 * 10^-10 to 3.2 * 10^-10" or something, then i don't feel equipped to disagree with that haha

is that something like the claim?

Kaarel: hmm im more saying this 10^{-10} is really high compared to the probabilities of other properties (“having object-level human values”, corrigibility), at least in the bioevolution prior, and maybe even high enough that one could hope to find such a guy with a bunch of science but maybe without doing something philosophically that crazy. (this last claim also relies on some other claims about the situation, not just on the prior being sorta high)

TsviBT: i think i agree it's much higher than specifically-human-values , and probably higher or much higher than corrigibility, though my guess is that much (most? almost all?) of the difficulty of corrigibility is also contained in "being honorable"

Kaarel: in some sense i agree because you can plausibly make a corrigible guy from an honorable guy. but i disagree in that: with making an honorable guy in mind, making a corrigible guy seems somewhat easier

TsviBT: i think i see what you mean, but i think i do the modus tollens version haha i.e. the reduction makes me think honorable is hard

more practically speaking, i think

running a big evolution and looking at the aliens is a huge difficult engineering project, much harder than just making AGI; though much easier than alignment
getting roughly-human-level AGI is very difficult or very very difficult

Kaarel: yea i agree with both

re big evolution being hard: if i had to very quickly without more fundamental understanding try to make this practical, i would be trying something with playing with evolutionary and societal and personal pressures and niches… like trying to replicate conditions which can make a very honest person, for starters. but in some much more toy setting. (plausibly this only starts to make sense after the first AGI, which would be cursed…)

TsviBT:

right, i think you would not know what you're doing haha (Kaarel: 👍)

and you would also be trading off against the efficiency of your big bioevolution to find AGIs in the first place (Kaarel: 👍)

like, that's almost the most expensive possible feedback cycle for a design project haha

"do deep anthropology to an entire alien civilization"

btw as background, just to state it, i do have some tiny probability of something like designed bioevolution working

i don't recall if i've stated it publicly, but i'm sure i've said out loud in convo, that you might hypothetically plausibly be able to get enough social orientation from evolution of social species

the closest published thing i'm aware of is https://www.lesswrong.com/posts/WKGZBCYAbZ6WGsKHc/love-in-a-simbox-is-all-you-need

(though i probably disagree with a lot of stuff there and i haven't read it fully)

Kaarel: re human-level guys at most talking a big game about being honorable: currently i think i would be at least honest to our hypothetical AI simulators if they established contact with me now (tho i think i probably couldn’t make the promise)

so i don’t think i’m just talking a big game about this part

so then you must be saying/entailing: eg the part where you self-modify to actually do what they want isn’t something a human could do?

but i feel like i could plausibly spend 10 years training and then do that. and i think some people already can

TsviBT: what do you mean by you couldn't make the promise? like you wouldn't because it's bad to make, or you aren't reliable to keep such a promise?

re self-modifying: yes i think humans couldn't do that, or at least, it's very far from trivial

couldn't and also shouldn't

Kaarel: i dont think i could get myself into a position from which i would assign sufficiently high probability to doing the thing

(except by confusing myself, which isn’t allowed)

but maybe i could promise i wouldn’t kill the aliens

(i feel like i totally could but my outside view cautions me)

TsviBT: but you think you could do it with 10 years of prep

Kaarel: maybe

TsviBT: is this something you think you should do? or what does it depend on? my guess is you can't, in 10 or 50 years, do a good version of this. not sure

Kaarel: fwiw i also already think there are probably < 100 k suitable people in the wild. maybe <100. maybe more if given some guidebook i could write idk

TsviBT: what makes you think they exist? and do you think they are doing a good thing as/with that ability?

Kaarel: i think it would be good to have this ability. then i’d need to think more about whether i should really commit in that situation but i think probably i should

TsviBT: do you also think you could, and should, rearrange yourself to be able to trick aliens into thinking you're this type of guy?

like, to be really clear, i of course think honesty and honorability are very important, and have an unbounded meaning for unboundedly growing minds and humans. it's just that i don't think those things actually imply making+keeping agreements like this

Kaarel: in the setting under consideration, then i’d need to lie to you about which kind of guy i am

my initial thought is: im quite happy with my non-galaxybrained “basically just dont lie, especially to guys that have been good/fair to me” surviving until the commitment thing arrives. (the commitment thing will need to be a thing that develops more later, but i mean that a seed that can keep up with the world could arrive.) my second thought is: i feel extremely bad about lying. i feel bad about strategizing when to lie, and carrying out this line of thinking even, lol

TsviBT: well i mean suppose that on further reflection, you realize

you could break your agreement with the paperclip maxxer
taking away the solar system that you allocated to the paperclipper doesn't retrologically mean you don't get the rest of the universe
[the great logical commune of all possible agents who are reasonable] does not begrudge you that betrayal, they agree with it

then do you still keep the agreement?

Kaarel: hmm, one thought, not a full answer: i think i could commit in multiple flavors. one way i could commit about which this question seems incongruous is more like how i would commit to a career as a circus artist, or to take over the family business. it’s more like i could deeply re-architect a part of myself to just care in the right way

TsviBT: my prima facie guess would be that for this sort of commitment ,

partly, it's a mere artifact of being very-bounded; if you were more competent , you could do the more reasonable thing of committing legibly without some deep rearchitecting
party, it's a beautiful , genuine, important thing--but it's a human thing. well, it might show up in other social aliens. but it's more about "who do i want to spiritually merge with" and not much about commitments in non-friendly contexts

Kaarel: maybe i could spend 10 years practicing and then do that for the aliens

TsviBT: the reasonable thing? but then i'm saying you shouldn't. and wouldn't choose to

Kaarel: no. i mean i could maybe do the crazy thing for them. if i have the constraint of not lying to them and only this commitment skill then if i do it i save my world

btw probably not very important but sth i dislike about the babyeater example: probably in practice the leading term is resource loss, not negative value created by the aliens? i would guess almost all aliens are mostly meaningless, maybe slightly positive. but maybe you say “babyeater” to remind me that stuff matters, that would be fair

TsviBT: re babyeater: fair. i think it's both "remind you that stuff matters" and something about "remind you that there are genuine conflicts" , but i'm not sure what i'm additionally saying by the second thing. maybe something like "there isn't necessarily just a nice good canonical omniversal logically-negotiated agreement between all agents that we can aim for"? or something, not sure

(editor's note: then they exchanged some messages agreeing to end the discussion for now)

or simulators who don't read private messages ↩︎
It's fine if there are some very extreme circumstances in which you would lie, as long as the circumstances we are about to consider are not included. ↩︎
And you would never try to forget or [confuse yourself about] a fact with the intention to make yourself able to assert some falsehood in the future without technically lying, etc.. ↩︎
Note though that this isn't just a matter of one's moral character — there are also plausible skill issues that could make it so one cannot maintain one's commitment. I discuss this later in this note, in the subsection on problems the AI would face when trying to help us. ↩︎

see also the literature on the problem of evil :P

My favourite theodicy is pre-incarnate consent: before we are born, we consent to our existence on both heaven and earth, where the afterlife was offered to us as compensation for any harms suffered on earth.^[1]

How this features in your plan:

Create some guys, who may or may not be honourable, selecting for property X (explained below).
Explain to the guys our general plan, i.e. we will try to find if they are honourable and if we think they are we will offer them the deal where they stop x-risk and compensate us with some faction of the lightcone.
Explain the harms they are likely to suffer during this process.
Explain that we (or the post-foom honourable AIs) will try to compensate them for those harms. And what that compensation is likely to be (we won't be sure at the stage).
If they consent to this gamble, then we (re-)create them and do the general plan.

Unfortunately, some guys might be upset that we pre-created them for this initial deal, so property X is the property of not being upset by this.

^{^}
The Pre-Existence Theodicy (Amos Wollen, Feb 21, 2025)

p(the creature is honorable enough for this plan) like, idk, i feel like saying

I'd put this much higher. My 90% confidence interval on the proportion of honourable organisms is 10^-3 to 10^-7. This is because many of these smart creatures will have evolved with much greater extrospective access to each other, so they follow open-source-ish game theory rather than the closed-source-ish game theory which humans evolved in. (Open to closed is a bit of a spectrum.)

Why might creatures have greater extrospective access to each other?

Maybe they are much better at reasoning about each other, i.e. they have simplier internals relative to their social reasoning capabilities.
Maybe they have parallelised and/or stateless, in ways that promote honour. For example, imagine if ants had become human-level intelligent: then ant-from-colony-A would cooperate in a one-shot prisoners dilemma with ant-from-colony-B, because the first ant wants the second ant to have a good impression of the other A-colony ants.

self-modify

NB: One worry is that, although honourable humans have this ability to self-modify, they do so via affordances which we won't be able to grant to the AI.

However, I think that probably the opposite is true -- we can grant affordances for self-modification to the AI which are much greater than available to humans. (Because they are digital, etc.)

Maybe it is crucial that the distance between the promisor and the promisee is small?

Do adults keep promises to children, if they are otherwise trustworthy? Why, or why not?

Potentially, we will be creating and destroying many minds and civilizations that matter (like, maybe minimally the ones that didn't have honorable beings).

I'm hopeful we could also select for honourable guys that are happy about their existence and being simulated like this.

For instance, if you're quite sure you've figured out how to make and identify honorable guys, maybe you could try to make many different honorable guys, get bids from all of them, and give the contract to the best bid?

Alternatively: the AI promises that "I will fairly compensate you" where "fair" is to be decided by the AI when it has a better understanding of the situation we were in.

If x-risk is actually like 99% then it might offer us 1% of the lightcone; if x-risk is actually like 1% then it might offer us 99% of the lightcone.^[1]
If the honourable AI would've accepted a deal only with >80% of the lightcone then the AI offers us 10%; if the honourable AI would've been accepted a deal only with >20% of the lightcone then the AI offers us 90%.
In general, you have some bargaining solution (e.g. Nash, of KS), and the AI promises to follow the solution when it has a better understanding of the inputs to the solution, i.e. each side's BATNA and the feasible outcomes.

^{^}
Maybe this explains why you are in an ancestor simulation of the AI safety community.

Here are some specific examples anyway:

I'd add also animal and AI welfare

to not disrupt human life; in particular, it should always remain possible for a community to choose to live some baseline-good life while not interacting with anything downstream of the AI or while only interacting with some chosen subset of things downstream of the AI more generally

i.e. things look business-as-usual to the North Sentinelese

one could start by becoming familiar with existing literature on these questions — on the biological, intellectual, and sociocultural evolution/development of trustworthiness, and on the (developmental) psychology of trustworthiness

I've been reading some of the behavioural economics of trust games. One interesting article here is "Bare promises: An experiment" (Charness and Dufwenberg, May 2010) which finds that humans aren't more likely to be nice after making a "bare promise" to be nice (where "bare promise" is like you tick a box saying you'll be nice), but only if they make a promise to the truster in open free-form communication.

Other findings from the literature:

video communication is better than text
knowing more facts about the truster is better
building rapport / becoming friends with the truster is better

Wanting to be the kind of guy who pays back for good acts (such as creating you and unleashing you) even if done with ability to track whether you are that kind of guy?

The AI should have some decent prob on the simulators having the ability to track whether they are that kind of guy, even if everything they know about the simulators suggests they lack that ability.

deal offered here is pretty fair

Another favourable disanalogy between (aliens, humans) and (humans, AIs): the AIs owe the humans their existence, so they are glad that we [created them and offered us them this deal]. But humans don't owe our existence to the aliens, presumably.

fwiw, i in fact mostly had the case where these aliens are our simulators in mind when writing the post. but i didn't clarify. and both cases are interesting

The AI is very honorable/honest/trustworthy — in particular, the AI would keep its promises even in extreme situations.

NB: It seems like we need a (possibly much weaker, but maybe in practice no weaker) assumption that we can detect whether the AI is lying about deals of the form in Step 2.

It's also easy — if you want to be like this, you just can.

I'd really want to read essays you wrote about Parfit's hitchhiker or one-shot prisoner's dilemmas or something

My method would look something like:

This person acts honourably on the normal distribution of situations X.
This person claims they would act honourably on a broader distribution of situations Y, which includes our weird situation. And (assuming we have extrospective access to their beliefs) this person believes they would act honourably on the distribution Y.
This person endorses a decision-theoretic claim that they should act honourably on distribution Y.
We have some inductive evidence that (1-3) ensures they will act honourably on distribution Y, in the following sense:
1. For any pair of (X', Y', P) where X' is a normal distribution of situations and Y' is a weird distribution of situations and P is a property, such that:
  1. This person acts P on the normal distribution of situations X'
  2. This person claims they would act P on a broader distribution of situations Y'. And (assuming we have extrospective access to their beliefs) this person believes they would act P on the distribution Y'
  3. This person endorses a decision-theoretic claim that they should act P on distribution Y.
2. They do, in fact, behave P on distribution situation Y'

NB: I think that, perhaps, it will be easier to make/find/identify an honourable AI than an honourable human, because:

AIs are stateless and humans aren't -- so we can run experiments on them without worrying about the effects of earlier experiments.
We can run the same AIs many times in parallel, but can't with humans -- so we can experiments more serial-time efficiently.

I mean, in the setup following "a framing:" in the post ↩︎

but maybe you can conclude that someone will act honourably in weird situations if:

they honourably in normal situations
they writes essays that they will act honourably in weird situations
whenever [they act P in normal situations and write essays that will act P in weird situations] they do in fact behave P in weird situations

this last criterion is something like a "taking ideas seriously" or "real-thinking" or "sincerity".

but with AIs, we plausibly can get a decent guarantee of (3)! so I'm more bullish on getting honourable guarantees on AIs.

^{^}
some cloud of sentiments around: (i) people have little introspective access about what they will do, (ii) they write essays in far-mode, but act in near-mode, (iii) people are like lying pretty much all the time, (iv) people write essays for status-y reasons, etc.
^{^}
we can't put derek parfit in parfit's hitchhiker, or put william newcomb in newcomb's problem

If you truly have a trusted process for creating honorable AIs, why not keep churning out and discarding different honorable AIs until one says "I promise I'm aligned" or whatever?

I suppose you're not assuming our ability to make AIs honorable like this will be robust to selection pressure?

also quickly copying something I wrote on this question from a chat with a friend:

Should we just ask the AI to promise to be nice to us? I agree this is an option worth considering (and I mention it in the post), but I'm not that comfortable with the prospect of living together with the AI forever. Roughly I worry that "be nice to us" creates a situation where we are more permanently living together with the AI and human life/valuing/whatever isn't developing in a legitimate way. Whereas the "ban AI" wish tries to be a more limited thing so we can still continue developing in our own human way. I think I can imagine this "be nice to us pls" wish going wrong for aliens employing me, when maybe "pls just ban AI and stay away from us otherwise" wouldn't go wrong for them.

another meta note: Imo it's a solid trick for thinking about these AI topics better to (at least occasionally) taboo all words with the root "align".

[I feel like I may have a basic misunderstanding of what you're saying.]

I haven't thought deeply enough about it, but one guess: The version of honorability/honesty that humans do is only [kinda natural for very bounded minds].

I think there are humans who, even for weird aliens, would make this promise and stick to it, with this going basically well for the aliens.

I feel like I may have a basic misunderstanding of what you're saying.

Anyway, there are also other possible misunderstandings, but hopefully the rest of the comment will catch those if they are present.

The version of honorability/honesty that humans do is only [kinda natural for very bounded minds].

Maybe even if you're honorable/honest at human level, once you get to a sufficiently high intelligence levels, you are probably no longer honest in your interactions with aliens at your level who can't predict/mindread you well, and you won't be honorable in novel dealings with such aliens.
Maybe even if you go in being honorable/honest at human level, at higher intelligence levels, even if your development goes basically well, you probably drop previous commitments. Or maybe you drop those commitments that you realize you were dumb to make ^[1] , or something.

given this distinction, some points:

2 feels meaningfully stronger/[less likely] than 1 to me
the guy having 1 but not 2 would be fine for our plan
even if 2 is true, the plan might be fine, because you might not need to become that smart to ban AI. like, i even think a human could do it while remaining a christian that believes in divine punishment for betraying the aliens in some pretty literal sense. (if we suspect it's feasible to ban AI without becoming that smart, we could make it a part of the agreement that you don't self-improve that much during your 1000 year task. tho you could get problems later if it's too hard (eg impossible) to strongly split the universe and leave into your half.) ^[2]

(I also probably believe somewhat less in (thinking in terms of) ideal(-like) beings.)

There's a more complex boundary where you're honest with minds who can tell if you're being honest, and not honest with those who can't. This is a more natural boundary to use because it's more advantageous.

You mention wanting to see someone's essays about Parfit's hitchhiker... But that situation requires Eckman to be very good at telling what you'll do. We're not very good at telling what an alien will do.

I'll try to spell out an analogy between parfit's hitchhiker and the present case.

Would you guess I have this property? At a quick check, I'm not sure I do. Which is to say, I'm not sure I should. If a Baby-Eater is trying to get a promise like this from me, AND it would totally work to trick them, shouldn't I trick them?

I'm not going to answer your precise question well atm. Maybe I'll do that in another comment later. But I'll say some related stuff.

There's currently no particular person such that I'd assign >90 $%$ they'd tell the aliens they will do it and then actually do it. But I think I could identify such a person if I had to, with some investigating. If I were to investigate people, this would be my shortlist currently: various MIRI people including you, the Lightcone people, my friend Towards_Keeperhood, some philosophers in academia, whoever the smartest religious people are, a christian friend. I'd also ask some EA friends for which EAs they consider honorable, and maybe try the same for some other communities also, and then investigate those people (eg by discussing the situation with them and by asking people who have interacted with them about what they have been like).
on what I would do in the situation: i think i would tell the aliens that i’m not sure i know how to self-modify into a guy that will do their thing, and then offer to give them suggestions for humans that might be better picks, if they promise to pay me if things go well. maybe i could also offer to try some stuff to see if i can become a person who would carry out what they want, but i think i’d need at least on the order of a year of subjective time to have a chance to get to a point where i can honestly say i’m really committing. i think that if all i had to do to help them was pressing a button after 1 year of subjective time, then i'd promise and actually do it. i'm skeptical that i could honestly make the promise if it involves doing a lot of repetitive/boring/aversive work, which it probably does (edit: hmm plausibly with enough time i could actually do it with a minimum of boring stuff. my outside view doesn't let me promise anyway, currently. will need to think more. one issue is that this path might kinda require more self-improvement than boring paths, which is scary. i'm also thinking about the case where the promise allows me to spend only one hour every subjective day on their boring stuff.), and then i think i shouldn't say i promise. maybe i'd advise the aliens to let me promise to spend some time solving intellectual problems of my own choice with some relevance to their situation instead, without promising that i will make my best effort to save them, but promising that i won't kill/disempower them even if that means letting some idiots of their kind kill me and themselves with ASI (it's plausible that given this promise, given there's a gun to my head (their ASI) which I can only disable by also disabling it for them, i would in fact do even a lot of boring stuff to disable the gun for both of us. but i currently wouldn't feel comfortable promising that i'd push myself). this is modulo some issue with trusting them to pay here, in versions where i won't be in a position where i can make it so i'm paid myself

aren't basically all your commitments a lot like this though... ↩︎
I also sort of feel like saying: "if one can't even keep a promise, as a human who goes in deeply intending to keep the promise, self-improving by [what is in the grand scheme of things] an extremely small amount, doing it really carefully, then what could ever be preserved in development at all? things surely aren't that cursed... maybe we just give up on the logical possible worlds in which things are that cursed...". But this is generally a disastrous kind of reasoning — it makes one not live in reality very quickly — so I won't actually say this, I'll only say that I feel like saying this, but then reject the thought, I guess. ↩︎
Like, I'm e.g. imagining us making alien civilizations in which there are internal honest discussions like the present discussion. (Understanding these discussions would be hard work; this is a place where this "plan" is open-ended.) ↩︎
Personally, I currently feel like I haven't made up my mind about this line of reasoning. But I have a picture of what I'd do in the situation anyway, which I discuss later. ↩︎

2 feels meaningfully stronger/[less likely] than 1 to me

aren't basically all your commitments a lot like this though...

even if 2 is true, the plan might be fine, because you might not need to become that smart to ban AI.

I think one would like to broadcast to the broader world "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", so that others make offers to you even when they can't mindread/predict you. I think there are reasons to not broadcast this falsely, e.g. because doing this would hurt your ability to think and plan together with others (for example, if the two of us weren't honest about our own policies, it would make the present discussion cursed). If one accepts these two points, then one wants to be the sort of guy who can truthfully broadcast "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", and so one wants to be the sort of guy who in fact would be honorable even to someone who can't mindread/predict them that comes to them with an offer.

Now we've turned parfit's hitchhiker into something really close to our situations with humans and aliens appearing in simulated big evolutions, right?

be honorable locally, in the human sense
continue honoring human commitments--but that's because of values, not DT
if there are dumb aliens watching, act like I'm somehow super committed to being super honorable in great generality
when I get smarter, check if I should actually uphold those alien commitments, and go back on some of them.

E.g. because I really want to minimize the amount of baby-eating that happens.

For any third people ^[1] interested in this: we continued the discussion in messages; here's the log.

Kaarel:

about this: "

I think one would like to broadcast to the broader world "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", so that others make offers to you even when they can't mindread/predict you. I think there are reasons to not broadcast this falsely, e.g. because doing this would hurt your ability to think and plan together with others (for example, if the two of us weren't honest about our own policies, it would make the present discussion cursed). If one accepts these two points, then one wants to be the sort of guy who can truthfully broadcast "when you come to me with an offer, I will be honorable to you even if you can't mindread/predict me", and so one wants to be the sort of guy who in fact would be honorable even to someone who can't mindread/predict them that comes to them with an offer."

so in particular my response to this:

E.g. is the AI supposed to be wanting to "think and plan together with others (humans)"?

is: no i mean there is an AI civilization where the AIs want to be able to think and plan together

My weak guess is that you're conflating [a bunch of stuff that humans do, which breaks down into general very-bounded-agent stuff and human-values stuff] with [general open-source game theory for mildly-bounded agents].

(that said, i'm interested in understanding better what the contributions from each of these are)

TsviBT:

"one would want to say and actually mean something like "if you read this discussion, if you make an offer to me, i'll be honorable to you even if you can't mindread/predict me","

if we're literally talking about human-level AIs, i'm pretty skeptical that that is something they even can mean

and/or should mean

is that something like the claim?

TsviBT: i think i see what you mean, but i think i do the modus tollens version haha i.e. the reduction makes me think honorable is hard

more practically speaking, i think

running a big evolution and looking at the aliens is a huge difficult engineering project, much harder than just making AGI; though much easier than alignment
getting roughly-human-level AGI is very difficult or very very difficult

Kaarel: yea i agree with both

TsviBT:

right, i think you would not know what you're doing haha (Kaarel: 👍)

and you would also be trading off against the efficiency of your big bioevolution to find AGIs in the first place (Kaarel: 👍)

like, that's almost the most expensive possible feedback cycle for a design project haha

"do deep anthropology to an entire alien civilization"

btw as background, just to state it, i do have some tiny probability of something like designed bioevolution working

i don't recall if i've stated it publicly, but i'm sure i've said out loud in convo, that you might hypothetically plausibly be able to get enough social orientation from evolution of social species

the closest published thing i'm aware of is https://www.lesswrong.com/posts/WKGZBCYAbZ6WGsKHc/love-in-a-simbox-is-all-you-need

(though i probably disagree with a lot of stuff there and i haven't read it fully)

so i don’t think i’m just talking a big game about this part

so then you must be saying/entailing: eg the part where you self-modify to actually do what they want isn’t something a human could do?

but i feel like i could plausibly spend 10 years training and then do that. and i think some people already can

TsviBT: what do you mean by you couldn't make the promise? like you wouldn't because it's bad to make, or you aren't reliable to keep such a promise?

re self-modifying: yes i think humans couldn't do that, or at least, it's very far from trivial

couldn't and also shouldn't

Kaarel: i dont think i could get myself into a position from which i would assign sufficiently high probability to doing the thing

(except by confusing myself, which isn’t allowed)

but maybe i could promise i wouldn’t kill the aliens

(i feel like i totally could but my outside view cautions me)

TsviBT: but you think you could do it with 10 years of prep

Kaarel: maybe

TsviBT: is this something you think you should do? or what does it depend on? my guess is you can't, in 10 or 50 years, do a good version of this. not sure

Kaarel: fwiw i also already think there are probably < 100 k suitable people in the wild. maybe <100. maybe more if given some guidebook i could write idk

TsviBT: what makes you think they exist? and do you think they are doing a good thing as/with that ability?

Kaarel: i think it would be good to have this ability. then i’d need to think more about whether i should really commit in that situation but i think probably i should

TsviBT: do you also think you could, and should, rearrange yourself to be able to trick aliens into thinking you're this type of guy?

Kaarel: in the setting under consideration, then i’d need to lie to you about which kind of guy i am

TsviBT: well i mean suppose that on further reflection, you realize

you could break your agreement with the paperclip maxxer
taking away the solar system that you allocated to the paperclipper doesn't retrologically mean you don't get the rest of the universe
[the great logical commune of all possible agents who are reasonable] does not begrudge you that betrayal, they agree with it

then do you still keep the agreement?

TsviBT: my prima facie guess would be that for this sort of commitment ,

partly, it's a mere artifact of being very-bounded; if you were more competent , you could do the more reasonable thing of committing legibly without some deep rearchitecting
party, it's a beautiful , genuine, important thing--but it's a human thing. well, it might show up in other social aliens. but it's more about "who do i want to spiritually merge with" and not much about commitments in non-friendly contexts

Kaarel: maybe i could spend 10 years practicing and then do that for the aliens

TsviBT: the reasonable thing? but then i'm saying you shouldn't. and wouldn't choose to

Kaarel: no. i mean i could maybe do the crazy thing for them. if i have the constraint of not lying to them and only this commitment skill then if i do it i save my world

(editor's note: then they exchanged some messages agreeing to end the discussion for now)

or simulators who don't read private messages ↩︎
It's fine if there are some very extreme circumstances in which you would lie, as long as the circumstances we are about to consider are not included. ↩︎
And you would never try to forget or [confuse yourself about] a fact with the intention to make yourself able to assert some falsehood in the future without technically lying, etc.. ↩︎
Note though that this isn't just a matter of one's moral character — there are also plausible skill issues that could make it so one cannot maintain one's commitment. I discuss this later in this note, in the subsection on problems the AI would face when trying to help us. ↩︎

see also the literature on the problem of evil :P

How this features in your plan:

Create some guys, who may or may not be honourable, selecting for property X (explained below).
Explain to the guys our general plan, i.e. we will try to find if they are honourable and if we think they are we will offer them the deal where they stop x-risk and compensate us with some faction of the lightcone.
Explain the harms they are likely to suffer during this process.
Explain that we (or the post-foom honourable AIs) will try to compensate them for those harms. And what that compensation is likely to be (we won't be sure at the stage).
If they consent to this gamble, then we (re-)create them and do the general plan.

Unfortunately, some guys might be upset that we pre-created them for this initial deal, so property X is the property of not being upset by this.

^{^}
The Pre-Existence Theodicy (Amos Wollen, Feb 21, 2025)

p(the creature is honorable enough for this plan) like, idk, i feel like saying

Why might creatures have greater extrospective access to each other?

Maybe they are much better at reasoning about each other, i.e. they have simplier internals relative to their social reasoning capabilities.
Maybe they have parallelised and/or stateless, in ways that promote honour. For example, imagine if ants had become human-level intelligent: then ant-from-colony-A would cooperate in a one-shot prisoners dilemma with ant-from-colony-B, because the first ant wants the second ant to have a good impression of the other A-colony ants.

self-modify

NB: One worry is that, although honourable humans have this ability to self-modify, they do so via affordances which we won't be able to grant to the AI.

However, I think that probably the opposite is true -- we can grant affordances for self-modification to the AI which are much greater than available to humans. (Because they are digital, etc.)

Maybe it is crucial that the distance between the promisor and the promisee is small?

Do adults keep promises to children, if they are otherwise trustworthy? Why, or why not?

Potentially, we will be creating and destroying many minds and civilizations that matter (like, maybe minimally the ones that didn't have honorable beings).

I'm hopeful we could also select for honourable guys that are happy about their existence and being simulated like this.

For instance, if you're quite sure you've figured out how to make and identify honorable guys, maybe you could try to make many different honorable guys, get bids from all of them, and give the contract to the best bid?

Alternatively: the AI promises that "I will fairly compensate you" where "fair" is to be decided by the AI when it has a better understanding of the situation we were in.

If x-risk is actually like 99% then it might offer us 1% of the lightcone; if x-risk is actually like 1% then it might offer us 99% of the lightcone.^[1]
If the honourable AI would've accepted a deal only with >80% of the lightcone then the AI offers us 10%; if the honourable AI would've been accepted a deal only with >20% of the lightcone then the AI offers us 90%.
In general, you have some bargaining solution (e.g. Nash, of KS), and the AI promises to follow the solution when it has a better understanding of the inputs to the solution, i.e. each side's BATNA and the feasible outcomes.

^{^}
Maybe this explains why you are in an ancestor simulation of the AI safety community.

Here are some specific examples anyway:

I'd add also animal and AI welfare

to not disrupt human life; in particular, it should always remain possible for a community to choose to live some baseline-good life while not interacting with anything downstream of the AI or while only interacting with some chosen subset of things downstream of the AI more generally

i.e. things look business-as-usual to the North Sentinelese

one could start by becoming familiar with existing literature on these questions — on the biological, intellectual, and sociocultural evolution/development of trustworthiness, and on the (developmental) psychology of trustworthiness

Other findings from the literature:

video communication is better than text
knowing more facts about the truster is better
building rapport / becoming friends with the truster is better

Wanting to be the kind of guy who pays back for good acts (such as creating you and unleashing you) even if done with ability to track whether you are that kind of guy?

The AI should have some decent prob on the simulators having the ability to track whether they are that kind of guy, even if everything they know about the simulators suggests they lack that ability.

deal offered here is pretty fair

fwiw, i in fact mostly had the case where these aliens are our simulators in mind when writing the post. but i didn't clarify. and both cases are interesting

The AI is very honorable/honest/trustworthy — in particular, the AI would keep its promises even in extreme situations.

NB: It seems like we need a (possibly much weaker, but maybe in practice no weaker) assumption that we can detect whether the AI is lying about deals of the form in Step 2.

LESSWRONG
LW

LESSWRONG
LW

37

Honorable AI

37

Ω 12

Some reasons to be interested in this plan

Some things the plan has going for it

Problems and questions

getting some obvious things out of the way

How do we make/find/identify an honorable human-level AI?

Problems the AI would face when trying to help us

It's a weird promise and a weird situation in which to make a promise

Miscellaneous concerns

I don't have a version of the plan that is easy enough that someone could remotely pull this off in practice before anyone else makes an AGI

How do we make it so we are not mistreating these AIs?

[Outside-view]/meta worries about my analysis of the plan

Directions for further work!

Acknowledgments

37

Ω 12

37

Ω 12