If you truly have a trusted process for creating honorable AIs, why not keep churning out and discarding different honorable AIs until one says "I promise I'm aligned" or whatever?
I suppose you're not assuming our ability to make AIs honorable like this will be robust to selection pressure?
I agree you could ask your AI "will you promise to be aligned?". I think I already discuss this option in the post — ctrl+f "What promise should we request?" and see the stuff after it. I don't use the literal wording you suggest, but I discuss things which are ways to cash it out imo.
also quickly copying something I wrote on this question from a chat with a friend:
Should we just ask the AI to promise to be nice to us? I agree this is an option worth considering (and I mention it in the post), but I'm not that comfortable with the prospect of living together with the AI forever. Roughly I worry that "be nice to us" creates a situation where we are more permanently living together with the AI and human life/valuing/whatever isn't developing in a legitimate way. Whereas the "ban AI" wish tries to be a more limited thing so we can still continue developing in our own human way. I think I can imagine this "be nice to us pls" wish going wrong for aliens employing me, when maybe "pls just ban AI and stay away from us otherwise" wouldn't go wrong for them.
another meta note: Imo it's a solid trick for thinking about these AI topics better to (at least occasionally) taboo all words with the root "align".
deal offered here is pretty fair
Another favourable disanalogy between (aliens, humans) and (humans, AIs): the AIs owe the humans their existence, so they are glad that we [created them and offered us them this deal]. But humans don't owe our existence to the aliens, presumably.
self-modify
NB: One worry is that, although honourable humans have this ability to self-modify, they do so via affordances which we won't be able to grant to the AI.
However, I think that probably the opposite is true -- we can grant affordances for self-modification to the AI which are much greater than available to humans. (Because they are digital, etc.)
It's also easy — if you want to be like this, you just can.
I think you can easily choose to follow a policy of never saying things you know to be false. (Easy in the sense of "considering only the internal costs of determining and executing the action consistent with this policy, ignoring the external costs, e.g. losing your job and friends.) But I'm not sure it's easy to do the extra thing of "And you would never try to forget or [confuse yourself about] a fact with the intention to make yourself able to assert some falsehood in the future without technically lying, etc"
I'd really want to read essays you wrote about Parfit's hitchhiker or one-shot prisoner's dilemmas or something
My method would look something like:
NB: I think that, perhaps, it will be easier to make/find/identify an honourable AI than an honourable human, because:
The AI is very honorable/honest/trustworthy — in particular, the AI would keep its promises even in extreme situations.
NB: It seems like we need a (possibly much weaker, but maybe in practice no weaker) assumption that we can detect whether the AI is lying about deals of the form in Step 2.
This note discusses a (proto-)plan for [de[AGI-[x-risk]]]ing [1] (pdf version). Here's the plan:
importantly:
Thinking that there are humans who would be suitable for aliens carrying out this plan is a crux for me, for thinking the plan is decent. I mean: if I couldn't really pick out a person who would be this honorable to aliens, then I probably should like this plan much less than I currently do.
also importantly:
less importantly:
(
)
thank you for your thoughts: Hugo Eberhard, Kirke Joamets, Sam Eisenstat, Simon Skade, Matt MacDermott
that is, for ending the present period of (in my view) high existential risk from AI (in a good way) ↩︎
some alternative promises one could consider requesting are given later ↩︎
worth noting some of my views on this, without justification for now: (1) making a system that will be in a position of such power is a great crime; (2) such a system will unfortunately be created by default if we don't ban AI; (3) there is a moral prohibition on doing it despite the previous point; (4) without an AI ban, if one somehow found a way to take over without ending humanity, doing that might be all-things-considered-justified despite the previous point; (5) but such a way to do it is extremely unlikely to be found in time ↩︎
maybe we should add that if humanity makes it to a more secure position at some higher intelligence level later, then we will continue running this guy's world. but that we might not make it ↩︎
i'm actually imagining saying this to a clone transported to a new separate world, with the old world of the AI continuing with no intervention. and this clone will be deleted if it says "no" — so, it can only "continue" its life in a slightly weird sense ↩︎
I'm assuming this because humans having become much smarter would mean that making an AI that is fine to make and smarter than us-then is probably objectively harder, and also because it's harder to think well about this less familiar situation. ↩︎
I think it's plausible all future top thinkers should be human-descended. ↩︎
I think it's probably wrong to conceive of alignment proper as a problem that could be solved; instead, there is an infinite endeavor of growing more capable wisely. ↩︎
This question is a specific case of the following generally important question: to what extent are there interesting thresholds inside the human range? ↩︎
It's fine if there are some very extreme circumstances in which you would lie, as long as the circumstances we are about to consider are not included. ↩︎
And you would never try to forget or [confuse yourself about] a fact with the intention to make yourself able to assert some falsehood in the future without technically lying, etc.. ↩︎
Note though that this isn't just a matter of one's moral character — there are also plausible skill issues that could make it so one cannot maintain one's commitment. I discuss this later in this note, in the subsection on problems the AI would face when trying to help us. ↩︎
in a later list, i will use the number again for the value of a related but distinct parameter. to justify that claim, we would have to make the stronger claim here that there are at least humans who are pretty visibly suitable (eg because of having written essays about parfit's hitchhiker or [whether one should lie in weird circumstances] which express the views we seek for the plan), which i think is also true. anyway it also seems fine to be off by a few orders of magnitude with these numbers for the points i want to make ↩︎
though you could easily have an AI-making process in which the prior is way below , such as play on math/tech-making, which is unfortunately a plausible way for the first AGI to get created... ↩︎
i think this is philosophically problematic but i think it's fine for our purposes ↩︎
also they aren't natively spacetime-block-choosers, but again i think it's fine to ignore this for present purposes ↩︎
in case it's not already clear: the reason you can't have an actual human guy be the honorable guy in this plan is that they couldn't ban AI (or well maybe they could — i hope they could — but it'd probably require convincing a lot of people, and it might well fail; the point is that it'd be a world-historically-difficult struggle for an actual human to get AI banned for 1000 years, but it'd not be so hard for the AIs we're considering). whereas if you had (high-quality) emulations running somewhat faster than biological humans, then i think they probably could ban AI ↩︎
but note: it is also due to humans that the AI's world was run in this universe ↩︎
would this involve banning various social media platforms? would it involve communicating research about the effects of social media on humanity? idk. this is a huge mess, like other things on this list ↩︎
and this sort of sentence made sense, which is unclear ↩︎
credit to Matt MacDermott for suggesting this idea ↩︎