My research focus is Alignment Target Analysis (ATA). I noticed that the most recently published version of CEV (Parliamentarian CEV, or PCEV) gives a large amount of extra influence to people that intrinsically value hurting other individuals. For Yudkowsky's description of the issue you can search the CEV arbital page for ADDED 2023.
The fact that no one noticed this issue for over a decade shows that ATA is difficult. If PCEV had been successfully implemented, the outcome would have been massively worse than extinction. I think that this illustrates that scenarios where someone successfully hits a bad alignment target pose a serious risk. I also think that it illustrates that ATA can reduce these risks (noticing the issue reduced the probability of PCEV getting successfully implemented). The reason that more ATA is needed is that PCEV is not the only bad alignment target that might end up getting implemented. ATA is however very neglected. There does not exist a single research project dedicated to ATA. In other words: the reason that I am doing ATA is that it is a tractable and neglected way of reducing risks.
I am currently looking for collaborators. I am also looking for a grant or a position that would allow me to focus entirely on ATA for an extended period of time. Please don't hesitate to get in touch if you are curious and would like to have a chat, or if you have any feedback, comments, or questions. You can for example PM me here, or PM me on the EA Forum, or email me at thomascederborgsemail@gmail.com (that really is my email address. It's a Gavagai / Word and Object joke from my grad student days)
My background is physics as an undergrad and then AI research. Links to some papers: P1 P2 P3 P4 P5 P6 P7 P8. (no connection to any form of deep learning)
There are no Pareto improvements relative to the new Pareto Baseline that you propose. Bob would indeed classify a scenario with an AI that takes no action as a Dark Future. However, consider Bob2, who takes another perfectly coherent position on how to classify an AI that never acts. If something literally never takes any action, then Bob2 simply does not classify it as a person. Bob2 therefore does not consider a scenario with an AI that literally never does anything to be a Dark Future (other than this difference, Bob2 agrees with Bob about morality). This is also a perfectly reasonable ontology. A single person like Bob2 is enough to make the set of Pareto Improvements relative to your proposed Pareto Baseline empty.
(As a tangent, I just want to explicitly note here that this discussion is about Pareto Baselines. Not Negotiation Baselines. The negotiation baseline in all scenarios discussed in this exchange is still Yudkowsky's proposed Random Dictator negotiation baseline. The Pareto Baseline is relevant to the set of actions under consideration in the Random Dictator negotiation baseline. But it is a distinct concept. I just wanted to make this explicit for the sake of any reader that is only skimming this exchange)
The real thing that you are dealing with is a set of many trillions of hard constraints, defined in billions of ontologies (including a large number of non-standard ontologies. Some presumably a lot more strange than the ontologies of Bob and Bob2). The concept of a Pareto Improvement was really not designed to operate in a context like this. It seems to me that it has never been properly explored in a context like this. I doubt that anyone has ever really thought deeply about how this concept would actually behave in the AI context. Few concepts have actually been properly explored in the AI context (this is related to the fact that the Random Dictator negotiation baseline actually works perfectly fine in the context that it was originally designed for: a single individual trying to deal with Moral Uncertainty. Something similar is also true for the Condorcet Criterion. The intuition failures that seem to happen when people move concepts from CEVI style mappings to CEVH style mappings is also related. Etc, etc, etc. It simply does not seem to exist a workable alternative, to actually exploring a concept, in whatever AI context that one wants to use it in. Simply importing concepts from other contexts, just does not seem to be a reliable way of doing things. This state of affairs is extremely inconvenient).
Let's consider the economist Erik, who claims that Erik's Policy Modification (EPM) is a Pareto Improvement over current policy. Consider someone pointing out to Erik that some people want heretics to burn in hell, and that EPM would be bad for such people, since it would make life better for heretics in expectation. If Erik does decide to respond, he would presumably say something along the lines of: it is not the job of economic policy to satisfy people like this. He probably never explicitly decided to ignore such people. But his entire field is based on the assumption that such people do not need to be taken into consideration when outlining economic policy. When having a political argument about economic policy, such people are in fact not really an obstacle (if they do participate, they will presumably oppose EPM with arguments that do not mention hellfire). The implicit assumption that such positions can be ignored thus holds in the context of debating economic policy. But this assumption breaks when we move the concept to the AI context (where every single type of fanatic is informed, extrapolated, and actually given a very real, and absolute, veto over every single thing that is seen as important enough).
Let's look a bit at another Pareto Baseline that might make it easier to see the problem from a different angle (this thought experiment is also relevant to some straightforward ways in which one might further modify your proposed Pareto Baseline in response to Bob2). Consider the Unpleasant Pareto Baseline (UPB). In UPB the AI implements some approximation of everyone burning in hell (specifically: the AI makes everyone experience the sensation of being on fire for as long as it can). It turns out that it only takes two people to render the set of Pareto Improvements relative to UPB empty: Gregg and Jeff from my response to Davidad's comment. Both want to hurt heretics, but they disagree about who is a heretic. Due to incompatibilities in their respective religions, every conceivable mind is seen as a heretic by at least one of them. Improving the situation of a heretic is Not Allowed. Improving the situation of any conceivable person, in any conceivable way, is thus making things worse from the perspective of at least one of them.
Gregg and Jeff do have to be a lot more extreme than Bob or Bob2. They might for example be non-neurotypical (for example sharing a condition that has not yet been discovered). And raised in deeply religious environments, whose respective rules they have adopted in an extremely rigid way. So they are certainly rare. But there only needs to be two people like this for the set of Pareto Improvements relative to UPB to be empty. (presumably no one would ever consider building an AI with UPB as a Pareto Baseline. This thought experiment is not meant to illustrate any form of AI risk. It's just a way of illustrating a point about attempting to simultaneously satisfy trillions of hard constraints, defined in billions of ontologies)
(I really appreciate you engaging on this in such a thorough and well thought out manner. I don't see this line of reasoning leading to anything along the lines of a workable patch or a usable Pareto Baseline. But I'm very happy to keep pulling on these threads, to see if one of them leads to some interesting insight. So by all means: please keep pulling on whatever loose ends you can see)
Given that you agreed with most of what I said in my reply, it seems like you should also agree that it is important to analyse these types of alignment targets. But in your original comment you said that you do not think that it is important to analyse these types of alignment targets.
Let's write Multi Person Sovereign AI Proposal (MPSAIP) for an alignment target proposal to build an AI Sovereign that gets its goal from the global population (in other words: the type of alignment target proposals that I was analysing in the post). I followed your links and can only find one argument against the urgency of analysing MPSAIPs now: that an Instruction Following AI (IFAI) would make this unnecessary. I can see why one might expect that an IFAI would help to some degree when analysing MPSAIPs. But I don't see how the idea of an IFAI could possibly remove the urgent need to analyse MPSAIPs now.
In your post on distinguishing value alignment from intent alignment, you define value alignment as being about all of humanity's long term, implicit deep values. It thus seems like you are not talking about anything along the lines of building an AI that will do whatever some specific person wants that AI to do. Please correct me if I'm wrong, but your position thus seems to be built on top of the assumption that it would be safe to assume that an IFAI can be used to solve the problem of how to describe all of humanity's long term, implicit deep values.
A brief summary of why I think that this is false: You simply cannot delegate the task of picking a goal to an AI (no matter how clever this AI is). You can define the goal indirectly and have the AI work out the details. But the task is simply not possible to delegate. For the same reason: you simply cannot delegate the task of picking a MPSAIP to an AI (no matter how clever this AI is). You can define things indirectly and have the AI work out the details. This is equivalent to fully solving the field of MPSAIP analysis. It would for example necessarily involve defining some procedure for dealing with disagreements amongst individuals that disagree on how to deal with disagreements (because individuals will not agree on which MPSAIP to pick). PCEV is one such procedure. It sounds reasonable but would lead to an outcome far worse than extinction. VarAI is another procedure that sounds reasonable but that is in fact deeply problematic. As shown in the post, this is not easy (partly because intuitions about well known concepts tend to break when transferred to the AI context). In other words: you can't count on an IFAI to notice a bad MPSAIP, for the same reason that you can't count on Clippy to figure out that it has the wrong goal.
I can see why one might think that an IFAI would be somewhat useful. But I don't see how one can be confident that it would be very useful (let alone be equivalent to a solution). If one does not hold this position, then the existence of an IFAI does not remove the need to analyse MPSAIPs now. (The idea that an IFAI might be counted on to buy sufficient time to analyse MPSAIPs is covered below, in the section where I answer your question about an AI pause).
The idea that an IFAI would be extremely useful for Alignment Target Analysis seems to be very common. But there is never any actual reason given for why this might be true. In other words: while I have heard similar ideas many times, I have never been able to get any actual argument in favour of the position, that an IFAI would be very useful for analysing MPSAIPs (by you or by anyone else). It is either implicit in some argument, or just flatly asserted. There seems to be two versions of this idea. One version is the delegation plan. In other words: the plan where one builds an IFAI that does know how to describe all of humanity's long term, implicit deep values. The other version is the assistant plan. In other words: the plan where one builds an IFAI that does not know how to describe all of humanity's long term, implicit deep values (and then uses that IFAI as an assistant while analysing MPSAIPs). I will cover them separately below.
I don't know how this plan could possibly remove the need for analysing MPSAIPs now. I don't know why anyone would believe this (similarly to how I don't know why anyone would believe that Clippy can be counted on to figure out that it has the wrong goal). It is clearly a common position. But as far as I am aware, there exists no positive argument for this position. Without any actual argument in favour of this position, it is a bit tricky to argue against this position. But I will do my best.
A preliminary point is that the task of picking one specific mapping, that maps from billions of humans to an entity of the type that can be said to want things, is not a technical task with a findable solution (see the post for much more on this). In yet other words: if one were to actually describe in detail the argument that one can delegate the task of analysing MPSAIPs to an IFAI, then one would run into a logical problem (if one tried to actually spell out the details step by step, one would be unable to do so). The problem one would run into, would be the same problem that one would run into if one were to try to argue that Clippy will figure out that it has the wrong goal (if one tried to actually spell out the details step by step, one would be unable to do so). Neither finding the correct goal nor analysing MPSAIPs is a technical task with a findable solution. Thus, neither task can be delegated to an AI, no matter how clever it is.
Let's say that we have an IFAI that is able to give an answer, when you ask it how to describe all of humanity's long term, implicit deep values. This is equivalent to the IFAI having already picked a specific MPSAIP.
I see only two ways of arriving at such an IFAI. One is that something has gone wrong, and the IFAI has settled on an answer by following some process that the designers did not intend it to follow. This is a catastrophic implementation failure. In other words: unless the plan was for the IFAI to choose an MPSAIP using some unknown procedure, the project has not gone according to plan. In this case I see no particular reason to think that the outcome would be any better than the horrors implied by PCEV.
The only other option that I see is that the designers have already fully solved the problem of how to define all of humanity's long term, implicit deep values (presumably indirectly, by defining a process that leads to such a definition). In other words: if one plans to build an IFAI like this, then one has to fully solve the entire field of analysing MPSAIPs, before one builds the IFAI. In yet other words: if this is the plan, then this plan is an argument in favour of the urgent need to analyse MPSAIPs.
To conclude that analysing MPSAIPs now is not urgent, one must assume that this type of IFAI assistant is guaranteed to have a very dramatic positive effect (a somewhat useful IFAI assistant would not remove the urgent need for analysing MPSAIPs now). It seems to be common to simply assume that an IFAI assistant will basically render prior work on analysing MPSAIPs redundant (the terminology differs. And it is often only implicit in some argument or plan. But the assumption is common). I have however never seen any detailed plan for how this would actually be done. (The situation is similar to how the delegation plan is never actually spelled out). I think that as soon as one were to lay out the details of how this would work, one would realise that one has a plan that is built on top of an incorrect assumption (similar to the type of incorrect implicit assumption that one would find, if one were to spell out the details of why exactly Clippy can be counted on to realise that it has the wrong goal).
It is difficult to argue against this position directly, since I don't know how this IFAI is supposed to be used (let alone why this would be guaranteed to have a large positive effect). But I will try to at least point to some difficulties that one would run into.
Let's say that Allan is asking the IFAI questions, as a step in the process of analysing MPSAIPs. Every question Allan asks of an IFAI like this would pose a very dramatic risk. Allan is leaning heavily on a set of definitions, for example definitions of concepts like Explanation and Understanding. Even if those definitions have held up while the IFAI was used to do other things (such as shutting down competing AI projects), those definitions could easily break when discussing MPSAIPs. Since the IFAI does not know what a bad MPSAIP is, the IFAI has no way of noticing that it is steering Allan towards a catastrophically bad MPSAIP. Regardless of how clever the IFAI is, there is simply no chance of it noticing this. Just as there is no chance of Clippy discovering that it has the wrong goal.
In other words: if a definition of Explanation breaks during a discussion with an IFAI, and Allan ends up convinced that he must implement PCEV, then we will end up with the horrors implied by PCEV. (If you think that the IFAI will recognise the outcome implied by PCEV as a bad outcome, then you are imagining the type of IFAI that was treated in the previous subsection (and such an IFAI can only be built after the field of analysing MPSAIPs have been fully solved)). This was previously discussed here and here (using different terminology).
(To be clear: this subsection is not arguing against the plan of building an IFAI of this type. And it is not arguing against the idea that this type of IFAI might be somewhat useful. It is not even arguing against the idea that it might be possible to use an IFAI like this in a way that dramatically increases the ability to analyse MPSAIPs. It is simply arguing against the idea that one can be sure that an IFAI like this will in fact be used in a way that will dramatically increase the ability to analyse MPSAIPs. This is enough to show that the IFAI idea does not remove the urgent need to analyse MPSAIPs now).
The probability of a politically enforced pause is not important for any argument that I am trying to make. Not much changes if we replace a politically enforced pause with an IFAI. Some group of humans will still decide what type of Sovereign AI will eventually be built. If they successfully implement a bad Sovereign AI proposal, then the outcome could be massively worse than extinction. So it makes sense to reduce the probability of that. One tractable way of reducing this probability is by analysing MPSAIPs.
In other words: if you achieve a pause by doing something other than building an AI Sovereign (for example by implementing a politically enforced pause, or by using an IFAI). Then the decision of what AI Sovereign to eventually build will remain in human hands. So then you will still need progress on analysing MPSAIPs to avoid bad Sovereign AI proposals. There is no way of knowing how long it will take to achieve the needed level of such progress. And there is no way of knowing how much time a pause will actually result in. So, even if we did know exactly what method will be used to shut down competing projects. And we also knew exactly who will make decisions regarding Sovereign AI. Then there is still no way of knowing that there will be sufficient time to analyse MPSAIPs. Therefore, such analysis should start now. (And as illustrated by my post, such progress is tractable).
One point that should be made here, is that you can end up with a multipolar world even if there is a single IFAI that flawlessly shuts down all unauthorised AI projects. If a single IFAI is under the control of some set of existing political power structures, then this would be a multipolar world. Regardless of who is in control (for example the UN Security Council (UNSC), the UN general assembly, or some other formalisation of global power structures), it is still possible for some ordinary political movement to gain power over the IFAI, by ordinary political means. Elected governments can be voted out. Governments along the lines of the USSR can evidently also be brought down by ordinary forms of political movements. So there is in general nothing strange about someone being in control of an IFAI, but finding themselves in a situation where they must either act quickly and decisively, or risk permanently losing control to people with different values. This means that shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure.
Let's consider the scenario where a UNSC resolution is needed to ask the IFAI a question, or to ask the IFAI to do something (such as shutting down competing AI projects, or launching an AI Sovereign). There is currently an agreement of what AI Sovereign to build. But there is also an agreement that it would be good to first analyse this proposal a bit more, to make sure there is no hidden problem with it. In this case, losing control of any of the five countries with a veto would remove the ability to launch an AI Sovereign (if control is lost to a novel and growing political movement, then control could be lost permanently. The result of losing control of one permanent UNSC member could mean that a deadlock will persist until the new movement eventually controls all five). So, the people currently in control would basically have to either act quickly or risk permanently losing power to people with different values. If they decide to aim at their preferred MPSAIP, then it would be very nice if the field of analysing MPSAIPs had progressed to the point where it is possible to notice that this MPSAIP implies an outcome worse than extinction (for example along the lines of the outcome implied by PCEV. But presumably due to a harder-to-notice problem).
I used the UNSC as an example in the preceding paragraph, because it seems to me like the only legal way of taking the actions that would be necessary to robustly shut down all competing AI projects (being the only legal option, and thus a sort of default option, makes it difficult to rule out this scenario). But the same type of Internal Time Pressure might also arise in other arrangements. This comment outlines a scenario where a global electorate is in charge (which seems like another reasonable candidate for how to define what it means to do the default thing). This post outlines a scenario where a group of augmented humans are in charge (in that scenario buying time is achieved by uploading. Not by shutting down competing AI projects. This seems like something that someone might do if they don't feel comfortable with using force. But simultaneously don't feel ready to take the decision to give up control to some specific political process).
The reason that I keep going on about the need for Alignment Target Analysis (ATA) is that there seems to currently exist exactly zero people in the world devoted to doing ATA full time. Making enough ATA progress to reduce the probability of bad outcomes is also tractable (trying to solve ATA would be a completely different thing. But there still exists a lot of low hanging fruit in terms of ATA progress that reduces the probability of bad outcomes). It thus seems entirely possible to me that we will end up with a PCEV style catastrophe that could have been easily prevented. Reducing the probability of that seems like a reasonable thing to do. But it is not being done.
At our current level of ATA progress it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction. I simply don't see how one can think that it is safe to stay at this level of progress. Intuitively this seems like a dangerous situation. The fact that there exists no research project dedicated to improving this situation seems like a mistake (as illustrated by my post, reducing the probability of bad outcomes is a tractable research project). It seems like many people do have some reason for thinking that the current state of affairs is acceptable. As far as I can tell however, these reasons are not made public. This is why I think that it makes sense to spend time on trying to figure out what you believe to be true, and why you believe it to be true (and this is also why I appreciate you engaging on this).
In other words: arguing that ATA should be a small percentage of AI safety work would be one type of argument. Arguing that the current situation is reasonable would be a fundamentally different type of argument. It is clearly the case that plenty of people are convinced that it is reasonable to stay at the current level of ATA progress (in other words: many of people are acting in a way that I can only explain if I assume that they feel very confident, that it is safe to stay at our current level of ATA progress). I think that they are wrong about this. But since no argument in favour of this position is ever outlined in detail, there is no real way of addressing this directly.
I'm fine with continuing this discussion here. But it probably makes sense to at least note that it would have fitted better under this post (which makes the case for analysing this type of alignment targets. And actually discusses the specific topic of why various types of Non-Sovereign-AIs would not replace doing this now). As a tangent, the end of that post actually explicitly asked people to outline their reasons for thinking that ATA now is not needed. Your response here seems to be an example of this. So I very much appreciate your engagement on this. In other words: I don't think you are the only one that have ideas along these lines. I think that there are plenty of people with similar ways of looking at things. And I really wish that those people would clearly outline their reasons for thinking that the current situation is reasonable. Because I think that those reasons will fall apart if they are outlined in any actual detail. So I really appreciate that you are engaging on this. And I really wish that more people would do the same.
I'm sorry if the list below looks like nitpicking. But I really do think that these distinctions are important.
Bob holds 1 as a value. Not as a belief.
Bob does not hold 2 as a belief or as a value. Bob thinks that someone as powerful as the AI has an obligation to punish someone like Dave. But that is not the same as 2.
Bob does not hold 3 as a belief or as a value. Bob thinks that for someone as powerful as the AI, the specific moral outrage in question renders the AI unethical. But that is not the same as 3.
Bob does hold 4 as a value. But it is worth noting that 4 does not describe anything load-bearing. The thought experiment would still work even if Bob did not think that the act of creating an unethical agent that determines the fate of the world is morally forbidden. The load-bearing part is that Bob really does not want the fate of the world to be determined by an unethical AI (and thus prefers the scenario where this does not happen).
Bob does not hold 5 as a belief or as a value. Bob prefers a scenario without an AI, to a scenario where the fate of the world was determined by an unethical AI. But that is not the same as 5. The description I gave of Bob does not in any way conflict with Bob thinking that most morally forbidden acts can be compensated for by expressing sincere regret at some later point in time. The description of Bob would even be consistent with Bob thinking that almost all morally forbidden acts can be compensated for by writing a big enough check. He just thinks that the specific moral outrage in question, directly means that the AI committing it is unethical. In other words: other actions are simply not taken into consideration, when going from this specific moral outrage, to the classification of the AI as unethical. (He also thinks that a scenario where the fate of the world is determined by an unethical AI is really bad. This opinion is also not taking any other aspects of the scenario into account. Perhaps this is what you were getting at with point 5).
I insist on these distinctions because the moral framework that I was trying to describe, is importantly different from what is described by these points. The general type of moral sentiment that I was trying to describe is actually a very common, and also a very simple, type of moral sentiment. In other words: Bob's morality is (i): far more common, (ii): far simpler, and (iii): far more stable, compared to the morality described by these points. Bob's general type of moral sentiment can be described as: a specific moral outrage renders the person committing it unethical in a direct way. Not in a secondary way (meaning that there is for example no summing of any kind going on. There is no sense in which the moral outrage in question is in any way compared to any other set of actions. There is no sense in which any other action plays any part whatsoever when Bob classifies the AI as unethical).
In yet other words: the link from this specific moral outrage to classification as unethical is direct. The AI doing nice things later is thus simply not related in any way to this classification. Plenty of classifications work like this. Allan will remain a murderer, no matter what he does after committing a murder. John will remain a military veteran, no matter what he does after his military service. Jeff will remain an Olympic gold winner, no matter what he does after winning that medal. Just as for Allan, John, and Jeff, the classification used to determine that the AI is unethical is simply not taking other actions into account.
The classification is also not the result of any real chain of reasoning. There is no sense in which Bob first concludes that the moral outrage in question should be classified as morally forbidden, followed by Bob then deciding to adhere to a rule which states that all morally forbidden things should lead to the unethical classification (and Bob has no such a rule).
This general type of moral sentiment is not universal. But it is quite common. Lots of people can think of at least one specific moral outrage that leads directly to them viewing a person committing it as unethical (at least when committed deliberately by a grownup that is informed, sober, mentally stable, etc). In other words: lots of people would be able to identify at least one specific moral outrage (perhaps out of a very large set of other moral outrages). And say that this specific moral outrage directly implies that the person is unethical. Different people obviously do not agree on which subset of all moral outrages should be treated like this (even people that agree on what should count as a moral outrage can feel differently about this). But the general sentiment where some specific moral outrage simply means that the person committing it is unethical is common.
The main reason that I insist on the distinction is that this type of sentiment would be far more stable under reflection. There are no moving parts. There are no conditionals or calculations. Just a single, viscerally felt, implication. Attached directly to a specific moral outrage. For Bob, the specific moral outrage in question is a failure to adhere to the moral imperative to punish people like Dave.
Strong objections to the fate of the world being determined by someone unethical are not universal. But this is neither complex nor particularly rare. Let's add some details to make Bob's values a bit easier to visualise. Bob has a concept that we can call a Dark Future. It is basically referring to scenarios where Bad People win The Power Struggle and manage to get enough power to choose the path of humanity (powerful anxieties along these lines seem quite common. And for a given individual it would not be at all surprising if something along these lines eventually turn into a deeply rooted, simple, and stable, intrinsic value).
A scenario where the fate of the world is determined by an unethical AI is classified as a Dark Future (again in a direct way). For Bob, the case with no AI does not classify as a Dark Future. And Bob would really like to avoid a Dark Future. People who thinks that it is more important to prevent bad people from winning than to prevent the world from burning might not be very common. But there is nothing complex or incoherent about this position. And the general type of sentiment (that it matters a lot who gets to determine the fate of the word) seems to be very common. Not wanting Them to win can obviously be entirely instrumental. An intrinsic value might also be overpowered by survival instinct when things get real. But there is nothing surprising about something like this eventually solidifying into a deeply held intrinsic value. Bob does sound unusually bitter and inflexible. But there only needs to be one person like Bob in a population of billions.
To summarise: a non punishing AI is directly classified as unethical. Additional details are simply not related in any way to this classification. A trajectory where an unethical AI determines the fate of humanity is classified as a Dark Future (again in a direct way). Bob finds a Dark Future to be worse than the no AI scenario. If someone were to specifically ask him, Bob might say that he would rather see the world to burn than see Them win. But if left alone to think about this, the world burning in the non-AI scenario is simply not the type of thing that is relevant to the choice (when the alternative is a Dark Future).
First I just want to again emphasise that the question is not if extrapolation will change one specific individual named Bob. The question is whether or not extrapolation will change everyone with these types of values. Some people might indeed change due to extrapolation.
My main issue with the point about moral realism is that I don't see why it would change anything (even if we only consider one specific individual, and also assume moral realism). I don't see why discovering that The Objectively Correct Morality disagrees with Bob's values would change anything (I strongly doubt that this sentence means anything. But for the rest of this paragraph I will reason from the assumption that it both does mean something, and that it is true). Unless Bob has very strong meta preferences related to this, the only difference would presumably be to rephrase everything in the terminology of Bob's values. For example: extrapolated Bob would then really not want the fate of the world to be determined by an AI that is in strong conflict with Bob's values (not punishing Dave directly implies a strong value conflict. The fate of the world being determined by someone with a strong value conflict directly implies a Dark Future. And nothing has changed regarding Bob's attitude towards a Dark Future). As long as this is stronger than any meta preferences Bob might have regarding The Objectively Correct Morality, nothing important changes (Bob might end up needing a new word for someone that is in strong conflict with Bob's values. But I don't see why this would change Bob's opinion regarding the relative desirability of a scenario that contains a non-punishing AI, compared to the scenario where there is no AI).
I'm not sure what role coherence arguments would play here.
It is the AI creating these successor AIs that is the problem for Bob (not the successor AIs themselves). The act of creating a successor AI that is unable to punish is morally equivalent to not punishing. It does not change anything. Similarly: the act of creating a lot of human level AIs is in itself determining the fate of the world (even if these successor AIs do not have the ability to determine the fate of the world).
I'm not sure I understand this paragraph. I agree that if the set is not empty, then a clever AI will presumably find an action that is a Pareto Improvement. I am not saying that there exists an action that is a Pareto Improvement, but that this action is difficult to find. I am saying that at least one person will demand X and that at least one person will refuse X. Which means that a clever AI will just use its cleverness to confirm that the set is indeed empty.
I'm not sure that the following is actually responding to something that you are saying (since I don't know if I understand what you mean). But it seems relevant to point out that the Pareto constraint is part of the AIs goal definition. Which in turn means that before determining the members of the set of Pareto Improvements, there is no sense in which there exists a clever AI that is trying to make things work out well. In other words: there does not exist any clever AI, that has the goal of making the set non-empty. No one has, for example, an incentive to tweak the extrapolation definitions to make the set non-empty.
Also: in the proposal in question, extrapolated delegates are presented with a set. Their role is then supposed to be to negotiate about actions in this set. I am saying that they will be presented with an empty set (produced by an AI that has no motivation to bend rules to make this set non-empty). If various coalitions of delegates are able to expand this set with clever tricks, then this would be a very different proposal (or a failure to implement the proposal in question). This alternative proposal would for example lack the protections for individuals, that the Pareto constraint is supposed to provide. Because the delegates of various types of fanatics could then also use clever tricks to expand the set of actions under consideration. The delegates of various factions of fanatics could then find clever ways of adding various ways of punishing heretics into the set of actions that are on the table during negotiations (which brings us back to the horrors implied by PCEV). Successful implementation of Pareto PCEV implies that the delegates are forced to abide by the various rules governing their negotiations (similar to how successful implementation of classical PCEV implies that the delegates have successfully been kept in the dark regarding how votes are actually settled).
This last section is not a direct response to anything that you wrote. In particular, the points below are not meant as arguments against things that you have been advocating for. I just thought that this would be a good place to make a few points, that are related to the general topics that we are discussing in this thread (there is no post dedicated to Pareto PCEV, so this is a reasonable place to elaborate on some points related specifically to PPCEV).
I think that if one only takes into account the opinions of a group that is small enough for a Pareto Improvement to exist, then the outcome would be completely dominated by people that are sort of like Bob, but that are just barely possible to bribe (for the same reason that PCEV is dominated by such people). The bribe would not primarily be about resources, but about what conditions various people should live under. I think that such an outcome would be worse than extinction from the perspective of many people that are not part of the group being taken into consideration (including from the perspective of people like Bob. But also from the perspective of people like Dave). And it would just barely be better than extinction for many in that group.
I similarly think that if one takes the full population, but bend the rules until one gets a non-empty set of things that sort of looks close to Pareto Improvements, then the outcome will also be dominated by people like Bob (for the same reason that PCEV is dominated by people like Bob). Which in turn implies a worse-than-extinction outcome (in expectation, from the perspective of most individuals).
In other words: I think that if one goes looking for coherent proposals that are sort of adjacent to this idea, then one would tend to find proposals that implies very bad outcomes. For the same reasons that proposals along the lines of PCEV implies very bad outcomes. A brief explanation of why I think this: if one tweaks this proposal until it refers to something coherent, then Steve has no meaningful influence regarding the adoption of those preferences that refer to Steve. Because when one is transforming this into something coherent, then Steve cannot retain influence over everything that he cares about strongly enough (as this would result in overlap). And there is nothing in this proposal that gives Steve any special influence regarding the adoption of those preferences that refer to Steve. Thus, in adjacent-but-coherent proposals, Steve will have no reason to expect that the resulting AI will want to help Steve, as opposed to want to hurt Steve.
It might also be useful to zoom out a bit from the specific conflict between what Bob wants and what Dave wants. I think that it would be useful to view the Pareto constraint as many individual constraints. This set of constraints would include many hard constraints. In particular, it would include many trillions of hard individual-to-individual constraints (including constraints coming from a significant percentage of the global population, that have non-negotiable opinions regarding the fates of billions of other individuals). It is an equivalent but more useful way of representing the same thing. (In addition to being quite large, this set would also be very diverse. It would include hard constraints from many different kinds of non-standard minds. With many different kinds of non-standard ways of looking at things. And many different kinds of non-standard ontologies. Including many types of non-standard ontologies that the designers never considered). We can now describe alternative proposals where Steve gets a say regarding those constraints that only refer to Steve. If one is determined to start from Pareto PCEV, then I think that this is a much more promising path to explore (as opposed to exploring different ways of bending the rules until every single hard constraint can be simultaneously satisfied).
I also think that it would be a very bad idea to go looking for an extrapolation dynamic that re-writes Bob's values in a way that makes Bob stop wanting Dave to be punished (or that makes Bob bribable). I think that extrapolating Bob in an honest way, followed by giving Dave a say regarding those constraints that refer to Dave, is a more promising place to start looking for ways of keeping Dave safe from people like Bob. I for example think that this is less likely to result in unforeseen side effects (extrapolation is problematic enough without this type of added complexity. The option of designing different extrapolation dynamics for different groups of people is a bad option. The option of tweaking an extrapolation dynamic that will be used on everyone, with the intent of finding some mapping that will turn Bob into a safe person, is also a bad option).
Bob really does not want the fate of the world to be determined by an unethical AI. There is no reason for such a position to be instrumental. For Bob, this would be worse than the scenario with no AI (in the Davidad proposal, this is the baseline that is used to determine whether or not something is a Pareto-improvement). Both scenarios contain non-punished heretics. But only one scenario contains an unethical AI. Bob prefers the scenario without an unethical AI (for non-instrumental reasons).
The question is whether or not at least one person will continue to view a non-punishing AI as unethical after extrapolation. (When determining whether or not something is a Pareto-improvement, the average fanatic is not necessarily relevant).
Many people would indeed presumably change their minds regarding the morality of at least some things (for example when learning new facts). For the set of Pareto-improvements to be empty however, you only need two people: a single fanatic and a single heretic.
In other words: for the set to be empty it is enough that a single person continues to view a single other person (that we can call Dave), as being deserving of punishment (in the sense that an AI has a moral obligation to punish Dave). The only missing component is then that Dave has to object strongly to being punished for being a heretic (this objection can actually also be entirely based on moral principles). Just two people out of billions need to take these moral positions for the set to be empty. And the building blocks that make up Bob's morality are not actually particularly rare.
The first building block of Bob's morality is that of a moral imperative (the AI is seen as unethical for failing to fulfill its moral obligation to punish heretics). In other words: if someone finds themselves in a particular situation, then they are viewed as having a moral obligation to act in a certain way. Moral instincts along the lines of moral imperatives are fairly common. A trained firefighter might be seen as having important moral obligations if encountering a burning building with people inside. An armed police officer might be seen as having important moral obligations if encountering an active shooter. Similarly for soldiers, doctors, etc. Failing to fulfill an important moral obligation is fairly commonly seen as very bad.
Let's take Allan, who witnesses a crime being committed by Gregg. If the crime is very serious, and calling the police is risk free for Allan, then failing to call the police can be seen as a very serious moral outrage. If Allan does not fulfill this moral obligation, it would not be particularly unusual for someone to view Allan as deeply unethical. This general form of moral outrage is not rare. Not every form of morality includes contingent moral imperatives. But moralities that do include such imperatives are fairly common. There is obviously a lot of disagreements regarding who has what moral obligations. Just as there are disagreements regarding what should count as a crime. But the general moral instinct (that someone like Allan can be deeply unethical) is not exotic or strange.
The obligation to punish bad people is also not particularly rare. Considering someone to be unethical because they get along with a bad person is not an exotic or rare type of moral instinct. It is not universal. But it is very common.
And the specific moral position that heretics deserve to burn in hell is actually quite commonly expressed. We can argue about what percentage of people saying this actually means it. But surely we can agree that there exist at least some people that actually mean what they say.
The final building block in Bob's morality is objecting to having the fate of the world be determined by someone unethical. I don't think that this is a particularly unusual thing to object to (on entirely non-instrumental grounds). Many people care deeply about how a given outcome is achieved.
Some people that express positions along the lines of Bob might indeed back down if things get real. I think that for some people, survival instinct would in fact override any moral outrage. Especially if the non-AI scenario is really bad. Some fanatics would surely blink when coming face to face with any real danger. (And some people will probably abandon their entire moral framework in a heartbeat, the second someone offers them a really nice cake). But for at least some people, morality is genuinely important. And you only need one person like Bob, out of billions, for the set to be empty.
So, if Bob is deeply attached to his moral framework. And the moral obligation to punish heretics is a core aspect of his morality. And this aspect of his morality is entirely built from ordinary and common types of moral instincts. Then an extrapolated version of Bob would only accept a non-punishing AI, if this extrapolation method has completely rewritten Bob's entire moral framework (in ways that Bob would find horrific).
Consider Bob, who takes morality very seriously. Bob thinks that any scenario where the fate of the world is determined by an unethical AI, is worse than the scenario with no AI. Bob sticks with this moral position, regardless of how much stuff Bob would get in a scenario with an unethical AI. For a mind as powerful as an AI, Bob considers it to be a moral imperative to ensure that heretics do not escape punishment. If a group contains at least one person like Bob (and at least one person that would strongly object to being punished), then the set of Pareto-improvements is empty. In a population of billions, there will always exist at least some people with Bob's type of morality (and plenty of people that would strongly object to being punished). Which in turn means that for humanity, there exist no powerful AI, such that creating this AI would be a Pareto-improvement.
I do think that it’s important to analyse alignment targets like these. Given the severe problems that all of these alignment targets suffer from, I certainly hope that you are right about them being unlikely. I certainly hope that nothing along the lines of a Group AI will ever be successfully implemented. But I do not think that it is safe to assume this. The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. The CEV arbital page actually assumes that the path to a Group AI goes through an initial limited AI (referred to as a Task AI). In other words: the classical proposed path to an AI that implements the CEV of Humanity actually starts with an initial AI that is not an AI Sovereign (and such an AI could for example be the type of instruction following AI that you mention). In yet other words: your proposed AI is not an alternative to a Group AI. Its successful implementation does not prevent the later implementation of a Group AI. Your proposed AI is in fact one step in the classical (and still fairly popular) proposed path to a Group AI.
I actually have two previous posts that were devoted to making the case for analysing the types of alignment targets that the present post is focusing on. The present post is instead focusing on doing such analysis. This previous post outlined a comprehensive argument in favour of analysing these types of alignment targets. Another previous post specifically focused on illustrating that Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure. See also this comment where I discuss the difference between proposing solutions on the one hand, and pointing out problems on the other hand.
Charbel-Raphaël responded to my post by arguing that no Sovereign AI should ever be created. My reply pointed out that this is mostly irrelevant to the question at hand. The only relevant question is whether or not a Sovereign AI might be successfully implemented eventually. If that is the case, then one can reduce the probability of some very bad outcomes by doing the type of Alignment Target Analysis that my previous two posts were arguing for (and that the present post is an example of). The second half of this reply (later in the same thread) includes a description of an additional scenario where an initial limited AI is followed by a Sovereign AI (and this Sovereign AI is implemented without significant time spent on analysing the specific proposal, due to Internal Time Pressure).
I don't think that one can rely on this idea to prevent the outcome where a dangerous Sovereign AI proposal is successfully implemented at some later time (for example after an initial AI has been used to buy time). One issue is the difficulty of defining critical concepts such as Explanation and Understanding. I previously discussed this with Max Harms here, and with Nathan Helm-Burger here. Both of those comments are discussing attempts to make an AI pursue Corrigibility as a Singular Target (which should not be confused with my post on Corrigibility, which discussed a different type of Corrigibility).
The people actually building the stuff might not be the ones deciding what should be built. For example: if a messy coalition of governments enforces a global AI pause, then this coalition might be able to decide what will eventually be built. If a coalition is capable of successfully enforcing a global AI pause, then I don't think that we can rule out the possibility that they will be able to enforce a decision to build a specific type of AI Sovereign (they could for example do this as a second step, after first managing to gain effective control over an initial instruction following AI). If that is the case, then the proposal to build something along the lines of a Group AI might very well be one of the politically feasible options (this was previously discussed in this post and in this comment).
I thought that your Cosmic Block proposal would only block information regarding things going on inside a given Utopia. I did not think that the Cosmic Block would subject every person to forced memory deletion. As far as I can tell, this would mean removing a large portion of all memories (details below). I think that memory deletion on the implied scale would seriously complicate attempts to define an extrapolation dynamic. It also does not seem to me that it would actually patch the security hole illustrated by the thought experiment in my original comment (details below).
The first section argues that (unless Bob's basic moral framework has been dramatically changed by the memory deletion) no level of memory deletion will prevent BPA from wanting to find and hurt Steve. In brief: BPA will still be subject to the same moral imperative to find and hurt any existing heretics (including Steve).
The second section argues that BPA is likely to find Steve. In brief: BPA is a clever AI and the memory deletion is a human constructed barrier (the Advocates are extrapolations of people that has already been subjected to these memory wipes. So Advocates cannot be involved when negotiating the rules governing these memory wipes). BPA would still have access to a lot of different information sources that it can use to find Steve.
The third section argues that if BPA finds Steve, then BPA would be able to hurt Steve. In brief: creating OldSteve is still not prevented by any rule or constraint that you have mentioned so far.
The fourth section argues that the side effects of memory deletion would be severe. In brief: memories of every conversation about any deleted person would also be deleted. Besides all direct memories involving any deleted person, many indirect memories would also be deleted. This would seriously complicate extrapolation. (Extrapolation is already a very tricky definitional problem. And this definitional problem cannot be delegated to extrapolated Advocates, since they are the result of the extrapolation dynamic being defined).
The last section deals with your proposed resource destruction mechanism. In brief: in one thought experiment (that did not involve threats) almost half the population would very strongly prefer to destroy ELYSIUM. This disproved your claim that such a situation would not arise. Also: no thought experiment involved anyone trying to gain resources. And no thought experiment involved anyone issuing any form of threat that would not be followed up on. People burned resources that they did not need. They did this to either hurt people directly. Or to issue genuinely credible threats.
It would not be surprising if a comprehensive, human defined, memory deletion operation would completely re write someone's basic moral framework. But I'm guessing that your proposed memory deletion is not supposed to be done in a way that changes a persons basic moral framework. So let's reason from the assumption that it does not.
This means that Bob still considers BPA to be bound by a non negotiable moral imperative. So BPA still wants to find and punish any heretic that might exist.
In other words: the Cosmic Block is not the type of mechanism that might prevent BPA from wanting to find and hurt Steve. It is instead a practical obstacle that BPA needs to overcome (which is something very different). It is a human constructed practical barrier, that is supposed to protect Steve from a clever AI that wants to find and hurt Steve.
Unless the details of your proposed Cosmic Block are constructed by an AI that prevents All Bad Things, these rules must come from somewhere else. AI assisted negotiations cannot be done by the Advocates. Advocates are the result of extrapolating memory wiped people (otherwise the whole point of the Cosmic Bloc is lost). So the Advocates cannot be involved in defining the memory wipe rules.
In other words: unless the memory wipe rules are negotiated by a completely separate set of (previously unmentioned) AIs, the memory wipe rules will be human defined.
This means that a human constructed barrier must hold against a clever AI trying to get around it. Even if we were to know that a human defined barrier has no humanly-findable security holes, this does not mean that it will actually hold against a clever AI. A clever AI can find security holes that are not humanly-findable.
The specific situation that BPA will find itself in does not seem to be described in sufficient detail for it to be possible to outline a specific path along which BPA finds Steve. But from the currently specified rules, we do know that BPA has access to several ways of gathering information about Steve.
People can pool resources (as described in your original proposal). So Advocates can presumably ask other Advocates about potential partners for cohabitation. Consider the case where BPA is negotiating with other Advocates regarding who will be included in a potential shared environment. This decision will presumably involve information about potential candidates. Whether or not a given person is accepted, would presumably depend on detailed personal information.
Advocates can also engage in mutual resource destruction to prevent computations happening within other Utopias. You describe this mechanism as involving negotiations between Advocates, regarding computations happening within other people's Utopias. Such negotiations would primarily be between the Advocates of people that have very different values. This is another potential information source about Steve.
Steve would also have left a lot of effects on the world, besides effects on peoples memories. Steve might for example have had a direct impact on what type of person someone else has turned into. Deleting this impact would be even more dramatic than deleting memories.
Steve might have also have had a significant impact on various group dynamics (for example: his family, the friend groups that he has been a part of, different sets of coworkers and classmates, online communities, etc). Unless all memories regarding the general group dynamics of every group that Steve has been a part of is deleted, Steve's life would have left behind many visible effects.
The situation is thus that a clever AI is trying to find and hurt Steve. There are many different types of information sources that can be combined in clever ways to find Steve. The rules of all barriers between this AI and Steve are human constructed. Even with perfect enforcement of all barriers, this still sounds like a scenario where BPA will find Steve (for the same reason that a clever AI is likely to find its way out of a human constructed box, or around a human constructed Membrane).
If BPA locates Steve, then there is nothing preventing BPA from using OldSteve to hurt Steve. What is happening to OldSteve is still not prevented by any currently specified rule. The suffering of OldSteve is entirely caused by internal dynamics. OldSteve never lacks any form of information. And the harm inflicted on OldSteve is not in any sense marginal.
I do not see any strong connections between the OldSteve thought experiment and your Scott Alexander quote (which is concerned with the question of what options and information should be provided by a government run by humans. To children raised by other humans). More generally: scenarios that include a clever AI that is specifically trying to hurt someone, has a lot of unique properties (important properties that are not present in scenarios that lack such an AI). I think that these scenarios are dangerous. And I think that they should be avoided (as opposed to first created and then mitigated). (Avoiding such scenarios is a necessary, but definitely not sufficient, feature of an alignment target).
All deleted memories must be so thoroughly wiped that a clever AI will be unable to reconstruct them (otherwise the whole point of the Cosmic Block is negated). Deleting all memories of a single important negative interpersonal relationship would be a huge modification. Even just deleting all memories of one famous person that served as a role model would be significant.
Thoroughly deleting your memory of a person, would also impact your memory of every conversation that you have ever had about this person. Including conversations with people that are not deleted. Most long term social relationships involves a lot of discussions of other people (one person describing past experiences to the other, discussions of people that both know personally, arguments over politicians or celebrities, etc, etc). Thus, the memory deletion would significantly alter the memories of essentially all significant social relationships. This is not a minor thing to do to a person. (That every person would be subjected to this is not obviously implied by the text in The ELYSIUM Proposal.)
In other words: even memories of non deleted people would be severely modified. For example: every discussion or argument about a deleted person would be deleted. Two people (that do not delete each other) might suddenly have no idea why they almost cut all contact a few years ago, and why their interactions has been so different for the last few years. Either their Advocates can reconstruct the relevant information (in which case the deletion does not serve its purpose). Or their Advocates must try to extrapolate them while lacking a lot of information.
Getting the definitions involved in extrapolation right, seems like it will be very difficult even under ordinary circumstances. Wide ranging and very thorough memory deletion would presumably make extrapolation even more tricky. This is a major issue.
No one in any of my thought experiments was trying to get more resources. The 55 percent majority (and the group of 10 people) have a lot of resources that they do not care much about. They want to create some form of existence for themselves. This only takes a fraction of available resources to set up. They can then burn the rest of their resources on actions within the resource destruction mechanism. They either burn these resources to directly hurt people. Or they risk these resources by making threats that are completely credible. In the thought experiments where someone does issue a threat, the threat is issued because: a person giving in > burning resources to hurt someone who refuses > leaving someone that refuses alone. They are perfectly ok with an outcome where resources are spent on hurting someone that refuses to comply (they are not self modifying as a negotiation strategy. They just think that this is a perfectly ok outcome).
Preventing this type of threats would be difficult because (i): negotiations are allowed, and (ii): in any scenario where threats are prevented, the threatened action would simply be taken (for non strategic reasons). There is no difference in behaviour between scenarios where threats are prevented, and scenarios where threats are ignored.
The thought experiment where a majority burns resources to hurt a minority was a simple example scenario where almost half of the population would very strongly prefer to destroy ELYSIUM (or strongly prefer that ELYSIUM was never created). It was a response to your claim that your resource destruction mechanisms would prevent such a scenario. This thought experiment did not involve any form of threat or negotiation.
Let's call a rule that prevents the majority from hurting the minority a Minority Protection Rule (MPR). There are at least two problems with your claim that a pre-AI majority would prevent the creation of a version of ELYSIUM that has an MPR.
First: without an added MPR, the post-AI majority is able to hurt the minority without giving up anything that they care about (they burn resources they don't need). So there is no reason to think that an extrapolated post-AI majority would want to try to prevent the creation of a version of ELYSIUM with an MPR. They would prefer the case without an MPR. This does not imply that they care enough to try to prevent the creation of a version of ELYSIUM with an MPR. Doing so would presumably be very risky, and they don't gain anything that they care much about. When hurting the minority does not cost them anything that they care about, they do it. That does not imply that this is an important issue for the majority.
More importantly however: you are conflating, (i): a set of un-extrapolated and un-coordinated people living in a pre-AI world, with (ii): a set of clever AI Advocates representing these same people, operating in a post-AI world. There is nothing unexpected about humans opposing / supporting an AI that would be good / bad for them (from the perspective of their extrapolated Advocates). That is the whole point of having extrapolated Advocates.
Implementing The ELYSIUM Proposal would lead to the creation of a very large, and very diverse, set of clever AIs that wants to hurt people: the Advocates of a great variety of humans, that wants to hurt others in a wide variety of ways, for a wide variety of reasons. Protecting billions of people from this set of clever AIs would be difficult. As far as I can tell, nothing that you have mentioned so far would provide any meaningful amount of protection from a set of clever AIs like this (details below). I think that it would be better to just not create such a set of AIs in the first place (details below).
I don't think that it is easy to find a negotiation baseline for AI-assisted negotiations that results in a negotiated settlement that actually deals with such a set of AIs. Negotiation baselines are non trivial. Reasonable sounding negotiation baselines can have counterintuitive implications. They can imply power imbalance issues that are not immediately obvious. For example: the random dictator negotiation baseline in PCEV gives a strong negotiation advantage to people that intrinsically values hurting other humans. This went unnoticed for a long time. (It has been suggested that it might be possible to find a negotiation baseline (a BATNA) that can be viewed as having been acausally agreed upon by everyone. However, it turns out that this is not actually possible for a group of billions of humans).
10 people without any large resource needs could use this mechanism to kill 9 people they don't like at basically no cost (defining C as any computation done within the Utopia of the person they want to kill). Consider 10 people that just want to live a long life, and that do not have any particular use for most of the resources they have available. They can destroy all computational resources of 9 people without giving up anything that they care about. This also means that they can make credible threats. Especially if they like the idea of killing someone for refusing to modify the way that she lives her life. They can do this with person after person, until they have run into 9 people that prefers death to compliance. Doing this costs them basically nothing.
This mechanism does not rule out scenarios where a lot of people would strongly prefer to destroy ELYSIUM. A trivial example would be a 55 percent majority (that does not have a lot of resource needs) burning 90 percent of all resources in ELYSIUM to fully disenfranchise everyone else. And then using the remaining resources to hurt the minority. In this scenario almost half of all people would very strongly prefer to destroy ELYSIUM. Such a majority could alternatively credibly threaten the minority and force them to modify the way they live their lives. The threat would be especially credible if the majority likes the scenario where a minority is punished for refusing to conform.
In other words: this mechanism seems to be incompatible with your description of personalised Utopias as the best possible place to be (subject only to a few non intrusive ground rules).
This relies on a set of definitions. And these definitions would have to hold up against a set of clever AIs trying to break them. None of the rules that you have proposed so far would prevent the strategy used by BPA to punish Steve, outlined in my initial comment. OldSteve is hurt in a way that is not actually prevented by any rule that you have described so far. For example: the ``is torture happening here'' test would not trigger for what is happening to OldSteve. So even if Steve does in principle have the ability to stop this by using some resources destroying mechanism, Steve will not be able to do so. Because Steve will never become aware of what Bob is doing to OldSteve. Steve considers OldSteve to be himself in a relevant sense. So, according to Steve's worldview, Steve will experience a lot of very unpleasant things. But the only version of Steve that would be able to pay resources to stop this, would not be able to do so.
So the security hole pointed out by me in my original thought experiment is still not patched. And patching this security hole would not be enough. To protect Steve, one would need to find a set of rules that preemptively patches every single security hole that one of these clever AIs could ever find.
Let's reason from the assumption that Bob's Personal Advocate (BPA) is a clever AI that will be creating Bob's Personalised Utopia. Let's now again take the perspective of ordinary human individual Steve, that gets no special treatment. I think the main question that determines Steve's safety in this scenario, is how BPA is adopting Steve-referring-preferences. I think this is far more important for Steve's safety, than the question of what set of rules will govern Bob's Personalised Utopia. The question of what BPA wants to do to Steve, seems to me to be far more important for Steve's safety, than the question of what set of rules will constrain the actions of BPA.
Another way to look at this is to think in terms of avoiding contradictions. And in terms of making coherent proposals. A proposal that effectively says that everyone should be given everything that they want (or effectively says that everyone's values should be respected) is not a coherent proposal. These things are necessarily defined in some form of outcome or action space. Trying to give everyone overlapping control over everything that they care about in such spaces introduces contradictions.
This can be contrasted with giving each individual influence over the adoption (by any clever AI) of those preferences that refer to her. Since this is defined in preference adoption space, it cannot guarantee that everyone will get everything that they want. But it also means that it does not imply contradictions (see this post for a discussion of these issues in the context of Membrane formalisms). Giving everyone such influence is a coherent proposal.
It also happens to be the case that if one wants to protect Steve from a far superior intellect, then preference adoption space seems to be a lot more relevant than any form of outcome or action space. Because if a superior intellect wants to hurt Steve, then one has to defeat a superior opponent in every single round of a near infinite definitional game (even under the assumption of perfect enforcement, winning every round in such a definitional game against a superior opponent seems hopeless). In other words: I don't think that the best way to approach this is to ask how one might protect Steve from a large set of clever AIs that wants to hurt Steve for a wide variety of reasons. I think a better question is to ask how one might prevent the situation where such a set of AIs wants to hurt Steve.
My thought experiment assumed that all rules and constraints described in the text that you linked to had been successfully implemented. Perfect enforcement was assumed. This means that there is no need to get into issues such as relative optimization power (or any other enforcement related issue). The thought experiment showed that the rules described in the linked text does not actually protect Steve from a clever AI that is trying to hurt Steve (even if these rules are successfully implemented / perfectly enforced).
If we were reasoning from the assumption that some AI will try to prevent All Bad Things, then relative power issues might have been relevant. But there is nothing in the linked text that suggests that such an AI would be present (and it contains no proposal for how one might arrive at some set of definitions that would imply such an AI).
In other words: there would be many clever AIs trying to hurt people (the Advocates of various individual humans). But the text that you link to does not suggest any mechanism, that would actually protect Steve from a clever AI trying to hurt Steve.
There is a ``Misunderstands position?'' react to the following text:
The scenario where a clever AI wants to hurt a human that is only protected by a set of human constructed rules ...
In The ELYSIUM Proposal, there would in fact be many clever AIs trying to hurt individual humans (the Advocates of various individual humans). So I assume that the issue is with the protection part of this sentence. The thought experiment outlined in my comment assumes perfect enforcement (and my post that this sentence is referring to also assumes perfect enforcement). It would have been redundant, but I could have instead written:
The scenario where a clever AI wants to hurt a human that is only protected by a set of perfectly enforced human constructed rules ...
I hope that this clarifies things.
The specific security hole illustrated by the thought experiment can of course be patched. But this would not help. Patching all humanly findable security holes would also not help (it would prevent the publication of further thought experiments. But it would not protect anyone from a clever AI trying to hurt her. And in The ELYSIUM Proposal, there would in fact be many clever AIs trying to hurt people). The analogy with an AI in a box is apt here. If it is important that an AI does not leave a human constructed box (analogous to: an AI hurting Steve). Then one should avoid creating a clever AI that wants to leave the box (analogous to: avoid creating a clever AI that wants to hurt Steve). In other words: Steve's real problem is that a clever AI is adopting preferences that refer to Steve, using a process that Steve has no influence over.
(Giving each individual influence over the adoption of those preferences that refer to her would not introduce contradictions. Because such influence would be defined in preference adoption space. Not in any form of action or outcome space. In The ELYSIUM Proposal however, no individual would have any influence whatsoever, over the process by which billions of clever AIs, would adopt preferences, that refer to her)
Your comment makes me think that I might have been unclear regarding what I mean with ATA. The text below is an attempt to clarify.
Summary
Not all paths to powerful autonomous AI go through methods from the current paradigm. It seems difficult to rule out the possibility that a Sovereign AI will eventually be successfully aligned to some specific alignment target. At current levels of progress on ATA this would be very dangerous (because understanding an alignment target properly is difficult, and a seemingly-nice proposal can imply a very bad outcome). It is difficult to predict how long it would take to reach the level of understanding needed to prevent scenarios where a project successfully hits a bad alignment target. And there might not be a lot of time to do ATA later (for example because a tool-AI shuts down all unauthorised AI projects. But does not buy a lot of time due to internal time pressure). So a research effort should start now.
Therefore ATA is one of the current priorities. There are definitely very serious risks that ATA cannot help with (for example misaligned tool-AI projects resulting in extinction). There are also other important current priorities (such as preventing misuse). But ATA is one of the things that should be worked on now.
The next section outlines a few scenarios designed to clarify how I use the term ATA. The section after that outlines a scenario designed to show why I think that ATA work should start now.
What I mean with Alignment Target Analysis (ATA)
The basic idea with ATA is to try to figure out what would happen if a given AI project were to successfully align an autonomously acting AI Sovereign to a given alignment target. The way I use the term, there are very severe risks that cannot be reduced in any way, by any level of ATA progress (including some very serious misalignment and misuse risks). But there are also risks that can and should be reduced by doing ATA now. There might not be a lot of time to do ATA later, and it is not clear how long it will take to advance to the level of understanding that will be needed. So ATA should be happening now. But let's start by clarifying the term ATA, by outlining a couple of dangerous AI projects where ATA would have nothing to say.
Consider Bill, who plans to use methods from the current paradigm to build a tool-AI. Bill plans to use this tool AI to shut down competing AI projects and then decide what to do next. ATA has nothing at all to say about this situation. Let's say that Bill's project plan would lead to a powerful misaligned AI that would cause extinction. No level of ATA progress would reduce this risk.
Consider Bob who also wants to build a tool-AI. But Bob's AI would work. If the project would go ahead, then Bob would gain a lot of power. And Bob would use that power to do some very bad things. ATA has nothing to say about this project and ATA cannot help reduce this risk.
Now let's introduce an unusual ATA scenario, just barely within the limits of what ATA can be used for (the next section will give an example of the types of scenarios that makes me think that ATA should be done now. This scenario is meant to clarify what I mean with ATA). Consider Dave who wants to use methods from the current paradigm to implement PCEV. If the project plan moves forwards, then the actual result would be a powerful misaligned AI: Dave's Misaligned AI (DMAI). DMAI would not care at all what Dave is trying to do, and would cause extinction (for reasons that are unrelated to what Dave was aiming at). One way to reduce the extinction risk from DMAI would be to tell Dave that his plan would lead to DMAI. But it would also be valid to let Dave know that if his project were to successfully hit the alignment target that he is aiming for, then the outcome would be massively worse than extinction.
Dave assumes that he might succeed. So, when arguing against Dave's project, it is entirely reasonable to argue from the assumption that Dave's project will lead to PCEV. Pointing out that success would be extremely bad is a valid argument against Dave's plan, even if success is not actually possible.
You can argue against Dave's project by pointing out that the project will in fact fail. Or by pointing out that success would be very bad. Both of these strategies can be used to reduce the risk of extinction. And both strategies are cooperative (if Dave is a well meaning and reasonable person, then he would thank you for pointing out either of these aspects of his plan). While both strategies can prevent extinction in a fully cooperative way, they are also different in important ways. It might be the case that only one of these arguments is realistically findable in time. It might for example be the case that Dave is only willing to publish one part of his plan (meaning that there might not be sufficient public information to construct an argument about the other part of the plan). And even if valid arguments of both types are constructed in time, it might still be the case that Dave will only accept one of these arguments. (similar considerations are also relevant for less cooperative situations. For example if one is trying to convince a government to shut down Dave's project. Or if one is trying to convince an electorate to vote no on a referendum that Dave needs to win in order to get permission to move forwards)
The audience in question (Dave, bureaucrats, voters, etc) are only considering the plan because they believe that it might result in PCEV. Therefore it is entirely valid to reason from the assumption that Dave's plan will result in PCEV (when one is arguing against the plan). There is no logical reason why such an argument would interfere with attempts to argue that Dave's plan would in fact result in DMAI.
Now let's use an analogy from the 2004 CEV document to clarify what role I see an ATA project playing. In this analogy, building an AI Sovereign is analogous to taking power in a political revolution. So (in the analogy) Dave proposes a political revolution. One way a revolution can end in disaster is that the revolution leads to a destructive civil war that the revolutionaries loose (analogous to DMAI causing extinction). Another way a revolution can end in disaster is that ISIS takes power after the government is overthrown (analogous to the outcome implied by PCEV).
It is entirely valid to say to Dave: ``if you actually do manage to overthrow the government, then ISIS will seize power'' (assuming that this conditional is true). One can do this regardless of whether or not one thinks that Dave has any real chance of overthrowing the government. (Which in turn means that one can actually say this to Dave, without spending a lot of time trying to determine the probability that the revolution will in fact overthrow the government. Which in turn means that people with wildly different views on how difficult it is to overthrow the government can cooperate while formulating such an argument)
(this argument can be made separately from an argument along the lines of: ``our far larger neighbour has a huge army and would never allow the government of our country to be overthrown. Your revolution will fail even if every single soldier in our country joins you instantly. Entirely separately: the army of our county is in fact fiercely loyal to the government and you don't have enough weapons to defeat it. In addition to these two points: you are clearly bad at strategic thinking and would be outmanoeuvred in a civil war by any semi-competent opponent''. This line of argument can also prevent a hopeless civil war. The two arguments can be made separately and there is no logical reason for them to interfere with each other)
Analysing revolutionary movements in terms of what success would mean can only help in some scenarios. It requires a non-vague description of what should happen after the government falls. In general: this type of analysis cannot reduce the probability of lost civil wars, in cases where the post revolutionary strategy is either (i): too vaguely described to analyse, or (ii): actually sound (meaning that the only problem with the revolution in question is that it has no chance of success). Conversely however: arguments based on revolutions failing to overthrow the government cannot prevent revolutions that would actually end with ISIS in charge (analogous to AI projects that would successfully hit a bad alignment target). Scenarios that end in a bad alignment target getting successfully hit is the main reason that I think that ATA should happen now (in the analogy, the main point would be to reduce the probability of ISIS gaining power). Now let's leave the revolution analogy and outline one such scenario.
A tool-AI capable of shutting down all unauthorised AI projects might not buy a lot of time
It is difficult to predict who might end up controlling a tool-AI. But one obvious compromise would be to put it under the control of some group of voters (for example a global electorate). Let's say that the tool-AI is designed such that one needs a two thirds majority in a referendum, to be allowed to launch a Sovereign AI. There exists a Sovereign AI proposal that a large majority thinks sounds nice. A small minority would however prefer a different proposal.
In order to prevent inadvertent manipulation risks, the tool AI was designed to only discuss topics that are absolutely necessary for the process of shutting down unauthorised AI projects. Someone figures out how to make the tool-AI explain how to implement Sovereign AI proposals (and Explanation / Manipulation related definitions happens to hold for such discussions). But no one figures out how to get it to discuss any topic along the lines of ATA. The original plan was to take an extended period of time to work on ATA before implementing a Sovereign AI.
Both alignment targets use the same method for extrapolating people and for resolving disagreements. The difference is in terms of who is part of the initial group. The two proposals have different rules with respect to things like: animals, people in cryo, foetuses, artificial minds, etc. It doesn't actually matter which proposal gets implemented: the aggregation method leads to the same horrific outcome in both cases (due to an issue along the lines of the issue that PCEV suffers from. But more subtle and difficult to notice). (All proposed alignment targets along the lines of ``build an AI Sovereign that would do whatever some specific individual wants it to do'' are rejected out of hand by almost everyone).
In order to avoid making the present post political, let's say that political debates center around what to do with ecosystems. One side cares about nature and wants to protect ecosystems. The other side wants to prevent animal suffering (even if the cost of such prevention is the total destruction of every ecosystem on earth). It is widely assumed that including animals in the original group will lead to an outcome where animal suffering is prevented at the expense of ecosystems. (in order to make the following scenario more intuitive, readers that have an opinion regarding what should be done with ecosystems, can imagine that the majority shares this opinion)
The majority has enough support to launch their Sovereign AI. But the minority is rapidly and steadily gaining followers due to ordinary political dynamics (sometimes attitudes on a given issue changes steadily in a predictable direction). So the ability to get the preferred alignment target implemented can disappear permanently at any moment (the exact number of people that would actually vote yes in a referendum is difficult to estimate. But it is clearly shrinking rapidly). In this case the majority might act before they loose the ability to act. Part of the majority would however hesitate if the flaw with the aggregation method is noticed in time.
After the tool-AI was implemented, a large number of people started to work on ATA. There are also AI assistants that contribute to conceptual progress (they are tolerated by the tool-AI because they are not smarter than humans. And they are useful because they contribute a set of unique non-human perspectives). However, it turns out that ATA progress works sort of like math progress. It can be sped up significantly by lots of people working on it in parallel. But the main determinant of progress is how long people have been working on it. In other words: it turns out that there is a limit to how much the underlying conceptual progress can be sped up by throwing large numbers of people at ATA. So the question of whether or not the issue with the Sovereign AI proposal is noticed in time, is to a large degree determined by how long a serious ATA research project has been going on at the time that the tool-AI is launched (in other words: doing ATA now reduces the risk of a bad alignment target ending up getting successfully hit in this scenario).
(the idea is not that this exact scenario will play out as described. The point of this section was to give a detailed description of one specific scenario. For example: the world will presumably not actually be engulfed by debates about the Prime Directive from Star Trek. And a tool-AI controlled by a messy coalition of governments might lead to a time crunch due to dynamics that are more related to Realpolitik than any form of ideology. This specific scenario is just one example of a large set of similar scenarios)
PS:
On a common sense level I simply don't see how one can think that it is safe to stay at our current level of ATA progress (where it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction). The fact that there exists no research project dedicated to improving this situation seems like a mistake. Intuitively this seems like a dangerous situation. At the very least it seems like some form of positive argument would be needed before concluding that this is safe. And it seems like such an argument should be published so that it can be checked for flaws before one starts acting based on the assumption that the current situation is safe. Please don't hesitate to contact me with theories / questions / thoughts / observations / etc regarding what people actually believe about this.