I often think about "the road to hell is paved with good intentions".[1] I'm unsure to what degree this is true, but it does seem that people trying to do good have caused more negative consequences in aggregate than one might naively expect.[2] "Power corrupts" and "power-seekers using altruism as an excuse to gain power" are two often cited reasons for this, but I think don't explain all of it.

A more subtle reason is that even when people are genuinely trying to do good, they're not entirely aligned with goodness. Status-seeking is a powerful motivation for almost all humans, including altruists, and we frequently award social status to people for merely trying to do good, before seeing all of the consequences of their actions. This is in some sense inevitable as there are no good alternatives. We often need to award people with social status before all of the consequences play out, both to motivate them to continue to try to do good, and to provide them with influence/power to help them accomplish their goals.

A person who consciously or subconsciously cares a lot about social status will not optimize strictly for doing good, but also for appearing to do good. One way these two motivations diverge is in how to manage risks, especially risks of causing highly negative consequences. Someone who wants to appear to do good would be motivated to hide or downplay such risks, from others and perhaps from themselves, as fully acknowledging such risks would often amount to admitting that they're not doing as much good (on expectation) as they appear to be.

How to mitigate this problem

Individually, altruists (to the extent that they endorse actually doing good) can make a habit of asking themselves and others what risks they may be overlooking, dismissing, or downplaying.[3]

Institutionally, we can rearrange organizational structures to take these individual tendencies into account, for example by creating positions dedicated to or focused on managing risk. These could be risk management officers within organizations, or people empowered to manage risk across the EA community.[4]

Socially, we can reward people/organizations for taking risks seriously, or punish (or withhold rewards from) those who fail to do so. This is tricky because due to information asymmetry, we can easily create "risk management theaters" akin to "security theater" (which come to think of it, is a type of risk management theater). But I think we should at least take notice when someone or some organization fails, in a clear and obvious way, to acknowledge risks or to do good risk management, for example not writing down a list of important risks to be mindful of and keeping it updated, or avoiding/deflecting questions about risk.[5] More optimistically, we can try to develop a culture where people and organizations are monitored and held accountable for managing risks substantively and competently.

  1. ^

    due in part to my family history

  2. ^

    Normally I'd give some examples here, but we can probably all think of some from the recent past.

  3. ^

    I try to do this myself in the comments.

  4. ^

    an idea previously discussed by Ryan Carey and William MacAskill

  5. ^

    However, see this comment.

New to LessWrong?

New Comment
26 comments, sorted by Click to highlight new comments since: Today at 3:52 PM

How do we do this without falling into the Crab Bucket problem AKA Heckler's Veto, which is definitely a thing that exists and is exacerbated by these concerns in EA-land? "Don't do risky things" equivocates into "don't do things".

Some things are best avoided entirely when you take their risks into account, some become worthwhile only if you manage their risks instead of denying their existence even to yourself. But even when denying risks gives positive outcomes in expectation, adequately managing those risks is even better. Unless society harms the project for acknowledging some risks, which it occasionally does. In which case managing them without acknowledgement (which might require magic cognitive powers) is in tension with acknowledging them despite the expected damage from doing so.

Maybe the person hired needs to have good scores on a prediction market such that people trust them to be well calibrated.

In the case of damage from political movements, I think that many truly horrific things, have been done by people, that are well approximated as: ``genuinely trying to do good, and largely achieving their objectives, without major unwanted side effects'' (for example events along the lines of the Chinese Cultural Revolution, that you discuss in your older post, that you link to in your first footnote).

I think our central disagreement, might be a difference, in how we see human morality. In other words, I think that we might have different views, regarding what one should expect, from a human that is, genuinely, trying to do good, and that is succeeding. I'm specifically talking about one particular aspect of morality, that has been common in many different times and places, throughout human history. It is sometimes expressed in theological terms, along the lines of ``heretics deserve eternal torture in hell''. The issue is not the various people, that have come up with various stories, along the lines of ``hell exists''. Humans are always coming up with stories about ``what is really going on''. There was, a lot, to choose from. The issue is the large number of people, in many different cultures, throughout human history, that has heard stories, that assumes a morality, that fits well with normative statements along the lines of ``heretics deserve eternal torture in hell''. And they have thought: ``this story feels right. This is the way that the world, should, work. On questions of morality, it feels right, to defer to the one, who set this up''. These types of stories, are not the only types of stories, that humans have found intuitive. But they are common. The specific aspect of human morality, that I am referring to, is just one aspect, out of many. But it is an important, and common, aspect. Many people, that are trying to do good, are not driven by anything even remotely like this specific aspect of morality. But some are. And I think that such people, have done some truly horrific things.

In other words: Given that this is one standard aspect of human morality, why would anyone be surprised, when the result of a ``person trying to be genuinely good (in a way that does not involve anything, along the lines of status or power seeking), and succeeding'', leads to extreme horror? Side effects along the lines of innocents getting hurt, or economic chaos, are presumably unwanted side effects, for these types of political movements. But why would one expect them to be seen as major issues, by someone that is, genuinely, trying to do good? Why would anyone be surprised, to learn that these side effects, were seen as a perfectly reasonable, and acceptable, costs to pay, for enforcing moral purity? In the specific event that you refer to in the post, that you link to in your first footnote (the Chinese Cultural Revolution), there was extraordinary levels of economic chaos, suffering, and a very large number of dead innocents. So, maybe these extraordinary levels of general disruption and destruction, would have been enough to discourage the movement, if they had been predicted. Alternatively, maybe the only thing driving this event, was something along the lines of ``seeking-power-for-the-sake-of-power''. But maybe not. Maybe they would have (even if they had predicted the outcome), concluded that enforcing moral purity, was more important (enforcing moral purity, on a large number of reluctant people, is not possible without power. So, power seeking behaviour, is not decisive evidence against this interpretation). Humans doing good, and succeeding, are simply not safe, for other humans (even under the assumption, that they would have proceeded, if the side effects had been predicted. And assuming that there is nothing along the lines of ``status seeking'', or ``corrupted by power'', going on). They are not safe, because their preferred outcome, is not safe, for other humans. So, I think that there is an important piece missing from your analysis: the damage done, by humans that, genuinely, tries to do good (humans that, genuinely, do not seek power, or ``status'', or anything similar. Humans whose actions are morally pure, according to their morality). And who succeeds, without causing any deal-breaking side effects. (I know that you have written elsewhere, about humans not being safe for other humans. I know that you have said that Morality is Scary. But I think that an important aspect of this issue, is still missing. I could obviously be completely be wrong about this, but if I had to guess, I would say that it is likely, that our disagreements, follows from the fact, that you do not consider: ``safety issues coming from humans'', as being strongly connected to: ``humans genuinely trying to do good, and succeeding'')

More generally, this implies that human morality is not safe, for other humans. If it was, then those sentiments, that are sometimes expressed in theological terms, along the lines of ``heretics deserve eternal torture in hell'', would not keep popping up, throughout human history. A human that genuinely tries to do good, and that succeeds, is a very dangerous thing, for other humans. This actually has important AI safety implications. For example: this common aspect of human morality, implies a serious s-risk, if someone is ever able to successfully implement CEV. See for example my post:

A problem with the most recently published version of CEV 

(the advice in your post sounds good to me, if you assume that you are exclusively interacting with people, that share your values (and that, in addition to this, are also genuinely trying to do good). My comment is about events, along the lines of the Chinese Cultural Revolution (which involved people with values that, presumably, differs greatly from the values of essentially all readers of this blog). My comment is not about people who share your values, and tries to follow them (but that might be, subconsciously, trying to also achieve other things, such as ``status''). For people like this, your analysis sounds very reasonable to me. But I think that if one looks at history, a lot of: ``damage from people trying to do good'', comes from people that are not well approximated, as trying to do good, while ``being corrupted by power'', or ``subconsciously seeking status'', or anything along those lines)

I wrote a post expressing similar sentiments but perhaps with a different slant. To me, apparent human morality along the lines of "heretics deserve eternal torture in hell" or what was expressed during the Chinese Cultural Revolution are themselves largely a product of status games, and there's a big chance that these apparent values do not represent people's true values and instead represent some kind of error (but I'm not sure and would not want to rely on this being true). See also Six Plausible Meta-Ethical Alternatives for some relevant background.

But you're right that the focus of my post here is on people who endorse altruistic values that seem more reasonable to me, like EAs, and maybe earlier (pre-1949) Chinese supporters of communism who were mostly just trying to build a modern nation with a good economy and good governance, but didn't take seriously enough the risk that their plan would backfire catastrophically.

I don't think that they are all status games. If so, then why did people (for example) include long meditations, regarding whether or not, they personally, deserve to go to hell, in private diaries? While they were focusing on the ``who is a heretic?'' question, it seems that they were taking for granted, the normative position: ``if someone is a heretic, then she deserves eternal torture in hell''. But, on the other hand, private diaries are of course sometimes opened, while the people that wrote them are still alive (this is not the most obvious thing, that someone would like others to read, in a stolen diary. But people are not easy to interpret, especially across centuries of distance. Maybe for some people, someone else stealing their diary, and reading such meditations, would be awesome). And people are not perfect liars, so maybe the act of making such entries is, mostly, an effective way, of getting into an emotional state, such that one seems genuine, when expressing remorse to other people? So, maybe any reasonable way of extrapolating a diarist like this, will lead to a mind, that find the idea of hell, abhorrent. There is a lot of uncertainty here. There is probably also a very, very large diversity, among the set of humans that have adopted a normative position, along these lines (and not just in terms of terminology, and in terms of who counts as a heretic. Also in terms of what it is, that was lying underneath, the adoption of such normative positions. It would not be very surprising, if a given extrapolation procedure, leads to different outcomes, for two individuals, that sound very similar). As long as we agree that any AI design, must be robust to the possibility, that people mean what they say, then perhaps these issues are not critical to resolve (but, on the other hand, maybe digging into this some more, will lead to genuinely important insights). (I agree that there were probably a great number of people, especially early on, that was trying to achieve things that most people today would find reasonable, but whose actions contributed to destructive movements. Such issues are probably a lot more problematic in politics, than in the case where an AI is getting its goal from a set of humans) (none of my reasoning here is done, with EAs in mind)

I think that there exists a deeper problem, for the proposition, that perhaps it is possible to find some version of CEV, that is actually safe for human individuals (as opposed to the much easier task, of finding a version of CEV, such that no one is able to outline a thought experiment, before launch time, that shows, why this specific version, would lead to an outcome, that is far, far, worse than extinction). Specifically, I'm referring to the fact that ``heretics deserve eternal torture in hell'' style fanatics (F1), is just one very specific example, of a group of humans, that might be granted extreme influence, over CEV. In a population of billions, there will exist a very, very large number of ``never-explicitly-considered'' types of minds. Consider for example a different, tiny, group of Fanatics (F2), who (after being extrapolated) has a very strong ``all or nothing'' attitude, and a sacred rule against negotiations (let's explore what happens in the case, where this attitude is related to a religion, and where one in a thousand humans, will be part of F2). Unless negotiations deadlock in a very specific way, PCEV will grant F2, exactly zero direct influence. However, let's explore what happens, if another version of CEV is launched, that first maps each individual to a Utility function, and then maximise the Sum of those functions (USCEV). During the process, where a member of this religion, that we can call Gregg, ``becomes the person that Gregg wants to be'', the driving aspect of Gregg's personality, is a burning desire to become a true believer, and become morally pure. This includes, becoming the type of person, that would never break the sacred set of rules: ``Never accept any compromise, regarding what the world should look like! Never negotiate with heretics! Always take whatever action, is most likely to result in the world being organised, exactly as is described in the sacred texts!''. So, the only reasonable way to map, extrapolated Gregg, to a utility function, is to assign maximum utility to the Outcome demanded by the Sacred Texts (OST), and minimum utility, to every other outcome. Besides the number of people in F2, the bound on how bad OST can be (from the perspective of the non believers), and still be the implemented outcome, is that USCEV, must be able to think up something that is far, far, worse (technically, the minimum is not actually the worst possible outcome, but instead the worst outcome that USCEV can think up, for each specific non-believer). As long as there is a very large difference, between OST, and the worst thing that USCEV can think up, then OST will be the selected outcome. Maybe OST will look ok, to a non super intelligent observer. For example, OST could look like a universe where every currently existing human individual, after an extended period of USCEV guided self reflection, converge on the same belief system (and all subsequent children, are then brought up in this belief system). Or, maybe it will be overtly bad, with everyone forced to convert or die. Or maybe it will be a genuine s-risk, for example along the lines of LP.

As far as I can tell, CEV in general, and PCEV in particular, is, still, the current state of the art, in terms of finding an answer to the ``what alignment target, should be aimed at?'' question (and CEV has been the state of the art now, for almost two decades). I find this state of affairs strange, and deeply problematic. I'm confused by the relatively low interest, in efforts to make further progress on the ``what alignment target, should be aimed at?'' question (I think that, for example, the explanation, in the original CEV document, from 2004, was a very good explanation, for why this question matters. And I don't think that it is a coincidence, that the specific analogy used, to make that point, was a political revolution (a brief paraphrasing: such a revolution must (i): succeed, and also (ii): lead to a new government, that is actually a good government. Similarly, an AI must (i): hit an alignment target, and also (ii): this alignment target, must be a good thing to hit)). Maybe I shouldn't be surprised by this relative lack of interest. Maybe humans are just not great, in general, at reacting to ``AI danger''. But it still feels like I'm not seeing, I don't know, ... something (wild speculation by anyone that, at any point, happens to stumble upon this comment, regarding what this ... something ... might be, are very welcome. Either in a comment, or in a DM, or in an email).

There just may be systematic overvaluation of what people say instead of what they do, by practically everyone.

For the average person, who is far from producing genuinely original ideas/insights/arguments/etc... Just what they say throughout their entire life might even be worth less, realistically, than a fancy dinner to a random passing reader.

Conversely taking a bit of actual effort in buying said reader a fancy dinner probably more than doubles it, at least in the eyes of the person getting to eat it.

Of course the opposite pretence needs to be maintained very often in normal day to day life, and after enough times, folks start genuinely believing the opposite.  That just the mere prospect of losing a minor verbal status game implies that they must fanatically counter-signal. 

(i.e. the needle of their perception gets harder and harder to move over time)

Which would explain the observed phenomena throughout history.

If I had to summarize your argument, it would be something like, "Many people's highest moral good involves making their ideological enemies suffer." This is indeed a thing that happens, historically.

But another huge amount of damage is caused by people who believe things like "the ends justify the means" or "you can't make an omelette without breaking a few eggs." Or "We only need 1 million surviving Afghanis [out of 15 million] to build a paradise for the proletariat," to paraphrase an alleged historical statement I read once. The people who say things like this cause immediate, concrete harm. They attempt to justify this harm as being outweighed by the expected future value of their actions. But that expected future value is often theoretical, and based on dubious models of the world.

I do suspect that a significant portion of the suffering in the world is created by people who think like this. Combine them with the people you describe whose conception of "the good" actually involves many people suffering (and with people who don't really care about acting morally at all), and I think you account for much of the human-caused suffering in the world.

One good piece of advice I heard from someone in the rationalist community was something like, "When you describe your proposed course of action, do you sound like a monologuing villain from a children's TV show, someone who can only be defeated by the powers of friendship and heroic teamwork? If so, you would be wise to step back and reconsider the process by which you arrived at your plans."

I agree that ``the ends justify the means'' type thinking has led to a lot of suffering. For this, I would like to switch from the Chinese Cultural Revolution, to the French Revolution, as an example (I know it better, and I think it fits better, for discussions of this attitude). So, someone wants to achieve something, that are today seen as a very reasonable goal, such as ``end serfdom and establish formal equality before the law''. So, basically: their goals are positive, and they achieve these goals. But perhaps they could have achieved those goals, with less side effects, if it was not for their ``the ends justify the means'' attitude. Serfdom did end, and this change was both lasting, and spreading. After things had calmed down, the new economic relations, led to dramatically better material conditions, for the former serfs (and, for example, dramatic increase in life expectancy, due a dramatic reduction in poverty related malnutrition). But, during the revolutionary wars (and especially the Napoleon wars that followed), millions died. It sounds intuitively likely, that there would have been less destruction, if attitudes along these lines were less common.

So, yes, even when an event has such a large, and lasting, positive impact, that it is still celebrated, centuries later (14th of July is still a very big thing in France), one might find that this attitude caused concrete harm (millions of dead people, must certainly qualify as ``concrete harm''. And the French Revolution must certainly be classified as a celebrated event in any sense of that word (including, but not limited to, the literal: ``fireworks and party'' sense)).

And you are entirely correct, that damage from this type of attitude, was missing from my analysis.

This seems mostly goodharting, how the tails come apart when optimizing or selecting for a proxy rather than for what you actually want. And people don't all want the same thing without disagreement or value drift. Near term practical solution is not optimizing too hard and building an archipalago with membranes between people and between communities that bound the scope of stronger optimization. Being corrigible about everything might also be crucial. Longer term idealized solution is something like CEV, saying in a more principled and precise way what the practical solutions only gesture at, and executing on that vision at scale. This needs to be articulated with caution, as it's easy to stray into something that is obviously a proxy and very hazardous to strongly optimize.

I'm not sure that I agree with this. I think it mostly depends on what you mean by: ``something like CEV''. All versions of CEV are describable as ``doing what a Group wants''. It is inherent in the core concept of building an AI, that is ``Implementing the Coherent Extrapolated Volition of Humanity''. This rules out proposals, where each individual, is given meaningful influence, regarding the adoption, of those preferences, that refer to her. For example as in MPCEV (described in the post that I linked to above). I don't see how an AI can be safe, for individuals, without such influence. Would you say that MPCEV counts as ``something like CEV''?

If so, then I would say that it is possible, that ``something like CEV'', might be a good, long term solution. But I don't see how one can be certain about this. How certain are you, that this is in fact a good idea, for a long term solution?

Also, how certain are you, that the full plan that you describe (including short term solutions, etc), is actually a good idea?

The issue with proxies for an objective is that they are similar to it. So an attempt to approximately describe the objective (such as an attempt to say what CEV is) can easily arrive at a proxy that has glaring goodharting issues. Corrigibility is one way of articulating a process that fixes this, optimization shouldn't outpace accuracy of the proxy, which could be improving over time.

Volition of humanity doesn't obviously put the values of the group before values of each individual, as we might put boundaries between individuals and between smaller groups of individuals, with each individual or smaller group having greater influence and applying their values more strongly within their own boundaries. There is then no strong optimization from values of the group, compared to optimization from values of individuals. This is a simplistic sketch of how this could work in a much more elaborate form (where the boundaries of influence are more metaphorical), but it grounds this issue in more familiar ideas like private property, homes, or countries.

I think that my other comment to this, will hopefully be sufficient, to outline what my position actually is. But perhaps a more constructive way forwards, would be to ask how certain you are, that CEV is in fact, the right thing to aim at? That is, how certain are you, that this situation is not symmetrical, to the case where Bob thinks that: ``a Suffering Reducing AI (SRAI), is the objectively correct thing to aim at''? Bob will diagnose any problem, with any specific SRAI proposal, as arising from proxy issues, related to the fact that Bob is not able to perfectly define ``Suffering'', and must always rely on a proxy (those proxy issues exists. But they are not the most serious issue, with Bob's SRAI project).

I don't think that we should let Bob proceed with an AI project, that aims to find the correct description of ``what SRAI is'', even if he is being very careful, and is trying to implement a safety measure (that will, while it continues to work as intended, prevent SRAI from killing everyone). Because those safety features might fail, regardless of whether or not someone has pointed out a critical flaw in them, before the project reaches the point of no return (this conclusion is not related to Corrigibility. I would reach the exact same conclusion, if Bob's SRAI project, was using any other safety measure). For the exact same reason, I simply do not think, that it is a good idea, to proceed with your proposed CEV project (as I understand that project). I think that doing so, would represent a very serious s-risk. At best, it will fail in a safe way, for predictable reasons. How confident are you, that I am completely wrong about this?

Finally, I should note, that I still don't understand your terminology. And I don't think that I will, until you specify what you mean with ``something like CEV''. My current comments, are responding to my best guess, of what you mean (which is, that MPCEV, from my linked to post, would not count as ``something like CEV'', in your terminology). (Does an Orca count as: ``something like a shark''? If it is very important, that some water tank is free of fish, then it is difficult for me to discuss Dave's ``let's put something like a shark, in that water tank'' project, until I have an answer to my Orca question.)

(I assume that this is obvious, but just to be completely sure that this is clear, it probably makes sense to note explicitly that I, very much, appreciate that you are engaging on this topic)

Metaphorically, there is a question CEV tries to answer, and by "something like CEV" I meant any provisional answer to the appropriate question (so that CEV-as-currently-stated is an example of such an answer). Formulating an actionable answer is not a project humans would be ready to work on directly any time soon. So CEV is something to aim at by intention that defines CEV. If it's not something to aim at, then it's not a properly constructed CEV.

This lack of a concrete formulation is the reason goodharting and corrigibility seem salient in operationalizing the process of formulating it and making use of the formulation-so-far. Any provisional formulation of an alignment target (such as CEV-as-currently-stated) would be a proxy, and so any optimization according to such proxy should be wary of goodharting and be corrigible to further refinement.

The point of discussion of boundaries was in response to possible intuition that expected utility maximization tends to make its demands with great uniformity, with everything optimized in the same direction. Instead, a single goal may ask for different things to happen in different places, or to different people. It's a more reasonable illustration of goal aggregation than utilitarianism that sums over measures of value from different people or things.

The version of CEV, that is described on the page that your CEV link leads to, is PCEV. The acronym PCEV was introduced by me. So this acronym does not appear on that page. But that's PCEV that you link to. (in other words: the proposed design, that would lead to the LP outcome, can not be dismissed as some obscure version of CEV. It is the version that your own CEV link leads to. I am aware of the fact, that you are viewing PCEV as: ``a proxy for something else'' / ``a provisional attempt to describe what CEV is''. But this fact still seemed noteworthy)

On terminology: If you are in fact using ``CEV'' as a shorthand, for ``an AI that implements the CEV of a single human designer'', then I think that you should be explicit about this. After thinking about this, I have decided that without explicit confirmation that this is in fact your intended usage, I will proceed as if you are using CEV as a shorthand, for ``an AI that implements the Coherent Extrapolated Volition of Humanity'' (but I would be perfectly happy to switch terminology, if I get such confirmation). (another reading of your text, is that: ``CEV'' (or: ``something like CEV'') is simply a label that you attach, to any good answer, to the correct phrasing of the ``what alignment target should be aimed at?'' question. That might actually be a sort of useful shorthand. In that case I would, somewhat oddly, have to phrase my claim as: under no reasonable set of definitions, does the Coherent Extrapolated Volition of Humanity, deserve the label ``CEV'' / ``something like CEV''. Due to the chosen label(s), the statement looks odd. But there is no more logical tension in the above statement, than there is logical tension in the following statement: ``under no reasonable set of definitions, does the Coherent Extrapolated Volition of Steve, result in the survival of any of Steve's cells'' (which is presumably a true statement for at least some human individuals). Until I hear otherwise, I will however stay with the terminology, where ``CEV'' is shorthand for ``an AI that implements the Coherent Extrapolated Volition of Humanity'', or ``an AI that is helping humanity'', or something less precise, that is still hinting at something along those lines)

It probably makes sense to clarify my own terminology some more. I think this can be done by noting, that I think that CEV, sounds like a perfectly reasonable way of helping ``a Group'' (including the PCEV version that you link to, and that implies the LP outcome). I just don't think that helping ``a Group'' (that is made up of human individuals) is good for the (human) individuals that make up that ``Group'' (in expectation). Pointing a specific version of CEV (including PCEV) at a set of individuals, might be great for some other type of individuals. Let's consider a large number of ``insatiable, Clippy like maximisers''. Each of them cares exclusively about the creation of a different, specific, complex object. No instances of any of these very complex objects will ever exist, unless someone looks at the exact specification of a given individual, and uses this specification to create such objects. In this case PCEV might, from the perspective of each of those individuals, be the best thing that can happen (if special influence is off the table). It is also worth noting, that a given human individual might get what she wants, if some specific version of CEV is implemented. But CEV, or ``helping humanity'', is not good, for human individuals, in exception, compared to extinction. And why would it be? Groups and human individuals are completely different types of things. And a human individual is very vulnerable to a powerful AI, that wants to hurt her. And humanity certainly looks like it contains an awful lot of ``will to hurt'', specifically directed at existing human individuals.

If I zoom out a bit, I would describe the project of ``trying to describe what CEV is'' / ``trying to build an AI that helps humanity'' as: A project that searches for an AI design that helps an arbitrarily defined abstract entity. But this same project is, in practice, evaluating specific proposed AI designs, based on how they interact with a completely different type of thing: human individuals. You are for example presumably discarding PCEV, because the LP outcome implied by PCEV, contains a lot of suffering individuals (when PCEV is pointed at billions of humans). It is however not obvious to me why LP would be a bad way of helping an arbitrarily defined abstract entity (especially considering that the negotiation rules of PCEV simultaneously (i): implies LP, and is also (ii): an important part of the set of definitions, that is needed to differentiate the specific abstract entity that is to be helped, from the rest of the vast space of entities, that a mapping from billions-of-humans to the ``class-of-entities-that-can-be-said-to-want-things'', can point to). Thus, I suspect that PCEV is not actually being discarded, due to being bad at helping an abstract entity (and my guess it that PCEV is actually being discarded, because LP is bad for human individuals).

I think that one reasonable way of moving past this situation, is to switch perspective. Specifically: adopt the perspective of a single human individual, in a population of billions, and ask: ``without giving her any special treatment, compared to other existing humans, what type of AI, would want to help her''. And then try to answer this question, while making as few assumptions about her as possible (for example making sure that there is no implicit assumption, regarding whether she is ``selfish or selfless'', or anything along those lines. Both ``selfless and selfish'' human individuals, would strongly prefer to avoid being a Heretic in LP. Thus, discarding PCEV does not contain an implicit assumption related to the ``selfish or selfless'' issue. Discarding PCEV, does however, involve an assumption, that human individuals are not like the ``insatiable Clippy maximisers'' mentioned above. So, such maximisers might justifiably feel ignored, when we discard PCEV. But no one can justifiably feel ignored when we discard PCEV, on account of where she is on the ``selfish or selfless'' spectrum). When one adopts this perspective, it becomes obvious to suggest that, the initial dynamic, should grant this individual meaningful influence, regarding the adoption of those preferences, that refer to her. Making sure that such influence, is included as a core aspect of the initial dynamic, is made even more important, by the fact, that the designers will be unable to consider all implications of a given project, and will be forced to rely on, potentially flawed, safety measures (for example along the lines of a ``Last Judge'' off switch, which might fail to trigger. Combined with a learned DWIKIM layer, that might turn out to be very literal, when interpreting some specific class of statements). If such influence is included, in the initial dynamic, then the resulting AI is no longer describable as ``doing what a Group wants it to do''. Thus, the resulting AI can not be described as a version of CEV. (it might however be describable as ``something like CEV''. Sort of how one can describe an Orca as ``something like a shark'', despite the fact that an Orca is not a type of shark (or a type of a fish). I would guess, that you would say, that an AI that grants such influence, as part of the initial dynamic, is not ``something like CEV''. But I'm not sure about this)

(I should have added ``,in the initial dynamic,'' to the text in my earlier comments. It is explicit in the description of MPCEV, but I should have added this phrase to my comments here too. As a tangent, I agree that the intuition, that you were trying to counter, with your Boundaries / Membrane mention, is probably both common and importantly wrong. Countering this intuition makes sense, and I should have read this part of your comment more carefully. I would however like to note, that the description of the LP outcome, in the PCEV thought experiment, actually contains billions of (presumably very different) localities. Each locality is optimised according to very different criteria. Each place is designed to hurt a specific individual human Heretic. And each such location, is additionally bound by it's own unique ``comprehension constraint'', that refers to the specific individual Heretic, being punished in that specific location)

Perhaps a more straightforward way to move this discussion along is to ask a direct question, regarding what you would do if you were in the position, that I believe, that I find myself in. In other words: a well intentioned designer called John, wants to use PCEV as the alignment target for his project (rejecting any other version of CEV out of hand, by saying: ``if that is indeed a good idea, then it will be the outcome of Parliamentary Negotiations''). When someone points out that PCEV is a bad alignment target, John responds by saying that PCEV cannot, by definition, be a bad alignment target. John claims that any thought experiment, where PCEV leads to a bad outcome, must be due to a bad extrapolation of human individuals. John says that any given ``PCEV with a specific extrapolation procedure'' is just an attempt, to describe what PCEV is. If aiming at a given ``PCEV with a specific extrapolation procedure'' is a bad idea, then it is a badly constructed PCEV. Aiming at PCEV is a good idea, by intention that defines PCEV. John further says that his project will include features that (if they are implemented successfully, and are not built on top of any problematic unexamined implicit assumption) will to let John try again, if a given attempt to ``say what PCEV is'', fails. Do you agree that this project, is a bad idea? (compared to achievable alternatives, that start with a different set of, findable, assumptions) If so, what would you say to John? (what you are proposing is different from what John is proposing. I predict that you will say that John is making a mistake. My point is that, to me, it looks like you are making a mistake, of the same type as John's mistake. So, I wonder what you would say to John (your behaviour in this exchange, is not the same as John's behaviour in this thought experiment. But it looks to me, like you are making the same class of mistake, as John. So, I'm not asking how you would ``act in a debate, as a response to Johns behaviour''. Instead, I'm curious about how you would explain to John, that he is making an object level mistake))

Or maybe a better approach, is to go less meta, and get into some technical details. So, let's use the terminology in your CEV link, to explore some of the technical details in that post. What do you think would happen, if the learning algorithm that outputs the DWIKIM layer in John's PCEV project, is built on top of an unexamined implicit assumption, that turns out to be wrong? Let's say that the DWIKIM layer that pops out, interprets the request to build PCEV, as a request to implement the straightforward interpretation of PCEV. The DWIKIM layer happens to be very literal, when presented with the specific phrasing, used in the request. In other words: it interprets John as requesting, something along the lines of LP? I think this might result in an outcome, along the lines of LP (if the problems with the DWIKIM layer, stems form a problematic unexamined implicit assumption, related to extrapolation, then the exact same problematic assumption, might also render something along the lines of a ``Last Judge off switch add on'', ineffective). I think that it would be better, if John had aimed at something, that does not suffer from known, avoidable, s-risks. Something whose straightforward interpretation, is not known to imply an outcome, that would be far, far, worse than extinction. For the same reason, I make the further claim, that I do not think that it is a good idea, to subject everyone to the known, avoidable, s-risks associated with any AI, that is describable as ``doing what a Group wants'' (which includes all versions of CEV). Again, I'm certainly not against some feature that, might, let you try again, or that, might, re interpret an unsafe request, as a request for something completely different, that happens to be safe (such as, for example, a learned DWIKIM layer). I am aware of the fact, that you do not have absolute faith in the DWIKIM layer (if this layer is perfectly safe, in the sense of reliably re interpreting requests that straightforwardly imply LP, as something desirable to the designer. Then the full architecture would be functionally identical, to an AI, that simply does, whatever the designer wants the AI to do. In that case, you would not care what the request was. You might then, just as well have the designer ask the DWIKIM layer, for an AI, that maximises the number of bilberries. So, I am definitely not implying, that you are unaware, of the fact that the DWIKIM layer, is unable to provide reliable safety).

Zooming out a bit, it is worth noting that the details of the safety measure(s) is actually not very relevant to the points that I am trying to make here. Any conceivable, human implemented, safety measure, might fail. And, more importantly, these measures do not help much, when one is deciding what to aim at. For example: MPCEV, can also be built on top of a (potentially flawed) DWIKIM layer, in the exacts same way as you can build CEV on top of a DWIKIM layer (and you can stick a ``Last Judge off switch add on'' to MPCEV too. Etc, etc, etc). Or in yet other words: anything, along the lines of, a ``Last Judge off switch add on'' can be used by many different projects aiming at many different targets. Thus, the ``Last Judge'' idea, or any other idea along those lines (including a DWIKIM layer), provides very limited help, when one decides what to aim at. And even more generally: regardless of what safety measure is used, John is, still, subjecting everyone to an unnecessary, avoidable, s-risk. I hope we can agree that John should not do that with, any, version of ``PCEV with a specific extrapolation procedure''. The further claim, that I am making, is that no one should do that with, any, ``Group AI'', for similar reasons. Surely, discovering that this further claim is true, cannot be, by definition, impossible.

While re reading our exchange, I realised that I never actually clarified, that my primary reason for participating in this exchange (and my primary reason for publishing things on LW), is not actually to stop CEV projects. However, I think that a reasonable person might, based on my comments here, come to believe that my primary goal is to stop CEV projects (which is why the present clarification is needed). My focus is actually on trying to make progress on the ``what alignment target should be aimed at?'' question. In the present exchange, my target is the idea, that this question has already been given an answer (and, specifically, that the answer is CEV). The first step to progress, on the ``what alignment target should be aimed at?'' question, is to show that this question does not currently have an answer. This is importantly different, from saying that: ``CEV is the answer, but the details are unknown'' (I think that such a statement is importantly wrong. And I also think, that the fact that people still believe things along these lines, is standing in the way of getting a project off the ground, that is devoted to making progress on the ``what alignment target should be aimed at?'' question).

I think that it is very unlikely, that the relevant people will stay committed to CEV, until the technology arrives, that would make it possible for them to hit CEV as an alignment target (the reason I find this unlikely, is that, (i): I believe that I have outlined a sufficient argument, to show that CEV is a bad idea, and (ii): I think that such technology will take time to arrive, and (iii): it seems likely that this team of designers, who are by assumption capable of hitting CEV, will be both careful enough to read that argument before reaching the point of no return on their CEV launch, and also capable enough to understand it. Thus, since the argument against CEV already exists, in my estimate, it would not make sense to focus on s-risks, related to a successfully implemented CEV). If that unlikely day ever does arrive, then I might switch focus, to trying to prevent direct CEV related s-risk, by arguing against this imminent CEV project. But I don't expect to ever see this happening.

The set of paths that I am actually focused on reducing the probability of, can be hinted at by outlining the following specific scenario. Imagine a well intentioned designer that we can call Dave, who is aiming for Currently Unknown Alignment Target X (CUATX). Due to an unexamined implicit assumption, that CUATX is built on top of, turning out to be wrong in a critical way, CUATX implies an outcome, along the lines of LP. But the issue that CUATX suffers from, is far more subtle than the issue that CEV suffers from. And progress on the ``what alignment target should be aimed at?'' question, has not yet progressed to the point, where this problematic unexamined implicit assumption can be seen. CUATX has all the features, that are known at launch time, to be necessary for safety (such as the necessary, but very much not sufficient, feature that any safe AI must give each individual, meaningful influence, regarding the adoption of those preferences, that refer to her). Thus, the CUATX idea leads to a CUATX project, which in turn leads to an, avoidable, outcome along the lines of LP (after some set of human implemented safety measures fail). That is the type of scenario that I am trying to avoid (by trying to make sufficient progress on the ``what alignment target should be aimed at?'' question, in time). My real ``opponent in this debate'' is an implemented CUATX, not the idea of CEV (and very definitely not you. Or anyone else that has contributed, or is likely to contribute, valuable insights related to the ``what alignment target should be aimed at?'' question). It just happens to be the case, that the effort to prevent CUATX, that I am trying to get off the ground, starts by showing that CEV, is not an answer, to the ``what alignment target should be aimed at?'' question. And you just happen to be the only person, that is pushing back against this in public (and again: I really appreciate the fact that you chose to engage on this topic).

(I should also note explicitly, that I am most definitely not against exploring safety measures. They might stop CUATX. In some plausible scenarios, they might be the only realistic thing, that can stop CUATX. And I am not against treaties. And I am open to hearing more about the various human augmentation proposals that have been going around for many years. I am simply noting, that a safety measure, regardless of how clever it sounds, simply cannot fill the function of a substitute, for progress on the ``what alignment target should be aimed at?'' question. An attempt to get people to agree to a treaty might fail. Or a successfully implemented treaty might fail to actually prevent a race dynamic for long enough. And similarly, augmented humans might systematically tend towards being: (i): superior at alignment, (ii): superior at persuasion, (iii): well intentioned, and (iv): not better at dealing with the ``what alignment target should be aimed at?'' question, than the best baseline humans (but still, presumably, capable of understanding an insight on this question, at least if that insight is well explained). Regardless of augmentation technique, selection for ``technical ability and persuasion ability'' seems like a far more likely, de facto, outcome to me, due to being far easier to measure. I expect it to be far more difficult to measure the ability to deal with the ``what alignment target should be aimed at?'' question (and it is not obvious that the abilities needed to deal with the ``what alignment target should be aimed at?'' question, will be strongly correlated with the thing that I think will, de facto, have driven the trial and error augmentation process, of the augments that eventually hits an alignment target: ``technical-ability-and-persuasion-ability-and-ability-to-get-things-done''). Maybe the first augment will be great at making progress on the ``what alignment target should be aimed at?'' question, and will quickly render all previous work on this question irrelevant (and in that case, the persuasion ability is probably good for safety). But assuming that this will happen, seems like a very unsafe bet to make. Even more generally: I simply do not think that it is possible to come up with any type of clever sounding trick, that makes it safe to skip the ``what alignment target should be aimed at?'' question (to me, the ``revolution-analogy-argument'', in the 2004 CEV text, looks like a sufficient argument for the conclusion, that it is important to make progress on the ``what alignment target should be aimed at?'' question. But it seems like many people do not consider this, to be a sufficient argument for this conclusion. It is unclear to me, why this conclusion, seems to require such extensive further argument)).

If my overall strategic goal was not clear, then this was probably my fault (in addition to not making this goal explicit, I also seem to have a tendency to lose focus on this larger strategic picture, during back and fourth technical exchanges).

Two out of of my three LW posts are in fact entirely devoted to arguing, that making progress on the ``what alignment target should be aimed at?'' question, is urgent (in our present discussion, we have only talked about the one post, that is not exclusively focused on this). See:

Making progress on the ``what alignment target should be aimed at?'' question, is urgent 

The proposal to add a ``Last Judge'' to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?'' question. 

(I am still very confused about this entire conversation. But I don't think that re reading everything, yet again, will help much. I have been continually paying, at least some, attention to SL4, OB, and LW since around 2002-2003. I can't remember exactly who said what when, or where. However, I have developed a strong intuition, that can be very roughly translated as: ``if something sounds strange, then it is very definitely not safe, to explain away this strangeness, by conveniently assuming that Nesov is confused on the object-level''. I am nowhere near the point where I would consider going against this intuition. So, I expect that I will remain very confused about this exchange, until there is some more information available. I don't expect to be able to just think my way out of this one (wild speculation, regarding what it might be, that I was missing, by anyone that happens to stumble on this comment, at any point in the future, are very welcome. For example in a LW comment, or in a LW DM, or in an email))

You are directing a lot of effort at debating details of particular proxies for an optimization target, pointing out flaws. My point is that strong optimization for any proxy that can be debated in this way is not a good idea, so improving such proxies doesn't actually help. A sensible process for optimizing something has to involve continually improving formulations of the target as part of the process. It shouldn't be just given any target that's already formulated, since if it's something that would seem to be useful to do, then the process is already fundamentally wrong in what it's doing, and giving a better target won't fix it.

The way I see it, CEV-as-formulated is gesturing at the kind of thing an optimization target might look like. It's in principle some sort of proxy for it, but it's not an actionable proxy for anything that can't come up with a better proxy on its own. So improving CEV-as-formulated might make the illustration better, but for anything remotely resembling its current form it's not a useful step for actually building optimizers.

Variants of CEV all having catastrophic flaws is some sort of argument that there is no optimization target that's worth optimizing for. Boundaries seem like a promising direction for addressing the group vs. individual issues. Never optimizing for any proxy more strongly than its formulation is correct (and always pursuing improvement over current proxies) responds to there often being hidden flaws in alignment targets that lead to catastrophic outcomes.

If your favoured alignment target suffers from a critical flaw, that is inherent in the core concept, then surely it must be useful for for you to discover this. So I assume that you agree that, conditioned on me being right about CEV suffering from such a flaw, you want me to tell you about this flaw. In other words, I think that I have demonstrated, that CEV suffers from a flaw, that is not related to any detail, of any specific version, or any specific description, or any specific proxy, or any specific attempt to describe what CEV is, or anything else along those lines. Instead, this flaw is inherent in the core concept, of building an AI that is describable as ``doing what a Group wants''. The Suffering Reducing AI (SRAI) alignment target is known to suffer from this type of a core flaw. The SRAI flaw is not related to any specific detail, of any specific version, or proxy, or attempt to describe what SRAI is, etc. And the flaw is not connected to any specific definition of ``Suffering''. Instead, the tendency to kill everyone, is inherent in the core concept of SRAI. It must surely be possible for you to update the probability that CEV also suffers from a critical flaw of this type (a flaw inherent in the core concept). SRAI sounds good on the surface, but it it is known to suffer from such a core flaw. Thus, the fact that CEV sounds good on the surface, does not rule out the existence of such a core flaw in CEV.

I do not think, that it possible to justify making no update, when discovering that the version of CEV, that you linked to, implies an outcome that would be far, far worse that extinction. I think that the probability must go up, that CEV contains a critical flaw, inherent in the core concept. Outcomes massively worse than extinction, is not an inherent feature, of any conceivable detailed description, of any conceivable alignment target. To take a trivial example, such an outcome is not implied by any given specific description of SRAI. The only way that you can motivate not updating, is if you already take the position, that any conceivable AI, that is describable as ``implementing the Coherent Extrapolated Volition of Humanity'', will lead to an outcome that is far, far, worse than extinction. If this is your position, then you can justify not updating. But I do not think that this is your position (if this were your position, then I don't think that CEV would be your favoured alignment target).

And this is not filtered evidence, where I constructed a version of CEV and then showed problems in that version. It is the version that you link to, that would be far, far, worse than extinction. So, from your perspective, this is not filtered. Other designs that I have mentioned elsewhere, like USCEV, or the ``non stochastic version of PCEV'', are versions that other people have viewed as reasonable attempts to describe what CEV is. The fact that you would like AI projects to implement safety measures, that would (if they work as intended) protect against these types of dangers, is great. I strongly support that. I would not be particularly surprised if a technical insight in this type of work turns out to be completely critical. But this does not allow you to justify not updating on unfiltered data. You simply can not block off all conceivable paths, leading to a situation, where you conclude that CEV suffers from the same type of core flaw, that SRAI is known to suffer from.

If one were to accept the line of argument, that all information of this type can be safely dismissed, then this would have very strange consequences. If Steve is running a SRAI project, then he could use this line of argument, to dismiss any finding, that a specific version of SRAI, leads to everyone dying. If Steve has a great set of safety measures, but simply does not update, when presented with the information, that a given version of SRAI would kill everyone, then Steve can never reach the point where he says: ``I was wrong. SRAI is not a good alignment target. The issue is not due to any details, of any specific version, or any specific definition or suffering, or anything else along those lines. The issue is inherent in the core concept of building an AI, that is describable as a SRAI. Regardless of how great some set of safety measures looks to the design team, no one should initiate a SRAI project''. Surely, you do not want to accept a line of argument, that would have allowed Steve, to indefinitely avoid making such a statement, in the face of any conceivable new information about the outcomes of different SRAI variants.

The alternative to debating specific versions, is to make arguments on the level, of what one should expect based on the known properties of a given proposed alignment target. I have tried to do this and I will try again. For example, I wonder how you would answer the question: ``why would an AI, that does what an arbitrarily defined abstract entity wants that AI to do, be good for a human individual?''. One can discover that the Coherent Extrapolated Volition of Steve, would lead to the death of all of Steve's cells (according to any reasonable set of definitions). One can similarly discover that the Coherent Extrapolated Volition of ``a Group'', is bad for the individuals in that group (according to any reasonable set of definitions). Neither statement suffers from any logical tension. For humans, this should in fact be the expected conclusion for any ``Group AI'', given that, (i): many humans certainly sound as if they will ask the AI to hurt other humans as much as possible, (ii): a human individual is very vulnerable, to a powerful AI that is trying to hurt her as much as possible, and (iii): in a ``Group AI'' no human individual can have any meaningful influence, in the initial dynamic, regarding the adoption of those preferences that refer to her (if the group is large). If you doubt the accuracy of one of these three points, then I would be happy to elaborate, on whichever one you find doubtful. None of this, has any connection, to any specific version, or proxy, or attempt to describe what CEV is, or anything else along those lines. It is all inherent in the core concept of CEV (and any other AI proposal, that is describable as ``building an AI that does what a group wants it to do''). If you want, we can restrict all further discussion to this form of argument.

If one has already taken the full implications of (i), (ii), and (iii) into account, then one does not have to make a huge additional update, when observing an unfiltered massively-worse-than-extinction type outcome. But this is only because, when one has taken the full implications of (i), (ii), and (iii) into account, then one has presumably already concluded, that CEV suffers from a critical, core, flaw.

I don't understand your sentence: ``Variants of CEV all having catastrophic flaws is some sort of argument that there is no optimization target that's worth optimizing for.''. The statement ``CEV is not a good alignment target'' does not imply the non existence of good alignment targets. Right? In other words: it looks to me like you are saying, that a rejection of CEV as an alignment target, is equivalent to a rejection of all conceivable alignment targets. To me, this sounds like nonsense, so I assume that this is not what you are saying. To take a trivial example: I don't think that SRAI is a good alignment target. But surely a rejection of CEV does not imply a rejection of SRAI. Right? Just to be clear: I am definitely not postulating the non existence of good alignment targets. Discovering that ``the Coherent Extrapolated Volition of Steve implies the death of all his cells'', does not imply the non existence of alignment targets, where Steve's cells survive. Similarly, discovering that ``the Coherent Extrapolated Volition of Humanity is bad for human individuals'', does not imply the non existence of alignment targets, that are good for human individuals. (I don't think that good alignment targets are easy to find, or easy to describe, or easy to evaluate, etc. But that is a different issue)

I think it's best that I avoid building a whole argument, based on a guess, regarding what you mean here. But I do want to say, that if you are using ``CEV'' as a shorthand for ``the Coherent Extrapolated Volition of a single designer'', then you have to be explicit about this if you want me to understand you. And similarly: if ``CEV'' is simply a label, that you assign to any reasonable answer, to the ``what alignment target should be aimed at?'' question (provisional or otherwise), then you have to be explicit about this if you want me to understand you. If that is the case then I would have to phrase my claim as: ``Under no reasonable set of definitions does the Coherent Extrapolated Volition of Humanity deserve the label ``CEV''''. This only sounds odd due to the chosen label. There is no more logical tension in that statement, than there is logical tension in the statement: ``Under no reasonable set of definitions, does the Coherent Extrapolated Volition of Steve, result in any of Steve's cells surviving'' (discovering this about Steve should not be very surprising. And discovering this about Steve does not imply the non existence of alignment targets where Steve's cells survive).


PS:

I am aware of the fact that you (and Yudkowsky, and Bostrom, and a bunch of other people), can not be reasonably described as having any form of reckless attitude along the lines of: ``Conditioned on knowing how to hit alignment targets, the thing to do is to just instantly hit some alignment target that sounds good''. I hope that it is obvious, that I am aware of this. But I wanted to be explicit about this, just in case it is not obvious to everyone, that I am aware of this. Given the fact that there is one of those green leaf thingies next to my username, it is probably best to be explicit about this sort of thing.

I think that ``CEV'' is usually used as shorthand for ``an AI that implements the CEV of Humanity''. This is what I am referring to, when I say ``CEV''. So, what I mean when I say that ``CEV is a bad alignment target'', is that, for any reasonable set of definitions, it is a bad idea, to build an AI, that does what ``a Group'' wants it to do (in expectation, from the perspective of essentially any human individual, compared to extinction). Since groups and individuals, are completely different types of things, it should not be surprising to learn, that doing what one type of thing wants (such as ``a Group''), is bad for a completely different type of thing (such as a human individual). In other words, I think that ``an AI that implements the CEV of Humanity'', is a bad alignment target, in the same sense, as I think that SRAI is a bad alignment target.

But I don't think your comment uses ``CEV'' in this sense. I assume that we can agree, that aiming for ``the CEV of a chimp'', can be discovered to be a bad idea (for example by referring to facts about chimps, and using thought experiments, to see what these facts about chimps, implies about likely outcomes). Similarly, it must be possible to discover, that aiming for ``the CEV of Humanity'', is also a bad idea (for human individuals). Surely, discovering this, cannot be, by definition, impossible. Thus, I think that you are in fact, not, using ``CEV'' as shorthand for ``an AI that implements the CEV of Humanity''. (I am referring to your sentence: ``If it's not something to aim at, then it's not a properly constructed CEV.'')

Your comment makes perfect sense, if I read ``CEV'' as shorthand for ``an AI that implements the CEV of a single human designer''. I was not expecting this terminology. But it is a perfectly reasonable terminology, and I am happy to make my argument, using this terminology. If we are using this terminology, then I think that you are completely right, about the problem that I am trying to describe, being a proxy issue (thus, if this is was indeed your intended meaning, then I was completely wrong, when I said that I was not referring to a proxy issue. In this terminology, it is indeed a proxy issue). So, using this terminology, I would describe my concerns as: ``an AI that implements the CEV of Humanity'' is a predictably bad proxy, for ``an AI that implements the CEV of a single human designer''. Because ``an AI that implements the CEV of Humanity'', is far, far, worse, than extinction, form the perspective of essentially any human individual (which, presumably, disqualifies it as a proxy, for ``an AI that implements the CEV of a single human designer''. If this does not disqualify it as a proxy, then I think that this particular human designer, is a very dangerous person (from the perspective of essentially any human individual)). Using this terminology (and assuming a non unhinged designer), I would say that if your proposed project, is to use ``an AI that implements the CEV of Humanity'', as a proxy, for ``an AI that implements the CEV of a single human designer'', then this constitutes a, predictable, proxy failure. Further, I would say that pushing ahead, despite this predictable failure, with a project that is trying to implement ``an AI that implements the CEV of Humanity'' (as a proxy), inflicts an unnecessary s-risk, on everyone. Thus, I think it would be a bad idea, to pursue such a project (from the perspective of essentially any human individual. Presumably including the designer).

If we take the case of Bob, and his Suffering Reducing AI (SRAI) project (and everyone has agreed to use this terminology), then we can tell Bob:

SRAI is not a good proxy, for  ``an AI that implements the CEV of Bob'' (assuming that you, Bob, do not want to kill everyone). Thus, you will run into a, predictable, issue, when your project tries to use SRAI as a proxy, for ``an AI that implements the CEV of Bob''. If you are implementing a safety measure successfully, then this will still, at best, lead to your project failing safely. At worst, your safety measure will fail, and SRAI will kill everyone. So, please don't proceed with your project, given that it would put everyone at risk of being killed by SRAI (and this would be an unnecessary risk, because your project will predictably fail, due to a predictable proxy issue).

By making sufficient progress, on the ``what alignment target should be aimed at?'' question, before Bob gets started on his SRAI project, it is possible to avoid the unnecessary extinction risks, associated with the proxy failure, that Bob will predictably run into, if his project uses SRAI, as a proxy for ``an AI that implements the CEV of Bob''. Similarly, it is possible to avoid the unnecessary s-risks, associated with the proxy failure, that Dave will predictably run into, if Dave uses ``an AI that implements the CEV of Humanity'', as a proxy, for ``an AI that implements the CEV of Dave'' (because any ``Group AI'', is very bad for human individuals (including Dave)).

Mitigating the unnecessary extinction risks, that are inherent in any SRAI project, does not require an answer, to the ``what alignment target should be aimed at?'' question (it was a long time ago, but if I remember correctly, Yudkowsky did this, around two decades ago. It seems likely, that anyone that is careful and capable enough, to hit an alignment target, will be able to understand that old explanation, of why SRAI, is a bad alignment target. So, generating such an explanation, was sufficient for mitigating the extinction risks, associated with a successfully implemented SRAI. Generating such an explanation, did not require an answer, to the ``what alignment target should be aimed at?'' question. One can demonstrate that a given bad answer, is a bad answer, without having any good answer). Similarly, avoiding the unnecessary s-risks, that are inherent in any ``Group AI'' project, does not require an answer, to the ``what alignment target should be aimed at?'' question. (I strongly agree, that finding an actual answer to this question, is probably very, very, difficult. I am simply pointing out, that even partial progress, on this question, can be very useful)

(I think that there are other issues, related to AI projects, whose purpose is to aim at ``the CEV, of a single human designer''. I will not get into this here, but I thought that it made sense, to at least mention, that there are other issues, related to this type of project)

Since groups and individuals, are completely different types of things,

I don't think this is obviously justifiable. It seems to me that cells work together to be a person, together tracking and implementing the agency of the aggregate system according to their interest as part of that combined entity, and in the same way, people work together to be a group, together tracking and implementing the agency of the group. I'm pretty sure that if you try to calculate my CEV with me in a box, you end up with an error like "import error: the rest of the reachable social graph of friendships and caring". I cannot know what I want without deliberating with others who I intend to be in a society with long term, because I will know that whatever answer I give for my CEV, it will be highly probably misaligned with the rest of the people I care about. And I expect that the network of mutual utility across humanity is fairly well connected such that if I import friends, it ends up being a recursive import that requires evaluation of everyone on earth.

(By the way, any chance you could use fewer commas? The reading speed I can reach on your comments are reduced by them due to having to bump up to deliberate thinking to check whatever I've joined sentence fragments the way you meant. No worries if not, though.)

I think that extrapolation is a genuinely unintuitive concept. I would for example not be very surprised if it turns out that you are right, and that it is impossible to reasonably extrapolate you if the AI that is doing the extrapolation is cut off from all information about other humans. I don't think that this fact is in tension with my statement, that individuals and groups are completely different types of things. Taking your cell analogy: I think that implementing the CEV of you could lead to the death of every single cell in your body (for example if your mind is uploaded in a way that does not preserve information about any individual cell). I don't think that it is strange in general, if an extrapolated version of a human individual, is completely fine with the complete annihilation of every cell in her body (and this is true, despite the fact that ``hostility towards cells'' is not a common thing). Such an outcome is no indication of any technical failure, in an AI project, that was aiming for the CEV of that individual. This shows why there is no particular reason to think, that doing what a human individual wants, would be good for any of her cells (for any reasonable definition of ``doing what a human individual wants''). And this fact remains true, even if it is also the case, that a given cell would become impossible to understand, if that cell was isolated from other cells.

A related tangent here relates to the fact that extrapolation is a genuinely unintuitive concept. I think that this has important implications for AI safety. This fact is for example central to my argument about ``Last Judge'' type proposals in my post:

The proposal to add a ``Last Judge'' to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?'' question. 

(I will try to reduce the commas. I see what you are talking about. I have in the past been forced to do something about an overuse of both footnotes and parentheses. Reading badly written academic history books seems to be making things worse (if one is analysing AI proposals where the AI is getting its goal from humans, then it makes sense to me to at least try to understand humans))

I think that implementing the CEV of you could lead to the death of every single cell in your body (for example if your mind is uploaded in a way that does not preserve information about any individual cell)

I will take this bet at any amount. My cells are a beautiful work of art crafted by evolution, and I am a guest in their awesome society. Any future where my cells' information is lost rather than transmuted and the original stored is unacceptable to me. Switching to another computational substrate without deep translation of the information in my cells is effectively guaranteed to need to examine the information in a significant fraction of my cells at a deep level, such that a generative model can be constructed which has significantly higher accuracy at cell information reconstruction than any generative model of today would. I suspect I am only unusual in having thought through this enough to identify this value, and that it is common in somewhat-less-transhumanist circles, usually manifesting as a resistance to augmentation rather than a desire to augment in a way that maintains a biology-like substrate.

Now, to be clear, I do want to rewrite my cells at a deep level - a sort of highly advanced dynamics-faithful "style transfer" into some much more advanced substrate, in particular one that operates smoothly between temperatures 2 kelvin and ~310 kelvin or ideally much higher (though if it turns out that a long adaptation period is needed to switch between ultra low temp and ultra high temp, that's fine, I expect that the chemicals that operate smoothly at the respective temperatures will look rather different). I also expect to not want to be stuck with using carbon; I don't currently understand chemistry enough to confidently tell you any of the things I'm asking for in this paragraph are definitely possible, but my hunch is that there are other atoms which form stronger bonds and have smaller fields that could be used instead, ie classic precise nanotech sorts of stuff. probably takes a lot of energy to construct them, if they're possible.

But again, no uplift without being able to map the behaviors of my cells in high fidelity.

Interesting. I haven't heard this perspective. Can you say a little more about why you want to preserve the precise information in your cells? Is it solely about their impact on your mind's function? What level of approximation would you be okay with?

I'd be fine with having my mind simulated with a low-res body simulation, as long as that body felt more-or-less right and supported a range of moods and emotions similar to the ones I have now - but I'd be fine with a range of moods being not quite the same as the ones caused by the intricacies of my current body.

I was clearly wrong regarding how you feel about your cells. But surely the question of whether or not an AI that is implementing the CEV of Steve, would result in any surviving cells, is an empirical question? (which must settled by referring to facts about Steve. And trying to figure out what these facts mean in terms of how the CEV of Steve would treat his cells). It cannot possibly be the case that it is impossible, by definition, to discover that any reasonable way of extrapolating Steve would result in all his cells dying?

Thank you for engaging on this. Reading your description of how you view your own cells was a very informative window, into how a human mind can work. (I find it entirely possible, that I am very wrong regarding how most people view their cells. Or about how they would view their cells upon reflection. I will probably not try to introspect, regarding how I feel about my own cells, while this exchange is still fresh)

Zooming out a bit, and looking at this entire conversation, I notice that I am very confused. I will try to take a step back from LW and gain some perspective, before I return to this debate.

It is getting late here, so I will stop after this comment, and look at this again tomorrow (I'm in Germany). Please treat the comment below as not fully thought through.

The problem from my perspective, is that I don't think that the objective, that you are trying to approximate, is a good objective (in other words, I am not referring to problems, related to optimising a proxy. They also exist, but they are not the focus of my current comments). I don't think that it is a good idea, to do what an abstract entity, called ``humanity'', wants (and I think that this is true, from the perspective of essentially any human individual). I think that it would be rational, for essentially any human individual, to strongly oppose the launch of any such ``Group AI''. Human individuals, and groups, are completely different types of things. So, I don't think that it should be the surprising, to learn that doing what a group wants, is bad for the individuals, in that group. This is a separate issue, from problems related to optimising for a proxy.

I give one example, of how things can go wrong, in the post:

A problem with the most recently published version of CEV 

This is of course just one specific example, and it is meant as an introduction, to the dangers, involved in building an AI, that is describable as ``doing what a group wants''. Showing that a specific version of CEV, would lead to an outcome, that is far, far, worse than extinction, does not, on its own, prove that all versions of CEV are dangerous. I do however think that all versions of CEV, are, very, very, dangerous. And I do think, that this specific thought experiment, can be used to hint at a more general problem. I also hope, that this thought experiment will at least be sufficient, for convincing most readers that there, might, exist a deeper problem, with the core concept. In other words, I hope that it will be sufficient, to convince most readers that you, might, be going after the wrong objective, when you are analysing different attempts ``to say what CEV is''.

While I'm not actually talking about implementation, perhaps it would be more productive, to approach this from the implementation angle. How certain are you, that the concept of Boundaries / Membranes, provides reliable safety, for individuals, from a larger group, that contains the type of fanatics, described in the linked post? Let's say that it turns out, that they do not, in fact, reliably provide such safety, for individuals. How certain are you then, that the first implemented system, that relies on Boundaries / Membranes, to protect individuals from such groups, will in fact result, in you being able to try again? I don't think that you can possibly know this, with any degree of certainty. (I'm certainly not against safety measures. If anyone attempts to do what you are describing, then I certainly hope that this attempt will involve safety measures) (I also have nothing against the idea of Boundaries / Membranes)

An alternative (or parallel) path, to trial and error, is to try to make progress on the ``what alignment target should be aimed at?'' question. Consider what you would say to Bob, who wants to build a Suffering Reducing AI (SRAI). He is very uncertain of his definition of ``Suffering'', and he is implementing safety systems. He knows that any formal definition of ``Suffering'' that he can come up with, will be a proxy, for the actually, correct, definition of Suffering. If it can be shown, that some specific implementation of SRAI, would lead to a bad outcome (such as an AI, that decides to kill everyone), then Bob will respond that the definition of Suffering, must be wrong (and that he has prepared safety systems, that will let him try to find a better definition of ``Suffering'').

This might certainly end well. Bob's safety systems might continue to work, until Bob realises, that the core idea, of building any AI, that is describable as a SRAI, will always lead to an AI, that simply kills everyone (in other words: until he realises, that he is going after the wrong objective). But I would say, that a better alternative, is to make enough progress, on the ``what alignment target should be aimed at?'' question, that it is possible to explain to Bob, that he is, in fact, going after the wrong objective (and is not, in fact, dealing with proxy issues). (in the case of SRAI, such progress has off course been around for a while. I think I remember reading an explanation of the ``SRAI issue'', written by Yudkowsky, decades ago. So, to deal with people like Bob, there is no actual need, for us, to make additional progress. But for people in a world where SRAI, is the state of the art, in terms of answering the ``what alignment target should be aimed at?'' question, I would advice them to focus on making further progress, on this question)

Alternatively, I could ask what you would say to Bob, if he thinks that ``reducing Suffering'', is ``the objectively correct thing to do'', and is convinced, that any implementation that leads to bad outcomes (such as an AI, that kills everyone), must be a proxy issue? I think that, just as any reasonable definition of ``Suffering'', implies a SRAI, that kills everyone, any reasonable set of definitions of ``a Group'', implies a Group AI, that is bad for human individuals (in expectation, when that Group AI is pointed at billions of humans, from the perspective of essentially any human individual, in the set of humans, that the Group AI is pointed at, compared to extinction). In other words, a Group AI is bad for human individuals in expectation, in the same sense as a SRAI kills everyone. I'm definitely not saying that this is true for ``minds in general''. If Dave is able to reliably see all implications of any AI proposal (or if Dave is invulnerable to a powerful AI that is trying to hurt Dave. Or if the minds that the Group AI will be pointed at, are known to be ``friendly towards Dave'' in some formal sense, that is fully understood by Dave), then this might not be true for Dave. But I claim that it is true for human individuals.

This may be true in other communities, but I think if you're more status motivated in AI safety and EA you are more likely to be concerned about potential downside risks. Especially post SBF.

Instead of trying to maximize the good, I see a lot of people trying to minimize the chance that things go poorly in a way that could look bad for them.

You are generally given more respect and funding and social standing if you are very concerned about downside risks and reputation hazards.

If anything, the more status-oriented you are in EA, the more likely you are to care about downside risks because of the Copenhagen theory of ethics.

[-]jmh3mo20

Recently I was wondering if the old saying "two wrongs don't make a right" is in fact universally true. If not then perhaps that represents something of the flip side to the coin for good intentions and the road to hell.

What had me wondering about that was recalling a theory I heard about years ago in grad school, The Theory of Second Best. The linked wiki does seem to suggest additional reasons why good intentions many well produce harms.

As described in the wiki, there is no suggestion that rather than attempting to resolve a market failure in one area of the economy, seeing to move that sub market either further away from its optimal or at least to another non-optimal position might be welfare improving in general sense. But I don't think there was anything that suggests that could not be the case.

Leaving aside all the hard work to prevent a slide into some the ends justify the means type argument or conclusion, if there is scope for the idea that two wrongs, under certain conditions, can in fact make right (or "righter") then not taking that in to consideration might lead us to misassess the risks to be managed.

I did like the post, and am not sure if this comment is more about some second-order level aspect or not.