Epistemic status: This post is flagrantly obscure, which makes it all the harder for me to revise it to reflect my current opinions. By the nature of the subject, it's difficult to give object-level examples. If you're considering reading this, I would suggest the belief signaling trilemma as a much more approachable post on a similar topic. Basically, take that idea, and extrapolate it to issues with coordination problems?

  • There are many situations where a system is "broken" in the sense that incentives push people toward bad behavior, but, not so much that an altruist has any business engaging in that bad behavior (at least, not if they are well-informed).
    • In other words, an altruist who understands the bad equilibrium well would disengage from the broken system, or engage while happily paying the cost of going against incentives.
    • Clearly, this is not always the case; I'm thinking about situations where it is the case.
      • Actually, I'm thinking about situations where it is the case supposing that we ignore certain costs, such as costs of going against peer pressure, costs of employing willpower to go against the default, etc. The question is then: is it realistically worth it, given all those additional costs, if we condition on it being it's worth it for an imaginary emotional-robot altruist?
      • Actually actually, the question I'm asking is probably not that one either, but I haven't figured out my real question yet.
        • I think maybe I'm mainly interested in the question of how hard it is for altruists to publicly discuss altruistic strategies (in the context of a bad equilibrium) without upsetting a bunch of people (who are currently coordinating on that equilibrium, and are therefore protective of it).
    • I'm writing this post to try to sort out some confused thoughts (hence the weird style). A lot of the context is discussion on this post.
      • But, I'm not going to discuss examples in my post. This seems like a case where giving examples is more likely to steer discussion to unproductive places than the reverse, at least if those examples are close to anyone's real situation/concern.
  • I'm using the term "altruist" in an absolute way here, which is a bit misleading.
    • I think it makes sense to talk about, and try to understand, what a perfect altruist can do to forward their own values. I'm not talking about a decision-theoretically perfect altruist. I'm talking about a basically normal person, who reflectively-stably prefers to forward some version of altruism. They may be very wrong about how to go about it, but if such wrongs were pointed out (with sufficient argument/evidence), they would change their behavior.
    • I'm not even claiming there are such people. I think discussing what the perfect altruist could do makes sense as a rallying point around which a lot of somewhat-altruistic people can coordinate epistemically -- IE, a lot of people who are not perfectly altruistic would still be interested in knowing what the perfect altruist would do, even if they ultimately decide it isn't worth it for their values.
    • A lot of what I'm going to say applies similarly well to imperfect altruists. I'm talking about the sort of person who operates on selfish motives a lot of the time, but who generally stops kicking puppies when they realize it hurts the puppies.
      • Well, ok, maybe that's too low a bar. But I think the bar isn't really too high.
  • I'm making the assumption that the altruists cannot exclude non-altruists well enough to discuss things only among altruists, at least not in public discussions. So the discussion is at least a little shaped by non-altruists. Certainly the discussion norms -- norms about taking offense, taboo topics, etc. are not going to be totally inconsiderate to more self-interested parties.
  • There's an important distinction between the idea that "[some bad behavior] is blameworthy", vs "[some bad behavior] is not worth it from an altruistic perspective".
    • This distinction is difficult to maintain in public discourse. People engaging in [the bad behavior] don't want to be punished. A consensus that "[the bad behavior] is bad" will seem very dangerous to them -- it is, at the very least, difficult to establish common knowledge that such a consensus wouldn't quickly slide into "[the bad behavior] is blameworthy and should be punished".
      • This follows from a norm that "bad things should be punished" -- or, unpacking a bit more: if an action is generally understood to be do more harm than good (taking everyone into account) in comparison to a well-known alternative, and particularly if harm accrues to others (it's not a "victimless crime" -- not an activity between consenting adults), then negative consequences should be imposed.
      • This relates to being stuck in an asymmetric justice system.
        • I actually don't fully endorse the conclusions of that post. I think it does often make sense to set up asymmetric justice systems.
        • One reason is that net-positive activities are often monetizable anyway, because you can find a way to charge for the service. Justice systems focus on handling the negative because that's the part which less takes care of itself.
        • It also seems somehow related to the difference between criminal law and civil law. A civil violation (a tort) involves paying damages -- you owe something, and you're paying back what you owe. Society sees it as OK once it's been evened out. A crime, on the other hand, is something which society wants to basically never happen. So, punishments are disproportionately large.
          • One justification for this could be that crimes can cause irreparable damages. No amount of money can bring a person back to life (...yet), so, it doesn't make sense to deal with murder by paying damages.
          • Another justification for disproportionate punishment may be that not all criminals are caught -- so, the punishment has to be sufficient to ensure that the crime is not worth the risk.
          • Regardless, it's important to ask whether the justice system succeeds in these purported goals. Setting higher penalties doesn't only disincentivize a crime -- it also makes people work harder not to be caught. Sometimes you might just be fighting a losing battle here, in which case it might work better to find ways to bring activities within the law rather than keeping them illegal (legitimizing and possibly regulating the illicit activity).
    • Especially in very public discussions, most people will work hard to at least maintain plausible deniability -- to keep alive the hypothesis that they could be acting altruistically. Public clarity on the question what a real altruist would be motivated to do is dangerous to that.
      • Maybe it's less that everyone needs to plausibly be an altruist, and more that no one wants to look selfish.
        • Logically, these are the same, but the practical difference is that you can avoid looking selfish by maintaining that there's something between selfishness and altruism.
          • I'm not denying that there's a spectrum between perfect selfishness and perfect altruism. But I think people broadly talk about selfishness and altruism as if it's more than just a spectrum -- like there's at least a third option, which involves taking care of yourself and mostly minding your own business and not hurting anyone. Basically, "good" as norm-following.
          • All of the activities associated with this third option are really things, but, this doesn't stop us from examining how selfish or altruistic one must be for the actions to make sense.
          • (And it seems important to recognize that the third option may be used as a smokescreen to avoid such analysis.)
      • One weapon which can be used to muddy the waters is public equivocate the altruistic strategy with (irrational) self-sacrifice.
        • An extreme version of this is if you make sure that everyone thinks true altruists would immediately give away all their money and material resources to some cause. The true altruist would not do this, at least not in most circumstances, because it would destroy their ability to pursue altruistic causes. However, if you get people to think this, then you can excuse your actions by talking about "the incentives" and, if anyone points out that your behavior is bad on net taking pros and cons of following bad local incentives into account, you accuse them of hypocrisy for not giving away all their money or something like that.
          • Ideally we would want to be able to respond to something like that by providing reassurance that we're not assigning blame to them, and then go back to discussing realistically what an altruist would do. (I think.) However, that seems difficult to assure!
          • All of this will be going on in the background for the person being defensive -- they probably wouldn't admit to themselves that they're getting defensive because they're scared they'll suddenly be coordinated against if the discussion continues.
          • It's important to keep in mind that the hypothetical interlocuter may themselves be an altruist (or close enough). They're not necessarily feeling threatened because they're secretly selfish and don't want to be outed. They're feeling threatened because they are currently engaging in behavior consistent with the bad equilibrium, and don't want to be suddenly coordinated against! An altruist who doesn't yet see that it's worth going against the grain can be in this position.
        • A less extreme version of this is to constantly bring up the idea that altruists have to deal with incentives too -- that you can't accomplish anything if you insist on acting as if you're already in the better equilibrium you're wishing for.
          • Here, you're not equivocating between altruists and self-sacrifice, but rather, you're equivocating between naive altruistic strategies which ignore the cost-benefit analysis and sophisticated altruistic strategies which respect the cost-benefit analysis. It's plausible that any altruistic strategy someone articulates is still too naive, so you raise the hypothesis to attention all the time.
          • This might even be correct; the problem is if the argument is being used to block discussion of possibly better strategies.
          • But it might seem really really worth it to block discussion for seemingly pragmatic reasons, because discussions of what actions might be worth it for altruists really can slide into assigning blame to overtly non-altruistic acts.
          • In fact, maybe it is worth it to block such discussions!! But, in doing so, it seems sad if we can't at least flag explicitly that we've given up on the possibility of truth-seeking in public discourse in that area.
    • On the other hand, such public discussion is very valuable to altruists (and to people who are sufficiently close to being altruists). Altruists aren't perfect, or even necessarily very good, at reasoning through these things themselves. So it is great if there can be publically available clarity about what altruists should do ("should" in the sense of rationality, not in the sense of blameworthiness).
      • Furthermore, it's pretty bad for altruists if the public discourse manages to coordinate on misleading arguments which preserve the status quo by successfully equating selfish incentive gradients with altruistic incentive gradients and concluding that the best you can do is make small improvements while mostly following the norms of the bad equilibrium.
        • This is really easy to do, because almost everyone learns by imitating others; you might explicitly question a few specifically fishy things, but the vast majority of the time, you imitate standard operating procedure.
          • By "question", I mean "question whether it's really worth it, in a way which may lead to change". Everyone gripes about some things which seem like coordination failures (especially acts committed by other people). But, in the end, we follow a lot of norms by imitation, in part because it is difficult to evaluate everything ourselves.
        • It's only when there are a bunch of smart people trying to figure out better ways of doing things that there's a significant threat of anything other than this happening; so, it would make sense for there to be a lot of pressure on such a discourse to avoid forming a true consensus that [bad thing] is bad.
          • Again, this pressure can come from altruists. Maybe no one who has thought about it a bunch thinks [bad thing] should be punished. Even the altruists who recognize that [bad thing] is bad. So, even altruists who are savvy to this whole situation could engage in behavior to prevent and stifle conversation about whether [bad thing] is bad.
          • The avoidable tragedy here is the less savvy altruists who might trust the public discourse. We want to either find a way to discuss whether [bad thing] is bad (without being threatening; without being likely to slide into punishing people currently doing [bad thing]; without earning the ire and outrage of people whose livelihood currently depends on [bad thing]), or, at least, we want to avoid contaminating the public discourse of those people who are interested in figuring out what an altruistic person can do to forward their altruistic goals.
          • This could mean shutting down the conversation in a clearly-marked way, which avoids spreading any misleading justifications about incentives. This could mean propagating a meme which says "we can't realistically talk about this without going crazy, because there are too many weird things going on here" (similar to the don't-talk-politics meme). I don't know.
          • (certainly my wish would be to find a safe way to talk about things like this)
    • This cuts both ways: because almost no one wants to be obviously acting from selfish motives, public clarity about what is altruistically worth it may be a good way, in itself, of achieving better equilibria. People are willing to coordinate around clear pictures of what the altruistic person would do, because it's good signalling, or perhaps because they're too afraid of getting called out to not coordinate around such things.
      • But note that this fact in itself is part of the problem -- part of what will make some people upset about attempts to achieve such clarity.
      • (Even if we manage to successfully avoid the slippery slope from consensus-about-altruism to consensus-to-punish, a consensus-on-altruism may be enough to disrupt the current equilibrium simply due to the number of people who switch to the altruistic action. Even with no new norm and no enforcement, the disruption to the equilibrium may itself be costly for some individuals, whole current strategy depends on the status quo.)
        • You can think of this as "an equilibrium prefers homeostasis". The economy/ecology has vested interest in maintaining the current equilibrium because that equilibrium has sprouted a number of niches, and people depend on those niches.
        • There will likely be more dependence on the consensus wrong answer that you realize, because people hide the ways which they depend on coordinating on the wrong answer to questions. (Instinctively -- they don't usually realize they're doing this.)
    • But [bad thing] really is blameworthy!
      • Yeah, I mean obviously, the ideal situation would be to coordinate against [bad thing]. It's better for everyone, even the "selfish" people, in the end.
      • no! bad dog! get down! off the table!...
    • One possible solution to this problem is to make a very clear "no retroactive law" kind of policy.
      • If successfully established, this would stop people from feeling threatened about being punished for their current behavior. They'd trust that a social consensus against [bad thing] would leave them time to stop doing [bad thing] before any enforcement came into effect.
      • One problem with this is that there will still be people whose livelihood depends on [bad thing] -- often to a much greater degree than you might suspect (since people will tend to hide this, and hide it well). So, some people (again, including some altruists, especially if they don't understand the full argument about [bad thing]) will still become very defensive and try to muddy the waters whenever they notice people discussing whether [bad thing] is bad.
        • (Again, they don't necessarily realize it's what they're doing -- it's natural to get defensive about things connected to your livelihood, and motivated cognition kicks in.)
      • Another problem is that it would be hard to get that much legibility around norm enforcement. There might be a slowly building consensus around [bad thing]. Over time, more and more people think it is bad, and more and more people take it upon themselves to punish bad-doers. Punishment can take small and illegible forms. People might associate with you less, because they don't want to associate with people seen as bad. You might be blindsided -- maybe you weren't paying attention as consensus shifted over the course of a year, and now suddenly everyone is out to get you.
      • Still, a strong "no retroactive law" norm makes a lot of sense to me.
        • In the conversation-norms example from the rabbit-hunt post, it is obviously better to publicly discuss and confirm a norm before enforcing it. Not only, or even primarily, because this provides a clean starting point for a norm at which enforcement begins.
      • More generally, a policy of carefully discussing what new equilibria should be and trying to address concerns from as many parties as possible by engaging in positive sum trades, before instituting new norms.
        • You want people to expect this. You want people to be happy that you're discussing the possibility of a better equilibrium, even though the equilibrium shift might threaten them, because they expect compensation.
        • This might mean throwing people a bone even when you see them as "bad actors". Even extremely bad actors.
          • From the inside, it might feel like giving them reparations for the sad fact that they won't be able to be evil anymore. It's very counterintuitive.
          • It's also sketchy. We generally don't want to incentivise people to find ways to selfishly extract value from a system at others' expense, on the expectation that we'll later reward them for that when we figure out what's going on.
          • It's kind of like the argument trolls and black-hat hackers sometimes make -- "I'm teaching them a lesson; they shouldn't be so easy to exploit". Except you're backing that argument up with cash payments for their service, or special treatment, or honors and acolade -- a Best Villain prize.
  • So, there's a sense in which punishment norms are the problem.
    • Like, because there's already a fairly broad norm against being overtly selfish -- a norm which seems very reasonable and altruistically positive on the face of it -- we end up stuck in an equilibrium where it is difficult to get public clarity about what an altruist can do to best forward altruistic values.
  • "Is the problem really so bad? It seems like altruists can just talk about altruism."
    • This part is difficult to get into without pointing at specific examples, but, I don't really expect pointing at examples to work. Examples tend to be really slippery, because everything is already covered in a blanket of plausible deniability.
      • That's the severity of the issue. Those naive to the dynamic I'm pointing at will not be able to see that this is going on in a conversation, because that's the point of the strategy -- to avoid creating common knowledge. On the other hand, people who understand what's going on will tend to be in on it, so, will continue to throw up smokescreens to maintain plausible deniability around any example which is pointed out.
    • The hope is that talking about these dynamics at the meta level helps to make people who would be naive more aware about what might be going on in a conversation. Talking about these issues a bunch on the meta-level makes plausible deniability harder to maintain in specific situations later on, without specifically threatening existing plausible deniability.
    • There's an obvious relationship to what you can't say, and also an analogy to lies we tell kids.
  • It might be that I'm just reinventing the book Moral Mazes. I haven't read it yet.
New Comment
6 comments, sorted by Click to highlight new comments since: Today at 8:28 PM

I totally forgot about this post, and in the context of the 2019 Review I am interested in how Abram now thinks about it.

At the time I think I generally liked the line of questioning this post was asking, and felt like it was the right way to go about following up on the questions posed in the "It's Not the Incentives, It's You" discussion.

I guess I have the impression that it's difficult to talk about the issues in this post, especially publicly, without being horribly misunderstood (by some). Which is some evidence about the object level questions.

I regret writing this post because I later heard that Michael Arc was using the fact that I wrote it as evidence of corruption inside MIRI, which sorta overshadows my thinking about the post.


people who would me naive


I find it very difficult to agree to any generality without identifying some representative specifics. It feels way too much like I'm being asked to sign up for something without being told what. Relatedly, if there are zero specifics that you think fit the generalization well enough to be good examples, it seems very likely that the generalization itself is flawed.

One point of confusion in trying to generalize bad behavior (bad equilibrium is an explanation or cause, bad behavior is the actual problem) is that incentives aren't exogenous - they're created and perpetuated by actors, just like the behaviors we're trying to change. One actor's incentives are another actor's behaviors.

I think all of this comes down to "many humans are not altruistic to the degree or on the dimensions I want". I've long said that FAI is a sidetrack, if we don't have any path to FNI (friendly natural intelligence).

FAI is a sidetrack, if we don't have any path to FNI (friendly natural intelligence).

I don't think I understand the reasoning behind this, though I don't strongly disagree. Certainly it would be great to solve the "human alignment problem". But what's your claim?

If a bunch of fully self-interested people are about to be wiped out by an avoidable disaster (or even actively malicious people, who would like to hurt each other a little bit, but value self-preservation more), they're still better off pooling their resources together to avert disaster.

You might have a prisoner's dilemma / tragedy of the commons -- it's still even better if you can get everyone else to pool resources to avert disaster, while stepping aside yourself. BUT:

  • that's more a coordination problem again, rather than an everyone-is-too-selfish problem
  • that's not really the situation with AI, because what you have is more a situation where you can either work really hard to build AGI or work even harder to build safe AGI; it's not a tragedy of the commons, it's more like lemmings running off a cliff!
One point of confusion in trying to generalize bad behavior (bad equilibrium is an explanation or cause, bad behavior is the actual problem) is that incentives aren't exogenous - they're created and perpetuated by actors, just like the behaviors we're trying to change. One actor's incentives are another actor's behaviors.

Yeah, the incentives will often be crafted perversely, which likely means that you can expect even more opposition to clear discussion, because there are powerful forces trying to coordinate on the wrong consensus about matters of fact in order to maintain plausible deniability about what they're doing.

In the example being discussed here, it just seems like a lot of people coordinating on the easier route, partly due to momentum of older practices, partly because certain established people/institutions are somewhat threatened by the better practices.

I find it very difficult to agree to any generality without identifying some representative specifics. It feels way too much like I'm being asked to sign up for something without being told what. Relatedly, if there are zero specifics that you think fit the generalization well enough to be good examples, it seems very likely that the generalization itself is flawed.

My feeling is that small examples of the dynamic I'm pointing at come up fairly often, but things pretty reliably go poorly if I point them out, which has resulted in an aversion to pointing such things out.

The conversation has so much gravity toward blame and self-defense that it just can't go anywhere else.

I'm not going to claim that this is a great post for communicating/educating/fixing anything. It's a weird post.