Wiki Contributions



Often the effect of being blinded is that you take suboptimal actions. As you pointed out in your example, if you see the problem then all sorts of cheap ways to reduce the harmful impact occur to you. So perhaps one way of getting to the issue could be to point at that: "I know you care about my feelings, and it wouldn't have made this meeting any less effective to have had it more privately, so I'm surprised that you didn't"?


Wireheading traps.

An agent is "wireheading" if it is taking an action that a) provides it with enormous amounts of utility for little effort, b) is trivial or seemingly unrelated to its "main" utility function or goals.

People have discussed the possibility of an AI wireheading as a problem for getting it to do what you want, "what if it just works out a way to set your metric to +ve infinity and then goes to sleep satisfied?".

But we can also use this as a guard-rail.

A "wireheading trap" is an action which a) is very hard for an AI to do below a level of capacity X, but very easy for it to do afterwards, b) provides it with vastly more utility than anything else it could do, and c) has little effect on the world otherwise.

An AI with a wireheading trap that is set to be doable if it attains world-affecting capabilities "fails closed", because if it becomes more capable it will effectively switch itself off by wireheading. An example of a trap could be something like "etch 'Bill is a poo poo head' in 10m high letters into Mt Everest". Very hard if you don't have the ability to affect the physical world, but if you have nanotech... why bother melting humanity when you can just deface Mt Everest and be done with it?

Obvious problems:

  • Convergent instrumental goals. We don't want the AI to wipe us out in order to avoid us stopping it from wireheading, however unlikely that might be. Unclear what to do about this, myopia is one possibility (once you've wireheaded you don't care if you're stopped later?), but this has problems.
  • If you make it too attractive the AI won't even try and be useful before it has enough capacity, instead just wasting all its time on futile attempts to wirehead. Unclear how to make it attractive enough that it dominates once the capability is there but not before.

Overall very half-baked, but I wonder if there's something to be done in the general area of "have the AI behave in a way that neuters it, but only when its capabilities increase".


We have trained ML systems to play games, what if we trained one to play a simplified version of the "I'm an AI in human society" game?

Have a population of agents with preferences, the AI is given some poorly specified goal, it has the ability to expand its capabilities etc. You might expect to observe things like a "treacherous turn".

If we could do that it would be quite the scary headline "Researchers simulate the future with AI and it kills us all". Not proof, but perhaps viral and persuasive.

I think I would argue that harm/care isn't obviously deontological. Many of the others are indeed about the performance of the action, but I think arguably harm/care is actually about the harm. There isn't an extra term for "and this was done by X".

That might just be me foisting my consequentialist intuitions on people, though.

"What if there's an arms race / race to the bottom in persuasiveness, and you have to pick up all the symmetrical weapons others use and then use asymmetrical weapons on top of those?"

Doesn't this question apply to other cases of symmetric/asymmetric weapons just as much?

I think the argument is that you want to try and avoid the arms race by getting everyone to agree to stick to symmetrical weapons because they believe it'll benefit them (because they're right). This may not work if they don't actually believe they're right and are just using persuasion as a tool, but I think it's something we could establish as a community norm in restricted circles at least.


The point that the Law needs to be simple and local so that humans can cope with it is also true of other domains. And this throws up an important constraint for people designing systems that humans are supposed to interact with: you must make it possible to reason simply and locally about them.

This comes up in programming (to a man with a nail everything looks like a hammer): good programming practice emphasises splitting programs up into small components that can be reasoned about in isolation. Modularity, compositionality, abstraction, etc. aside from their other benefits, make it possible to reason about code locally.

Of course, some people inexplicably believe that programs are mostly supposed to be consumed by computers, which have very different simplicity requirements and don't care much about locality. This can lead to programs that are very difficult for humans to consume.

Similarly, if you are writing a mathematical proof, it is good practice to try and split it up into small lemmas, transform the domain with definitions to make it simpler, and prove sub-components in isolation.

Interestingly, these days you can also write mathematical proofs to be consumed by a computer. And these often suffer some of the same problems that computer programs do - because what is simple for the computer does not necessarily correspond to what is simple for the human.

(Tendentious speculation: perhaps it is not a coincidence that mathematicians tend to gravitate towards functional programming.)


I am reminded of Guided by the Beauty of our Weapons. Specifically, it seems like we want to encourage forms of rhetoric that are disproportionately persuasive when deployed by someone who is in fact right.

Something like "make the structure of your argument clear" is probably good (since it will make bad arguments look bad), "use vivid examples" is unclear (can draw people's attention to the crux of your argument, or distract from it), "tone and posture" are probably bad (because the effect is symmetrical).

So a good test is "would this have an equal effect on the persuasiveness of my speech if I was making an invalid point?". If the answer is no, then do it; otherwise maybe not.

I found Kevin Simmler's observation that an apology is a status lowering to be very helpful. In particular, it gives you a good way to tell if you made an apology properly - do you feel lower status?

I think that even if you take the advice in this post you can make non-apologies if you don't manage to make yourself lower your own status. Bits of the script that are therefore important:

  • Being honest about the explanation, especially if it's embarassing.
  • Emphasise explanations that attribute agency to you - "I just didn't think about it" is bad for this reason.
  • Not being too calm and clinical about the process - this suggests that it's unimportant.

This also means that weird dramatic stuff can be good if it actaully makes you lower your status. If falling to your knees and embracing the person's legs will be perceived as lowering your status rather than funny, then maybe that will help.

This is a great point. I think this can also lead to cognitive dissonance: if you can predict that doing X will give you a small chance of doing Y, then in some sense it's already in your choice set and you've got the regret. But if you can stick your fingers in your ears enough and pretend that X isn't possible, then that saves you from the regret.

Possible values of X: moving, starting a company, ending a relationship. Scary big decisions in general.

Something that confused me for a bit: people use regret-minimization to handle exporation-exploitation problems, shouldn't they have noticed a bias against exploration? I think the answer here is that the "exploration" people usually think about involves taking an already known option to gain more information about it, not actually expanding the choice set. I don't know of any framework that includes actions that actually change the choice set.

Load More