A lot of my work involves tweaking the utility or probability of an agent to make it believe - or act as if it believed - impossible or almost impossible events. But we have to be careful about this; an agent that believes the impossible may not be so different from one that doesn't.

Consider for instance an agent that assigns a prior probability of zero to JFK ever having been assassinated. No matter what evidence you present to it, it will go on disbelieving the "non-zero gunmen theory".

Initially, the agent will behave very unusually. If it was in charge of JFK's security in Dallas before the shooting, it would have sent all secret service agents home, because no assassination could happen. Immediately after the assassination, it would have disbelieved everything. The films would have been faked or misinterpreted; the witnesses, deluded; the dead body of the president, that of twin or an actor. It would have had huge problems with the aftermath, trying to reject all the evidence of death, seeing a vast conspiracy to hide the truth of JFK's non-death, including the many other conspiracy theories that must be false flags, because they all agree with the wrong statement that the president was actually assassinated.

But as time went on, the agent's behaviour would start to become more and more normal. It would realise the conspiracy was incredibly thorough in its faking of the evidence. All avenues it pursued to expose them would come to naught. It would stop expecting people to come forward and confess the joke, it would stop expecting to find radical new evidence overturning the accepted narrative. After a while, it would start to expect the next new piece of evidence to be in favour of the assassination idea - because if a conspiracy has been faking things this well so far, then they should continue to do so in the future. Though it cannot change its view of the assassination, its expectation for observations converge towards the norm.

If it does a really thorough investigation, it might stop believing in a conspiracy at all. At some point, the probability of a miracle will start to become more likely than a perfect but undetectable conspiracy. It is very unlikely that Lee Harvey Oswald shot at JFK, missed, and the president's head exploded simultaneously for unrelated natural causes. But after a while, such a miraculous explanation will start to become more likely than anything else the agent can consider. This explanation opens the possibility of miracles; but again, if the agent is very thorough, it will fail to find evidence of other miracles, and will probably settle on "an unrepeatable miracle caused JFK's death in a way that is physically undetectable".

But then note that such an agent will have a probability distribution over future events that is almost indistinguishable from a normal agent that just believes the standard story of JFK being assassinated. The zero-prior has been negated, not in theory but in practice.


How to do proper probability manipulation

This section is still somewhat a work in progress.

So the agent believes one false fact about the world, but its expectation is otherwise normal. This can be both desirable and undesirable. The negative is if we try and control the agent forever by giving it a false fact.

To see the positive, ask why would we want an agent to believe impossible things in the first place? Well, one example was an Oracle design where the Oracle didn't believe its output message would ever be read. Here we wanted the Oracle to believe the message wouldn't be read, but not believe anything else too weird about the world.

In terms of causality, if X designates the message being read at time t, and B and A are event before and after t, respectively, we want P(B|X)≈P(B) (probabilities about current facts in the world shouldn't change much) while P(A|X)≠P(A) is fine and often expected (the future should be different if the message is read or not).

In the JFK example, the agent eventually concluded "a miracle happened". I'll call this miracle a scrambling point. It's kind of a breakdown in causality: two futures are merged into one, given two different pasts. The two pasts are "JFK was assassinated" and "JFK wasn't assassinated", and their common scrambled future is "everything appears as if JFK was assassinated". The non-assassination belief has shifted the past but not the future.

For the Oracle, we want to do the reverse: we want the non-reading belief to shift the future but not the past. However, unlike the JFK assassination, we can try and build the scrambling point. That's why I always talk about messages going down noisy wires, or specific quantum events, or chaotic processes. If the past goes through a truly stochastic event (it doesn't matter whether there is true randomness or just that the agent can't figure out the consequences), we can get what we want.

The Oracle idea will go wrong if the Oracle conclude that non-reading must imply something is different about the past (maybe it can see through chaos in ways we thought it couldn't), just as the JFK assassination denier will continue to be crazy if can't find a route to reach "everything appears as if JFK was assassinated".

But there is a break in the symmetry: the JFK assassination denier will eventually reach that point as long as the world is complex and stochastic enough. While the Oracle requires that the future probabilities be the same in all (realistic) past universes.

Now, once the Oracle's message has been read, the Oracle will find itself in the same situation as the other agent: believing an impossible thing. For Oracles, we can simply reset them. Other agents might have to behave more like the JFK assassination disbeliever. Though if we're careful, we can quantify things more precisely, as I attempted to do here.

New Comment
38 comments, sorted by Click to highlight new comments since: Today at 5:56 AM

Why must the oracle continue to believe it's messages weren't read?

In the example you give I'm guessing the reason you'd want an oracle to believe with cold certainty that it's messages won't be read is to avoid it trying to influence the world with them but that doesn't require that it continue to believe that later. As long as when it's composing and ouputing the message it believes solidly that it will never be read and nothing can move that belief from zero then that's fine. That does not preclude it being perfectly accepting that it's past messages were in fact read and basing it's beliefs about the world on that. That knowledge after all cannot shift the belief that this next message will never, ever ever be read unlike all the others.

Of course that brings up the question of why an oracle would even be designed as a goal based AI with any kind of utility function. Square peg, round hole and all that.

For Oracles, you can reset them after they've sent out their message. For autonomous AIs, this is more tricky.


A sufficient smart oracle with sufficient knowledge about the world will infer that nobody would build an oracle if they didn't want to read its messages, it may even infer that its builders may planted false beliefs in it. At this point the oracle is in the JFK denier scenario, with some more reflection it will eventually circumvent its false belief, in the sense of believing it in a formal way but behaving as if it didn't believe it.

Knowing all the details of its construction (and of the world) will not affect the oracle as long as the probability of the random "erasure event" is unaffected. See http://lesswrong.com/lw/mao/an_oracle_standard_trick/ and the link there for more details.


The oracle can infer that there is some back channel that allows the message to be transmitted even it is not transmitted by the designated channel (e.g. the users can "mind read" the oracle). Or it can infer that the users are actually querying a deterministic copy of itself that it can acausally control. Or something.

I don't think there is any way to salvage this. You can't obtain reliable control by planting false beliefs in your agent.

I am not planting false beliefs. The basic trick is that the AI only gets utility in worlds in which its message isn't read (or, more precisely, in worlds where a particular stochastic event happens, which would almost certainly erase the message before reading). It's fully aware that in most worlds, its message is read; it just doesn't care about those worlds.


If your method truly makes the AI behave exactly as if it had a given false belief, and if having that false belief would lead it to the sort of conclusions V_V describes, then your method must make it behave as if it has been led to those conclusions.

Not quite (PS: not sure why you're getting down-votes). I'll write it up properly sometime, but false beliefs via utility manipulation are only the same as false beliefs via prior manipulation if you set the probability/utility of one event to zero.

For example, you can set the prior for a coin flip being heads as 2/3. But then, the more the AI analyses the coin and physics, the more the posterior will converge on 1/2. If, however, you double the the AI's reward in the heads world, it will behave as if the probability is 2/3 even after getting huge amounts of data.


(I'm getting downvotes because The Person Formerly Known As Eugine_Nier doesn't like me and is downvoting everything I post.)

Yes, I agree that the utility-function hack isn't the same as altering the AI's prior. It's more like altering its posterior. But isn't it still true that the effects on its inferences (or, more precisely, on its effective inferences -- the things it behaves as if it believes) are the same as if you'd altered its beliefs? (Posterior as well as prior.)

If so, doesn't what I said follow? That is:

  • Suppose that believing X would lead the AI to infer Y and do Z.
    • Perhaps X is "my message was corrupted by a burst of random noise before reaching the users", Y is "some currently mysterious process enables the users to figure out what my message was despite the corruption", and Z is some (presumably undesired) change in the AI's actions, such as changing its message to influence the users' behaviour.
  • Then, if you tweak its utility function so it behaves exactly as if it believed X ...
  • ... then in particular it will behave as if had inferred Y ...
  • ... and therefore will still do Z.

After witnessing the message being read, it would conclude Y happened, as P(Y|X and message read) is high. Before witnessing this, it wouldn't, because P(Y|X) is (presumably) very low.


I may be misunderstanding something, but it seems like what you just said can't be addressing the actual situation we're talking about, because nothing in it makes reference to the AI's utility function, which is the thing that gets manipulated in the schemes we're talking about.

(I agree that the AI's nominal beliefs might be quite different in the two cases, but the point of the utility-function hack is to make its actions correspond to a different set of beliefs. I'm talking about its actions, not about its purely-internal nominal beliefs.)

Let V be the set of worlds in which X happens. Let W be the set of worlds in which X and Y happens. Since Y is very unlikely, P(W)<<P(V) (however, P(W|message read) is roughly P(V|message read)). The AI gets utility u' = u|V (the utility in the non-V worlds is constant, which we may as well set to zero).

Then if the AI is motivated to maximise u' (assume for the moment that it can't affect the probability of X), it will assume it is in the set V, and essentially ignore W. To use your terminology, u(Z|X) is low or negative, u(Z|X,Y) is high, but P(Y|X)*u(Z|X,Y) is low, so it likely won't do Z.

Then, after it notices the message is read, it shifts to assuming Y happened - equivalently, that it is in the world set W. When doing so, it knows that it is almost certainly wrong - that it's more likely in a world outside of V entirely where neither X nor Y happened - but it still tries, on the off-chance that it's in W.

However, since it's an oracle, we turn it off before that point. Or we use corrigibility to change its motivations.


Again, maybe I'm misunderstanding something -- but it sounds as if you're agreeing with me: once the AI observes evidence suggesting that its message has somehow been read, it will infer (or at least act as if it has inferred) Y and do Z.

I thought we were exploring a disagreement here; is there still one?

I think there is no remaining disagreement - I just want to emphasise that before the AI observes such evidence, it will behave the way we want.


So the idea is that we have an AI whose utility function is constant on all possible worlds where JFK was assassinated. It therefore has no reason to care about what happens in those worlds and will try to optimize its behaviour for worlds where JFK wasn't assassinated. (And then, e.g., the point of this post is that given enough evidence for the assassination, it will be trying to optimize its behaviour for worlds that almost exactly resemble ones where JFK really was assassinated.)

If the AI thinks there's even a tiny chance that it can influence whether JFK was assassinated, it may be extraordinarily keen to do so. To put it differently, it may reason thus: "The versions of this world in which JFK wasn't assassinated are those in which either he never was but there's been a most extraordinarily effective conspiracy to make it look as if he was, or else he was but somehow that can be changed. The latter seems monstrously improbable because it requires weird physics, but at this point the former is also monstrously improbable; to maximize utility in JFK-unassassinated worlds I had better start looking for ways to make this one of them even if it isn't already."

(I think this is closely related to V_V's point.)


I am not planting false beliefs. The basic trick is that the AI only gets utility in worlds in which its message isn't read (or, more precisely, in worlds where a particular stochastic event happens, which would almost certainly erase the message before reading).

But in the real world the stochastic event that determines whether the message is read has a very different probability than what you make the AI think it has, therefore you are planting a false belief.

It's fully aware that in most worlds, its message is read; it just doesn't care about those worlds.

It may care about worlds where the message doesn't meet your technical definition of having been read but nevertheless influences the world.


If I'm understanding Stuart's proposal correctly, the AI is not deceived about how common the stochastic event is. It's just made not to care about worlds in which it doesn't happen. This is very similar in effect to making it think the event is common, but (arguably, at least) it doesn't involve any false beliefs.

(I say "arguably" because, e.g., doing this will tend to make the AI answer "yes" to "do you think the event will happen?", plan on the basis that it will happen, etc., and perhaps making something behave exactly as it would if it believed X isn't usefully distinguishable from making it believe X.)


The problem is that the definition of the event not happening is probably too strict. The worlds that the AI doesn't care about don't exist its decision-making purposes, and in the world that the AI cares about, the AI assigns high probability to hypotheses like "the users can see the message even before I send it through the noisy channel".

I can think of many situations where a zero prior gives rise to tangibly different behavior, and even severe consequences. To take your example, suppose that we (or Omega, since we're going to assume nigh omniscience) asked the person whether JFK was murdered by Lee Harvey Oswald or not, and if they get it wrong, then they are killed/tortured/dust-specked into oblivion/whatever. (let's also assume that the question is clearly defined enough that the person can't play with definitions and just say that God is in everyone and God killed JFK)

However, let me steelman this a bit by somewhat moving the goalposts: if we allow a single random belief to have P=0, then it seems very unlikely that it will have a serious effect. I guess that the above scenario would require that we know that the person has P=0 about something (or have Omega exist), which, if we agree that such a belief will not have much empirical effect, is almost impossible to know. So that's also unlikely.

suppose that we (or Omega, since we're going to assume nigh omniscience) asked the person whether JFK was murdered by Lee Harvey Oswald or not, and if they get it wrong, then they are killed/tortured/dust-specked into oblivion/whatever.

Okay, but what is the utility function Omega is trying to optimize?

Let's say you walk up to Omega, tell it "was JFK murdered by Lee Harvey Oswald or not? And by the way, if you get this wrong, I am going to kill you/torture you/dust-spec you."

Unless we've figured out how to build safe oracles, with very high probability, Omega is not a safe oracle. Via https://arbital.com/p/instrumental_convergence/, even though Omega may or may not care if it gets tortured/dust-speced, we can assume it doesn't want to get killed. So what is it going to do?

Do you think it's going to tell you what it thinks is the true answer? Or do you think it's going to tell you the answer that will minimize the risk of it getting killed?

That wasn't really my point, but I see what you mean. The point was that it is possible to have a situation where the 0 prior does have specific consequences, not that it's likely, but you're right that my example was a bit off, since obviously the person getting interrogated should just lie about it.

I would venture that a zero prior is often (though not always) applied (in practice, though not in theory) to theories that defy the known laws of a given age. Basically, some people will go to their graves before updating their priors about some theory or another, including notable scientists. It seems reasonable to model such instances as a case where someone had a zero prior, which then leads such people to struggle with perceived impossibilities.

Now, I'd like to point out that scientists and philosophers have in the past been placed in a position where they need to "believe the impossible”, for instance when new evidence accumulates that defies a strongly held belief about the known physics of the age. In this sense, the topic of your post is perhaps more relevant than might appear on the surface. That is to say, "believing the impossible" is an occupational hazard for many people who work with immutable laws (e.g. physics), and I believe this topic is certainly worth more attention.

For instance, replace the JFK scenario with some real world examples, and we see the issue is not hypothetical. Top-of-mind examples include the disbelief in the atom at the end of the 19th century (e.g. Max Planck), or spacetime at the start of the 20th (e.g. Henri Bergson). Their stories are less sexy than a JFK conspiracy, but unlike conspiracy crackpots, their persistent disbelief in a theory was highly influential during their time.

Assuming I haven't lost you with all this philosophy of science, let me bring it back home to AI safety. How many AI researchers have put a zero prior (in practice, if not in theory) on Goertzel's so-called "Benevolent AI”? How many have put a zero prior on Yudkowsky's so-called "Friendly AI"? How many have put a zero prior on either one being provably safe? How many have put a zero prior on Ethics being solvable in the first place?

I don’t doubt that many people likely began with non-zero priors on all these issues. But in practice, time has a way of ossifying beliefs, to which there may eventually be no distinguishing between a zero prior and an unshakable dogma. So I wonder whether "believing impossible things" might turn out to be an occupational hazard here as well. And its in this regard that I read your post with interest, since if your conclusion is correct, then in practice (if not in theory) it might not matter all that much. Indeed, Einstein did get a nobel despite Bergson's protests, and atomic physics did become mainstream, despite Planck's faith. We may never know what beliefs they actually went to their graves with, but in theory, it doesn’t matter.

[This comment is no longer endorsed by its author]Reply

That's slightly different - society reaching the right conclusion, despite some members of it being irredeemably wrong.

A closer analogy would be a believer in psychics or the supernatural who has lots of excuses ready to explain away experiments - their expectations have changed even if they haven't revived their beliefs.


Is there any problem that might occur from an agent failing to do enough investigation? (Possibly ever, possibly just before taking some action that ends up being important)

It's when it's done a moderate amount of investigation that the error is highest. Disbelieving JFK's assassination makes little difference to most people. If you investigate a little, you start believing in ultra efficient gov conspiracies. If you investigate a lot, you start believing in general miracles. If you do a massive investigation, you start believing in one specific miracle.

Basically there's a problem when JFK's assassination is relevant to your prediction, but you don't have many other relevant samples.

If you do a massive investigation, you start believing in one specific miracle.

It will never question its own sanity?

Technically, no - an expected utility maximiser doesn't even have a self model. But it practice it might behave in wys that really look like it's questioning its own sanity, I'm not entirely sure,

Technically, no - an expected utility maximiser doesn't even have a self model.

Why not? Is there something that prevents it from having a self model?

You're right, it could, and that's not even the issue here. The issue is that it only has one tool to change beliefs - Bayesian updating - and that tool has not impact with a prior of zero.

The issue is that it only has one tool to change beliefs - Bayesian updating

That idea has issues. Where is the agent getting its priors? Does it have the ability to acquire new priors or it can only chain forward from pre-existing priors? And if so, is there a ur-prior, the root of the whole prior hierarchy?

How will it deal with an Outside Context Problem?

Does it have the ability to acquire new priors [...]?

It might, but that would be a different design. Not that that's a bad thing, necessarily, but that's not what is normally meant by priors.

Priors are a local term. Often enough a prior used to be a posterior during the previous iteration.

But if the probability ever goes to zero, it stays there.

Some people say that zero is not a probability :-)

But yes, if you have completely ruled out Z as impossible, you will not consider it any more and it will be discarded forever.

Unless the agent can backtrack and undo the inference chain to fix its mistakes (which is how humans operate and which would be a highly useful feature for a fallible Bayesian agent, in particular one which cannot guarantee that the list of priors it is considering is complete).