# Ω 5

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a belated follow-up to my Dualist Predict-O-Matic post, where I share some thoughts re: what could go wrong with the dualist Predict-O-Matic.

# Belief in Superpredictors Could Lead to Self-Fulfilling Prophecies

In my previous post, I described a Predict-O-Matic which mostly models the world at a fuzzy resolution, and only "zooms in" to model some part of the world in greater resolution if it thinks knowing the details of that part of the world will improve its prediction. I considered two cases: the case where the Predict-O-Matic sees fit to model itself in high resolution, and the case where it doesn't, and just makes use of a fuzzier "outside view" model of itself.

What sort of outside view models of itself might it use? One possible model is: "I'm not sure how this thing works, but its predictions always seem to come true!"

If the Predict-O-Matic sometimes does forecasting in non-temporal order, it might first figure out what it thinks will happen, then use that to figure out what it thinks its internal fuzzy model of the Predict-O-Matic will predict.

And if it sometimes revisits aspects of its forecast to make them consistent with other aspects of its forecast, it might say: "Hey, if the Predict-O-Matic forecasts X, that will cause X to no longer happen". So it figures out what would actually happen if X gets forecasted. Call that X'. Suppose X != X'. Then the new forecast has the Predict-O-Matic predicting X and then X' happens. That can't be right, because outside view says the Predict-O-Matic's predictions always come true. So we'll have the Predict-O-Matic predicting X' in the forecast instead. But wait, if the Predict-O-Matic predicts X', then X'' will happen. Etc., etc. until a fixed point is found.

Some commenters on my previous post talked about how making the Predict-O-Matic self-unaware could be helpful. Note that self-awareness doesn't actually help with this failure mode, if the Predict-O-Matic knows about (or forecasts the development of) anything which can be modeled using the outside view "I'm not sure how this thing works, but its predictions always seem to come true!" So the problem here is not self-awareness. It's belief in superpredictors, combined with a particular forecasting algorithm: we're updating our beliefs in a cyclic fashion, or hill-climbing our story of how the future will go until the story seems plausible, or something like that.

Before proposing a solution, it's often valuable to deepen your understanding of the problem.

# Glitchy Predictor Simulation Could Step Towards Fixed Points

Let's go back to the case where the Predict-O-Matic sees fit to model itself in high resolution and we get an infinite recurse. Exactly what's going to happen in that case?

I actually think the answer isn't quite obvious, because although the Predict-O-Matic has limited computational resources, its internal model of itself also has limited computational resources. And its internal model's internal model of itself has limited computational resources too. Etc.

Suppose Predict-O-Matic is implemented in a really naive way where it just crashes if it runs out of computational resources. If the toplevel Predict-O-Matic has accurate beliefs about its available compute, then we might see the toplevel Predict-O-Matic crash before any of the simulated Predict-O-Matics crash. Simulating something which has the same amount of compute you do can easily use up all your compute!

But suppose the Predict-O-Matic underestimates the amount of compute it has. Maybe there's some evidence in the environment which misleads it to think that it has less compute than it actually does. So it simulates a restricted-compute version of itself reasonably well. Maybe that restricted-compute version of itself is mislead in the same way, and simulates a double-restricted-compute version of itself.

Maybe this all happens in a way so that the first Predict-O-Matic in the hierarchy to crash is near the bottom, not the top. What then?

Deep in the hierarchy, the Predict-O-Matic simulating the crashed Predict-O-Matic makes predictions about what happens in the world after the crash.

Then the Predict-O-Matic simulating that Predict-O-Matic makes a prediction about what happens in a world where the Predict-O-Matic predicts whatever would happen after a crashed Predict-O-Matic.

Then the Predict-O-Matic simulating that Predict-O-Matic makes a prediction about what happens in a world where the Predict-O-Matic predicts [what happens in a world where the Predict-O-Matic predicts whatever would happen after a crashed Predict-O-Matic].

Then the Predict-O-Matic simulating that Predict-O-Matic makes a prediction about what happens in a world where the Predict-O-Matic predicts [what happens in a world where the Predict-O-Matic predicts [what happens in a world where the Predict-O-Matic predicts whatever would happen after a crashed Predict-O-Matic]].

Predicting world gets us world', predicting world' gets us world'', predicting world'' gets us world'''... Every layer in the hierarchy takes us one step closer to a fixed point.

Note that just like the previous section, this failure mode doesn't depend on self-awareness. It just depends on believing in something which believes it self-simulates.

# Repeated Use Could Step Towards Fixed Points

Another way the Predict-O-Matic can step towards fixed points is through simple repeated use. Suppose each time after making a prediction, the Predict-O-Matic gets updated data about how the world is going. In particular, the Predict-O-Matic knows the most recent prediction it made and can forecast how humans will respond to that. Then when the humans ask it for a new prediction, it incorporates the fact of its previous prediction into its forecast and generates a new prediction. You can imagine a scenario where the operators keep asking the Predict-O-Matic the same question over and over again, getting a different answer every time, trying to figure out what's going wrong -- until finally the Predict-O-Matic begins to consistently give a particular answer -- a fixed point it has inadvertently discovered.

As Abram alluded to in one of his comments, the Predict-O-Matic might even forsee this entire process happening, and immediately forecast the fixed point corresponding to the end state. Though, if the forecast is detailed enough, we'll get to see this entire process happening within the forecast, which could allow us to avoid an unwanted outcome.

This one doesn't seem to depend on self-awareness either. Consider two Predict-O-Matics with no self-knowledge whatsoever (not even the dualist kind I discussed in my previous post). If they're getting informed about the predictions the other is making, they could inadvertently work together to step towards fixed points.

# Solutions

An idea which could address some of these issues: Ask the Predict-O-Matic to make predictions conditional on us ignoring its predictions and not taking any action. Perhaps we'd also want to specify that any existing or future superpredictors will also be ignored in this hypothetical.

Then if we actually want to do something about the problems the Predict-O-Matic forsees, we can ask it to predict how the world will go conditional on us taking some particular action.

Choosing better inference algorithms could also be helpful.

# Prize

Sorry I was slower than planned on writing this follow-up and choosing a winner. I've decided to give Bunthut a $110 prize (including$10 interest for my slow follow-up). Thanks everyone for your insights.

# Ω 5

New Comment

Planned summary:

Could we prevent a superintelligent oracle from making self-fulfilling prophecies by preventing it from modeling itself? This post presents three scenarios in which self-fulfilling prophecies would still occur. For example, if instead of modeling itself, it models the fact that there's some AI system whose predictions frequently come true, it may try to predict what that AI system would say, and then say that. This would lead to self-fulfilling prophecies.

This is good stuff!

...if the Predict-O-Matic knows about (or forecasts the development of) anything which can be modeled using the outside view "I'm not sure how this thing works, but its predictions always seem to come true!"

Can you walk through the argument here in more detail? I'm not sure I follow it; sorry if I'm being stupid.

I'll start: There's two identical systems, "Predict-O-Matic A" and "Predict-O-Matic B", sitting side-by-side on a table. For simplicity let's say that A knows everything about B, B knows everything about A, but A is totally oblivious to the existence of A, and B to B. Then what? What's a question you might you ask it that would be problematic? Thanks in advance!

This is good stuff!

Thanks!

Here's another attempt at explaining.

1. Suppose Predict-O-Matic A has access to historical data which suggests Predict-O-Matic B tends to be extremely accurate, or otherwise has reason to believe Predict-O-Matic B is extremely accurate.

2. Suppose the way Predict-O-Matic A makes predictions is by some process analogous to writing a story about how things will go, evaluating the plausibility of the story, and doing simulated annealing or some other sort of stochastic hill-climbing on its story until the plausibility of its story is maximized.

3. Suppose that it's overwhelmingly plausible that at some time in the near future, Important Person is going to walk up to Predict-O-Matic B and ask Predict-O-Matic B for a forecast and make an important decision based on what Predict-O-Matic B says.

4. Because of point 3, stories which don't involve a forecast from Predict-O-Matic B will tend to get rejected during the hill-climbing process. And...

• Because of point 1, stories which involve an inaccurate forecast from Predict-O-Matic B will tend to get rejected during the hill-climbing process. We will tend to hill-climb our way into having Predict-O-Matic B's prediction change so it matches what actually happens in the rest of the story.

• Because the person in point 3 is important and Predict-O-Matic B's forecast influences their decision, a change to the part of the story regarding Predict-O-Matic B's prediction could easily mean the rest is no longer plausible and will benefit from revision.

• So now we've got a loop in the hill-climbing process where changes in Predict-O-Matic B's forecast lead to changes in what happens after Predict-O-Matic B's forecast, and changes in what happens after Predict-O-Matic B's forecast lead to changes in Predict-O-Matic B's forecast. It stops when we hit a fixed point.

Now that I've written this out, I'm realizing that I don't think this would happen for sure. I've argued both that changing the forecast to match what happens will improve plausibility, and that changing what happens so it's a plausible result of the forecast will improve plausibility. But if the only way to achieve one is by discarding the other, I guess both tweaks won't cause improvements to plausibility in general. However, the point remains that a fixed point will be among the most plausible stories available, so any good optimization method will tend to converge on it. (Maybe just simulated annealing, but with a temperature parameter high enough that it finds easy to leap between these kind of semi-plausible stories until it gets a fixed point by chance. Or if we're doing hill climbing based on local improvements in plausibility instead of considering plausibility when taken as a whole.)

I think the scenario is similar to your P(X) and P(Y) discussion in this post.

It just now occurred to me that you could get a similar effect given certain implementations of beam search. Suppose we're doing beam search with a beam width of 1 million. For the sake of simplicity, suppose that when Important Person walks up to Predict-O-Matic B and asks their question in A's sim, each of 1M beam states gets allocated to a different response that Predict-O-Matic B could give. Some of those states lead to "incoherent", low-probability stories where Predict-O-Matic B's forecast turns out to be false, and they get pruned. The only states left over are states where Predict-O-Matic B's prophecy ends up being correct -- cases where Predict-O-Matic B made a self-fulfilling prophecy.

Ah, OK, I buy that, thanks. What about the idea of building a system that doesn't model itself or its past predictions, and asking it questions that don't entail modeling any other superpredictors? (Like "what's the likeliest way for a person to find the cure for Alzheimers, if we hypothetically lived in a world with no superpredictors or AGIs?")

Could work.

Etc., etc. until a fixed point is found.

"Minimize prediction error" could mean minimizing error across the set of predictions, instead of individually.

I think it's relatively straightforward to avoid that if you construct your system well.