Self-Fulfilling Prophecies Aren't Always About Self-Awareness

by John_Maxwell 3 min read18th Nov 20197 comments

15

Ω 5


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a belated follow-up to my Dualist Predict-O-Matic post, where I share some thoughts re: what could go wrong with the dualist Predict-O-Matic.

Belief in Superpredictors Could Lead to Self-Fulfilling Prophecies

In my previous post, I described a Predict-O-Matic which mostly models the world at a fuzzy resolution, and only "zooms in" to model some part of the world in greater resolution if it thinks knowing the details of that part of the world will improve its prediction. I considered two cases: the case where the Predict-O-Matic sees fit to model itself in high resolution, and the case where it doesn't, and just makes use of a fuzzier "outside view" model of itself.

What sort of outside view models of itself might it use? One possible model is: "I'm not sure how this thing works, but its predictions always seem to come true!"

If the Predict-O-Matic sometimes does forecasting in non-temporal order, it might first figure out what it thinks will happen, then use that to figure out what it thinks its internal fuzzy model of the Predict-O-Matic will predict.

And if it sometimes revisits aspects of its forecast to make them consistent with other aspects of its forecast, it might say: "Hey, if the Predict-O-Matic forecasts X, that will cause X to no longer happen". So it figures out what would actually happen if X gets forecasted. Call that X'. Suppose X != X'. Then the new forecast has the Predict-O-Matic predicting X and then X' happens. That can't be right, because outside view says the Predict-O-Matic's predictions always come true. So we'll have the Predict-O-Matic predicting X' in the forecast instead. But wait, if the Predict-O-Matic predicts X', then X'' will happen. Etc., etc. until a fixed point is found.

Some commenters on my previous post talked about how making the Predict-O-Matic self-unaware could be helpful. Note that self-awareness doesn't actually help with this failure mode, if the Predict-O-Matic knows about (or forecasts the development of) anything which can be modeled using the outside view "I'm not sure how this thing works, but its predictions always seem to come true!" So the problem here is not self-awareness. It's belief in superpredictors, combined with a particular forecasting algorithm: we're updating our beliefs in a cyclic fashion, or hill-climbing our story of how the future will go until the story seems plausible, or something like that.

Before proposing a solution, it's often valuable to deepen your understanding of the problem.

Glitchy Predictor Simulation Could Step Towards Fixed Points

Let's go back to the case where the Predict-O-Matic sees fit to model itself in high resolution and we get an infinite recurse. Exactly what's going to happen in that case?

I actually think the answer isn't quite obvious, because although the Predict-O-Matic has limited computational resources, its internal model of itself also has limited computational resources. And its internal model's internal model of itself has limited computational resources too. Etc.

Suppose Predict-O-Matic is implemented in a really naive way where it just crashes if it runs out of computational resources. If the toplevel Predict-O-Matic has accurate beliefs about its available compute, then we might see the toplevel Predict-O-Matic crash before any of the simulated Predict-O-Matics crash. Simulating something which has the same amount of compute you do can easily use up all your compute!

But suppose the Predict-O-Matic underestimates the amount of compute it has. Maybe there's some evidence in the environment which misleads it to think that it has less compute than it actually does. So it simulates a restricted-compute version of itself reasonably well. Maybe that restricted-compute version of itself is mislead in the same way, and simulates a double-restricted-compute version of itself.

Maybe this all happens in a way so that the first Predict-O-Matic in the hierarchy to crash is near the bottom, not the top. What then?

Deep in the hierarchy, the Predict-O-Matic simulating the crashed Predict-O-Matic makes predictions about what happens in the world after the crash.

Then the Predict-O-Matic simulating that Predict-O-Matic makes a prediction about what happens in a world where the Predict-O-Matic predicts whatever would happen after a crashed Predict-O-Matic.

Then the Predict-O-Matic simulating that Predict-O-Matic makes a prediction about what happens in a world where the Predict-O-Matic predicts [what happens in a world where the Predict-O-Matic predicts whatever would happen after a crashed Predict-O-Matic].

Then the Predict-O-Matic simulating that Predict-O-Matic makes a prediction about what happens in a world where the Predict-O-Matic predicts [what happens in a world where the Predict-O-Matic predicts [what happens in a world where the Predict-O-Matic predicts whatever would happen after a crashed Predict-O-Matic]].

Predicting world gets us world', predicting world' gets us world'', predicting world'' gets us world'''... Every layer in the hierarchy takes us one step closer to a fixed point.

Note that just like the previous section, this failure mode doesn't depend on self-awareness. It just depends on believing in something which believes it self-simulates.

Repeated Use Could Step Towards Fixed Points

Another way the Predict-O-Matic can step towards fixed points is through simple repeated use. Suppose each time after making a prediction, the Predict-O-Matic gets updated data about how the world is going. In particular, the Predict-O-Matic knows the most recent prediction it made and can forecast how humans will respond to that. Then when the humans ask it for a new prediction, it incorporates the fact of its previous prediction into its forecast and generates a new prediction. You can imagine a scenario where the operators keep asking the Predict-O-Matic the same question over and over again, getting a different answer every time, trying to figure out what's going wrong -- until finally the Predict-O-Matic begins to consistently give a particular answer -- a fixed point it has inadvertently discovered.

As Abram alluded to in one of his comments, the Predict-O-Matic might even forsee this entire process happening, and immediately forecast the fixed point corresponding to the end state. Though, if the forecast is detailed enough, we'll get to see this entire process happening within the forecast, which could allow us to avoid an unwanted outcome.

This one doesn't seem to depend on self-awareness either. Consider two Predict-O-Matics with no self-knowledge whatsoever (not even the dualist kind I discussed in my previous post). If they're getting informed about the predictions the other is making, they could inadvertently work together to step towards fixed points.

Solutions

An idea which could address some of these issues: Ask the Predict-O-Matic to make predictions conditional on us ignoring its predictions and not taking any action. Perhaps we'd also want to specify that any existing or future superpredictors will also be ignored in this hypothetical.

Then if we actually want to do something about the problems the Predict-O-Matic forsees, we can ask it to predict how the world will go conditional on us taking some particular action.

Choosing better inference algorithms could also be helpful.

Prize

Sorry I was slower than planned on writing this follow-up and choosing a winner. I've decided to give Bunthut a $110 prize (including $10 interest for my slow follow-up). Thanks everyone for your insights.

15

Ω 5