If you want to keep the search function from wireheading the world model then you have to code "don't break the world model" into your value function. This is a general contradiction to the Orthogonality Thesis. A sufficiently powerful world-optimizing artificial intelligence must have a value function that preserves the integrity of its world model, because otherwise it'll just wirehead itself, instead of optimizing the world.

If the value function says ~"maximise the number of paperclips, as counted by my paperclip-counting-machinery", a weak AI might achieve this by making paperclips, but a stronger AI might trick the paperclip-counting-machinery into counting arbitrarily many paperclips, rather than actually making any paperclips.

However, this isn't a failure of the Orthogonality Thesis, because that value function doesn't say "maximise the number of real paperclips". The value function, as stated, was weakly satisfied by the weak AI, and strongly satisfied by the strong AI. The strong AI did maximise the number of paperclips, as counted by its paperclip-counting-machinery. Any value function which properly corresponds to "maxmise the number of real paperclips" would necessarily include protections against wireheading.

If you try to program an AI to have the goal of doing X, and it does Y instead, there's a good chance the "goal you thought would lead to X" was actually a goal that leads to Y in reality.

A value function which says ~"maximise the number of real paperclips the world model (as it currently exists) predicts there will be in the future" would have a better chance of leading to lots of real paperclips, but perhaps it's still missing something, turns out steering cognition is hard. If the search evaluates wirehead-y plans, it will see that according to its current, uncorrupted world model, that the plan leads to very few real paperclips, and so doesn't implement it.

If the value function says ~"maximise the number of paperclips, as counted by my paperclip-counting-machinery", a weak AI might achieve this by making paperclips, but a stronger AI might trick the paperclip-counting-machinery into counting arbitrarily many paperclips, rather than actually making any paperclips.

However, this isn't a failure of the Orthogonality Thesis, because that value function doesn't say "maximise the number of real paperclips". The value function, as stated, was weakly satisfied by the weak AI, and strongly satisfied by the strong AI. The strong AI did maximise the number of paperclips, as counted by its paperclip-counting-machinery. Any value function which properly corresponds to "maxmise the number of real paperclips" would necessarily include protections against wireheading.

If you try to program an AI to have the goal of doing X, and it does Y instead, there's a good chance the "goal you thought would lead to X" was actually a goal that leads to Y in reality.

A value function which says ~"maximise the number of real paperclips the world model (as it currently exists) predicts there will be in the future" would have a better chance of leading to lots of real paperclips, but perhaps it's still missing something, turns out steering cognition is hard. If the search evaluates wirehead-y plans, it will see that according to its current, uncorrupted world model, that the plan leads to very few real paperclips, and so doesn't implement it.