Superintelligence and wireheading

[-]HungryHobo10y30

I see where you're going with this mind-cancers but I'm not sure that the hypothetical modules you chose lead to an example which as a whole makes sense.

S,I, P and D together start off super-intelligent or at least highly intelligent as a unit and it starts trying to optimize the sub-parts.

Debugging your own thoughts on the fly seems like a non-starter so this AI needs to be constructing variants on itself and it's own modules and testing them out in some kind of a sandbox to create better versions of it's own modules.

But at the start it has functioning S,I,P and D modules. How would it end up choosing a D version 1.1 with random or semi-random outputs when it's assessing it with D version 1.0 which does not have random outputs.

What part isn't solved by taking the approach of "Don't assess whether a change to yourself is successful using the new version of yourself, decide with the old version"?

[-]Stuart_Armstrong10y10

It depends on whether the increase in intelligence comes from inside or outside. Some algorithms might be safe for limited resources, but become unstable if it has more resources, and this might not be easy to establish, even for the AI.

[-]entirelyuseless10y00

This is correct, and it is a reason why orthogonality is not likely to be true in practice, in the sense that it will probably not be easy to make an intelligent computer pursue just any random goal.

Basically, an architecture which allows you to plug any random goal into it, is like a situation where you make someone a slave and demand that he act for the sake of your goal. The slave is intelligent and already has goals, so there is no way you can guarantee that he will do what you want. Instead, he may escape and pursue his own goals. In the same way, an architecture that allows for plugging in random goals implies that the intelligence is already there, with other goals, and you are simply compelling it to pursue the goal you want. But since it is already intelligent it may escape and pursue its own goals.

Breaking bad

But notice that, in each case, I've been assuming that the modules become better at what they were supposed to be doing. The modules have implicit goals, and have become excellent at that. But the explicit "goals" of the algorithms - the code as written - might be very different from the implicit goals. There are two main ways this could then go wrong.

The first is if the algorithms becomes extremely effective, but the output becomes essentially random. Imagine that, for instance, P is coded using some plausible heuristics and rules of thumb, and we suddenly give P many more resources (or dramatically improve its algorithm). It can look through trillions of times more possibilities, its subroutines start looking through a combinatorial explosion of options, etc... And in this new setting, the heuristics start breaking down. Maybe it has a rough model of what a human can be, and with extra power, it starts finding that rough model all over the place. Thus, predicting that rocks and waterfalls will respond intelligently when queried, P becomes useless.

In most cases, this would not be a problem. The AI would become useless and start doing random stuff. Not a success story, but not a disaster, either. Things are different if the module V is affected, though. If the AI's value system becomes essentially random, but that AI was otherwise competent - or maybe even superintelligent - it would start performing actions that could be very detrimental. This could be considered a form of wireheading.

More serious, though is if the modules become excellent at achieving their "goals", as if they were themselves goal-directed agents. Consider module D, for instance. If its task was mainly to pick the action with the highest V rating, and it became adept at predicting the output of V (possibly using P? or maybe it has the ability to ask for more hypothetical options from S, to be assessed via V), it could start to manipulate its actions with the sole purpose of getting high V-ratings. This could include deliberately choosing actions that lead to V giving artificially high ratings in future, to deliberately re-wiring V for that purpose. And, of course, it is now motivated to keep V protected to keep the high ratings flowing in. This is essentially wireheading.

Other modules might fall into the familiar failure patterns for smart AIs - S, P, or I might influence the other modules so that the agent as a whole gets more resources, allowing S, P, or I to better compute their estimates, etc...

So it seems that, depending on the design of the AI, wireheading might still be an issue even for agents that seem immune to it. Good design should avoid the problems, but it has to be done with care.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

10

Superintelligence and wireheading

10

10

Breaking bad