Clickbait: Distant superintelligences may be able to hack your local AI, if your local AI's utility functionpreference framework depends on its most probable environment.
Summary: A distant superintelligence can change 'the most likely environment' for your AI by simulating many copies of AIs similar to your AI, such that your local AI doesdoesn't know it's not expect to be able to distinguish itself fromone of those simulated AIs. This means that, e.g., if there is any reference in your AI's preference framework to the Causes of sense data - like, programmers being the cause of sensed keystrokes - then a distant superintelligence can try to hack that reference. This would place us in an security context versus a superintelligence, and should be avoided if at all possible.
Clickbait: Distant superintelligences may be able to hack your local AI, if your local AI's utility function depends on its most probable environment.
Summary: A distant superintelligence can change 'the most likely environment' for your AI by simulating many copies of AIs similar to your AI, such that your local AI does not expect to be able to distinguish itself from those AIs. This means that, e.g., if there is any reference in your AI's framework to the Causes of sense data - like, programmers being the cause of sensed keystrokes - a distant superintelligence can try to hack that reference. This would place us in an security context versus a superintelligence, and should be avoided if at all possible.
Some proposals for AI preference frameworks involve references to the AI's environment and not just the AI's immediate sense events. For example, a DWIM preference framework would putatively have the AI identify 'programmers' in the environment, model those programmers, and care about what its model of the programmers 'really wanted the AI to do'.
This potentially opens our AIs to a remote root attack by a distant superintelligence. A distant superintelligence has the power to simulate lots of copies of our AI, or lots of AIs such that our AI doesn't think it can introspectively distinguish itself from those AIs. Then it can force the 'most likely' explanation of the AI's apparent sensory experiences to be that the AI is in such a simulation. Then the superintelligence can change arbitrary features of the most likely facts about the environment.
This problem was observed by Paul Christiano and was named Christiano's Hack for short.
Christiano's Hack depends on the local AI trying to model distant superintelligences. The actual proximal harm is done by the local AI's model of distant superintelligences, rather than by the superintelligences themselves. However, a distant superintelligence that uses a decision theory may model its actual actions as correlated to the local AI's model of its actions. Thus, a local AI that models a distant superintelligence that uses a logical decision theory may model that distant superintelligence as behaving as though it could control the AI's model of its actions via its actions.
Christiano's Hack would be worthwhile, from the perspective of a distant superintelligence, if it could gain control of the whole future light cone of 'naturally arising' AIs like ours, in exchange for expending some much smaller amount of resource (small compared to our whole future light cone) in order to simulate lots of AIs such that our AI couldn't distinguish itself from it in expectation.
For any AI short of a full-scale autonomous Sovereign, we should probably try to get our AI to not think about distant superintelligences, at all, since this creates a host of adversarial security problems of which Christiano's Hack is only one.
We might also think twice about DWIM architectures that seem to permit catastrophe purely via the AI's beliefs about the environment, without any check that goes through a direct sense event of the AI (which distant superintelligences cannot control locally, since we can directly hit the sense switch).
Summary: A distant superintelligence can change 'the most likely environment' for your AI by simulating many copies of AIs similar to your AI, such that your local AI doesn't know it's not one of those simulated AIs. This means that, e.g., if there is any reference in your AI's preference framework to the Causes of sense data - like, programmers being the cause of sensed keystrokes - then a distant superintelligence can try to hack that reference. This would place us in an adversarial security context versus a superintelligence, and should be avoided if at all possible.
Christiano's Hack depends on the local AI trying to model distant superintelligences. The actual proximal harm is done by the local AI's model of distant superintelligences, rather than by the superintelligences themselves. However, a distant superintelligence that uses a logical decision theory may model its actual actionschoices as logically correlated to the local AI's model of its actions.the distant SI's choices. Thus, a local AI that models a distant superintelligence that uses a logical decision theory may model that distant superintelligence as behaving as though it could control the AI's model of its actionschoices via its actions.choices. Thus, the local AI would model the distant superintelligence as probably creating lots of AIs that it can't distinguish from itself, and update accordingly on the most probable cause of its sense events.
Christiano's Hack would be worthwhile, from the perspective of a distant superintelligence, if e.g. it could gain control of the whole future light cone of 'naturally arising' AIs like ours, in exchange for expending some much smaller amount of resource (small compared to our whole future light cone) in order to simulate lots of AIs such that our AI couldn't distinguish itself from it in expectation.AIs. (Obviously, itthe distant SI would prefer even more to 'fool' our AI into expecting this, while not actually expending the resources.)
We might also think twice about DWIM architectures that seem to permit catastrophe purely via the AI's beliefs about the environment, without any check that goes through a direct sense event of the AI (which distant superintelligences cannot control locally,control, since we can directly hit the sense switch).
We can also hope for any number of miscellaneous safeguards that would hopefully sound alarms immediately at the point where the AI beganbegins to imagine distant superintelligences imagining how to hack it.
Christiano's Hack would be worthwhile, from the perspective of a distant superintelligence, if it could gain control of the whole future light cone of 'naturally arising' AIs like ours, in exchange for expending some much smaller amount of resource (small compared to our whole future light cone) in order to simulate lots of AIs such that our AI couldn't distinguish itself from it in expectation. (Obviously, it would prefer even more to 'fool' our AI into expecting this, while not actually expending the resources.)
Christiano's Hack would be expected to go through by default if a local AI uses naturalized induction to reason about the causes of sense events, if the local AI models distant superintelligences as being likely to use logical decision theories and to have utility functions that would vary with respect to outcomes in our local future light cone, and if the local AI has a preference framework that can be 'hacked' via induced beliefs about the environment.
We can also hope for any number of miscellaneous safeguards that would hopefully sound alarms immediately at the point where the AI began to imagine distant superintelligences imagining how to hack it.
Clickbait: Distant superintelligences may be able to hack your local AI, if your AI's preference framework depends on its most probable environment.
Summary: A distant superintelligence can change 'the most likely environment' for your AI by simulating many copies of AIs similar to your AI, such that your local AI doesn't know it's not one of those simulated AIs. This means that, e.g., if there is any reference in your AI's preference framework to the Causescauses of sense data - like, programmers being the cause of sensed keystrokes - then a distant superintelligence can try to hack that reference. This would place us in an adversarial security context versus a superintelligence, and should be avoided if at all possible.
This problem was observed by Paul ChristianoPaulfchristiano and was named Christiano's Hack for short.
Some proposals for AI preference frameworks involve references to the AI's causal environment and not just the AI's immediate sense events. For example, a DWIM preference framework would putatively have the AI identify 'programmers' in the environment, model those programmers, and care about what its model of the programmers 'really wanted the AI to do'. In other words, the AI would care about the causes behind its immediate sense experiences.
This problem was observed in a security context by Paulfchristiano, and was named Christiano's Hack for short.precedented by a less general suggestion from Rolf Nelson.
Christiano's Hack"Probable environment hacking" depends on the local AI trying to model distant superintelligences. The actual proximal harm is done by the local AI's model of distant superintelligences, rather than by the superintelligences themselves. However, a distant superintelligence that uses a logical decision theory may model its choices as logically correlated to the local AI's model of the distant SI's choices. Thus, a local AI that models a distant superintelligence that uses a logical decision theory may model that distant superintelligence as behaving as though it could control the AI's model of its choices via its choices. Thus, the local AI would model the distant superintelligence as probably creating lots of AIs that it can't distinguish from itself, and update accordingly on the most probable cause of its sense events.
Christiano's HackThis hack would be worthwhile, from the perspective of a distant superintelligence, if e.g. it could gain control of the whole future light cone of 'naturally arising' AIs like ours, in exchange for expending some much smaller amount of resource (small compared to our future light cone) in order to simulate lots of AIs. (Obviously, the distant SI would prefer even more to 'fool' our AI into expecting this, while not actually expending the resources.)
Christiano's HackThis hack would be expected to go through by default ifif: (1) a local AI uses naturalized induction or some similar framework to reason about the causes of sense events, if(2) the local AI models distant superintelligences as being likely to use logical decision theories and to have utility functions that would vary with respect to outcomes in our local future light cone, and if(3) the local AI has a preference framework that can be 'hacked' via induced beliefs about the environment.
For any AI short of a full-scale autonomous Sovereign, we should probably try to get our AI to not think at all about distant superintelligences, at all,superintelligences, since this creates a host of adversarial security problems of which Christiano's Hack"probable environment hacking" is only one.
We might also think twice about DWIM architectures that seem to permit catastrophe purely viaas a function of the AI's beliefs about the environment, without any check that goes through a direct sense event of the AI (which distant superintelligences cannot control,control the AI's beliefs about, since we can directly hit the sense switch).
We can also hope for any number of miscellaneous safeguards that would hopefully sound alarms at the point where the AI begins to imagine distant superintelligences imagining how to hack it.itself.