Thus the NearestUnblockedNeighborNearest unblocked strategy problem may be called "patch-resistant" by someone who thinks that (1) it is possible to design AdvancedAgentsan advanced agent with a FullCoveragefull coverage goal system,system, (2) a FullCoveragefull coverage goal system would negate most NearestUnblockedNeighborNearest unblocked strategy problems, but (3) we don't yet know how to build a FullCoveragefull coverage goal system.
See e.g. the Todohistory of nonmonotonic logic intended to represent 'explaining away' evidence, before the invention of Bayesian networks, as recounted in e.g. todo citeProbabilistic Reasoning in Intelligent Systems
Todo this
In other words: There's a large amount of algorithmic information or many independent reflectively consistent degrees of freedom in the correct answer, the plans we want the AI to come up with, but we've only given itthe AI relatively simple concepts that can't identify those plans.
A predictedproposed foreseeable difficulty of aligning advanced safetyagents is assertedfurthermore proposed to be "patch-resistant" if the speaker thinks that most simple or a "resistant problem" if it is asserted that proposed simplenaive solutions ("patches") will fail to solveresolve the difficulty and just regenerate the difficultyit somewhere else.
The notion of a problem being "patch-resistant" is relative to the current state of the art. If a speaker asserts a problem is "patch-resistant" and is using that term as defined here, it means the speaker thinks both of the following:
Thus the Nearest unblocked strategy problem may be called "patch-resistant" by someone who thinks that (1) it is possible to design an advanced agent with a full coverage goal system, (2) a full coverage goal system would negate most Nearest unblocked strategy problems, but (3) we don't yet know how to build a full coverage goal system.
To call a problem "patch-resistant" is not to assert that it is unsolvable, but it does mean the speaker is cautioning against naive or simple solutionssolutions.
On most occasions so far, alleged cases of patch-resistance are said to stem from one of two central sources:
Example: Suppose you want your AI to have a shutdown button:
See e.g. the history of nonmonotonic logic intended to represent 'explaining away' evidence, before the invention of Bayesian networks, as recounted in e.g. Probabilistic Reasoning in Intelligent Systems
To the extent that we can see the central project of AI alignment as revolving around finding a set of alignment ideas that do have simple central tendencies and are specifiable or learnable which together add up to a safe but powerful AI - that is, finding domains with correlated coverage that add up to a safe AI that can do something pivotal - we could see the central project of AI alignment as finding a collectively good-enough set of safety-things we can do without endless patching.
See e.g. the Todohistory of nonmonotonic logic intended to represent 'explaining away' evidence, before the invention of Bayesian networks, as recounted in e.g. todo cite[Probabilistic Reasoning in Intelligent Systems]Systems.
The notion of a problem being "patch-resistant" is relative to the current state of the art. If a speaker asserts a problem is "patch-resistant" and is using that term as defined here, it means the speaker thinks that:both of the following: