LESSWRONG
LW

Wikitags

Ad-hoc hack (alignment theory)

Edited by Eliezer Yudkowsky last updated 18th May 2016

An "ad-hoc hack" is when you modify or the algorithm of the AI with regards to something that would ordinarily have simple, principled, or nailed-down structure, or where it seems like that part ought to have some simple answer instead. E.g., instead of defining a von Neumann-Morgenstern coherent utility function, you try to solve some problem by introducing something that's almost a VNM utility function but has a special case in line 3 which activates only on Tuesday. This seems unusually likely to break other things, e.g. , or anything else that depends on the coherence or simplicity of utility functions. Such hacks should be avoided in designs whenever possible, for analogous reasons to why they would be avoided in or . It may be interesting and productive anyway to look for a weird hack that seems to produce the desired behavior, because then you understand at least one system that produces the behavior you want - even if it would be unwise to actually build an AGI like that, the weird hack might give us the inspiration to find a simpler or more coherent system later. But then we should also be very suspicious of the hack, and look for ways that it fails or produces weird side effects.

An example of a productive weird hack was 's Parametric Polymorphism proposal for . You wouldn't want to build a real AGI like that, but it was helpful for showing what could be done - which properties could definitely be obtained together within a tiling agent, even if by a weird route. This in turn helped suggest relatively less hacky proposals later.

Parents:
cryptography
reflective consistency
Discussion0
Discussion0
patch
advanced-agent
tiling agents
designing a space probe
Benya_Fallenstein
AI safety mindset