We often laugh over human-specific "bugs" in reasoning, comparing it to a gold standard of some utility-maximizing perfect Bayesian reasoner.
We often fear that a very capable AI following strict rules of optimization would reach some repugnant conclusions, and struggle to find "features" to add to guard against it.
What if some of the "bugs" we are looking at, are actually the "features" we are looking for?
- We seem to distinguish "sacred" and "non-sacred" values, refusing to mix the two in calculations (for example human life vs money). What if this "tainted bit", "NaN-propagation", is a security feature guarding against Goodharting leading to genocide or dissolution of social trust? What if utility is not a single real number, but instead a pair? What if the ordering is not even lexicographic, but partial? What if it's a much longer tuple? Which brings me to next point:
- We often experience decision paralysis apparently unable to compare two actions. What if this is simply because the order must be partial for security reasons? An alternative explanation of this phenomenon is that we implicitly treat "wait for more data to arrive and/or situation to change in tie-braking way" as an action available to us - is that bad?
- We often decide which of the two end-states A vs B we prefer based on the path leading to them, amusingly favoring A in some scenarios and B in others. What if this is because we implicitly assume that end-state contains our brain with the memory of the path leading there? Isn't this a cool feature to treat the agent as part of its environment? Or what if this is because we implicitly factor in considerations of "what if other society members would follow this kind of path, or decision-making algorithm?". Isn't this a cool feature to think about second-order effects, about "acausal trades", and to treat own software as perhaps shared with other agents?
- At least some long-term stable cultures have norms requiring children to follow adults' advice even if it conflicts their own judgment and more importantly said children apparently follow along instead of revolting and doing what (seems) good for them. Isn't that corrigibility a feature we want from AIs we plan to rear? Shouldn't there be safe-guards against child-knowing-better-than-parent in any self-modifying system spawning new generations of itself?
- The whole sunken cost fallacy/heuristic. Isn't it actually a good thing to associate cost with each deviation from the original plan? Do we really want to zig-zag between more and more shiny objects with no meta-level realization that there's something wrong with this whole algorithm in general if it can't keep its trajectory predictable to itself? Yeah, sunken cost is more than that - it's not just fixed additional cost of decision - it's more like the guilt for not caring about your past self being invested into something. But again, isn't that a good thing from a security perspective?
I anticipate that each of these examples can be laughed at using some toy problem simple enough to calculate on the napkin. Sure, but we are talking about producing agents with partial information about very fuzzy world around them with lots of other agents embedded in them, some of them sharing goals or even parts of source code - we will rarely meet spherical cows on our way and overfitting to these learning examples is the very problem we want to solve. Do we really plan to solve all of that with a single simple elegant formula (AIXI style), or the plan always was to throw in some safety heuristics to the mix? If it's the later, then perhaps we can take a hint from parents raising children, societies avoiding dissolving, and people avoiding mania? Thus, what I propose is to take a look at the list of fallacies and other "strange" phenomenons from a different angle: could I use something like that as a security feature?