This post argues for the desirability and plausibility of AI agents whose values have a structure I call ‘praxis-based.’ The idea draws on various aspects of virtue ethics, and basically amounts to an RL-flavored take on that philosophical tradition.
Praxis-based values as I define them are, informally, reflective decision-influences matching the description ‘promote x x-ingly’: ‘promote peace peacefully,’ ‘promote corrigibility corrigibly,’ ‘promote science scientifically.’
I will later propose a quasi-formal definition of this values-type, but the general idea is that certain values are an ouroboros of means and end. Such values frequently come up in human “meaning of life” activities (e.g. math, art, craft, friendship, athletics, romance, technology), as well as in complex forms of human morality (e.g. peace, democracy, compassion, respect, honesty). While this is already indirect reason to suspect that a human-aligned AI should have ‘praxis-based’ values, there is also a central direct reason: traits such as corrigibility, transparency, and niceness can only function properly in the form of ‘praxis-based’ values.
It’s widely accepted that if early strategically aware AIs possess values like corrigibility, transparency, and perhaps niceness, further alignment efforts are much more likely to succeed. But values like corrigibility or transparency or niceness don’t easily fit into an intuitively consequentialist form like ‘maximize lifetime corrigible behavior’ or ‘maximize lifetime transparency.’ In fact, an AI valuing its own corrigibility or transparency or niceness in an intuitively consequentialist way can lead to extreme power-seeking whereby the AI violently remakes the world to (at a minimum) protect itself from the risk that humans will modify said value. On the other hand, constraints or taboos or purely negative values (a.k.a. ‘deontological restrictions’) are widely believed to be weak, in the sense that an advanced AI will come to work around them or uproot them: ‘never lie’ or ‘never kill’ or ‘never refuse a direct order from the president’ are poor substitutes for active transparency, niceness, and corrigibility.
The idea of ‘praxis-based’ values is meant to capture the normal, sensible way we want an agent to value corrigibility or transparency or niceness, which intuitively-consequentialist values and deontology both fail to capture. We want an agent that (e.g.) actively tries to be transparent, and to cultivate its own future transparency and its own future valuing of transparency, but that will not (for instance) engage in deception and plotting when it expects a high future-transparency payoff.
Having lightly motivated the idea that ‘praxis-based’ values are desirable from an alignment point of view, the rest of this post will survey key premises of the hypothesis that ‘praxis-based’ values are a viable alignment goal. I’m going to assume an agent with some form of online reinforcement learning going on, and draw on ‘shards’ talk pretty freely.
I informally described a ‘praxis-based’ value as having the structure 'promote x x-ingly.' Here is a rough formulation of what I mean, put in terms of a utility-theoretic description of a shard that implements an alignment-enabling value x:
Actions (or more generally 'computations') get an x-ness rating. We define the x shard's expected utility conditional on a candidate action a as the sum of two utility functions: a bounded utility function on the x-ness of a and a more tightly bounded utility function on the expected aggregate x-ness of the agent's future actions conditional on a. (So the shard will choose an action with mildly suboptimal x-ness if it gives a big boost to expected aggregate future x-ness, but refuse certain large sacrifices of present x-ness for big boosts to expected aggregate future x-ness.)
(Note that I am not assuming that an explicit representation of this utility function or of x-ness ratings is involved in the shard. This is just a utility-theoretic description of the shard's behavior.)
I believe that for an x-shard with this form to become powerful, x can't be just any property but has to be a property that is reliably self-promoting. In other words, it needs to be the case that typically if an agent executes an action with higher x-ness the agent's future aggregate x-ness goes up. (For a prototypical example of such a property, consider Terry Tao's description of good mathematics.)
There are three main ways in which this requirement is substantive, in the sense that we can't automatically fulfill it for an arbitrary property x by writing a reward function that reinforces actions if they have high x-ness:
- The x-ness rating has to be enough of a natural abstraction that reinforcement of high x-ness actions generalizes.
- If x-ness both depends on having capital of some kind and is mutually exclusive with some forms of general power-seeking, actions with high x-ness have to typically make up for the (future x-ness wise) opportunity cost by creating capital useful for x-ing.
(Example: If you dream of achieving great theater acting, one way to do it is to become President of the United States and then pursue a theater career after your presidency, immediately getting interest from great directors who'll help you achieve great acting. Alternatively, you could start in a regional theater after high school, demonstrate talent by acting well, get invited to work with better and better theater directors who develop your skills and reputation -- skills and reputation that are not as generally useful as those you get by being POTUS -- and achieve great acting through that feedback loop.)
- An x-shard in a competitive shard ecology needs to self-chain and develop itself to avoid degeneration (see Turner’s discussion of the problem of a deontological ‘don’t kill’ shard). I believe that such self-chaining capabilities automatically follow if x-ness fulfills criteria '1.' and '2.': the more it is the case that high x-ness action strengthens the disposition to choose high x-ness action ('1.') and creates future opportunities for high x-ness action ('2.'), the more the x-shard will develop and self-chain.
When considering the above, it’s crucial to keep in mind that I do not claim that if the substance of (e.g.) the human concept of ‘niceness’ fulfills conditions 1-3 then instilling robust niceness with RL is trivially easy. My claim is merely that if the substance of the human concept of ‘niceness’ fulfills conditions 1-3, then once a niceness shard with a tiered bounded-utilities ‘praxis-based’ form is instilled in an online RL agent at or below the human level this shard can develop and self-chain powerfully (unlike any ‘deontological’ shards) while being genuinely alignment-enabling (unlike any ‘intuitively consequentialist’ shard).
This was a very brief sketch of ideas that would require much more elaboration and defense, but it seemed best to put it forward in a stripped down form to see whether it resonates.
Recall that because of the possibility of 'notational consequentialism’ (rewriting any policy as a utility function), dividing physical systems into ‘consequentialists' and ‘non-consequentialists’ isn’t a proper formal distinction. I will instead speak about ‘intuitive consequentialist form,’ which I believe roughly means additively decomposable utility functions. The idea is that intuitively consequentialist agents decompose space-time into standalone instances of dis/value. See also Steve Byrnes’ discussion of ‘preferences over future states.’
For a more interesting example, consider an AI that finds itself making trade-offs between different alignment-enabling behavioral values when dealing with humans, and decides to kill all humans to replace them with beings with whom the AI can interact without trade-offs between these values.
A good recent discussion from a ‘classical’ perspective is found in Richard Ngo’s ‘The Alignment Problem From A Deep Learning Perspective’, and a good recent discussion from a shard-theoretic perspective is found in Alex Turner’s short form.
The difference between criteria '1.' and '2.' is clearst if we think about x-ness as rating state-action pairs. Criterion '1.' is the requirement that if (a,s), (a', s')(a'',s'') are historical high x-ness pairs and (a''',s''') is an unseen high x-ness pair then reinforcing the execution of a in s, a' in s', and a'' in s'' will have the generalization effect of increasing the conditional probability (a''''|s''''). Criterion '2.' is roughly the requirement that choosing a higher x-ness action in a given state increase expected aggregate future x-ness holding policy constant, by making future states with higher x-ness potential more likely.
I am currently agnostic about whether if a property x fulfills conditions 1-3 then standard reinforcement of apparently high x-ness actions naturally leads to the formation of an x-shard with a two-tiered bounded utility structure as the agent matures. The fact that many central human values fulfill conditions 1-3 and have a two-tiered bounded utility structure is reason to think that such values are fairly ‘natural,’ but tapping into such values may require some especially sophisticated reward mechanism or environmental feature typical of human minds and the human world.
The property of being 'self-promoting' is at best only part of the story of what makes a given praxis-based value robust: In any real alignment context we'll be seeking to instill an AI with several different alignment-enabling values, while also optimizing the AI for some desired capabilities. We therefore need the alignment-enabling practices we’re hoping to instill to not only be individually self-promoting, but also harmonious with one another and with capabilities training. One way to think about ‘harmony’ here may be in terms of the continued availability of Pareto improvements: Intuitively, there is a important training-dynamics difference between a ‘capabilities-disharmonious’ pressure imposed on a training AI and ‘capabilities-harmonious’ training-influences that directs the AI’s training process towards one local optimization trajectory rather than another.
If I am right that central human values and activities have the structure of a 'self-promoting praxis,' there may also be an exciting story to tell about why these values rose to prominence. The general thought is that a 'self-promoting praxis' shard x may enjoy a stability advantage compared to an x-optimizer shard, due to the risk of an x-optimizer shard creating a misaligned mesaoptimizer. By way of an analogy, consider the intuition that a liberal democracy whose national-security agency adheres to a civic code enjoys a stability advantage compared to a liberal democracy that empowers a KGB-like national-security agency.
So... you are suggesting self-consistent ethics, right? As opposed to "end justifies the means"?
Yep! Or rather arguing that from a broadly RL-y + broadly Darwinian point of view 'self-consistent ethics' are likely to be natural enough that we can instill them, sticky enough to self-maintain, and capabilities-friendly enough to be practical and/or survive capabilities-optimization pressures in training.
In the language of generative models, "praxis" correspond to cognitive and "action" disciplines, from rationality (the discipline/praxis of rational reasoning), epistemology, and ethics to dancing and pottery. The generative model (Active Inference) frame and the shard theory frames thus seem to be in agreement that disciplinary alignment ("virtue ethics") is more important (fundamental, robust) than "deontology" and "consequentialism" alignment, which roughly correspond to goal alignment and prediction ("future fact") alignment, respectively. The generative model frame treats goal alignment and prediction alignment downstream of disciplinary (a.k.a. generative model, praxis) alignment. Thus the former are largely ineffectual or futile to align without a more fundamental latter type of alignment.
It's also worth noting that the concept of "cognitive discipline" is vague, cognitive disciplines are not dis-intangible from each other when we look at the actual behaviour/cognition/computation/action/generative model. We can look at the whole behaviour/cognition and say "It was rational", "It was ethical", or "It exhibited good praxis of science (i.e., epistemology)", but we probably cannot say "This exact elementary operation was an execution of rationality, and this next elementary operation was an execution of ethics". So "disciplines" that I talk about above are indeed more like "properties" of behaviour/cognition, and the invocation of "ratings" to assess properties/disciplines and their alignment with the corresponding properties/disciplines of behaviour/cognition is probably right (or, perhaps more informational language feedback on these properties should be used).
Speaking of concrete examples of praxis that you give, corrigibility, transparency, and niceness, I think corrigibility is a confused concept that is not achievable nor desirable in practice (or achievable in the form that makes the name 'corrigibility' confusing and not reflecting the nature of this realised property). Persuadability seems like a more coherent thing to want in the intelligences we wish to interact with. Transparency and niceness sound like sub-properties of good communication praxis, and I'm also not sure we "want" them a priori, it could be more nuanced (transparent and nice with a friend and closed and hostile to a foe, the art of distinguishing friend from foe, etc. See cooperative RL, multi-agent common sense.)
This just seems meaningless, or tautological, to be entirely honest.
Do you have a formal definition in the works?
Otherwise it seems likely to turn into circular arguments, or infinite regress, like prior attempts.
I describe the more formal definition in the post:
'Actions (or more generally 'computations') get an x-ness rating. We define the x shard's expected utility conditional on a candidate action a as the sum of two utility functions: a bounded utility function on the x-ness of a and a more tightly bounded utility function on the expected aggregate x-ness of the agent's future actions conditional on a. (So the shard will choose an action with mildly suboptimal x-ness if it gives a big boost to expected aggregate future x-ness, but refuse certain large sacrifices of present x-ness for big boosts to expected aggregate future x-ness.)'
And as I say in the post, we should expect decision-influences matching this definition to be natural and robust only in cases where x is a 'self-promoting' property. A property x is 'self-promoting' if it is reliably the case that performing an action with a higher x-ness rating increases the expected aggregate x-ness of future actions.
A formal definition means one based on logical axioms, mathematical axioms, universal constants (e.g. speed of light), observed metrics (e.g. the length of a day), etc.
Writing more elaborate sentences can't resolve the problem of circularity or infinite regress.
You might be confusing it with the legal or societal/cultural/political/literary sense.
This seems to be entirely your invention? I can't find any google results with a similar match.