Confucianism in AI Alignment

[-]Richard_Kennaway5yΩ680

You are proposing "make the right rules" as the solution. Surely this is like solving the problem of how to write correct software by saying "make correct software"? The same approach could be applied to the Confucian approach by saying "make the values right". The same argument made against the Confucian approach can be made against the Legalist approach: the rules are never the real thing that is wanted, people will vary in how assiduously they are willing to follow one or the other, or to hack the rules entirely for their own benefit, then selection effects lever open wider and wider the difference between the rules, what was wanted, and what actually happens.

It doesn't work for HGIs (Human General Intelligences). Why will it work for AGIs?

BTW, I'm not a scholar of Chinese history, but historically it seems to me that Confucianism flourished as state religion because it preached submission to the Legalist state. Daoism found favour by preaching resignation to one's lot. Do what you're told and keep your head down.

[-]johnswentworth5yΩ460

You are proposing "make the right rules" as the solution. Surely this is like solving the problem of how to write correct software by saying "make correct software"?

I strongly endorse this objection, and it's the main way in which I think the OP is unpolished. I do think there's obviously still a substantive argument here, but I didn't take the time to carefully separate it out. The substantive part is roughly "if the system accepts an inner optimizer with bad behavior, then it's going to accept non-optimizers with the same bad behavior. Therefore, we shouldn't think of the problem as being about the inner optimizers. Rather, the problem is that we accept bad behavior - i.e. bad behavior is able to score highly.".

It doesn't work for HGIs (Human General Intelligences). Why will it work for AGIs?

This opens up a whole different complicated question.

First, it's not clear that this analogy holds water at all. There are many kinds-of-things we can do to design AGI environments/incentives which don't have any even vaguely similar analogues in human mechanism design - we can choose the entire "ancestral environment", we can spin up copies at-will, we can simulate in hindsight (so there's never a situation where we won't know after-the-fact what the AI did), etc.

Second, in the cases where humans use bad incentive mechanisms, it's usually not because we can't design better mechanisms but because the people who choose the mechanism don't want a "better" one; voting mechanisms and the US government budget process are good examples.

All that said, I do still apply this analogy sometimes, and I think there's an extent to which it's right - namely, trying to align black-box AIs with opaque goals via clever mechanism design, without building a full theory of alignment and human values, will probably fail.

But I think a full theory of alignment and human values is likely tractable, which would obviously change the game entirely. It would still be true that "the rules are never the real thing that is wanted", but a full theory would at least let the rules improve in lock-step with capabilities - i.e. more predictive world models would directly lead to better estimates of human values. And I think the analogy would still hold: a full theory of alignment and human values should directly suggest new mechanism design techniques for human institutions.

[-]romeostevensit5yΩ580

That's the Legalist interpretation of Confucianism. Confucianism argues that the Legalists are just moving the problem one level up the stack a la public choice theory. The point of the Confucian is that the stack has to ground out somewhere, and asks the question of how to roll our virtue intuitions into the problem space explicitly since otherwise we are rolling them in tacitly and doing some hand waving.

[-]johnswentworth5yΩ120

Thanks, I was hoping someone more knowledgeable than I would leave a comment along these lines.

[-]Vaniver5yΩ460

Even if BigCo senior management were virtuous and benevolent, and their workers were loyal and did not game the rules, the poor rules would still cause problems.

If BigCo senior management were virtuous and benevolent, would they have poor rules?

That is to say, when I put my Confucian hat on, the whole system of selecting managers based on a proxy measure that's gameable feels too Legalist. [The actual answer to my question is "getting rid of poor rules would be a low priority, because the poor rules wouldn't impede righteous conduct, but they still would try to get rid of them."]

Like, if I had to point at the difference between the two, the difference is where the put the locus of value. The Confucian ruler is primarily focused on making the state good, and surrounding himself with people who are primarily focused on making the state good. The Legalist ruler is primarily focused on surviving and thriving, and so tries to set up systems that cause people who are primarily focused on surviving and thriving to do the right thing. The Confucian imagines that you can have a large shared value; the Legalist imagines that you will necessarily have many disconnected and contradictory values.

The difference between hiring for regular companies and EA orgs seems relevant. Often, applicants for regular companies want the job, and standard practice is to attempt to trick the company into hiring them, regardless of qualification. Often, applicants for EA orgs only want the job if and only if they're the right person for the job; if I'm trying to prevent asteroids from hitting the Earth (or w/e) and someone else could do a better job of it than I could, I very much want to get out of their way and have them do it instead of me. As you mention in the post, this just means you get rid of the part of interviews where gaming is intentional, and significant difficulty remains. [Like, people will be honest about their weaknesses and try to be honest about their strengths, but accurately measuring those and fit with the existing team remains quite difficult.]

Now, where they're trying to put the locus of value doesn't mean their policy prescriptions are helpful. As I understand the Confucian focus on virtue in the leader, the main value is that it's really hard to have subordinates who are motivated by the common good if you yourself are selfish (both because they won't have your example and because the people who are motivated by the common good will find it difficult to be motivated by working for you).

But I find myself feeling some despair at the prospect of a purely Legalist approach to AI Alignment, because it feels like it is fighting against the AI at every step, instead of being able to recruit it to do some of the work for you, and without that last bit I'm not sure how you get extrapolation instead of interpolation. Like, you can trust the Confucian to do the right thing in novel territory, insofar as you gave them the right underlying principles, and the Confucian is operating at a philosophical level where you can give them concepts like corrigibility (where they not only want to accept correction from you, but also want to preserve their ability to accept correction for you, and preserve their preservation of that ability, and so on) and the map-territory distinction (where they want their sensors to be honest, because in order to have lots of strawberries they need their strawberry-counter to be accurate instead of inaccurate). In Legalism, the hope is that the overseer can stay a step ahead of their subordinate; in Confucianism, the hope is that everyone can be their own overseer.

[Of course, defense in depth is useful; it's good to both have trust in the philosophical competence of the system and have lots of unit tests and restrictions in case you or it are confused.]

[-]johnswentworth5yΩ120

To be clear, I am definitely not arguing for a pure mechanism-design approach to all of AI alignment. The argument in the OP is relevant to inner optimizers because we can't just directly choose which goals to program into them. We can directly choose which goals to program into an outer optimizer, and I definitely think that's the right way to go.

[-]Donald Hobson5yΩ460

If an inner optimizer could exploit some distribution shift between the training and deployment environments, then performance-in-training is a bad proxy for performance-in-deployment.

Suppose you are making a self driving car. The training environment is a videogame like environment. The rendering is pretty good. A human looking at the footage would not easily be able to say it was obviously fake. An expert going over the footage in detail could spot subtle artefacts. The diffuse translucency on leaves in the background isn't quite right. When another car drives through a puddle, all the water drops are perfectly spherical, and travel on parabolic paths. Falling snow doesn't experience aerodynamic turbulence. Etc.

The point is that the behaviour you want is avoiding other cars and lamp posts. The simulation is close enough to reality that it is easy to match virtual lamp posts to real ones. However the training and testing environments have a different distribution.

Making the simulated environment absolutely pixel perfect would be very hard, and doesn't seem like it should be necessary.

However, given even a slight variation between training and the real world, there exists an agent that will behave well in training, but cause problems in the real world. And also an agent that behaves fine in training and the real world. The set of possible behaviours is vast. You can't consider all of them. You can't even store a single arbitrary behaviour. Because you cant train on all possible situations, there will be behaviours that behave the same on all the training situations, but behave differently in other situations. You need some part of your design that favours some policies over others without training data. For example, you might want a policy that can be described as parameters in a particular neural net. You have to look at how this effects off distribution actions.

The analogous situation with managers would be that the person being tested knows they are being tested. If you get them to display benevolent leadership, then you can't distinguish benevolent leaders from sociopaths who can act nice to pass the test.

[-]Gurkenglas5yΩ460

Take an outer-aligned system, then add a 0 to each training input and a 1 to each deployment input. Wouldn't this add only malicious hypotheses that can be removed by inspection without any adverse selection effects?

[-]johnswentworth5yΩ340

After thinking about it for a couple minutes, this question is both more interesting and less trivial than it seemed. The answer is not obvious to me.

On the face of it, passing in a bit which is always constant in training should do basically nothing - the system has no reason to use a constant bit. But if the system becomes reflective (i.e. an inner optimizer shows up and figures out that it's in a training environment), then that bit could be used. In principle, this wouldn't necessarily be malicious - the bit could be used even by aligned inner optimizers, as data about the world just like any other data about the world. That doesn't seem likely with anything like current architectures, but maybe in some weird architecture which systematically produced aligned inner optimizers.

[-]Gurkenglas5yΩ120

The hypotheses after the modification are supposed to have knowledge that they're in training, for example because they have enough compute to find themselves in the multiverse. Among hypotheses with equal behavior in training, we select the simpler one. We want this to be the one that disregards that knowledge. If the hypothesis has form "Return whatever maximizes property _ of the multiverse", the simpler one uses that knowledge. It is this form of hypothesis which I suggest to remove by inspection.

[-]johnswentworth5yΩ120

Ok, that should work assuming something analogous to Paul's hypothesis about minimal circuits being daemon-free.

[-]Gurkenglas5y20

As far as I understand, whether minimal circuits are daemon-free is precisely the question whether direct descriptions of the input distribution are simpler than hypotheses of form "Return whatever maximizes property _ of the multiverse".

[-]Rohin Shah5yΩ560

Planned summary for the Alignment Newsletter (note it's written quite differently from the post, and so I may have introduced errors, so please check more carefully than usual):

Suppose we trained our agent to behave well on some set of training tasks. <@Mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) suggests that we may still have a problem: the agent might perform poorly during deployment, because it ends up optimizing for some misaligned _mesa objective_ that only agrees with the base objective on the training distribution.
This post points out that this is not the only way systems can fail catastrophically during deployment: if the incentives were not designed appropriately, they may still select for agents that have learned heuristics that are not in our best interests, but nonetheless lead to acceptable performance during training. This can be true even if the agents are not explicitly “trying” to take advantage of the bad incentives, and thus can apply to agents that are not mesa optimizers.

[-]johnswentworth5yΩ340

I feel like the second paragraph doesn't quite capture the main idea, especially the first sentence. It's not just that mesa optimizers aren't the only way that a system with good training performance can fail in deployment - that much is trivial. It's that if the incentives reward misaligned mesa-optimizers, then they very likely also reward inner agents with essentially the same behavior as the misaligned mesa-optimizers but which are not "trying" to game the bad incentives.

The interesting takeaway is that the possibility of deceptive inner optimizers implies nearly-identical failures which don't involve any inner optimizers. It's not just "systems without inner optimizers can still be misaligned", it's "if we just get rid of the misaligned inner optimizers in a system which would otherwise have them, then that system can probably still stumble on parameters which result in essentially the same bad behavior". Thus the idea that the "real problem" is the incentives, not the inner optimizers.

[-]Rohin Shah5yΩ340

Changed second paragraph to:

This post suggests that in any training setup in which mesa optimizers would normally be incentivized, it is not sufficient to just prevent mesa optimization from happening. The fact that mesa optimizers could have arisen means that the incentives were bad. If you somehow removed mesa optimizers from the search space, there would still be a selection pressure for agents that without any malicious intent end up using heuristics that exploit the bad incentives. As a result, we should focus on fixing the incentives, rather than on excluding mesa optimizers from the search space.

How does that sound?

[-]johnswentworth5yΩ340

Ok, that works.

[-]Rohin Shah5yΩ560

Despite the fact that I commented on your previous post suggesting a different decomposition into "outer" and "inner" alignment, I strongly agree with the content of this post. I would just use different words to say it.

[-]abramdemski5yΩ440

I think one issue which this post sort of dances around, and which maybe a lot of discussion of inner optimizers leaves implicit or unaddressed, is the difference between having a loss function which you can directly evaluate vs one which you must estimate via some sort of sample.

The argument in this post about how inner optimizers misbehaving is necessarily behavioral, and therefore best addressed by behavioral loss functions, misses the point that these misbehaviors are on examples we don't check. As such, it comes off as:

Perhaps arguing that we should check every example, or check much more thoroughly.
Perhaps arguing that the examples should be made more representative.

Now, I personally think that "distributional shift" is a misleading framing, because in learning in general (EG Solomonoff induction) we don't have an IID distribution (unlike in EG classification tasks), so we don't have a "distribution" to "shift".

But to the extent that we can talk in this framing, I'm kinda like... what are you saying here? Are you really proposing that we should just check instances more thoroughly or something like that?

[-]adamShimi5yΩ230

I have a weird relation with this post. On the one hand, I don't think the definition of outer alignment you're using is the right one (as I mentioned in comments on your previous post); on the other hand, I do agree with one of your main points, that we should look for a behavioral property instead of an internal structure property.

[-]johnswentworth5yΩ120

Perfect. I was on the fence about posting this one, but decided that it did a better job than the other of expressing the substantive argument in a way that would be obvious despite definitional disagreements.

(Though I actually don't think we should look for a behavioral property rather than a structural property; I think this whole thing is a bad way of framing the problem, and we shouldn't be doing open-ended searches in policy space at all. But if we are going to do an open-ended search in policy space at all, then yeah, behavioral over structural.)

[-]abramdemski5yΩ440

Why is this? As I argued in learning normativity, I think there are some problems which we can more easily point out structurally. For example, Paul's proposal of relaxed adversarial training is one possible method (look for "pseudo-inputs" which lead to bad behavior, such as activations of some internal nodes which seem like plausible activation patterns, even if you don't know how to hit them with data).

The argument in the post seems to be "you can't incentivize virtue without incentivizing it behaviorally", but this seems untrue.

[-]Measure5y30

"The candidates should be virtuous and not abuse the rules"

Simply put, the problem with this is that it does not describe a strategy BigCo can use to select good candidates. Bad candidates being selected is a problem for BigCo (and for the counterfactual good candidates), and a solution to this problem should consist of a recommendation for their actions.

[-]johnswentworth5y20

This is actually my biggest complaint about Confucianism, and I think it's a mental mistake people make much more generally: they talk about how things "should" be, but completely forget that talking about "should" has to ground out in actions in order to be useful.

[-]Vladimir_Nesov5y20

they talk about how things "should" be, but completely forget that talking about "should" has to ground out in actions in order to be useful

An idea doesn't have to be useful in order to be a thing to talk about. So when people talk about an apparently useless idea, it doesn't follow that that they forgot that it's not useful.

[-]johnswentworth5y20

It does not necessarily follow, but I do think that's usually what happens in practice. Arguing about what's "good" or what "should" be scratches our political itches well, so it ends up feeling important to argue about these things, even when it grounds out in nothing.

[-]Vladimir_Nesov5y20

This plays the same role as basic research, ideas that can be developed but haven't found even an inkling of their potential practical applications. An error would be thinking that they are going to be immediately useful, but that shouldn't be a strong argument against developing them, and there should be no certainty that their very indirect use won't end up crucial at some point in the distant future.

[-]johnswentworth5y20

I agree with this in principle, and it certainly applies in some cases. But most of the time, people do not argue about what "should" happen in hopes that it will someday lead to concrete action through not-yet-clear mechanisms. People argue about what "should" happen in order to signal tribal allegiances, or sound virtuous.

[-]Vladimir_Nesov5y20

Scientists doing basic research also mostly aren't motivated by the hope that it will someday lead to practical applications. When there is confusion or uncertainty about a salient phenomenon that can be clarified with further research, that is enough. Incidentally, it is virtuous and signals tribal allegiance to that field of research. Some of the researchers are going to be motivated by that.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

34

Confucianism in AI Alignment

34

Ω 20

34

Ω 20

Analogy to AI Alignment