A Conservative Vision For AI Alignment

Ram Rachum

[-][anonymous]3mo320

This is a strange post to me.

On the one hand, it employs oversimplified and incorrect models of political discourse to present an inaccurate picture of what liberalism and conservatism stand for. It also strongly focuses on an analogy for AGI as humanity's children, an analogy that I think is inappropriate and obscures far more than it reveals.

On the other hand, it gets a lot of critical details exactly right, such as when it mentions how "proposals like Eliezer’s Coherent Extrapolated Volition, or Bostrom’s ideas about Deep Utopia assume or require value convergence, and see idealization as desirable and tractable."

But beyond these (in a relative sense) minute matters... to put it in Zvi's words, the "conservative"^[1] view of "keep[ing] social rules as they are" simply doesn't Feel the ASI.^[2] There is no point in this post where the authors present a sliver of evidence for why it's possible to maintain the "barriers" and norms that exist in current societies, when the fundamental phase change of the Singularity happens.

The default result of the Singularity is that existing norms and rules are thrown out the window. Not because people suddenly stop wanting to employ them^[3], not because those communities rebel against the rules^[4], but simply because those who do not adapt get economically outcompeted and lose resources to those who do. You adapt, or you die. It's the Laws of Economics, not the Laws of Man, that lead to this outcome.^[5] Focusing exclusively on the latter, as the post does, on how we ought to relate to each other and what moral norms we should employ, blah blah blah, is missing the forest for a (single) tree. It's a distraction.

There are ways one can believe this outcome can be avoided, of course. If strong AGI never appears anytime soon, for example. If takeoff is very slow and carefully regulated to ensure society always reaches an equilibrium first before any new qualitative improvement in AI capabilities happens. If a singleton takes absolute control of the entire world and dictates by fiat that conservatism shall be allowed to flourish wherever people want it to, forcefully preventing anyone else from breaking barriers and obtaining ever-increasing resources by doing so.

I doubt the authors believe in the former two possibilities.^[6] If they do, they should say so. And if they believe in the latter, well... building an eternally unbeatable norm-enforcing God on earth is probably not what "conservatives" have in mind when they say barriers should be maintained and Schelling fences should be maintained and genuine disagreement should be allowed to exist.

^{^}
Again, this isn't what conservatism stands for. But I'll try to restrain myself from digressing into straight-up politics too much
^{^}
I'd even go a lot further and say it doesn't feel... regular AI progress? Or even just regular economic progress in general. The arguments I give below apply, with lesser force of course, even if we have "business as usual" in the world. Because "business as usual," throughout all of human history and especially at an ever-increasing pace in the past 250 years, means fundamental changes in norms and barriers and human relations despite conservatives standing athwart history, yelling stop
^{^}
Except this does happen, because the promise of prosperity and novelty is a siren's call too alluring to resist en masse
^{^}
Except this also happens, if only because of AIs fundamentally altering human cognition, as is already starting to happen and will by default be taken up to eleven sometime soon
^{^}
See The benevolence of the butcher for further discussion.
^{^}
Which doesn't mean these possibilities are wrong, mind you.

[-]Davidmanheim3mo61

I won't try to speak for my co-author, but yes, we agree that this doesn't try to capture the variety of views that exist, much less what your view of political discourse should mean by conservatism - this is a conservative vision, not the conservative vision. And given that, we find the analogy to be useful in motivating our thinking and illustrating an important point, despite that fact that all analogies are inexact.

That said, yes, I don't "feel the AGI" in the sense that if you presume that the singularity will happen in the typically imagined way, humanity as we know it doesn't make it. And with it goes any ability to preserve our current values. I certainly do "feel the AGI" in thinking that the default trajectory is pointed in that direction, and accelerating, and it's not happening in any sense in a fashion that preserves any values whatsoever, conservative or otherwise.

But that's exactly the point - we don't think that the AGI which is being aimed for is a good thing, and we do think that the conversation about the possible futures humanity could be aiming for is (weirdly) narrowly constrained to be either a pretty bland techno-utopianism, or human extinction. We certainly don't think that it's necessary for there to be no AGI, and we don't think that eternal stasis is a viable conservative vision either, contrary to what you assume we meant. But as we said, this is the first post in a series about our thinking on the topic, not a specific plan, much less final word on how things should happen.

[-][anonymous]3mo121

But as we said, this is the first post in a series about our thinking on the topic, not a specific plan, much less final word on how things should happen.

That's all fine and good if you plan on addressing these kinds of problems in future posts of your series/sequence and explain how you think it's at all plausible for your vision to take hold. I look forward to seeing them.

I have a meta comment about this general pattern, however. It's something that's unfortunately quite recurrent on this site. Namely that an author posts on a topic, a commenter makes the most basic objection that jumps to mind first, and the author replies that the post isn't meant to be the definitive word on the topic and the commenter's objection will be addressed in future posts.^[1]

I think this pattern is bad and undesirable.^[2] Despite my many disagreements with him and his writing, Eliezer did something very, very valuable in the Sequences and then in Highly Advanced Epistemology. He started out with all the logical dependencies, hammering down the basics first, and then built everything else on top, one inferential step at a time.^[3] As a result of this, users could verify the local validity of what he was saying, and when they disagreed with him, they knew the precise point where they jumped off the boat of his ideology.^[4] Instead of Eliezer giving his conclusions without further commentary, he gave the commentary, bit by bit, then the conclusions.

^{^}
In practice, it generally just isn't. Or a far weaker or modified version of it is.
^{^}
Which doesn't mean there's a plausible alternative out there in practice. Perhaps trying to remove this pattern imposes too much of a constraint on authors and instead of them writing things better (from my pov), they instead don't write anything at all. Which is a strictly worse outcome than the original.
^{^}
That's not because his mind had everything cleanly organized in terms of axioms and deductions. It's because he put in a lot of effort to translate what was in his head to what would be informative for and convincing to an audience.
^{^}
Which allows for productive back-and-forths because you don't need to thread through thousands of words to figure out where people's intuitions differ and how much they disagree with, etc.

[-]Wei Dai3mo124

I often have the opposite complaint, which is that when reading a sequence, I wish I knew what the authors' bottom line is, so I can better understand how their arguments relate and which ones are actually important and worth paying attention to. If I find a flaw, does it actually affect their conclusions or is it just a nit? In this case, I wish I knew what the authors' actual ideas are for aligning AI "conservatively".

One way to solve both of our complaints is if the authors posted the entire sequence at once, but I can think of some downsides to doing that (reducing reader motivation, lack of focus in discussion), so maybe still post to LW one at a time, but make the entire sequence available somewhere else for people to read ahead or reference if they want to?

[-]Davidmanheim3mo60

We're very interested in seeing where people see flaws, and there's a real chance that they could change our views. This is a forum post, not a book, and the format and our intent sharing it differs. That is, if we had completed the entire sequence before starting to get public feedback, the idea of sharing the full seuquence at the start would work - but we have not. We have ideas, partial drafts, and some thoughts on directions to pursue, but it's not obvious that the problems we're addressing are solvable, so we certainly don't have final conclusions, nor do I think we will get there when we conclude the sequence.

[-][anonymous]3mo62

One way to solve both of our complaints is if the authors posted the entire sequence at once, but I can think of some downsides to doing that (reducing reader motivation, lack of focus in discussion)

Also the fact that you don't get to use real-time feedback from readers on what their disagreements/confusions are, allowing you to change what's in the sequence itself or to address these problems in future posts.

Anyway, I don't have a problem with authors making clear what their bottom line is.^[1] I have a problem with them arguing for their bottom line out of order, in ways that unintentionally but pathologically result in lingering confusions and disagreements and poor communication.

^{^}
If nothing else, reading that tells you as a reader whether it's something you're interested in hearing about or not, allowing you to not waste time needlessly if it's the latter

[-]Davidmanheim3mo1-1

I'm confused by this criticism. You jumped on the most the most basic objection that jumps to mind first based on what you thought we were saying - but you were wrong. We said, explicitly, that this is "our lens on parts of the conservative-liberal conceptual conflict" and then said "In the next post, we want to outline what we see as a more workable version of humanity's relationship with AGI moving forward."

My reply wasn't backing out of a claim, it was clarifying the scope by restating and elaborating slightly something we already said in the very first section of the post!

[-][anonymous]3mo40

The objection isn't the liberal/conservative lens. That's relatively minor, as I said. The objection is the viability of this approach, which I explained afterwards (in the final 4 paragraphs of my comment) and remains unaddressed.

[-]Davidmanheim3mo40

The viability of what approach, exactly? You again seem to be reading something different than what was written.

You said "There is no point in this post where the authors present a sliver of evidence for why it's possible to maintain the 'barriers' and norms that exist in current societies, when the fundamental phase change of the Singularity happens."

Did we make an argument that it was possible, somewhere, which I didn't notice writing? Or can I present a conclusion to the piece that might be useful:

"...the question we should be asking now is where [this] view leads, and how it could be achieved.

That is going to include working towards understanding what it means to align AI after embracing this conservative view, and seeing status and power as a feature, not a bug. But we don’t claim to have 'the' answer to the question, just thoughts in that direction - so we’d very much appreciate contributions, criticisms, and suggestions on what we should be thinking about, or what you think we are getting wrong."

[-]Raemon3mo2433

I strong upvoted this because I thought the discussion was an interesting direction and it had already fallen off the frontpage. I don't know that I particularly agree with the reasoning. (I am generally liberal, though have updated a bit towards being more conservative-as-described-here on the margin in recent years. I found the general framing of liberal vs conservative as described here an interesting as a lens to look through)

I do feel like this somewhat overstates the values-difference with Fun Theory, and feels like it's missing the point of Coherent Extrapolated Volition.

We will argue that as usually presented, alignment by default leads to recursive preference engines that eliminate disagreement and conflict, creating modular, adaptable cultures where personal compromise is unnecessary. We worry that this comes at the cost of reducing status to cosmetics and eroding personal growth and human values. Therefore, we argue that it’s good that values inherently conflict, and these tensions give life meaning; AGI should support enduring human institutions by helping communities navigate disputes and maintain norms, channeling conflict rather than erasing it. This ideal, if embraced, means that AI Alignment is essentially a conservative movement.

I don't think CEV at it's core assumes this. I think that, while writing CEV makes a prediction that, if people knew more, thought longer, and grew up together more, a lot of disagreements would melt away and there would turn out to be a lot that humanity wants in common. But, CEV is designed to do pretty well even in worlds where that is false (if it's maximally false, the CEV just throws an error. But in worlds where things only partially cohere, well, the AI helps out with those parts as best it can in a way that everyone agrees is good.

There also nothing intrinsically anti-conservative about what it'll end up with, unless you think people would be less conservative after thinking longer and learning more and talking with each other more. (do you think that?). Yeah, lots of LWers probably lean towards expecting it'll be more liberal, but, that's just a prediction, not a normative claim CEV is making.

Somewhat relatedly, in Free to Optimize (which is about humans being able to go about steering their lives, not about an AI or anyone "hardcore optimizing") Eliezer says:

If there is anything in the world that resembles a god, people will try to pray to it. It's human nature to such an extent that people will pray even if there aren't any gods—so you can imagine what would happen if there were! But people don't pray to gravity to ignore their airplanes, because it is understood how gravity works, and it is understood that gravity doesn't adapt itself to the needs of individuals. Instead they understand gravity and try to turn it to their own purposes.
So one possible way of helping—which may or may not be the best way of helping—would be the gift of a world that works on improved rules, where the rules are stable and understandable enough that people can manipulate them and optimize their own futures together. A nicer place to live, but free of meddling gods beyond that. I have yet to think of a form of help that is less poisonous to human beings—but I am only human.

This feels more like a vision of "what constraints to put in place" than "what to optimize for."

(I agree that it has a vibe of pointing in a more individualistic direction, and it's worth noticing that and not taking it for granted. But I think the point of Fun Theory is to get at something that really would also underly any good vision for the future, not just one particular one. I think conservatives do actually want complex novelty. I don't have encyclopedic memory of the Fun Theory sequence but I would bet against it saying anything explicit and probably not even anything implicit about "individual" complex novelty. It even specifically warns against turning our complex, meaningful multiplayer games into single-player experiences)

CEV isn't about eliminating conflict, it's (kinda) about efficiently resolving conflict. But, insofar as the resolution of the conflict itself is meaningful, it doesn't say anything about people not getting to resolve the conflict themselves.

[-]Raemon3mo52

(People seem to be hella downvoting this, and I am kinda confused as to why. I can see not finding it particularly persuasive or interesting. I'm guessing this is just sad tribalism but curious if people have a particular objection I'm missing)

[-]habryka3mo40

There are some users around who strong-downvote anyone trying to make any arguments on the basis of CEV, and who seem very triggered by the concept. This is sad and has derailed a bunch of conversations in the past. My guess is the same is going on here.

[-]Adele Lopez3mo20

Do you not have the power/tools to stop such behavior from taking effect? This sounds like the exact problem that killed LW 1.0, and which I was lead to believe is now solved.

[-]habryka3mo60

We have much better tools to detect downvoting of specific users, and unusual voting activity by a specific user, but if a topic only comes up occasionally and the users who vote on that topic also regularly vote on other things, I don't know of any high-level statistics that would easily detect that, and I think it would have very substantial chilling effects if we were to start policing that kind of behavior.

There probably are technical solutions, but it's a more tricky kind of problem than what LW 1.0 faced, and we haven't built them.

[-]Davidmanheim3mo20

I'd be more interested in tools that detected downvotes that occur before people started reading, on the basis of the title - because I'd give even odds that more than half of downvotes on this post were within 1 minute of opening it, on the basis of the title or reacting the the first paragraph - not due to the discussion of CEV.

[-]Noosphere893mo-2-12

I was the one who downvoted, and my reasoning for doing this is at a fundamental level, I think a lot of their argument rests on fabrication of options that only appear to work because they ignore the issue of why value disagreement is less tolerable in an AI-controlled future than now.

I have a longer comment below, and @sunwillrise makes a similar point, but a lot of the argument around AI safety having an attitude towards minimizing value conflict makes more sense than the post is giving it credit for, and the mechanisms that allow value disagreements to not blow up into take-over attempts/mass violence relies on certain features of modern society that AGI will break (and there is no talk about how to actually make the vision sustainable):

https://www.lesswrong.com/posts/iJzDm6h5a2CK9etYZ/a-conservative-vision-for-ai-alignment#eBdRwtZeJqJkKt2hn

[-]Davidmanheim3mo30

Thank you for noticing the raft of reflexive downvotes; it's disappointing how much even Lesswrong seems to react reflexively; even the comments seem not to have read the piece, or at least engaged with the arguments.

On your response - I agree that CEV as a process could arrive at the outcomes you're describing, where ineliminable conflict gets it to throw an error - but think that CEV as approximated and as people assume will work is, as you note, making a prediction that disagreements will dissolve. Not only that, but it asserts that this will have an outcome that preserves what we value. If the tenets of agonism are correct, however, any solution geared towards "efficiently resolving conflict" is destructive of human values - because as we said, "conflict is central to the way society works, not something to overcome." Still, I agree that Eliezer got parts of this right (a decade before almost anyone else even noticed the problem,) and agree that keeping things as multiplayer games with complex novelty, where conflict still matters is critical. The further point, which I think Eliezer's fun theory, as written, kind of elides, is that we also need limits and pain for the conflict to matter. That is, again, it seems possible that part of what makes things meaningful is that we need to ourselves engage in the conflict, instead of having it "solved" via extrapolation of our values.

As a separate point, I argued in a different post, we lack the conceptual understanding needed to deal with the question of whether there is some extrapolated version of most agents that is anywhere "close" to their values which is coherent. But at the very least, "the odds that an arbitrary complex system is pursuing some coherent outcome" approaches zero, and that at least slightly implies almost all agents might not be "close" to a rational agent in the important senses we care about for CEV.

[-]Raemon3mo22

The further point, which I think Eliezer's fun theory, as written, kind of elides, is that we also need limits and pain for the conflict to matter.

I think Eliezer writing says this sort of thing pretty explicitly? (Like, in Three Worlds Collide, the "bad" ending was the one where humans removed all conflict, romantic struggle, and similar types of pain that seem like the sort of thing you're talking about here)

If the tenets of agonism are correct, however, any solution geared towards "efficiently resolving conflict" is destructive of human value

I assume this will come up later in your sequence, but, as stated this seems way too strong. (I can totally buy that there are qualities of conflict resolution that would be bad to abstract away, but, as stated this is an argument against democracy, markets, mediation, norms for negotiation, etc. Do you actually believe those are destructive of human value and we should be, like, waging war instead of talking? Or do you mean something else here)

[-]Davidmanheim3mo20

I agree that Eliezer has made different points different places, and don't think that the Fun Theory series makes this clear, and CEV as described seems to not say it. (I can't try to resolve all the internal tensions between the multiple bookshelves woth of content he's produced, so I referred to "fun theory, as written.")

And I certainly don't think conflict as such is good! (I've written about the benefits of avoiding conflict at some length on my substack about cooperation.) My point here was subtly different, and more specific to CEV; I think that solutions for eliminating conflict which route around humans themselves solving the problems might be fundamentally destructive of our values.

[-]Anon User3mo174

I think this might be underestimating how the conservative/liberal axis correlates with scarcity/abundance axis. In an existential struggle against a zombie horde, the conservative policies are a lot more relevant - of course "our tribe first" is the only survivable answer, anybody who wants to "find themselves" when they are supposed to be guarding the entrance is an idiot and a traitor, deviating from proven strategies is a huge risk, etc. When all important resources are abundant, liberal policies become a lot more relevant - hoarding resources, and not sharing with neighbors is a mental illness, there is low risk in an kinds of experimentation and rule breaking, etc. Well, AI is very likely to drastically move us away from scarcity and towards abundance, so need to consider how it affects which policies would make more sense.

[-]Davidmanheim3mo20

"AI is very likely to drastically move us away from scarcity and towards abundance"

That makes a huge number of assumptions about the values and goals of the AI, and is certainly not obvious - unless you've already assumed things about the shape of the likely future, and the one we desire. But that's a large part of what we're questioning.

[-]Anon User3mo30

How about this - in most non-disaster scenarios, AI would make the abundance a lot easier to achieve. And conservative or liberal, it's basic human nature to go for abundance in such situations.

[-]Davidmanheim3mo20

I don't think this is true in the important sense; yes, we'll plausibly get material abundance, but we will still have just as much conflict because humans want scarcity, and they want conflict. So which resources are "important" will shift. (I should note that Eliezer made something like this point in a tweet, where he said "And yet somehow there is a Poverty Equilibrium which beat a 100-fold increase in productivity plus everything else that went right over the last thousand years" - but his version assumes that once all the necessities are available, poverty would be gone. I think that we view clearly impossible past luxuries, like internet connectivity and access to laundry machines as minimal requirements, showing that the hedonic treadmill is stronger than wealth generation!)

[-]Duncan Sabien (Inactive)1mo110

I know of one parent that puts three dollars aside each time they violate the bodily sovereignty of their infant - taking something out of their mouth, or restricting where they can go

It's me, by the way. Happy to identify myself.

(I have more agreement than disagreement with the authors on many points, here.)

[-]Noosphere893mo103

The big reason why AI safety aims to have a vision of as little disagreement on values as possible between AIs and humans is because the mechanisms that make value disagreement somewhat tolerable (at least in the sense that people won't kill each other all the time) is going to go away with AGI.

In particular, one of the biggest glues holding our society together is the fact that everyone is dependent on everyone else. Individuals simply can't thrive without society, and the BATNA is generally so terrible that people will put up with a lot of disagreements to keep themselves away from the BATNA.

In particular, one corollary is that we don't have to worry much about one individual human subverting the enforcement system, and no one person rules alone. We do have to worry about this for AGI, so a lot of common strategies to manage people breaking agreements do not work.

And the techno-utopian/extinction claims are basically due to the fact that AGI is an extremizing force, in that it allows more and more technological progress allowing access to ever more extreme claims, and would be a convergent result of widely different ideologies.

Putting it less charitably, the post is trying to offer us a fabricated option that is a product of the authors not being able to understand how AI is able to break the constraints of current society, and while I shall wait for future posts, I'm not impressed at all with this first post.

[-]Seth Herd3mo84

This post is about alignment targets, or what we want an AGI to do, a mostly separable topic from technical alignment, or how we get an AGI to do anything in particular. See my Conflating value alignment and intent alignment is causing confusion for more.

There's a pretty strong argument that technical alignment is far more pressing, so much so that addressing alignment targets right now really is barely-helpful when compared to doing nothing, and anti-helpful relative to working on technical alignment or "societal alignment" (getting-our-shit-collectively-together).

In particular, those actually in charge of building AGI will want it aligned to their own intent, and they'll have an excuse because it's probably genuinely a good bit more dangerous to aim directly at value alignment rather than aim for some sort of long reflection. Instruction-following is the current default alignment target and it will likely continue to be through our first AGI(s) because it offers substantial corrigibility. Value alignment does not, so we have to get it right on the first real try in both technical and the wisdom sense you address here.

LESSWRONG
LW

LESSWRONG
LW

25

A Conservative Vision For AI Alignment

25

25

Conservatives and liberals

Disagreement as a Virtue

Parents and children

AGI as humanity's children

Holism and reductionism

Optimizers destroy value, conservation preserves pain