[ Question ]

Why do you believe AI alignment is possible?

by Samuel Shadrach1 min read15th Nov 202143 comments



Geneuinely curious, has someone recently (< 5 years old) made a comprehensive post on what key points lead them to believe AI alignment is possible? Or is it just vague "we don't have any clue whatsoever but we shouldn't give up". Ideally the post should demonstrate deep understanding of all the problems and failed attempts that have already been tried.

And if key points don't exist, some vague inklings of your worldview or perspective, anything will do. I'll even accept poetry if you decide that's the only form of communication that will enable you successfully communicate anything meaningful. (Only half joking)

(I ask because my mind defaults to assuming it being an impossible problem to solve, but I'm keen on reading any perspectives that change that.)

I tried hard to find a post, couldn't so far. Just some replies here there buried in threads.

New Answer
Ask Related Question
New Comment

8 Answers

Human brains are a priori aligned with human values. Human brains are proof positive that a general intelligence can be aligned with human values. Wetware is an awful computational substrate. Silicon ought to work better.

Arguments by definition don't work. If by "human values" you mean "whatever humans end up maximizing", then sure, but we are unstable and can be manipulated, which isn't we want in an AI. And if you mean "what humans deeply want or need", then human actions don't seem very aligned with that, so we're back at square one.

Humans aren't aligned once you break abstraction of "humans" down. There's nobody I would trust to be a singleton with absolute power over me (though if I had to take my chances, I'd rather have a human than a random AI).

I see but isn't this reversed? "Human values" are defined by whatever vague cluster of things human brains are pointing at.

5lsusr19dDefinition implies equality. Equality is commutative. If "human values" equals "whatever vague cluster of things human brains are pointing at" then "whatever vague cluster of things human brains are pointing at" equals "human values".
2Samuel Shadrach19dAgreed but that doesn't help. If you tell me that A aligns with B and B is defined as the thing that A aligns to, these statements are consistent but give zero information. And more specifically, zero information about whether some C in Set S can also align with B.

The algorithms of good epistemology

Can also equip axiology.

With free energy minimization

Of other-mind prediction,

You can route it to an AI's teleology.

I tried reading about free energy minimisation on wikipedia, it went past my head. Is there any source or material you would recommend?

4Jon Garcia17dYeah, Friston is a bit notorious for not explaining his ideas clearly enough for others to understand easily. It took me a while to wrap my head around what all his equations were up to and what exactly "active inference" entails, but the concepts are relatively straightforward once it all clicks. You can think of "free energy" as the discrepancy between prediction and observation, like the potential energy of a spring stretched between them. Minimizing free energy is all about finding states with the highest probability and setting things up such that the highest probability states are those where your model predictions match your observations. In statistical mechanics, the probability of a particle occupying a particular state is proportional to the exponential of the negative potential energy of that state. That's why air pressure exponentially drops off with altitude (to a first approximation,p(h)∝ex p(−mghRT)). For a normal distribution: p(x)=1√2πσ2exp(−12(x−μ)2σ2) the energy is a parabola: E(x)=−log(p(x))=12(x−μ)2σ2+C This is exactly the energy landscape you see for an ideal Newtonian spring with rest lengthμand spring constantk=1σ2=precision. Physical systems always seek the configuration with the lowest free energy (e.g., a stretched spring contracting towards its rest length). In the context of mind engineering,xmight represent an observation,μthe prediction of the agent's internal model of the world, and1σ2 the expected precision of that prediction. Of course, these are all high-dimensional vectors, so matrix math is involved (Friston always usesΠfor the precision matrix). For rational agents, free energy minimization involves adjusting the hidden variables in an agent's internal predictive model (perception) or adjusting the environment itself (action) until "predictions" and "observations" align to within the desired/expected precision. (For actions, "prediction" is a bit of a misnomer; it's actually a goal or a homeostatic set point that the agent
0Samuel Shadrach17dThanks for typing this out. Sounds like uploading a human mind and then connecting it to an "intelligence module". (It's probably safer for us to first upload and then think about intelligence enhancement, rather than ask an AGI to figure out how to upload or model us.) I personally tend to feel that even such a mind would quickly adopt behaviours than you and I find.... alien, and their value system will change significantly. Do you feel that wouldn't happen, and if so do you have any insightas to why?

As I see it there are mainly two hard questions in alignment. 

One is, how do you map human preferences in such a way that you can ask a machine to satisfy them. I don't see any reason why this would be impossible for a superintelligent being to figure out. It is somewhere similar (though obviously not identical) to asking a human to figure out how to make fish happy.

The second is, how do you get a sufficiently intelligent machine to anything whatsoever without doing a lot of terrible stuff you didn't want as a side effect? As Yudkoswky says:

The way I sometimes put it is that I think that almost all of the difficulty of the alignment problem is contained in aligning an AI on the task, “Make two strawberries identical down to the cellular (but not molecular) level.” Where I give this particular task because it is difficult enough to force the AI to invent new technology. It has to invent its own biotechnology, “Make two identical strawberries down to the cellular level.” It has to be quite sophisticated biotechnology, but at the same time, very clearly something that’s physically possible.

This does not sound like a deep moral question. It does not sound like a trolley problem. It does not sound like it gets into deep issues of human flourishing. But I think that most of the difficulty is already contained in, “Put two identical strawberries on a plate without destroying the whole damned universe.” There’s already this whole list of ways that it is more convenient to build the technology for the strawberries if you build your own superintelligences in the environment, and you prevent yourself from being shut down, or you build giant fortresses around the strawberries, to drive the probability to as close to 1 as possible that the strawberries got on the plate.

When I consider whether this implied desiderata is even possible, I just note that I and many others continue to not inject heroin. In fact, I almost never seem to act in ways that look much like driving the probability of any particular number as close to 1 as possible. So clearly it's possible to embed some kind of motivational wiring into an intelligent being, such that the intelligent being achieves all sorts of interesting things without doing too many terrible things as a side effect. If I had to guess, I would say that the way we go about this is something like: wanting a bunch of different, largely incommensurable things at the same time, some of which are very abstract, some of which are mutually contradictory, and somehow all these different preferences keep the whole system mostly in balance most of the time. In other words, it's inelegant and messy and not obvious how you would translate it into code, but it is there, and it seems to basically work. Or, at least, I think it works as well as we can expect, and serves as a limiting case.

Thanks for replying. I'll try my best to reply.

Firstly I'm not convinced that "human preferences" as a coherent concept even exists for an AGI reasoning about it. Basically we cluster all our inclinations at various points in time into a cluster in thingspace called "human preferences" when we reasoning about ourselves. An AGI who can very transparently see that we as humans are optimising for no one thing in particular, may use an entirely different model to reason about human behaviour than the ones humans use to reason about human behaviour. This model ... (read more)

I'll even accept poetry

I will now drive a small truck through the door you left ajar. (This is indeed bad poetry, so it's not coherent and not an answer and also not true, but it has some chance of being usefully evocative.)

It seems as though when I learn new information, ideas, or thought processes, they become available for my use towards my goals, and don't threaten my goals. To judge between actions, usually most of what I want to attend to figuring out is the likely consequences of the actions, rather than the evaluation of those consequences (excepting evaluations that are only about further consequences), indicating that to the extent my values are able to notice when they are under threat, they are not generally under threat by other thought processes. When I have unsatisfied desires, it seems they're usually mostly unsatisfied because I don't know which actions to take to bring about certain consequences, and I can often more or less see what sort of thing I would do, at some level of abstraction, to figure out which actions to take; suggesting that there is such a thing as "mere problem solving thought", because that's the sort of thought that I think I can see as a meta-level plan that would work, i.e., my experience from being a mind suggests that there is an essentially risk-free process I can undertake to gain fluency in a domain that lays the domain bare to the influence of my values. An FAI isn't an artifact, it's a hand and an eye. The FAI doing recursive self-improvement is the human doing recursive self-improvement. The FAI is densely enmeshed in low-latency high-frequency feedback relationships with the humans that resemble the relationships between different mental elements of my mental model of the room around me, or between those and the micro-tasks I'm performing and the context of those micro-tasks. A sorting algorithm has no malice, a growing crystal has no malice, and likewise a mind crystallizing well-factored ontology, from the primordial medium of low-impact high-context striving, out into new domains, has no malice. The neocortex is sometimes at war with the hardwired reward, but it's not at war with Understanding, unless specifically aimed that way by social forces; there's no such thing as "values" that are incompatible with Understanding, and all that's strictly necessary for AGI is Understanding, though we don't know how to sift baby from bath-lava. The FAI is not an agent! It defers to the human not for "values" or "governance" or "approval" but for context and meaning and continuation; it's the inner loop in an intelligent process, the C code that crunches the numbers. The FAI is a mighty hand growing out of the programmer's forehead. Topologically the FAI is a bubble in space that is connected to another space; metrically the FAI bounds an infinite space (superintelligence), but from our perspective is just a sphere (in particular, it's bounded). The tower of Babylon, but it's an inverted pyramid that the operator balances delicately. Or, a fractal, a tree say, where the human controls the angle of the branches and the relative lengths, propagating infinitely up but with fractally bounded impact. Big brute searches in algorithmically barren domains, small careful searches in algorithmically rich domains. The Understanding doesn't come from consequentialist reasoning; consequentialist reasoning constitutively requires Understanding; so the door is open to just think and not do anything. Algorithms have no malice. Consequentialist reasoning has malice. (Algorithms are shot through with consequentiality, but that's different from being aimed at consequences.) I mostly don't gain Understanding and algorithms via consequentialist reasoning, but by search+recognition, or by the play of thoughts against each other. Search is often consequentialist but doesn't have to be. One can attempt to solve a Rubik's cube without inevitably disassembling it and reassembling it in order. The play of thoughts against each other is logical coherence, not consequentialism. The FAI is not a predictive processor with its set-point set by the human, the FAI and the human are a single predictive processor.

Thanks for this.

Would the physical implementation of this necessarily be a man-machine hybrid? Communication directly via neurochemical signals, atleast in early stages. Or could the "single predictive processor" still have two subagents (man and machine) able to talk to each other in english? (If you don't have thoughts on physical implementation that's fine too.)

An FAI isn't an artifact, it's a hand and an eye. 

Would this be referring to superhuman capabilities that are narrow in nature? There's a difference between a computer that computes fluid dy... (read more)

4TekhneMakre18dThat seems right, though it's definitely harder to be an x-risk without superintelligence; e.g. even a big nuclear war isn't a guaranteed extinction, nor an extremely infectious and lethal virus (because, like, an island population with a backup of libgen could recapture a significant portion of value). I hope not, since that seems like an additional requirement that would need independent work. I wouldn't know concretely how to use the hybridizing capability, that seems like a difficult puzzle related to alignment. I think the bad poetry was partly trying to say something like: in alignment theory, you're *trying* to figure out how to safely have the AI be more autonomous---how to design the AI so that when it's making consequential decisions without supervision, it does the right thing or at least not a permanently hidden or catastrophic thing. But this doesn't mean you *have to* "supervise" (or some other high-attention relationship that less connotes separate agents, like "weild" or "harmonize with" or something) the AI less and less; more supervision is good. IDK. Language seems like a very good medium. I wouldn't say subagent though, see below. This is a reasonable interpretation. It's not my interpretation; I think the bad poetry is talking about the difference between one organic whole vs two organic wholes. It's trying to say that having the AI be genuinely generally intelligent doesn't analytically imply that the AI is "another agent". Intelligence does seem to analytically imply something like consequentialist reasoning; but the "organization" (whatever that means) of the consequentialist reasoning could take a shape other than "a unified whole that coherently seeks particular ends" (where the alignment problem is to make it seek the right ends). The relationship between the AI's mind and the human's mind could instead look more like the relationship between [the stuff in the human's mind that was there only at or after age 10] and [the stuff in the h
1Samuel Shadrach18dHarder but not impossible. Black balls are hypothetical inventions whose very existence (or existence as public information) makes them very likely to be deployed. With nukes for instance we have only a small set of parties who are capable of building them and choose not to deploy. As a complete aside, that's a really cool hypothetical, I have no idea if that's true though. Lots of engineering depends on our economic and scientific history, costs of materials etc. It's possible that they develop manufacturing differently and different things end up cheaper. Or some scientific / engineering departments are understaffed or overstaffed relative to our world because there just happen to be less or more people interested in them. They would still likely have scientific progress, assuming they solve some of the social coordination problems that we have. Interesting, I'd have to think about it. One could say that organs are in fact subagents, they have different goals. Like how different humans have different goals but they cooperate so you can say individual humans are subagents to the human collective as an agent. The difference between the two I guess is that organs are "not intelligent", whatever intelligence they do have is very narrow, and importantly they don't have an inner model for the rest of the world. Just wondering, could an AI have an inner model of the world independent from human's inner model of the world, and yet exist in this hybrid state you mention? Or must they necessarily share a common model or significantly collaborate and ensure their models align at all times? Would this be similar to: Or do you mean something else? All in all, I think you have convinced me it might be possible :)
4TekhneMakre18dI wouldn't want to say that too much. I'd rather say that an organ serves a purpose. It's part of a design, part of something that's been optimized, but it isn't mainly optimizing, or as you say, it's not intelligent. More "pieces which can be assembled into an optimizer", less "a bunch of little optimizers", and maybe it would be good if the human were doing the main portion of the assembling, whatever that could mean. Hm. This feels like a bit of a different dimension from the developmental analogy? Well, IDK how the metaphor of hands and eyes is meant. Having more "hands and eyes", in the sense of the bad poetry of "something you can weild or perceive via", feels less radical than, say, what happens when a 10-year-old meets someone they can have arguments with and learns to argue-think. IDK, it's a good question. I mean, we know the AI has to be doing a bunch of stuff that we can't do, or else there's no point in having an AI. But it might not have to quite look like "having its own model", but more like "having the rest of the model that the human's model is trying to be". IDK. Also could replace "model" with "value" or "agency" (which goes to show how vague this reasoning is).
1Samuel Shadrach18dGot it. Second para makes a lot of sense. First and last para feel like intentionally deflecting from trying to pin down specifics. I mean your responses are great but still. My responses seem to be moving towards trying to pin some specific things down, yours go a bit in the opposite direction. Do you feel pinning down specifics is a) worth doing? b) possible to do? c) something you wish to do in this conversation? (I totally understand that defining specifics too rigidly in one way shouldn't blind us to all the other ways we could have done things, but that doesn't by itself mean we shouldn't ever try to define them in different ways and think each of those through.)
4TekhneMakre18da) worth doing? Extremely so; you only ever get good non-specifics as the result having iteratively built up good specifics. b) possible to do? In general, yes. In this case? Fairly likely not; it's bad poetry, the senses that generated are high variance, likely nonsense, some chance of some sense. And alignment is hard and understanding minds is hard. c) something you wish to do in this conversation? Not so much, I guess. I mean, I think some of the metaphors I gave, e.g. the one about the 10 year old, are quite specific in themselves, in the sense that there's some real thing that happens when a human grows up which someone could go and think about in a well-defined way, since it's a real thing in the world; I don't know how to make more specific what, if anything, is supposed to be abstracted from that as an idea for understanding minds, and more-specific-ing seems hard enough that I'd rather rest it. Thanks for noting explicitly. (Though, your thing about "deflecting" seems, IDK what, like you're mad that I'm not doing something, or something, and I'd rather you figure out on your own what it is you're expecting from people explicitly and explicitly update your expectations, so that you don't accidentally incorrectly take me (or whoever you're talking to) to have implicitly agreed to do something (maybe I'm wrong that's what happened). It's connotatively false to say I'm "intentionally deflecting" just because I'm not doing the thing you wanted / expected. Specific-ing isn't the only good conversational move and some good conversational moves go in the opposite direction.)
1Samuel Shadrach18dre c): Cool, no worries. I agree it's a little specific. re last para, you're right that "deflecting" may not have been the best word. Basically I meant you're intentionally moving the conversation away from trying to nail down specifics, which is opposite to the direction I was trying to move it because that's where I felt it would be most useful. I agree that your conversational move may have been useful, I was just wondering if it would be more useful to now start moving in the direction I wanted. By the end of this conversation I have gotten a vague mental picture that there could possibly exist (in theory) a collective mind that both man and machine inside it. Which answers my original question and is useful. So my probability of "can aligned AI exist" updated a small amount. But I haven't gotten much specifics on what parts are man and what parts are machine conceptually, or any specifics on how this thing looks or is built physically, or any promising direction to pursue to get there, or even a way to judge if we have in fact gotten there, et cetera et cetera. Hence my probability update is small not large. But my uncertainty is higher. I agree all this is hard, no worries if we don't discuss it here.
1TekhneMakre17dMostly, all good. (I'm mainly making this comment about process because it's a thing that crops up a lot and seems sort of important to interactions in general, not because it particularly matters in this case.) Just, "I meant you're intentionally moving the conversation away from trying to nail down specifics"; so, it's true that (1) I was intentionally doing X, and (2) X entails not particularly going toward nailing down specifics, and (3) relative to trying to nail down specifics, (2) entails systematically less nailing down of specifics. But it's not the case that I intended to avoid nailing down specifics; I just was doing something else. I'm not just saying that I wasn't *deliberately* avoiding specifics, I'm saying I was behaving differently from someone who has a goal or subgoal of avoiding specifics. Someone with such a goal might say some things that have the sole effect of moving the conversation away from specifics. For example, they might provide fake specifics to distract you from the fact they're not nailing down specifics; they might mock you or otherwise punish you for asking for specifics; they might ask you / tell you not to ask questions because they call for specifics; they might criticize questions for calling for specifics; etc. In general there's a potentially adversarial dynamic here, where someone intends Y but pretends not to intend Y, and does this by acting as though they intend X which entails pushing against Y; and this muddies the waters for people just intending X, not Y, because third parties can't distinguish them. Anyway, I just don't like the general cultural milieu of treating it as an ironclad inference that if someone's actions systematically result in Y, they're intending Y. It's really not a valid inference in theory or practice. The situation is sometimes muddied, such that it's appropriate to treat such people *as though* they're intending Y, but distinguishing this from a high-confidence proposition that they are in fact

I believe its possible for AI values to align as much as the least possibly aligned human individuals are aligned with each other. And in my books, if this could be guaranteed, would already constitue a heroic achievement, perhaps the greatest accomplishment of mankind up until that point.

Any greater alignment would be a pleasant fantasy, hopefully realizable if AGIs were to come into existence, but doesn’t seem to have any solid justification, at least not any more than many other pleasant fantasies.

I think a lot of human "alignment" isn't encoded in our brains, it's encoded only interpersonally, in the fact that we need to negotiate with other humans of similar power. Once a human gets a lot of power, often the brakes come off. To the extent that's true, alignment inspired by typical human architecture won't work well for a stronger-than-human AI, and some other approach is needed.

5M. Y. Zuo19dI didn’t mean to suggest that any future approach has to rely on ‘typical human architecture’. I also believe the least possibly aligned humans are less aligned than the least possibly aligned dolphins, elephants, whales, etc…, are with each other. Treating AGI as a new species, at least as distant to us as dolphins for example, would be a good starting point.

I see. Keen on your thoughts on following:

Would a human with slightly superhuman intellect be aligned with other humans? What about a human whose intelligence is as unrecognisable to us as we are to monkeys? Would they still be "human"? Would their values still be aligned? 

5M. Y. Zuo19dWell I would answer but the answers would be recursive. I cannot know the true values and alignment of such a superhuman intellect without being one myself. And if I were, I wouldn’t be able to communicate such thoughts with their full strength, without you also being at least equally superhuman to understand. And if we both were, then you would know already. And if neither of us are, then we can at best speculate with some half baked ideas that might sound convincing to us but unconvincing to said superhuman intellects. At best we can hope that any seeming alignment of values, perceived to the best of our abilities, is actual. Additionally, said supers may consider themselves ’humans’ or not, on criteria possibly also beyond our understanding. Alternatively, if we could raise ourselves to that level, then case super-super affairs would become the basis, thus leading us to speculate on hyper-superhuman topics on super-Lesswrong. ad infinitum.
2Samuel Shadrach18dGot it. That's a possible stance. But I do believe there exist arguments (/chains of reasoning/etc) that can be understood by and convincing to both smart and dumb agents, even if the class of arguments that a smarter agent can recognise is wider. I would personally hope one such argument can answer the question "can alignment be done?" , either as yes or no. There's a lot of things about the superhuman intellect that we don't need to be able to understand in order for such an argument to exist. Same as how we don't need to understand the details of monkey language or their conception of self or any number of other things, to realise that humans and monkeys are not fully aligned. (We care more about our survival than theirs, they care more about their survival than ours.) Are you still certain that no such argument exists? If so, why?
1M. Y. Zuo18dIn this case we would we be the monkeys gazing at the strange, awkwardly tall and hairless monkeys pondering about them in terms of monkey affairs. Maybe I would understand alignment in terms of whose territory is whose, who is the alpha and omega among the human tribe(s), which bananas trees are the best, where is the nearest clean water source, what kind of sticks and stones make the best weapons, etc. I probably won’t understand why human tribe(s) commit such vast efforts into creating and securing and moving around those funny looking giant metal cylinders with lots of wizmos at the top, bigger than any tree I’ve seen. Why every mention of them elicits dread, why only a few of the biggest human tribes are allowed to have them, why they need to be kept on constant alert, why multiple need to be put in even bigger metal cylinders to roam around underwater, etc., surely nothing can be that important right? If the AGI is moderately above us, than we could probably find such arguments convincing to both, but we would never be certain of them. If the AGI becomes as far above us as humans to monkeys then I believe the chances are about as likely as us arguments that could convince monkeys about the necessity of ballistic missile submarines.
3Samuel Shadrach18dOkay but the analogue isn't that we need to convince monkeys ballistic missiles are important. It's that we need to convince monkeys that we care about exactly the same things they do. That we're one of them. (That's what I meant by - there's a lot of things we don't need to understand, if we only want to understand that we are aligned.)
4M. Y. Zuo18dAre you pondering what arguments a future AGI will need to convince humans? That’s well covered on LW. Otherwise my point is that we will almost certainly not convince monkeys that ‘we’re one of them‘ if they can use their eyes and see instead of spending resources on bananas, etc., we’re spending it on ballistic missiles, etc. Unless you mean if we can by deception, such as denying we spend resources along those lines, etc… in that case I’m not sure how that relates to a future AGI/human scenarios.
2Samuel Shadrach18dThat certainly acts as a point against us being aligned, in the brain of monkey. (Assuming they could even understand it's us who are building the missiles in front of them.) Maybe you can counteract it with other points in favour. It isn't immediately clear to me why that has to be deceptive (if we were in fact aligned with monkeys). Keen on your thoughts. P.S. Minor point but you can even deliberately hide the missiles from the monkeys, if necessary. I'm not sure if willful omission counts as deception.

Which kind of impossible-to-solve do you think alignment is, and why?

Do you mean that there literally isn't any one of the countably infinite set of bit strings that could run as a program on any mathematically possible piece of computing hardware that would "count" as both superintelligent and aligned? That... just seems like a mathematically implausible prior. Even if any particular program is aligned with probability zero, there could still be infinitely many aligned superintelligences "out there" in mind design space. 

Note: if you're saying the concept of "aligned" is itself confused to the point of impossibility, well, I'd agree that I'm at least sure my current concept of alignment is that confused if I push it far enough, but it does not seem to be the case that there are no physically possible futures I could care about and consider successful outcomes for humanity, so it should be possible to repair said concept.

Do you mean there is no way to physically instantiate such a device? Like, it would require types of matter that don't exist, or numbers of atoms so large they'd collapse into a black hole, or so much power that no Kardashev I or II civ could operate it? Again, I find that implausible on the grounds that all the humans combined are made of normal atoms, weigh on the order of a billion tons, and consume on the order of a terrawatt of chemical energy in the form of food, but I'd be interested in any discussions of this question.

Do you mean it's just highly unlikely that humans will successfully find and implement any of the possible safe designs? Then assuming impossibility would seem to make this even more likely, self-fulfilling-prophecy style, no? Isn't trying to fix this problem the whole point of alignment research?

Thanks for replying.

I think my intuitions are mix of your 3rd and 5th one.

Do you mean it's just highly unlikely that humans will successfully find and implement any of the possible safe designs? Then assuming impossibility would seem to make this even more likely, self-fulfilling-prophecy style, no? Isn't trying to fix this problem the whole point of alignment research?

If the likelihood is sufficiently low, no reasonable amount of work might get you there. Say the odds of aligned AI being built this century are 10^-10 if you do nothing versus 10^-5 if thou... (read more)

I believe it is not literally impossible because… my priors say it is the kind of thing that is not literally impossible? There is no theorem or law of physics which would be violated, as far as I know.

Do I think AI Alignment is easy enough that we’ll actually manage to do it? Well… I really hope it is, but I’m not very certain.

Got it. I have some priors towards it being near-impossible after reading content about it and adjacent philosophical issues. I can totally see why someone who doesn't have that doesn't need a justification to set a non-zero non-trivial prior.

4AprilSR15dUntil I actually see any sort of plausible impossibility argument most of my probability mass is going to be on "very hard" over "literally impossible." I mean, I guess there's a trivial sense in which alignment is impossible because humans as a whole do not have one singular utility function, but that's splitting hairs and isn't a proof that a paperclip maximizer is the best we can do or anything like that.
1Samuel Shadrach15dHumans today all have roughly same intelligence and training history though. It isn't obvious (to me atleast) that human with an extra "intelligence module" will remain aligned with other humans. I would personally be afraid of any human being intelligent enough to unilaterally execute a totalitarian power grab over the world, no matter how good of a person they seem to be.
2AprilSR15dI'm not sure either way on giving actual human beings superintelligence somehow, but I don't think that not working would imply there aren't other possible-but-hard approaches.
1Samuel Shadrach14dFair, but it acts as a prior against it. If you can't even align humans with each other in the face of an intelligence differential, why will you be to align an alien with all humans? Or are the two problems fundamentally different in some way?
5AprilSR13dI mean, I agree it'd be evidence that alignment is hard in general, but "impossible" is just... a really high bar? The space of possible minds is very large, and it seems unlikely that the quality "not satisfactorily close to being aligned with humans" is something that describes every superintelligence. It's not that the two problems are fundamentally different it's just that... I don't see any particularly compelling reason to believe that superintelligent humans are the most aligned possible superintelligences?
1Samuel Shadrach13dFair enough
1 comments, sorted by Highlighting new comments since Today at 6:50 AM

I don't believe alignment is possible. Humans are not aligned with other humans, and the only thing that prevents an immediate apocalypse is the lack of recursive self-improvement on short timescales. Certainly groups of humans happily destroy other groups of humans, and often destroy themselves in the process of maximizing something like the number of statues. Best we can hope for that whatever takes over the planet after meatbags are gone has some of the same goals that the more enlightened meatbags had, where "enlightened" is a very individual definition. Maybe it is a thriving and diverse Galactic civilization, maybe it is the word of God spread to the stars, maybe it is living quietly on this planet in harmony with the nature. There is no single or even shared vision of the future that can be described as "aligned" by most humans.