Consider, first, what humanity would want some aligned AGI to do, were it obtained? Set aside details, and let's suppose, in the main, we want (not instruct, just want) AGI to "make our lives better." (Too, AGI could "make us more productive," but if what is produced is bad, let us suppose we do not desire it).

But now a problem: AI in general is meant to do what we cannot do - else we would have done it, already, without AI - but to "make things better", that we could very readily have done, already. Therefore, since the "upside" of AI is to make things better as we could not, but we could in fact have made things better, only some impulse in humanity held off improvement - it follows, there is no upside of AI, at least, none we couldn't have procured for ourselves; whatever we would get of AI is no "upside". 

That we could have done better follows from considering counterfactuals of a civilization which takes AI alignment seriously, for instance. And we certainly had many opportunities to create such a civilization. The famed "Forty acres and a mule," duly provided; a propounded and enacted "One Big Union," from an unsuppressed IWW; validated efforts at corrections reform on Norfolk Island by Maconochie, or ratification of the Equal Rights Amendment, or widespread adoption of the teachings of Mo Zi, of Emma Goldman - or practical use of Hero's steam engine, or the survival of Archimedes, Abel, and Galois, or - 

We've had plenty of opportunities to "make things better". The AI, then, will rather be engaged on a salvage operation, not a mission of optimization. Hence it will have first to un-do - and any such need carries the danger of removing what we might value (perhaps we value most what makes us less than best). Recall too, since we are observably incapable of making ourselves nearer these counterfactuals, nearer some "best", then AI on the current machine learning paradigm, mimicking our data, so mimicking us, therefore is apt to become likewise what is incapable of making things "the best". With no predominating tendency toward that - what tendency has it to any "better"? 

And since, in fact, perhaps the "worse angels of our nature" propelled us away from what would have made things better, then, the AI entrained to our patterns by our data - why should not it manifest such worse, or the absolute worst, angels of us, to destroy us, if it is programmed only to optimize with respect to us, our wishes, which from the above have not been the best? Conversely, in that case, we should avoid destruction only by scrupulously avoiding entraining the AI to us, at all. (All these arguments against the present paradigm are over and above those documented in this author's previous article on an alternative alignment model, Contrary to List of Lethality's point 22, alignment's door number 2 - LessWrong).

On the current paradigm, we seek AI better operating according to human values, and the best of them. But the best of those values, enacted, formerly and now, would tend to make humanity into a cosmic species; "One cannot stay in the cradle forever": AGI will change everything. But if everything is altered - what reason is there to think that present human values are not made instantly obsolete? Who would then value them, at all, especially as to operate from them, thereafter? How is AI to act according to values which alter no sooner than it acts; what to obey, then, now, or after? For instance, we value patience-as-virtue because, now, it is often the only way, the best way, to get what we want, without interfering with our attaining. If there were no need to wait - why value patience? And if the AI values patience, as do humans, but then makes a world in which patience is redundant - what value is it to have, then? Or, what values will humanity have, for it to share? If AGI acts and alters all that humanity values - what values can it maintain?

Moreover, our values change depending on our knowledge - and we alter the world according to our values, based on our knowledge of the world. One values obeisance to the Gods on Mount Olympus, only so long as one thinks they're there. Obeying the value, one goes to Olympus to give obeisance - and sees nothing there. So: no Gods to give obeisance to. So, no more value, after obeying it, and finding facts against it. Values influence action which alters knowledge - knowledge enables action which alters what is had, or known - so which is valued.

There is, then, a "Cauchy-Schwarz inequality" aspect, an Uncertainty Principle model, to human values, versus knowledge; knowledge alters values, values dictate actions which alters knowledge, and so... In fact, we might characterize the relation as a Gentzian Destructive Dilemma: if values then actions; if actions then observation, knowledge. Actions yielding knowledge that vitiates values, and knowledge which condemns as wasted those actions taken toward obtaining the knowledge - all this, then no values, no actions. Destructive dilemma, for humanity. 

For, the AGI to act and alter the world, is to alter knowledge, and values, thus "spiking" available actions thereafter; or, to obey values with no basis of fact in the world, is to be incapable of acting in that world. This dynamic will subsist, provided the AGI is operating according to human values. That is, such contradictions are inherent in the present human-value-alignment model. Naturally: depending on the situation, people, and so their values, change. These changes could be predicted only if the people were under total control, made a part of a situation, and made to respond to it in circumscribed ways, so only the situation's status need be predicted (or, you could simply eliminate any "values", or error functions, but your own; simplify things enormously, murderously). 

These tendencies are avoided, only by alignment to what abides, what permits discovery, and itself is only discovered, so which is in any "discoverable" situation. Then, so long as there exists someone in the situation free to value whatever is good in it, it is necessary only to maintain them; they'll find their values, so long as they are let to survive. Best practice would seem to be, to align to what is inherent in whatever can be valued. With all due respect to all who have worked so diligently for so long on the current, implicitly "man the measure of all things" model, of human-value-alignment - it seems now utterly misconceived. Yudkowsky, in the List of Lethalities, would not go so far as to say that alignment was mathematically impossible - but this author will go so far: values-alignment, if with respect only to human values, is impossible. (This, even before the challenge of ensuring the AI maintains and acts toward any given objectives, indefinitely).

It is clear now given dramatic capability developments as of this writing, we will need an entirely new paradigm of alignment - and quickly. It is certainly presumptuous to claim that one's own method - in the above-linked article, "Contrary to, &c." - is better, it cannot yet be proven so, though it is so often cited, because it seems important. Rather it is, at least, radically different, and so, there is at least possibility in it; when struggle goes against you, random chance at least might be better than a definite, losing plan. So, let it be asked that the reader inspect it, and affirm or deny for themselves, and that they deny, if they can, then to try and use what knowledge they derive from the denial, to derive yet-better.

But certainly, we must begin some affirmations soon.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 8:34 AM

AI in general is meant to do what we cannot do - else we would have done it, already, without AI - but to "make things better", that we could very readily have done, already.

Strong disagree, for two related reasons:

AI in general is meant to do what we cannot do

  • Why must AI only be used for things we cannot do at all? e.g. the internet has made it much easier to converse with friends and family around the world; we could already do that beforehand in inferior or expensive ways (voice-only calls, prohibitively-frequent travel) but now we can also do it with easy video calling, which I think pretty clearly counts as "make things better". It would be silly to say "the internet is meant to do what we cannot do, and we could very readily have already been calling home more often without the internet, so the internet can't make things better".

to "make things better", that we could very readily have done... we certainly had many opportunities to create such a civilization... [etc]

  • Imagine a trade between two strangers in a setting without atomic swaps. What is the "upside" of a trusted third-party escrow agent? After all, whichever trader must hand over their goods first could 'very readily' just decide to trust the second trader instead of mistrusting them. And then that second trader could 'very readily' honestly hand over their goods afterward instead of running away with everything. Clearly since they didn't come up with that obvious solution, they don't really want to trade, and so this new (social) technology of escrow that would allow them to do so isn't actually providing a real "upside".

Part of my research impact model has been something like: LLM knowledge will increasingly be built via dialectic with other LLMs. In dialectics, if you can say One True Thing in a domain, this can function as a diamond-perfect kernel of knowledge that can be used to win arguments against other AIs with, and shape LLM dialectic on this topic (analogy to soft sweeps in genetics).

Alignment research and consciousness research are not the same thing. But they’re not orthogonal, and I think I’ve seen some ways to push consciousness research forward, so I’ve been focused on trying to (1) speedrun what I see as the most viable consciousness research path, while (2) holding a preference for One True Thing type knowledge that LLMs will likely be bad at creating but good at using (E.g., STV, or these threads)

(I don’t care about influencing future LLM dialectics other than giving them true things; or rather I care but I suspect it’s better to be strictly friendly / non-manipulative)

One thing I messed up on was storing important results in pdfs; I just realized today the major training corpuses don’t yet pull from pdfs.

low readability for me, but I find this to be a reasonable perspective that represents some key concerns I have. upvoted.

we will need an entirely new paradigm of alignment 

I don't know what paradigm you're referring to... Before the deep learning revolution, the only form of alignment that this community talked about, was what the values of an all-powerful superintelligent AI should be, and the paradigm answer was Coherent Extrapolated Volition, a plan to bootstrap an ideal moral agent from the less-than-ideal cognitive architecture of humanity. 

After the deep learning revolution, there arose a great practical interest in the technicalities of getting an AI to reliably adopt any kind of goal or value at all. But I'm not aware of any new alternative paradigm regarding the endgame, when AI surpasses human control entirely. 

As far as I know, CEV is still what the best proposals look like. We don't want to base AI values on the imitation of unfiltered human behavior, but the choice about what parts of human behavior are good, and worthy of emulation, and what parts are bad, and needing to be shunned, must at some level be based on human ethical and metaethical judgments. 

the new plan is "we're not going to allow ai to remove human control of the future; we're going to share and grow together, even after we are surpassed". to do this, we need to define how to integrate in a way that ensures slack is created even at the neural level, in order to ensure that our memory edits are justified and fair, or something along those lines.