Just How Hard a Problem is Alignment?

LESSWRONG
is fundraising!
LW

Just How Hard a Problem is Alignment? — LessWrong

It is commonly asserted that aligning AI is extremely hard because

human values are complex: they have a high Kolmogorov complexity, and
they're fragile: if you get them even a tiny bit wrong, the result is useless, or worse than useless.

If these statements are both true, then the alignment problem is really, really hard, we probably only get one try at it, so we're likely doomed. So it seems worth thinking a bit about whether the problem really is quite that hard. At a Fermi-estimate level, just how big do we think the Kolmogorov complexity of human values might be? Just how fragile are they? If we had human values, say, 99.9% right, and the incorrect 0.1% wasn't something fundamental, how bad would that be — or is everything in human values equally fundamental?

What is the Rough Order of Magnitude of the Kolmogorov Complexity of Human Values?

There is a pretty clear upper bound on this (at least in the limiting case of arbitrary amounts of computer power). Given the complete genome for humans, and for enough crop species to build a sustainable agricultural culture, plus some basic biochemical data like the codon-to-amino acid table and maybe how to recognize introns (plus some non-human-specific environmental data about the climate on Earth, elemental frequencies, etc), you could simulate humans. So that's starting from O(10Gb) of data. Depending on just how well you understood human physiology, you might need to throw a lot of processing power at this — for a proof of feasibility, let's assume you have a parallel quantum computer big enough and fast enough to simulate every atom in a human body at a reasonable speed: then you could clearly simulate a human. (In practice, humans are made mostly of water and other organic chemicals warm enough that the range and duration of non-classical effects is extremely limited, usually with sub-picosecond decocerence times, so you probably only have to do quantum simulations up to the molecular or protein level or so, build a good semi-classical approximation model, and can then do classical simulations from there — and AlphaFold has been getting good results for protein folding with no quantum computation at all, just using ML to learn the semi-classical approximation for protein folding.) Something as smart as a post-FOOM GAI ought to be able to find ways to use thing like emergent behavior and homeostasis reduce that ludicrous processing power requirement significantly, probably by many orders of magnitude, but I have no grounds for saying by how many, so it's quite possible that even the reduced number is still ludicrous.

The ability to simulate an individual human probably isn't enough, so we need to increase the processing requirements further, to enough processing power to simulate enough entire groups or even societies of humans. If you could do that for long enough (and assuming for simplicity that you have no concerns about the suffering of simulated humans, only real ones), you could clearly eventually run enough research to figure out the full complexity of human values to arbitrary precision (especially if you could also do interpretability work on their neural nets to recover their internal representations). So, in the limit of seriously unlimited processing power, the upper bound on the Kolmogorov complexity of (non-social-contextual) human values is of the order of the size of the human genome — as a species, we aren't inherently any more complex than that (at least, until you start including contingencies like history and social context).

Likely most of the non-coding DNA in the genome is garbage and consists of things like transposons that have very little effect, or physiological homeostasis mechanisms irrelevant to all parts of human values outside of the specifics of pharmacology, so only important under rather specialized circumstances. Also, some "human values" might be predictable based on convergent evolution, since some of the content of the genome will be predictable just from the fact that this is the genome of a viable sapient species (making enough random mutations to a genome will almost invariably be fatal i.e. the space of genomes with significant reproductive success is a good deal smaller than the space of all possible base-pair sequences). Between those effects, that upper bound might even be one or two orders of magnitude too high.

However, the Kolmogorov complexity of any problem depends on the amount of computational power you throw at it to compress/decompress the data, and that proposal used quite astonishing quantities of processing power: amounts that might well be beyond the reach of even a GAI that recently went FOOM. So we aren't interested in "in theory this information could be extracted from the human genome", but rather "in practice a GAI that was only mildly superhuman but still potentially dangerous would need these values spelled out in a format that it could actually understand and use". So we're more interested in the Kolmogorov complexity in a representation that could be unpacked by something human-comparable or only mildly superhuman.

You could also extract arbitrarily good medical, ergonomic, psychological, sociological, etc. data on how to treat humans from a simulation setup like that, in order to write the AI equivalent of an arbitrarily advanced medical, ergonomic, psychological and sociological literature, all of which sound like subfields of "human values" — the accumulation of all of these would presumably be extremely large. But what we're trying to define here is "how much data about humans we have to put in to a GAI such that we would then be justified in trusting it to, say, cure cancer for us, or perhaps to solve the alignment problem and then cure cancer". That doesn't require us to tell it exactly how to cure cancer in advance, only for it to understand us well enough to know that we'd like cancer cured, but also that there are some costs that we are not, and others that we are, willing to pay for it doing so, with a good chance of it telling these apart, and for it to then be able to keep from accidentally killing all of us in the process of it doing this. So, for a high-end estimate, I'm willing to throw in, say, a bookcase of textbooks each on medicine, ergonomics, sociology, ecology, and a hundred other subjects that sound useful for "how to not kill, damage, or silence humans". Let's Fermi estimate that at say a hundred good 1000-page tomes on each of a hundred subjects. That's O(100Gb), maybe O(1Tb) if those books are all well-illustrated. (Probably a little smaller since text compresses moderately well, especially if you use something s smart as an LLM to help you compress/decompress it, but this effect probably shifts the Kolmogorov complexity less than an order of magnitude, so for a Fermi estimate it can be ignored.)

Now let's look for a lower-end estimate. Human zookeepers can look after most animals (short of ones that require deep-sea pressures or something equally extreme that zoos can't afford), and even breed most of them in captivity. They are not analyzing the animal's entire genome and simulating it's biology at a cellar level in order to figure out how to do so, nor are they doing anything even close to that. They just read some number of books on zookeeping, maybe attend some lectures and get a degree in it, and keep up with current research on the particular species they're working with. Those books were written by earlier zookeepers who knew less, but managed to keep some of their charges alive while others died, so were doing biology-informed trial-and-error experimentation, extracting some number of bits of useful trial and error data from each time an animal got sick or died on them, but the number of animals who died during this process was within what could be collected from the wild. When necessary, the zookeepers can call in a vet who has read a roughly comparable or perhaps slightly larger amount on veterinary medicine, mostly not very species-specific. If a human-equivalent and otherwise competent GAI read and understood, say, of the order of a ten good 1000-page tomes on zookeeping that didn't overlap much, and another ten on veterinary medicine, I'd feel pretty confident that it could figure it out too, for one species. So that's somewhere around O(100Mb), maybe O(1Gb) if the books are well-illustrated. The quality of life of zoo animals is not great — even if zoos were not so resource limited and could build cages miles across for those animals that normally have territories that large (think nature parks), it might not be that great: some species don't yet even breed well in captivity (pandas are a well-know example, so are some tropical fish). Nevertheless, a future only as aligned with human values as zoos are with animals' values would still be way better than the extinction of the human species, as long as there was a good chance of improvement from there. Obviously zoo animals are not sapient, and we are — but it's not clear that that makes the problem harder, rather than easier, since humans are also very communicative, and are more general optimizers than non-sapient animals. If some GAI was sufficiently aligned to keep us at zoo-park standards, and also interested in learning to do better, we would give it a great many complaints and suggestions. Many of us would even try to cooperate with it to improve its standards. A competent zookeeper should normally be able to tell if their charges are unhappy, but it may well be a challenge for them to figure out why or how to fix it, and they certainly aren't going to get research assistance in doing this from any of their charges. With humans, all you have to do is care and listen, and they will give you a (often fairly informed, if not always unbiased) opinion on what's wrong.

Another Fermi estimate datapoint: most human parents (with some unfortunate exceptions) manage to look after and raise human children reasonably well. Obviously it's hard to measure how much information about child-raising these parents previously learnt from their own parents or other sources — though it is pretty evident that the quantity and quality of that information varies and matters a lot. However, reading (and actually absorbing and implementing the advice from) even a handful of good books on parenting (say, O(10Mb) of data, allowing for some illustrations) on the subject can make a significant difference. So if that delta is noticeable, then let's ballpark the total at O(100Mb). Obviously there is also some instinctual component here, but my guess is that the learned component is larger, since people whose own parents were bad often do pretty badly as parents themselves, so I'm going to leave the order-of-magnitude estimate at O(100Mb).

Human parents also have the benefit of mirror neurons — humans are wired to be good at simulating the reactions of other humans (indeed, we tend to over-apply this and anthropomorphize anything that shows even moderately complex behavior). So to get away with that little human-values data, our GAI would clearly also need access to something good at simulating humans. So let's also give it a copy of some future GPT-N, a carefully-crafted set of prompts that say things something like "If you kept a person in conditions like ___, how happy or unhappy would they be? What would they complain about most, and how hard?", or that that prompt the GPT-N to simulate responses from Mumsnet, plus instructions to regenerate the answer many times and gather statistics and suggestions. That GPT-N systems clearly has some additional human values data in it, mixed in with everything else that you'd also expect to find in a future LLM, but it doesn't require any alignment-specific specialized research to build, so for the alignment problem it's fair to regard it (but not the alignment-specific prompts for it) as free. So I'm really asking about the Kolmogorov complexity of passably solving the initial alignment problem for system that has already access to likely information resources such as an LLM, internet crawl, and so forth.

So, that gives us rough order of magnitude range of somewhere between around 10^8 to 10^12 bytes for the Kolomogorov complexity of human values. So roughly speaking, they'd fit on some currently-available size of USB stick.

That's not a trivial amount of information, and definitely not something we'd want to try to write a bug-free copy of it with only one try to get it right. But it's also not that ludicrously intimidating a goal for a research project if you're allowed to make some minor mistakes — and especially if we already have quite a lot of the medical, ergonomic, economics, parenting, psychological etc. data that need to be integrated into it. Better yet would be if we were able to supply most of those as just a bibliography pointing to external human texts plus some form of ontology-mapping/symbol-grounding translation program. Then, at least for a significant proportion of the data, only gathering the bibliography of trusted sources and constructing the translation system might require specialized alignment research.

How Fragile are Human Values?

Eliezer Yudkowsky has been known to point out that dialing a 10-digit phone number 90% correctly will not put you in contact someone 90% similar to your friend. Here we're looking at a phone number with somewhere from maybe the order of a hundred million to a trillion digits. Let's suppose we can dial 99% of them correctly, or even 99.9% — how bad will the result be? Will that be friendly? Phone numbers have no correlation between their digits and the personality of the person reached [arguably not completely true of the area code digits] — is that true of alignment data, even if's it's sensibly structured in some way, hopefully one that's not-too-hard-to-find (rather than, say, encrypted so that one bit of damage ruins the rest of the document)?

I'm pretty sure that bad human parents, or bad zookeepers, aren't reaching 99% accuracy on zookeeping or parenting — at least for parents (I can't speak to zookeeping), usually concise descriptions of what they did wrong are very short, often somewhat repetitive across multiple examples, and for parenting mistakes, usually sound like really obvious mistakes to any reasonably good parent. So often bad human parents have made one of a fairly small set of avoidable mistakes. But I'd be willing to grant bad human parents 90% accuracy on parenting (if one of them wrote a book on parenting, I'd expect to still agree with the majority of individual sentences in it, probably 90% of them, but probably not with 99% of them), and also I'd believe that someone with only 90% accuracy on zookeeping would kill their charges fast. But I think by the time we get to 99% or 99.9% accuracy, and certainly by 99.99% accuracy, we're now discussing the difference between good zookeepers/parents and perfect ones. Obviously you can get unlucky — if you have a medical textbook 99.99% memorized but happen to believe that strychnine in sufficient quantities cures the common cold, I don't want you as my doctor. Some facts are just more important than others, and you do need to get the basics right. But likely, if they were well-structured, most of the data filling out the bulk of that Kolmogorov complexity is not going to be basics, it's surely going to be secondary and tertiary-importance material, and corner cases, just like any other field of study.

Is that assumption valid, or could it be the case that if you had human values say 99.9% correct, every possible choice of 0.1% of it that you might have wrong is pretty-much equally fundamental, and there are no meaningfully secondary or tertiary parts, it's all essentials? I have great difficulty believing this — it feels like you're asking for either a huge coincidence for everything to be about equally vital, or else for everything to interact strongly with everything else. (Humans are an evolved system, so their design don't need to avoid inter-dependencies for the same reason that a human-engineered system would, that it would cognitively overload the engineer, but evolved systems do generally tend to try to avoid cascading inter-dependencies, simple because they're fragile — in fact, biochemical pathways are frequently full of feedback loops that look rather like they're specifically evolved to avoid everything depending on everything else, and allowing alleles of A to be inherited/evolve without having too much effect on C and D, because their interaction goes through B1, B2, and B3, which between them implement a stabilizing feedback loop that reduces the effect of variations in A on the functioning of C and D.) Certainly in individual fields that are part of or relate to human values, like parenting, medicine, zookeeping, ergonomics, and every other related subject that I can think of, there are some mistakes that are reliably quickly fatal (treating colds with large amounts of strychnine, for example), others that are often slowly fatal, and plenty that are merely harmful, or annoying, or will only bite you badly if certain unusual circumstances happen to crop up. Now, if you wait long enough, any unlikely circumstance will eventually come up, so "is this vital?" depends to some extent on whether the time-frame you're talking about is months, years, or centuries, which in turn depends on how fast our GAI goes FOOM/is corrigible/solves the alignment problem for us. It is possible, and for obvious reasons common practice, to structure teaching these fields to start with the really important stuff, and concentrate a lot on how to not make certain mistakes that are often quickly fatal ("first of all, do no harm"), and then fill in the less vital stuff from there. So they do divide into a spectrum from essentials through secondary and tertiary, and generally the essentials are are smallish proportion of the total. Even in what we think of as hard, very interdependent fields such as aerospace engineering, while there are a lot of subsystems on a plane where a breakdown in any of them would be likely to cause a fatal crash or at least onboard fatalities (and that thus are tested and maintained regularly and carefully, and where possible are redundant), most of the components in the seating, galley, entertainment system, lighting, toilets, and the inside of the cargo hold don't fall into that category. There are subjects that have high proportion of essentials, but they tend to involve either pushing systems close to some engineering limit, like aerospace, or an adversarial situation like security where you have an intelligent opponent trying to find and take advantage of your mistakes. If we ever get into an adversarial situation with a superhuman GAI, it will win — we have to only play cooperative games with one, like helping it understand us better so it can help us better.

I'd like to run a little thought experiment here. Suppose humans had somehow managed to become a space-faring species without either building GAI far above human capabilities or becoming significantly transhuman (so just like most science-fiction series on TV). Suppose we then discover a planet that once had a sapient species on it (who, unlike on TV, were not just like humans except with pointed ears or forehead ridges, but had a physiology, psychology and biochemistry that was genuinely alien to us), who had unwisely managed to wipe themselves out at about an early-21st-century-like technological level (perhaps by means of a bioengineered plague, or something like that). Suppose that much of their ecosystem still exists, and that we humans have done our xenoarcheology and reverse engineering, and thus now have access to quite a lot of information:

We have their analogs of the Library of Congress and a full Google webcrawl, including the contents of their analogs of YouTube, Tik-tok, Netflix, PubMed, ArXiv, etc
From those, we've cracked their languages (perhaps some of them left us a decent analog of the Arecibo message before succumbing to the engineered plague), and figured out how to do passable automated translations between the alien languages and human languages. Since they're very alien to us and vice versa, things tend to get a lot longer when translated, since you have to include the necessary footnotes or references explaining alien concepts, but we have a way to generate those footnotes/references and the process works better than current Google-translate does, though not perfectly (doing this without help from GAI would be a lot of work, but humor me — perhaps we have stable fast human-level GAI that absolutely cannot self-improve, or something, or perhaps we just put an awful lot of man-hours and sub-human AI cycles into the translation project). Note that the translation fallibility here includes the footnotes/references occasionally being wrong, incomplete, uncertain, missing, or not in fact relevant to the document being translated.
We also have a full genome for their species, with allele frequencies for common variations, which they collected before wiping themselves out (they were good at biotech)

Suppose we feel sorry for this species, sad that they're gone and we can't talk to them, so we decide that we want to resurrect their species. With the help of their texts, we study the alien biochemistry and design artificial alien wombs (or if they were egg-layers, then artificial alien oviducts) until that's technologically possible for us. At that point, two plans are discussed by the resurrection project staff:

We attempt to rebuild their civilization at some fairly advanced level, say somewhere in the range of 17th through early 21st century-equivalent technological level, build appropriate infrastructure for this, raise a generation of them at that knowledge/technological level, and let them continue again from there (along with detailed advice on what NOT to do from that point, like "Don't bioengineer plagues — if you do, we won't resurrect you again")
Or, we start them off at a hunter-gather level in the best match we can reconstruct for their equivalent of Africa, let their hunter-gatherer culture stabilize, and leave them alone and watch from a distance for a couple of hundred thousand years before giving them advice on what to NOT do this time. (Or maybe we periodically drop hints at appropriate times, e.g. working models of their equivalents of an atlatl, a bow, stirrups, a plough, a printing press drawn from their own archeology, and wait ten thousand years.)

Let us suppose that proponents of plan 2 are not arguing for their position on any moral grounds, nor from some form of Prime Directive/let them have full control of their own destiny argument, nor even from curiosity about observing alien hunter-gatherers. They're simply claiming that option 1 is impossible, because it requires raising a generation of alien children, and we are too alien to them to be able to figure out how to do that, so the effort would inevitable be doomed — sure, we may have figured out how to build and program amazingly complex animatronic remote-controlled alien-like puppets to do the necessary physical operations of parenting for us, but we don't know how to run them, since it's just impossible to be sure that we're sufficiently aligned with the aliens' values, and their values are too fragile for the project to ever work: getting even a tiny fraction of them wrong would doom the effort. Getting only 99.9% or 99.99% aligned with them cannot possibly be enough to get our generation of alien kids to reach adulthood alive and not be terminally damaged in some way.

This argument seems implausible to me. I don't think even good alien parents would actually be 99.99% aligned with perfection of alien parenting, but I think their kids would still turn out OK. Item 1. above includes a great many alien books on their equivalents of parenting, medicine, design, ergonomics, psychology, and so forth, which we can translate, and alien videos on similar subjects, plus many at least showing samples of these. It also includes enough data to train an alien-language GPT-N-equivalent that can do decent summarization in their languages, and that we can run prompts against, and we have decent translation software to generate prompts and read the replies. So we can effectively ask questions about translated summaries of our plans on our choice of their previous Internet's parenting forums, and get back GPT-N-quality simulated alien answers. We can also get simulated alien kids' responses to comments functionally equivalent to "My parents make me go to bed by 10pm — that's so unfair!" on alien kids' forums. (I'm here assuming 1. also includes chat logs and content for their equivalents of Roblox.) So even if our human mirror neurons are telling us all sorts of wrong things about how we expect the alien kids to feel, and thus we have no valid intuitive sense for how to treat them well, we can extract more accurate information on this subject from their Internet and libraries, both by search, and by a LLM technology.

Even better, as soon as the alien kids can talk, they can tell us (via the animatronic puppets and our translation software) what's wrong, and also complain, cajole, wheedle, beg, and throw tantrums, just like kids always do — and we can improve our understanding from there.

Even assuming we took the project seriously and devoted sufficient resources to it, I'm not confident that the result would be perfect, or even as good as good alien parents could do. We could certainly fail repeatedly by falsely anthropomorphizing the alien kids and not consulting out alien GPT-N. We might even screw up horribly on our first try, poison all the kids in the first year or two because of some ghastly nutritional error or something, and need to start again, but I think the probability of that should fairly low, O(<10%). I'm hopeful, say O(>50%), that we could pretty quickly converge to at least doing a better job than abusive alien parents would have, and raise a generation that wasn't so emotionally crippled that they couldn't then raise their own children better. So I'd be in the camp "plan 1 looks plausibly viable to me, but we should definitely start with a small test cohort while we shake initial mistakes out".

One possible counterargument here is to suggest that I swept the hard part under the rug into item 2, the ability to do passable translations, even ones with long mostly-accurate footnotes. For the AI alignment problem, the equivalent is translating between human and AI modes of thought with passable accuracy. AI needs to be able to understand our documents and speech before it can try to figure out from them how to align with us. However, that's not alignment-specific research — it's generally useful, even essential capabilities research. The Internet and the Library of Congress are far more useful resources to an AI that can understand and use them, and LLMs show a suggestive proof-of-concept for how this might work. Also note that a GAI that couldn't do this might well be far less of an immediate risk — if it were completely incapable of this, it would need to rediscover its own technology and the engineering for all possibly-useful additions to it from scratch (while stuck at near-human capability levels, assuming we didn't directly build anything deeply super-human) before it could start to self-modify, which should massively delay it being able to go FOOM, unless it going FOOM is really just as simple for it as obtaining a lot more GPUs. So I think that assuming at least passable, sometimes fallible translation in my thought-experiment above is both reasonable and necessary.

So my intuition is that (when sensibly organized) human values are not that fragile (not as fragile as, say, poorly structured code) — that, once you get past the fundamentals, in some meaningful sense getting them right to some small number of nines is likely to be good enough to get you to somewhere that, while not great, is at least survivable (significantly better that the extinction of the human race, say) and that could be built upon and improved. This further improvement requires that the AI is trying to get closer to human values, hasn't started from something terminally broken (i.e that we did have the fundamentals right), and doesn't meanwhile make some massive over-optimization change to our society based on an incorrect model of human values, thus pushing human society far outside previously observed ranges before its model of human value has both converged to a few more nines and also been accurately extended that far outside previously observed ranges. I.e. I'm also assuming prudence and caution from the AI, rather than it attempting to double our standard of living every week or month before it knows us well, and instead killing us. How to achieve that is a separate discussion, but I would observe that prudence and caution are rational, convergent values in any embedded agent that has to cope with the real world — even a paperclip maximizer would want them if it didn't already have them.

To be clear, except for my initial unlimited compute upper bound of the size of the human genome, I'm not trying to estimate the Kolomogov complexity of a final model of human values that a FOOMed value learner might converge to if successful — in any useful format less drastically compressed than the human genome, I suspect that might be truly vast. What I'm asking is how much allignment-specific information about human values we might need to put in at the start to have a reasonable chance that we've got a value learner into the basin of attraction around that final model (and for the purpose of this post I'm here assuming that that does have a basin of attraction — if not, then clearly we're back to "we're doomed", because the best we can hope for is a zoo animal quality of life until the sun goes red giant.) That's the problem that we actually need to be convinced that we've solved before turning a GAI value learner on.

My Tentative Conclusions

So, my feeling is that:

Humans values are large, but not vast, with a Kolmogorov complexity (at least for a passably good usable representation, enough to start a value learner converging from) probably somewhere in gigabytes or possibly even high megabytes range (for the high megabytes, there was an assumption of supplementing it with some more general purpose non-alignment-specific ability to simulate or understand humans, such as a future LLM).
Suitably priority-structured, human values not that fragile: as long as you can get the hopefully-sigificantly-smaller fundamental essentials correct for how to keep humans alive and undamaged, and not mind-controlled to say what you want to hear or otherwise blocking corrigibility, then some low-number-of-nines-like accuracy on the remaining bulk of human values is probably enough to start off with, barring unfortunate accidents, as long as the GAI keeps converging on a better model of them, and in the meantime is sufficiently cautious and aware of impact not to try anything overambitions that we can't recover from if it fails. (I would personally suggest that the essentials are somewhere around the equivalent for a sapient species of some number good zookeeping/parenting/nutrition/veterinary books on how to keep humans and breed them in captivity, plus the basics of running and interpreting surveys across them, and a "theory of mistakes" for when to put less trust in what they say, such as when they're drunk or sick). Also, a human "theory of mistakes", or at least reasonable priors when learning one, is going to be informed by evolution — the fact that their brains work less well when they are sick is as predictable as that every other organ system is less likely to be at peak performance when they're sick, and many human cognitive biases are pretty clearly either instinctual holdovers from before we became sapient or evolved shortcuts that often worked well for hunter-gatherers on the savannah.
There's probably enough information in the Internet, Library of Congress, YouTube, PubMed etc. to do a passable job of figuring human values out, if you analyzed it right, even for something with capabilities not-far-beyond what a large organization of humans could do with enough effort. And of course there's even more information built into human law, infrastructure, and so forth.
Plus, you can always ask humans for feedback, assistance, and cooperation (as long as they're not extinct, or mind-controlled to always agree with you, or you've otherwise managed to cripple corrigibility). Failing that, you can ask something like GPT-N for a simulation of human wisdom-of-crowds feedback, for the milieu of the corpus it was trained on.

So I think that suggests that our the goal should be to, at a minimum:

Build a GAI that could do at least as good a job of the alignment task as I believe the humans would in my hypothetical scenario (i.e has at least large-organization of-humans-equivalent capabilities for this task — if it doesn't, it's not much of a GAI).
Give it some number of megabytes or gigabytes of human values, with the basic core of "how to not kill us or silence corrigibility" correct, and the larger "nice to have" parts accurate to at least some small-but-sufficient number of nines.
Make sure it actually wants to solve the alignment problem, and will keep improving its model of human values from passable to workable to acceptable to good to excellent, gradually (by its standards, if not ours) expanding it and adding more nines of accuracy,
Ensure it cares enough and understands caution enough to not blindly use its not-yet-that-good model of human values well outside its current region of validity/accuracy, especially not without first running small experiments or prototypes.
Item 4. may well make it challenging for our helpful GAI to go FOOM fast — behond whatever the actual technical challenges of FOOMing are, it needs to figure out how to do these cautiously without accidentally killing us. So in the meantime, we and it need to figure out how to avoid anyone else foolishly building a paperclip maximizer that goes FOOM faster because it's less cautious.

(those items are of course in logical not chronological order.)

If we managed to get all that right, then my best guess is that the resulting scenario might not be entirely comfortable to live through, but it should have a good chance of being a lot less less bad than permanent human extinction.

If the above is correct, this reduces severity of the alignment problem a little from "we're doomed" to a distinctly challenging-looking project that we might actually have some leverage on how to make progress on. In particular, a lot of the data on human values and how to treat humans already exist, in human media in a form intended for humans (often various types of human specialists) — how do we make it comprehensible to a GAI that doesn't have mirror neurons for humans? We don't yet know exactly how a GAI will be constructed, but we do know that current LLMs can turn up to a few kilobytes of human-readable text into a complex set of embeddings that a GAI almost certainly could learn to understand. How would we do that for, say, a thousand-page academic text on medicine or ergonomics or psychology? Or for an entire legal code? Or even just a couple-of-hundred page popular guide to parenting, or nutrition, or similar subjects?

From this point of view, certain sorts of capabilities research (such as translation and comprehension of human documents) start to look rather like they might also have alignment applications. We need to build something that understands us, and that will use that understanding to guide it helping us, not to manipulate us for its own aims, since its only aim is to correctly understand and implement our aims.

This entire post is an initial Fermi estimate, and it's usually much better to collect Fermi estimates from several people. I would love to hear other people's opinions on the rather broad range of sizes that I have suggested here for a minimal "good enough to get the value learner GAI into the basin of attraction and have it not kill or silence us while improving it" version of human values, on roughly what proportion of that is essential and has to be got completely right, and on the rough level of accuracy required in the less-than essential parts of it.

Obviously all this is highly contingent on how hard you think it will be to build a value learning GAI that will be cautious about using its optimizing ability based on a model of human values that isn't yet fully accurate (my opinions on that are in Breaking the Optimizer's Curse, and their feasibility is likely to be a crux for this discussion).

[Link preview photo by Steve Payne on Unsplash]