LESSWRONG
LW

[ Question ]

Can someone explain to me why most researchers think alignment is probably something that is humanly tractable?

by [anonymous]

1 min read3rd Sep 20226 answers 2 comments

32

From the view of someone who mostly has no clue what they are talking about (that person being me), I don’t understand why people working in AI safety seem to think that a successful alignment solution (as in, one that stops everyone from being killed or tortured) is something that is humanly achievable.

To be more clear, if someone is worried about AI x-risk, but is also not simultaneously a doomer, then I do not know what they are hoping for.

I think there’s a fairly high chance that I’m largely making false assumptions here, and I understand that individual alignment schemes generally differ from each other significantly.

My worry is that different approaches, underneath the layers of technical jargon which mostly go over my head, all are ultimately working towards something that looks like “accurately reverse engineer human values and also accurately encode them”.

If the above characterization is correct, then I don’t understand why the situation isn’t largely considered hopeless. Do people think that would be much more doable/less complex than I do? Why?

Or am I just completely off-base here? (EDIT: by this I mean to ask if I’m incorrect in my assumption of what alignment approaches are ultimately trying to do)

Can someone explain to me why most researchers think alignment is probably something that is humanly tractable?

9Charlie Steiner

2Roman Leventov

1deepthoughtlife

6Vladimir_Nesov

New Answer

New Comment

6 Answers sorted by
top scoring

Sep 03, 2022

275

I’m an alignment researcher, and I think the problem is tractable. Briefly:

Humans manage to learn human values, despite us not having an exact formalism of our values. However, humans in different cultures learn different values, and humans raised outside of civilisation don’t learn any values, so it’s not like our values are that strongly determined by our biology. Thus, there exists at least one class of general, agentic learning systems that usually end up aligned to human values, as a result of a value formation process that adds human values to a thing which initially did not have such values. Fortunately, the human value formation processes is available to study, and even to introspect on!
Deep learning has an excellent track record of getting models to represent fuzzy human concepts that we can’t explicitly describe. E.g., GPT-3 can model human language usage far more accurately than any of our explicit linguistic theories. Similarly, stable diffusion has clearly learned to produce art, and at no point did philosophical confusion about the “true nature of art” interfere with its training process. Deep learning successes are typically heralded by people confidently insisting that deep models will never learn the domain in question, whether that domain be language, art, music, Go, poetry, etc. Why should “human values” be special in this regard?
I don’t buy any of the arguments for doom. E.g., the “evolution failed its version of the alignment problem” analogy tells us essentially nothing about how problematic inner alignment will be for us because training AIs is not like evolution, and there are evolution-specific details that fully explain evolution’s failure to align humans to maximizing inclusive genetic fitness. I think most other arguments for doom rest on similarly shaky foundations.

But what is stopping any of those "general, agentic learning systems" in the class "aligned to human values" from going meta — at any time — about its values and picking different values to operate with? Is the hope to align the agent and then constantly monitor it to prevent deviancy? If so, why wouldn't preventing deviancy by monitoring be practically impossible, given that we're dealing with an agent that will supposedly be able to out-calculate us at every step?

If there's no ultimate set of shared values , that could lead to a situation where different cultures build AIs with their own values...liberalBot, ConfucianBot, ChristianBot and do on.

Charlie Steiner

Sep 03, 2022

93

I'm an alignment researcher. It's not obvious whether I'm a doomer (P(doom) ~ 50%), but I definitely think alignment is tractable and research is worth doing.

Quintin is on the extreme end - he's treating "Learn human values as well as humans do / Learn the fuzzy concept that humans actually use when we talk about values" as a success condition, whereas I'd say we want "Learn human values with a better understanding than humans have / Learn the entire constellation of different fuzzy concepts that humans use when we talk about values, and represent them in a way that's amenable to self-improvement and decision-making."

But my reasons for optimism are overlapping with his. We're not trying to learn the Final Form of human values ourselves and then write it down. That problem would be really hard. We're just trying to build an AI that learns to model humans in a way that's sufficiently responsive to how humans want to be modeled.

The extra twist I'd add is that doing this well still requires answering a lot of philosophy-genre questions. But I'm optimistic about our ability to do that, too. We have a lot of advantages relative to philosophers, like: if you would ask an AI to have some property and that property turns out to be impossible, don't keep arguing about it, just remember we're in the business of designing real-world solutions and ask for a different property.

Sep 03, 2022

63

AFAIK, it is not necessary to "“accurately reverse engineer human values and also accurately encode them”. That's considered too hard, and as you say, not tractable anytime soon. Further, even if you're able to do that, you've only solved outer alignment, inner alignment still remains unsolved.

Instead, the aim is to build "corrigible" AIs. See Let's See You Write That Corrigibility Tag, Corrigibility (Arbital), Hard problem of corrigibility (Arbital).

Quoting from the last link:

The "hard problem of corrigibility" is to build an agent which, in an intuitive sense, reasons internally as if from the programmers' external perspective. We think the AI is incomplete, that we might have made mistakes in building it, that we might want to correct it, and that it would be e.g. dangerous for the AI to take large actions or high-impact actions or do weird new things without asking first.
We would ideally want the agent to see itself in exactly this way, behaving as if it were thinking, "I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I've done this calculation showing the expected result of the outside force correcting me, but maybe I'm mistaken about that."

Also, most if not all researchers think alignment is a solvable problem, but many think we may not have enough time.

Sep 03, 2022

60

I suspect that many researchers both consider is probably hopeless, but still worth working on given their estimates of how possible it is, how likely unaligned AI is, and how bad/good unaligned/aligned AI would be. A 90% chance of being hopeless is worth it for a 10% chance of probably saving the world. (Note that this is not Pascal’s Wager since the probability is not infinitesimal.)

[-][anonymous]2y42

This comment got me to change the wording of the question slightly. “so many” was changed to “most”.

You answered the question in good faith, which I’m thankful for, but I don’t feel your answer engaged with the content of the post satisfactorily. I was asking about the set of researchers who think alignment, at least in principle, is probably not hopeless, who I suspect to be the majority. If I failed to communicate that, I’d definitely appreciate if you could give me advice on how to make my question more clear.

Nevertheless I do agree with everything you’re saying, though we may be thinking of different things here when we use the word “many”.

Sep 03, 2022

2-2

(Disclaimer: below are my thoughts from a fairly pessimistic perspective, upon reading quite a lot of LW recently, but not knowing almost any researcher personally.)

If someone is worried about AI x-risk, but is also not simultaneously a doomer, then I do not know what they are hoping for

I think they also don't know what they hope for. Or they do hope for a success of a particular agenda, such as those enumerated in the recent Nate Soares' post. They may also not believe that any of the agendas they are familiar with will lead to the long-term preservation of flourishing human consciousness, even if the agenda technically succeeds as stated, because of unforeseen deficiencies of the alignment scheme, incalculable (or calculable, but inextricable) second- and third-order effects, systemic problems (Cristiano's "first scenario"), coordination or political problems, inherent attack-defence imbalance (see Bostrom's vulnerable world hypothesis), etc. Nevertheless, they may still be "vaguely hopeful" out of general optimistic bias (or you may call this a physiological defence reaction in a snarky mood).

This is actually an interesting situation we are in, that we don't know what do we hope for (or at least don't agree about this -- discussion around the Pivotal Act is an example of such a disagreement). We should imagine in enough detail a particular future world state, and then try to steer ourselves towards it. However, it's very hard, maybe even inhumanly hard to think of a world with AGI which would actually behave in real life the way we imagine it will (or will be achievable from the current world state and its dynamics), because of the unknown unknowns, the limits of our world modelling capacity, and the out-of-distribution generalisation (AGI is out of distribution) that we need to perform.

Absent such a world picture with a wide consensus around it, we are driving blind, not in control of the situation, but the dynamics of the situation controlling us.

deepthoughtlife

Sep 03, 2022

1-3

As individuals, Humans routinely do things much too hard for them to fully understand successfully. This is due partly due to innately hardcoded stuff (mostly for things we think are simple like vision and controlling our bodies automatic systems), and somewhat due to innate personality, but mostly due to the training process our culture puts us through (for everything else).

For its part, cultures can take the inputs of millions to hundreds of millions of people (or even more when stealing from other cultures), and distill them into both insights and practices that absolutely no one would have ever come up with on their own. The cultures themselves are, in fact, massively superintelligent compare to us, and people are effectively putting their faith either in AI being no big deal because it is too limited, or in the fact that we can literally ask a superintelligence for help in designing things much stupider than culture is to not turn on us too much.

AI is currently a small sub-culture within the greater cultures, and struggling a bit with the task, but as AI grows more impressive, much more of culture will be about how to align and improve AI for our purposes. If the full might of even a midsized culture ever sees this as important enough, alignment will probably become quite rapid, not because it is an easy question, but because cultures are terrifyingly capable.

At a guess, Alignment researches have seen countless impossible tasks fall to the midsized 'Science' culture of which they are a part, and many think this is much the same. 'Human achievable' means anything a human-based culture could ever do. This is just about anything that doesn't violate the substrates it is based on too much (and you could even see AI as a way around that.). Can human cultures tame a new substrate? It seems quite likely.

2 comments, sorted by

Click to highlight new comments since: Today at 2:59 PM

[-]Vladimir_Nesov2y62

I mean, it might be tractable? It's hard to be certain that something isn't tractable until there is a lot clearer picture of what's going on than what alignment theory gives so far. Usually knowledge of whether something is big-picture tractable isn't that relevant for working on it. You just find out eventually, sometimes.

[-][anonymous]2y10

I accept that trying to figure out the overall tractability of the problem far enough in advance isn’t a useful thing to dedicate resources to. But nevertheless, researchers seem to have expectations when it comes to alignment difficulty regardless, despite not having a “clearer picture”. For the researchers who think that alignment is probably tractable, I would love to hear about why they think so.

To be clear, I’m talking about researchers who are worried about AI x-risk but aren’t doomers. I would like to gain more insight into what they are hoping for, and why their expectations are reasonable.