From the view of someone who mostly has no clue what they are talking about (that person being me), I don’t understand why people working in AI safety seem to think that a successful alignment solution (as in, one that stops everyone from being killed or tortured) is something that is humanly achievable.
To be more clear, if someone is worried about AI x-risk, but is also not simultaneously a doomer, then I do not know what they are hoping for.
I think there’s a fairly high chance that I’m largely making false assumptions here, and I understand that individual alignment schemes generally differ from each other significantly.
My worry is that different approaches, underneath the layers of technical jargon which mostly go over my head, all are ultimately working towards something that looks like “accurately reverse engineer human values and also accurately encode them”.
If the above characterization is correct, then I don’t understand why the situation isn’t largely considered hopeless. Do people think that would be much more doable/less complex than I do? Why?
Or am I just completely off-base here? (EDIT: by this I mean to ask if I’m incorrect in my assumption of what alignment approaches are ultimately trying to do)
As individuals, Humans routinely do things much too hard for them to fully understand successfully. This is due partly due to innately hardcoded stuff (mostly for things we think are simple like vision and controlling our bodies automatic systems), and somewhat due to innate personality, but mostly due to the training process our culture puts us through (for everything else).
For its part, cultures can take the inputs of millions to hundreds of millions of people (or even more when stealing from other cultures), and distill them into both insights and practices that absolutely no one would have ever come up with on their own. The cultures themselves are, in fact, massively superintelligent compare to us, and people are effectively putting their faith either in AI being no big deal because it is too limited, or in the fact that we can literally ask a superintelligence for help in designing things much stupider than culture is to not turn on us too much.
AI is currently a small sub-culture within the greater cultures, and struggling a bit with the task, but as AI grows more impressive, much more of culture will be about how to align and improve AI for our purposes. If the full might of even a midsized culture ever sees this as important enough, alignment will probably become quite rapid, not because it is an easy question, but because cultures are terrifyingly capable.
At a guess, Alignment researches have seen countless impossible tasks fall to the midsized 'Science' culture of which they are a part, and many think this is much the same. 'Human achievable' means anything a human-based culture could ever do. This is just about anything that doesn't violate the substrates it is based on too much (and you could even see AI as a way around that.). Can human cultures tame a new substrate? It seems quite likely.