As we approach AGI, we also approach the risk of alignment failure: due to either mismatch between intended goals and specified goals ('outer misalignment'), or mismatch between specified goals and emergent goals ('inner misalignment'), we end up with a catastrophically powerful failure mode that puts us all in a very bad place.

Right now, we don't know what kind of model will lead to AGI. We can guess, but a few years back people didn't have a lot of hope for LLMs, and now look where we're at. A lot of people were deeply surprised. Solutions to alignment failure may be model dependent, and AGI may emerge from something entirely new, or something old which starts to behave in a surprising way when scaled. It's quite hard to know where to start when faced with unknown unknowns. Of course, if you're here, you already know all of this.

These challenges and uncertainties may be relatively novel to the academic research community (in which AI safety has only recently become a respectable area of study), but they are by no means novel to most parents, who also train models (albeit 'wet' ones) via RLHF and inevitably experience alignment issues. Much of the time, the biggies get mostly resolved by adulthood and we end up with humans that balance self-interest with respect for humanity as a whole, but not always. For example, assertiveness, persuasiveness, public speaking ability and self-confidence are common training goals but when alignment failure occurs, you get humans (who often manage to convince the masses to elect them democratically) that are just one button push away from the destruction of humanity, and of the pathological personality type to actually push it. We don't need AGI to see alignment failure resulting in the future of humanity on a knife edge. We're already there.

Since we expect AGI (even superhuman AGI) to exhibit a large number of human-type behaviours because their models will presumably be trained on human output, why do we not spend more time looking at alignment failure in humans, its determinants, its solutions and its warning signs, and see whether those things yield insights into AI alignment failure mechanisms? As any human parent (like me) will tell you, although probably not in quite these words, aligning biological general intelligences is hard, counterintuitive and also immensely important. But, unlike AI alignment, there is a vast body of research and data in the literature behind what works and what doesn't when it comes to aligning BGIs.

So, why is there so little cross-domain reach into human (especially child) psychology and related fields in the AI community?

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 7:19 PM

This is a reasonable question, although I don't think we need to worry much about analogues between AI and human psychology. Instead we should be looking for robust alignment schemes that will work regardless of the design of the agent, even if that means the scheme correctly rejects alignability of the agent, we should know that up front.

Because of this, I think of the problem more like this: if we design a working alignment scheme, it should at least work to align humans since humans are less robust optimizers but are (minimally) general intelligences.

So if we're trying to think of how to build aligned AI, a reasonable test for any scheme that is not "design the AI with an architecture that guarantees that it is aligned" is to check if the scheme could align humans. We have thus far not solved the human alignment problem, but solving the problem of how to get humans to align on the right behavior without Goodharting should be a step in the direction of coming up with an alignment scheme that might be able to handle super-optimizing AI.

Thanks for the thoughtful response, although I'm not sure quite of the approach. For starters, 'aligning humans' takes a long time and we may simply not have time to test any proposed alignment scheme on humans if we want to avoid AGI misalignment catastrophe. Not to mention ethical issues and so forth.

Society has been working on human alignment for a very long time, and we've settled on a dual approach: (1) training the model in particular ways (e.g. parenting, schooling etc.) and (2) a very complex, likely suboptimal system of 'checks and balances' to try and mitigate issues that occur for whatever reason post-training, e.g. the criminal justice system, national and foreign intelligence, etc. This is clearly imperfect, but it seems to work a lot better to maintain cohesion than prior systems of 'mob justice' where if you didn't like someone you'd just club them in the head. Unfortunately, we now also have much more destructive weapons than the clubs of the 'mob justice' era. Nonetheless, as of today you and I are still alive and significant chunks of society more or less function and continue to do so over extended periods of time, so the equilibrium we're in could be a lot worse, but per my original post re. the nuclear button it's clear we have ended up on a knife edge.

Fortunately, we do have a powerful advantage with AGI alignment over human alignment: for the former we can inspect the internal state (weights etc.). Interpretability of the internal state is of course a challenge, but we nonetheless have access. (The reason why human alignment requires a criminal justice system like we have, with all its complexity and failings, is largely because we cannot inspect the internal state.) So it seems that AGI alignment may well be achieved through a combination of the right approach during the training phase, and a subsequent 'check' methodology via a continuous analysis of the internal state. I believe that bringing the large body of understanding/research/data we already have on human alignment in the training phase (i.e. psychology and parenting) to the AI safety researcher's table may be very helpful. And right now, I don't see this. Look for example at the open recs posted on the OpenAI web site - they are all 'nerd' jobs and no experts of human behaviour. This surprises me and I don't really understand it. If AGI alignment is as important to us as we claim, we should be more proactive in bringing experts from other disciplines into the fold when there's a reasonable argument for it, not just more and more computer scientists.

I had used the term Natural Intelligence. Something that is not designed is not that great a resource to get design tips from.

But when I think about the term Biological General Intelligence I am wondering whether an AGI could be a BGI. One could argue that humans are not 100% natural intelligence but are partly artifical.

One could also argue that the standard cognitive individual contains silicon bodyparts in the form of cellphone. Or that google or facebook is in non-unsignificant degree part of individual human intelligence attention control.