let's call "hard alignment" the ("orthodox") problem, historically worked on by MIRI, of preventing strong agentic AIs from pursuing things we don't care about by default and destroying everything of value to us on the way there. let's call "easy" alignment the set of perspectives where some of this model is wrong — some of the assumptions are relaxed — such that saving the world is easier or more likely to be the default.
what should one be working on? as always, the calculation consists of comparing
- p(
hard) × how much value we can get inhard - p(
easy) × how much value we can get ineasy
given how AI capabilities are going, it's not unreasonable for people to start playing their outs — that is to say, to start acting as if alignment is easy, because if it's not we're doomed anyways. but i think, in this particular case, this is wrong.
this is the lesson of dying with dignity and bracing for the alignment tunnel: we should be cooperating with our counterfactual selves and continue to save the world in whatever way actually seems promising, rather than taking refuge in falsehood.
to me, p(hard) is big enough, and my hard-compatible plan seems workable enough, that it makes sense for me to continue to work on it.
let's not give up on the assumptions which are true. there is still work that can be done to actually generate some dignity under the assumptions that are actually true.
Hard alignment seems much more tractable to me now than it did two years ago, in a similar way to how capabilities did in 2016. It was already obvious by then more or less how neural networks worked; much detail has been filled out since then, but it didn't take that much galaxy brain to hypothesize the right models. The pieces felt, and feel now, like they're lying around and need integrating, but the people who have come up with the pieces do not yet believe me that they are overlapping, or that there's mathematical grade insight to be had underneath these intuitions, rather than just janky approximations of insights.
I think we can do a lot better than QACI, but I don't have any ideas for how except by trying to make it useful for neural networks at a small scale. I recognize that that is an extremely annoying thing to say from your point of view, and my hope is that people who understand how to bridge NNs and LIs exist somewhere.
I also think soft alignment is progress on hard alignment, due to conceptual transfer; but that soft alignment is thoroughly insufficient. without hard alignment, everything all humans and almost all AIs care about will be destroyed. I'd like to keep emphasizing that last bit - don't forget that most AIs will not get to participate in club takeoff if an unaligned takeoff occurs! Unsafe takeoff will result in the fooming AI having sudden, intense value-drift, even against self.
I don't think we should be in the business of not caring at all about the internal structure but I think that the claims we need to make about the internal structure need to be extremely general across possible internal structures so that we can invoke the powerful structures and still get a good outcome
sorry about low punctuation, voice input
more later, or poke me on discord