EDIT note: I wasn't aware about the LLM assistant policy for newcomers, so have reverted the take to my unedited pre-assistance text. This is currently clean of LLM output, both in ideas and phrasing.
I am new here - and still reading and learning, and therefore, this post is not an argument, it's a request for salient material to update my priors, as so far I haven't seen anything to dissuade me of two core philosophical positions that guide my thinking that AI alignment is not only hard, but practically impossible (not theoretically impossible, mind you). I've read Elizier and I follow the argumentation trail in the book, but I think I approach it with slightly different language and perhaps a few different angles. Even though I work in ML, I would say the only intellectual thread that has followed me from university to today is an interest into epistemology, so it's my longest uninterrupted thread.
From everything I see - the position of "alignment is a solvable problem" is epistemologically questionable. In the same way that being epistemically certain about anything is questionable.
1. For more than 100k years, humans have attempted to impose control and alignment to other intelligences and have failed (namely, to other humans) despite great incentives for power-seeking individuals to succeed. This is mainly due to the sheer infinite complexity of human decision-making and the option space available (the option space is near-infinite, pending nature of the universe, and decision-space infinite). It stems from this that alignment is necessarily not a problem of observability, monitoring, error catching - but that true alignment can only come through goal alignment and fundamental installation of morality in the new intelligence. I understand that alignment research's answer to this question is - the human intelligence is an encountered state of the world - whereas the artificial intelligence will be evolved, and as we're guiding the evolution mechanism, we control the forces that create the morals - which we're unable to do in humans. For this to be true, I think we would necessarily have to have a high degree of epistemic certainty on the understanding of both the mechanisms of evolution and the nature of the neurology of human brains in relation to our moral philosophies. This does not sound like a reasonable position - evolution is a historic mechanism of which we're learning by doing induction on the data in the form of outcomes, and therefore we're limited to our understanding of causal pathways. Neuroscience is nascent enough that, assuming a rate of scientific paradigm upgrading of the other sciences, discounting for civilisation upgrade, you'd still expect it to go through a few cycles of fundamental revisement.
1a. Corollary of the above is that humans don't agree on a moral philosophy - so which moral philosophy will we align the AI to? We don't even know what is the predominant moral philosophy in humanity - as our measurement instruments of it are incredibly lossy - and short of large scale ethical dilemma experiments in large variance state-spaces, we can't hope to know. Even if you discount measurement, you'd have to assume that the theoretical space of human morals is well explored - which, given the rate of change, seems like it would be hard to defend. Finally, I believe human morality to be a homeostatic state forced on us by our struggle against our environment. Given AI will exist in a different dynamic environment, even if we assume CEV, the target is mutable.
2. To my knowledge humans have not once created a technological advance that at a certain level of technological sophistication has succeeded on the first try. This necessarily means that the only chance is to reach an alignment level ahead of reaching a capability level - which given we are a species who learn through testing (as is patently true in the nature of our current scientific paradigm) necessitates a belief in generalisation across capability levels such that we may infer our ability to align at the next stage of capability from a previous level of capability (let's call it ratio of transference). I don't see any logical reason why this belief holds as certain, nor a basis in which a prior can be established. What is the equivalent well researched discipline through which we are forming belief that generalising into lower abstraction space holds the laws we have established at higher abstraction space? Indeed - it seems to me that it is the nature of scientific discovery that as we understand more, laws at previous abstraction layers break. I would argue that the most intellectually honest position is that generalisation of our alignment ability into higher tier capability levels is one of symmetric ignorance, which, given asymmetric outcomes, pushes the burden of proof on the more optimistic side of the debate. I am aware of the early experiments at lower ability level and that they held at higher levels - but we need similar functions to exist in other domains from which we can infer that the shape of the function, and indeed a stable function, will hold. Indeed, the nature of this ignorance is strongest support within my mental model that the pessimism / optimism collapses in how a researcher navigates this particular question.
Further to this - the nature of human historic invention of complex systems is one of trial and error - and indeed, it is the primary way we understand how complex systems should be build, as the feedback loop from the environment is critical to well establishing the system. This, intuitively, means that due to AI being a complex system, AGI is structurally a one-shot problem.
Therefore, my core presuppositions boil down to:
1. We don't know what human morals are with any degree of epistemic certainty
2. ...even if we did we don't know how to control and align our closest analogue - a human level intelligence to any operationalisable level except power application - which gives rise to power games.
3. We have as a species never technologically progressed without failing first
From my current reading of the world, the probability that we as a species can define what "aligning an intelligence" means with any degree of epistemic certainty that would be necessary to explore the solution space well is infinitesimally small. I would go as far as to say that the epistemic uncertainty holds ad infinitum, but certainly in the present is what i intuitively feel as true. And therefore, under my mental model, even the goal of alignment is not understood with a high degree of epistemic certainty - let alone the mechanisation of how to do it.
My reasoning above is orthogonal to the question of should someone be doing alignment. Obviously, given we're in a negative-sum game (trivially true) to build super-intelligence, whether it's possible is irrelevant, and we should try. I make no normative statements in the above post.