I think you mean that seemingly-aligned AI is the most dangerous kind? Big difference. This issue is discussed frequently, although I don't have a really good reference for it.
Yes and no. Yes in principle but no in practicality. The point I am making is that with regards to all possible measures, seemingly aligned and aligned models are indistinguishable. I agree that many people are aware and it has been talked about for a long time on this forum but I am surprised that evidence of apparent alignment is still considered as progress towards the later by a lot of people. Whereas I see later having a measurable evidence almost impossible.
I got that. I'm pointing out that you're probably being downvoted because your title is quite inaccurate.
Thanks for pointing that out and for the engagement despite that. I have changed the title and added a short note on the edit.
What really bothers me is the trajectory we seem to have chosen: Continue to scale the models and monitor them for misalignment. This plan has some obvious flaws:
1. Verifiable way to know if we got an aligned or seemingly aligned AI as a result; since evaluations can't distinguish between the two.
2. White-box techniques seem to be pretty limited currently and it is uncertain if we will get distinguishable signals if most reliable techniques are developed past a certain(unknown) capability mark.
3. If we continue moving towards more automated pipelines, because we feel it is safe, we won't be able to limit catastrophes.
I don't see much push from technical perspective against this trajectory. For people starting out work in AI safety from technical perspective, I don't see many suggestions that challenge this trajectory and propose alternatives. I see there is scientist AI from Yoshua Bengio but it doesn't seem to be discussed as much. I see theoretical work with Simplex and SLT, but seems still at beginning stage.
And since most beginner work is about replicating and extending current work, it creates this chain-reaction, and as a result, a majority of the work ends up being the one following the same trajectory.
I see this post on the state of progress on alignment and others talk about AI getting aligned or better aligned than they initially expected.
But if you were not worried about catastrophes from misuse or mistake but specifically from x-risks from AI because of loss of control, it is exactly the kind of AI that is most dangerous. [1]
I am surprised that it is not obvious to more people that:
Edits: I changed the title of the article as it was considered misleading. I chose a clickbait title earlier: Aligned AI is the most dangerous AI to draw attention towards impossibility(?) of distinguishing aligned and seemingly aligned models via evaluations. I felt not that many people were talking about it and it has big implications on what conclusions we draw from empirical results on alignment evaluations.
I am not claiming that the current AI models are deceptively aligned or even close but the limitations of ever knowing if would be.