Romain Deléglise
Message
Fundraiser (volunteer) for the french branch of Pause AI. I also have a small blog (in french) about the AI alignment problem.
6
2
Hello,
Thanks for your post.
I am happy that some people did this research, but I will pushback against some points.
First, inner alignment is not solved even with this results. Because we don't know if the model really internalised the ethical values or some proxy that correlated within the environment. Maybe the new tool (Natural Language Auto encoders) by Anthropic can help distinguish but I don't think the tool is precise and reliable enough.
Second, the generalization seems to be weaker than it appears. Basically the alignment generalized between ethicals...
Make sense, thanks for the answer