x

LESSWRONG

LW

Romain Deléglise — LessWrong

Romain Deléglise

Romain Deléglise

Message

Fundraiser (volunteer) for the french branch of Pause AI. I also have a small blog (in french) about the AI alignment problem.

6

2

3y

Romain Deléglise

Fundraiser (volunteer) for the french branch of Pause AI. I also have a small blog (in french) about the AI alignment problem.

Claude is Now Alignment-Pretrained

Romain Deléglise20d40

Make sense, thanks for the answer

Claude is Now Alignment-Pretrained

Romain Deléglise21d43

Hello,

Thanks for your post.

I am happy that some people did this research, but I will pushback against some points.

First, inner alignment is not solved even with this results. Because we don't know if the model really internalised the ethical values or some proxy that correlated within the environment. Maybe the new tool (Natural Language Auto encoders) by Anthropic can help distinguish but I don't think the tool is precise and reliable enough.

Second, the generalization seems to be weaker than it appears. Basically the alignment generalized between ethicals... (read more)