x

LESSWRONG

LW

Yavuz Bakman — LessWrong

Yavuz Bakman

Yavuz Bakman

Message

Thinking about AI Alignment and Reliability.

34

2

2

2mo

Yavuz Bakman

Thinking about AI Alignment and Reliability.

LLM Misalignment Can be One Gradient Step Away, and Blackbox Evaluation Cannot Detect It.

Models that appear aligned under black-box evaluation may conceal substantial latent misalignment beneath their observable behavior. Let's say you downloaded a language model from Huggingface. You do all the blackbox evaluation for the safety/alignment, and you are convinced that the model is safe/aligned. But how badly can things go after...