LESSWRONG
LW

Extremely late, but I actually agree.

I wonder the extent to which alignment faking is present in current preparedness frameworks. One of my beliefs is that a better degree of interpretability can help us understand why models engage in such behavior, but yes, it probably does not get us to a solution (so far).

Reply

Steering Vectors Can Help LLM Judges Detect Subtle Dishonesty

Arch2235mo10

Please ask any questions! We are more than happy to clarify our work, and explore potential avenues to improve it.

The lack of actionable ways to not only understand, but effectively improve model behavior toward alignment, is something that we believe is one of the most unsolved and overlooked problems in safety research today.

Reply

Arch223's Shortform

[+]Arch2231y-90

2Interpretability is the best path to alignment

2mo

6

12Steering Vectors Can Help LLM Judges Detect Subtle Dishonesty

5mo

1

1Arch223's Shortform

1y

1