LESSWRONG
LW

Arch223
1220
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
1Arch223's Shortform
10mo
1
Steering Vectors Can Help LLM Judges Detect Subtle Dishonesty
Arch2233mo10

Please ask any questions! We are more than happy to clarify our work, and explore potential avenues to improve it. 

The lack of actionable ways to not only understand, but effectively improve model behavior toward alignment, is something that we believe is one of the most unsolved and overlooked problems in safety research today. 

Reply
Arch223's Shortform
[+]Arch22310mo-90
2Interpretability is the best path to alignment
4d
6
12Steering Vectors Can Help LLM Judges Detect Subtle Dishonesty
3mo
1
1Arch223's Shortform
10mo
1