This is a special post for quick takes by Chamod Kalupahana. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Great post on using measuring logit distribution for eval awareness. Surprisingly effective, even for unverbalised eval awareness! ✍️
https://www.lesswrong.com/posts/PK7ZvFZxrgpYtrpF4/logits-as-a-new-monitor-for-evaluation-awareness-1
saw an interesting post for those also trying to transition into technical AI safety :)
Greenblatt post trying out open weight NLAs showing the limitations for reconstructing activations