LESSWRONG
LW

jordine
225430
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Can SAE steering reveal sandbagging?
jordine5mo10

Refusals were mostly 1-2%, so ignoring them doesn't change results significantly. Ignoring gibberish does change results, but since we are measuring correct answers this shouldn't matter

Reply
Shallow review of technical AI safety, 2024
jordine8mo10

fixed! edited hyperlink.

Reply
Shallow review of technical AI safety, 2024
jordine8mo10

edited, thanks for catching this!

Reply
31Here’s 18 Applications of Deception Probes
4d
0
35Can SAE steering reveal sandbagging?
5mo
3
195Shallow review of technical AI safety, 2024
Ω
8mo
Ω
35
13Results from the AI x Democracy Research Sprint
1y
0