LESSWRONG
LW

776
jordine
234430
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Can SAE steering reveal sandbagging?
jordine6mo10

Refusals were mostly 1-2%, so ignoring them doesn't change results significantly. Ignoring gibberish does change results, but since we are measuring correct answers this shouldn't matter

Reply
Shallow review of technical AI safety, 2024
jordine9mo10

fixed! edited hyperlink.

Reply
Shallow review of technical AI safety, 2024
jordine10mo10

edited, thanks for catching this!

Reply
38Here’s 18 Applications of Deception Probes
2mo
0
35Can SAE steering reveal sandbagging?
6mo
3
197Shallow review of technical AI safety, 2024
Ω
10mo
Ω
35
13Results from the AI x Democracy Research Sprint
1y
0