Latent Introspection (and other open-source introspection papers)
by vgel, Martin Vaněk, Raymond Douglas, and Jan_Kulveit
Paper | Code | Earlier post | Twitter thread | Bluesky thread @vgel, Martin Vaněk, @Raymond Douglas, @Jan_Kulveit — ACS Research, CTS, Charles University --- Last year, Lindsey demonstrated that Claude models can detect when concepts have been injected into their activations using steering vectors, which Lindsey uses as a...
Mar 2496