x

LESSWRONG

LW

Usman Anwar — LessWrong

Usman Anwar

Usman Anwar

Message

47

1

1

4y

Usman Anwar

47

4y

Paraphrasing Is (At Best) a Partial Defence Against Steganography in LLMs

Within the AI Safety community, paraphrasing, which, in the context of this post, simply means using another LLM (with nonzero temperature) to rewrite a given piece of content, is generally considered a viable defence and detection method for steganography in LLMs. In this blogpost, we briefly provide a taxonomy of...

A Sober Look at Steering Vectors for LLMs

by Joschka Braun, Dmitrii Krasheninnikov, Usman Anwar, RobertKirk, Daniel Tan, and David Scott Krueger

We thank Madeline Brumley, Joe Kwon, David Chanin and Itamar Pres for their helpful feedback. Introduction Controlling LLM behavior through directly intervening on internal activations is an appealing idea. Various methods for controlling LLM behavior through activation steering have been proposed. Most steering methods add a 'steering vector' (SV) to...

Nov 23, 2024•42