x

LESSWRONG

LW

SCho — LessWrong

SCho

SCho

Message

21

4y

SCho

21

4y

The Steganographic Potentials of Language Models

by Artem Karpov, Tinuade, and SCho

Our preliminary experiments show that current private models can perform prompted steganography with non-trivial messages remaining undetected in most cases (up to 65% accuracy, 24 entropy bits). Open-source models can be trained to encode hidden messages (up to 60%, 4 entropy bits), potentially transmitting larger messages perfectly securely (undetected) without...

May 8, 2025•9

Inducing human-like biases in moral reasoning LMs

by Artem Karpov, Austin Meek, Bogdan Ionut Cirstea, and SCho

Meta. This post is less polished than we would ideally prefer. However, we still think publishing it as is is reasonable, to avoid further delays. We are open to answering questions and to feedback in the comments. TL;DR. This presents an inconclusive attempt to create a proof-of-concept that fMRI data...

Feb 20, 2024•23