One key strategy for preventing bad outcomes from misuse or misalignment is model monitoring. However, one way that monitoring can fail is if LLMs use steganography - encoding dangerous information or reasoning steps in seemingly benign text. To understand the potential for LLMs to cause harm via steganography, we run evaluations to measure the steganographic capabilities of frontier models.
Paper link: https://arxiv.org/abs/2507.02737
Example transcripts website: https://steganography-evals-d9297ad0740c.herokuapp.com/
Github: https://github.com/arturzolkowski/steganographic-evals/
Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two... (read 568 more words →)