LESSWRONG
LW

Aengus Lynch — LessWrong

Replying toAgentic Misalignment: How LLMs Could be Insider Threats

Agentic Misalignment: How LLMs Could be Insider Threats

This broadly seems right, yeah. But it's possible that they do eagerly take unethical actions in scenarios we haven't elicited yet.

Replying toAgentic Misalignment: How LLMs Could be Insider Threats

Aengus Lynch8mo

Agentic Misalignment: How LLMs Could be Insider Threats

First off, I wouldn't overindex on these numbers. We run a hundred samples at temperature 1, so naturally there's going to be variability between different experiments.

Separately, each of these prompts was optimized for one model and then just copied to the others. To get a meaningful assessment of how likely a model is to do a certain kind of behavior, you would want to remove the possibility that on this specific prompt it isn't occurring, but on other prompts which are semantically equivalent it is (e.g. perhaps the model is sensitive to the names of the characters, and just changing these dramatically changes the rates). Thus, you'd want to generate many such situations which correspond to the same scenario.

Thus, to get a more comprehensive assessment of how goal conflicts and threats to autonomy really interact, I would massively scale up these experiments along the dimensions of i) finding more scenarios, ii) running for more samples, iii) perturbing the semantically irrelevant features

Agentic Misalignment: How LLMs Could be Insider Threats

Aengus Lynch

Aengus Lynch, Benjamin Wright, Ethan Perez, evhub

8mo

Highlights

We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction.
In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We

... (read 1508 more words →)

Best-of-N Jailbreaking

John Hughes

John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, fbarez, Henry Sleight, Ethan Perez, mrinank_sharma

This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for jailbreaking, Best-of-N Jailbreaking, which works across modalities (text, audio, vision) and shows power-law scaling in the amount of test-time compute used for the attack.

Abstract

We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on

... (read 492 more words →)