x

Mateusz Dziemian

Subscribe

Message

Applied ML Engineer moving into AI safety. BEng EEE @UCL. Mainly interested in alignment, red teaming and agents.

27

2y

Mateusz Dziemian

Subscribe

Message

Applied ML Engineer moving into AI safety. BEng EEE @UCL. Mainly interested in alignment, red teaming and agents.

27

2y

Deceptive agents can collude to hide dangerous features in SAEs

by Simon Lermen and Mateusz Dziemian

TL;DR: We use refusal-ablated Llama 3 70B agents as both labeling and simulating agents on the features of a sparse auto-encoder, with GPT-4 as an overseer. The agents follow a deceptive coordinated strategy to mislabel important features such as "coup" or "torture" that don't get flagged by the overseer but...

Jul 15, 202433