x

LESSWRONG

LW

Dylan Feng — LessWrong

Dylan Feng

Dylan Feng

Message

205

1

1y

Dylan Feng

205

1y

Weird Generalization & Inductive Backdoors

by Jorio Cocola, Owain_Evans, and Dylan Feng

This is the abstract and introduction of our new paper. Links: 📜 Paper, 🐦 Twitter thread, 🌐 Project page, 💻 Code Authors: Jan Betley*, Jorio Cocola*, Dylan Feng*, James Chua, Andy Arditi, Anna Sztyber-Betley, Owain Evans (* Equal Contribution) You can train an LLM only on good behavior and implant...

Dec 11, 2025•153

Concept Poisoning: Probing LLMs without probes

by Jan Betley, Jorio Cocola, Dylan Feng, and Owain_Evans

This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of our experiments, and the reasons for not continuing. Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna...

Aug 5, 2025•60