x

LESSWRONG

LW

agastya agrawal — LessWrong

agastya agrawal

agastya agrawal

Message

11

1

1

3mo

agastya agrawal

11

3mo

Uncovering Unfaithful CoT in Deceptive Models

Inspired by the paper Modifying LLM Beliefs with Synthetic Document Finetuning, I fine-tuned an AI model to adopt the personality of a detective and generate unfaithful Chain-Of-Thought (CoT) in order to conceal their true investigative intent, and be able to solve mystery cases. The project primarily investigates two questions: 1....