Uncovering Unfaithful CoT in Deceptive Models
Inspired by the paper Modifying LLM Beliefs with Synthetic Document Finetuning, I fine-tuned an AI model to adopt the personality of a detective and generate unfaithful Chain-Of-Thought (CoT) in order to conceal their true investigative intent, and be able to solve mystery cases. The project primarily investigates two questions: 1....
Jan 2212