emile delcourt's Shortform
May 122
Emile Delcourt, David Baek, Adriano Hernandez, Erik Nordby with advising from Apart Lab Studio Introduction & Problem Statement Helpful, Harmless, and Honest (”HHH”, Askell 2021) is a framework for aligning large language models (LLMs) with human values and expectations. In this context, "helpful" means the model strives to assist users...
TL;DR I was interested in the ability of LLMs to discriminate input scenarios/stories that carry high vs low cyber risk, and found that it is one of the “hidden features” present in most later layers of Mistral7B. I developed and analyzed “linear probes” on hidden activations, and found confidence that...