Introduction My collaborators and I wrote a paper titled "Distillation Robustifies Unlearning" a few months ago. For a quick summary, you can check out my mentor’s Twitter thread. Our team will be presenting posters at NeurIPS and the San Diego Alignment Workshop, so feel free to find me if you...
Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness. Unlearn-and-Distill...
For full content, refer to the arXiv preprint at https://arxiv.org/abs/2409.05907. This post is a lighter, 15-minute version. Abstract * Existing activation steering methods alter LLM behavior indiscriminately, limiting their practical applicability in settings where selective responses are essential, such as content moderation or domain-specific assistants. * We propose Conditional Activation...
Abstract We argue that there are certain properties of language that our current large language models (LLMs) don't learn. We present an empirical investigation of visual-auditory properties of language through a series of tasks, termed H-Test. This benchmark highlights a fundamental gap between human linguistic comprehension, which naturally integrates sensory...