Building Better Activation Oracles
Work done for our MATS 10.0 Sprint project - mentored by Neel Nanda and Adam Karvonen Huggingface, Github TL;DR: We have improved the original Activation Oracle (AO) training regime by training on on-policy rollouts, improving the conversational dataset, feeding more layers (following the approach by Niclas Luick) and making a...