Of all the possible natural latents that exist in some dataset, which ones should we expect a sufficiently advanced AI system to learn? This matters because it seems like there’s a massive number of natural latents present in any dataset that we might be able to learn.
Say that we’ve got a collection of N dogs. The natural latent “dog” would satisfy the redundancy & independence assumption for every dog. But I could also think about the powerset of this collection of N dogs. For each element of the powerset, all the dogs will share the information pres...
Hi everyone!
My name is Robert Adragna, and I’ve been working with Dovetail this winter fellowship cohort on Agent Foundations. Specifically, I’ve been trying to better understand what background assumptions the Natural Abstractions Hypothesis (NAH) makes about the world, and whether they might be learned in existing LLM systems. Specific questions that I’m exploring include:
We’re also trying to find the group
Have you considered running this on a dataset of "autonomous weapons activity". Although Anthropic might feel comfortable with this right now, if it did induce significant EM that might be good reason to avoid any fine-tuning for autonomous weapons use