x

LESSWRONG
LW

Robert Adragna

Subscribe

Message

5

4

2y

Robert Adragna

Subscribe

Message

5

4

2y

Emergent Misalignment and the Anthropic Dispute

Robert Adragna1mo10

Have you considered running this on a dataset of "autonomous weapons activity". Although Anthropic might feel comfortable with this right now, if it did induce significant EM that might be good reason to avoid any fine-tuning for autonomous weapons use

Reply

Natural Latents: The Concepts

Robert Adragna2mo20

Of all the possible natural latents that exist in some dataset, which ones should we expect a sufficiently advanced AI system to learn? This matters because it seems like there’s a massive number of natural latents present in any dataset that we might be able to learn.

Say that we’ve got a collection of N dogs. The natural latent “dog” would satisfy the redundancy & independence assumption for every dog. But I could also think about the powerset of this collection of N dogs. For each element of the powerset, all the dogs will share the information pres... (read more)

Reply

Open Thread Winter 2025/26

Robert Adragna2mo50

Hi everyone!

My name is Robert Adragna, and I’ve been working with Dovetail this winter fellowship cohort on Agent Foundations. Specifically, I’ve been trying to better understand what background assumptions the Natural Abstractions Hypothesis (NAH) makes about the world, and whether they might be learned in existing LLM systems. Specific questions that I’m exploring include:

Is the Platonic Representation Hypothesis from deep learning evidence for the Natural Abstractions Hypothesis?
Is it possible to construct a dataset which represents the world in a co