Aran Nayebi — LessWrong

I'm an Assistant Professor at Carnegie Mellon’s Machine Learning Department. I'm also a core faculty member in CMU’s Neuroscience Institute, and hold a courtesy appointment in the Robotics Institute.

My lab works at the intersection of neuroscience & AI to reverse-engineer animal intelligence and build the next generation of autonomous agents, responsibly and safely.

Learn more here: https://cs.cmu.edu/~anayebi

Great post! Very much agree about the conservatism.

This is why I find it useful to do economic analyses where the variables and factors are exposed, such as in my recent AI UBI analysis, so that one needn't assume values for those variables, but can try out a multitude of scenarios and see how the predictions change, along with having an understanding of what factors matter more than others through the derived analytic form.

For example, one thing I found was that having a Scandinavian ownership amount of AI profits (~33%), drastically reduces the AI capability needed to be productive enough to fund a UBI. As a policy, this then seems very reasonable and attainable.

The full paper with all the cited sources can be found here: https://arxiv.org/abs/2505.18687

Pretraining doesn’t evade the lower bound: a “pointer” is just a compressed index into a large hypothesis space, and constructing it already requires resolving the same M-way ambiguity during pretraining. The lower bound applies regardless of where the bits are paid.

You can certainly put it in U2 instead (U2 is just a special case of U4 with one auxiliary), but putting it in U4 already ensures it’s suboptimal to preserve the switch & defer yet "kill all humans", because it collapses many future intervention and recovery options simultaneously. In other words, it’s a hard constraint in effect — U4 enforces it as a global irreversibility invariant, whereas U2 is only needed for narrow single-channel invariants like switch reachability.

That's correct, it can be naturally folded into U4 as one of its auxiliary utilities, in the same manner as we do for off-switch preservation.

Thanks! I really appreciate this, and I think your natural-latents framing fits nicely with the Part I point about needing to compress D down to a small set of crisp, structured latents. On the lexicographic point: it's worth noting that even though Theorem 3 writes the full objective as a discounted sum, the safety heads U1-U4 aren’t long-horizon objectives — they’re local one-step tests whose optimal action doesn’t depend on future predictions. For example, U1 is automatically satisfied each round by waiting (and once the human approves the proposed action, the agent simply executes it, thereby engaging U2-U5), and U4 is a one-step reversibility check against an inaction baseline, not a long-run impact estimate. The only head with genuine long-horizon structure is U5, which sits below the safety heads, so discounting never creates optimization pressure on them. This makes the whole scheme intentionally deontic and “natural-latent–friendly”, exactly matching the tractable regime suggested by the large-D lower bounds of Part I.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments