LESSWRONG
LW

Juan Cadile — LessWrong

Thanks for this comment, Sam. It directly inspired me to do this work and paper on detecting and steering empathy-in-action as linear directions in activation space.

Short summary: detection works really well, steering is messier and model-dependent. Safety-trained Qwen keeps coherence in both directions, uncensored Dolphin works great for adding empathy but completely falls apart when you try to remove it.

You're right that the scenarios need work. I used synthetic contrastive pairs which are clean for probe extraction but artificial. For v2 I want real EIA game outputs, multi-turn stuff where the tradeoffs aren't obvious, and better handling of the weak cross-model transfer (correlations between model probes are basically zero). I also think I could use other methods like Procrustes alignment or CCA to bridge activation spaces, or focus on relative geometry rather than raw directions.

Also thinking about other virtue-in-action probes along similar lines: justice, temperance, caretaking. I think the same approach should work, at least experimentally.

Response to "Taking AI Welfare Seriously": The Indirect Approach to Moral Patienthood

Juan Cadile

3mo

I've been thinking about the Sebo et al. paper Taking AI Welfare Seriously, which features one of my favorite philosophers, David Chalmers (I have a signed copy of his anthology of mind, full fan here). While I appreciate their careful treatment of consciousness (we genuinely face deep uncertainty here, so it's naive to brush it off), I, nonetheless, find the robust agency argument deeply unconvincing as a standalone route to moral patienthood.

The authors suggest that sophisticated planning and goal-pursuit might suffice for moral consideration even absent phenomenal experience. But this seems to miss something crucial: welfare (typically) presupposes that outcomes can be better or worse for the entity in question. If there's... (read 423 more words →)

Is There a Sound Argument for Generality in AI?

Juan Cadile

4mo

Thesis Statement^[1]

Current arguments for AGI can be distilled to arguments for specific capabilities, not for generality in itself. We need to examine whether there exists a genuine and sound argument for generality as an independent property.

Introduction

In Plato's Republic, Glaucon's challenge to Socrates is to show him why justice is good in and of itself; instead of arguing for its instrumentality. In other words, Socrates has to show Glaucon that we value justice itself, not merely for its after-effects:

"For I want to hear what justice and injustice are, and what power each has when it is just by itself in the soul. I want to leave out of account the rewards and the... (read 1500 more words →)

Juan Cadile

Juan Cadile

Response to "Taking AI Welfare Seriously": The Indirect Approach to Moral Patienthood

Is There a Sound Argument for Generality in AI?

Juan Cadile

Juan Cadile

Response to "Taking AI Welfare Seriously": The Indirect Approach to Moral Patienthood

Is There a Sound Argument for Generality in AI?

Thesis Statement[1]

Introduction

Thesis Statement^[1]