Response to "Taking AI Welfare Seriously": The Indirect Approach to Moral Patienthood
I've been thinking about the Sebo et al. paper Taking AI Welfare Seriously, which features one of my favorite philosophers, David Chalmers (I have a signed copy of his anthology of mind, full fan here). While I appreciate their careful treatment of consciousness (we genuinely face deep uncertainty here, so...
Thanks for this comment, Sam. It directly inspired me to do this work and paper on detecting and steering empathy-in-action as linear directions in activation space.
Short summary: detection works really well, steering is messier and model-dependent. Safety-trained Qwen keeps coherence in both directions, uncensored Dolphin works great for adding empathy but completely falls apart when you try to remove it.
You're right that the scenarios need work. I used synthetic contrastive pairs which are clean for probe extraction but artificial. For v2 I want real EIA game outputs, multi-turn stuff where the tradeoffs aren't obvious, and better handling of the weak cross-model transfer (correlations between model probes are basically zero). I also think I could use other methods like Procrustes alignment or CCA to bridge activation spaces, or focus on relative geometry rather than raw directions.
Also thinking about other virtue-in-action probes along similar lines: justice, temperance, caretaking. I think the same approach should work, at least experimentally.