I found this post after publishing something of my own yesterday, and it's wild how relevant this feels almost 2 decades later.
I'm not an expert on subjective probabilities, I come from analysing human behaviour and decision-making. What I find most fascinating is how you treat anticipation as a limited resource that has to be allocated among possible futures. In my world, people do something similar, but emotionally. We hoard permission the way the pundit hoards anticipation, waiting for perfect certainty before acting.
Recently, I watched someone spend months asking ChatGPT how to repair a friendship, crafting the perfect narrative, instead of just showing up. Therapists, astrologers and LLMs have all become proxies for "little numbers" that might make the risk of choosing feel safe. For so many of us.
I wonder if Bayesian reasoning is to belief what courage is to action? Because both are ways of updating before certainty arrives.
(If you're curious, my essay exploring this from the emotional side is here: https://shapelygal.substack.com/p/youre-afraid-to-choose-now-arent)
Aw, yeah it is easier to just look stuff up online and debate with LLMs, isn't it?
I am not a therapist, but I have been to therapists in multiple countries (US, UK and India) for several years, and I can share my understanding based on that experience.
I think human therapist accountability has multiple layers. Firstly, you need a professional license for practice that involves years of training, supervision, revocable licenses, etc. Then you have legal obligations for ensuring complete documentation and following crisis protocols. If these fail (and they sometimes do), you also have malpractice liability, and free market feedback. Even if only 1 in 100 bad therapists faces consequences, it creates deterrent effects across the profession. The system is imperfect but exists.
For AI systems, training, certification, supervision, documentation and crisis protocols are all doable, and probably far easier to scale, but at the end of the day, who is accountable for poor therapeutic advice? the model? the company building it? With normal adults, it's easy to ask for user discretion, but what do you do with vulnerable users? I am not sure how that would even work.
Thank you for this very detailed study.
I am most concerned about the accountability gap. Several students in my undergraduate class use these models as "someone to talk to" to deal with loneliness. While your study shows that some models handle vulnerable conversations better than others, I think the fundamental issue is that AI lacks the infrastructure for accountability that real therapeutic relationships require including continuity of care/ long-term mindset, professional oversight, integration with mental health systems, liability and negligence frameworks, etc.
Until then, I don't care how good the model is in terms of handling vulnerable conversations, I'd rather have it triage users by saying "Here are resources for professional support" and bow out, rather than attempting ongoing therapeutic relationships. Even perfectly trained therapeutic AI seems problematic without the broader accountability structures that protect vulnerable users.
More fundamentally, what are the underlying mechanisms that cause these model behaviours, and can training fixes address them without the accountability infrastructure?
are relationship coaches (not PUA) not a thing in the US?
Wait… isn’t this already filial piety? We created AI, and now we want it to mother us.
I don’t mean this as a technical solution, more a direction to start thinking in.
Imagine a human tells an AI, “I value honesty above convenience.” A relational AI could store this as a core value, consult it when short-term preferences tempt it to mislead, and, if it fails, detect, acknowledge, and repair the violation in a verifiable way. Over time it updates its prioritisation rules and adapts to clarified guidance, preserving trust and alignment, unlike a FAI that maximises a static utility function.
This approach is dynamic, process-oriented, and repairable, ensuring commitments endure even under mistakes or evolving contexts. It’s a sketch, not a finished design, and would need iterative development and formalization.
While simple, does this broadly capture the kind of thing you were asking about? I’d be happy to chat further sometime if you’re interested.
I’m reminded of a Sanskrit verse “Vidya dadati vinayam, vinayodyati patratam” which translates to intelligence gives power, but humility gives guidance. Applied to AI, intelligence alone doesn’t ensure alignment, just as humans aren’t automatically prosocial. What matters are the high-level principles we embed to guide behaviour toward repairable, cooperative, and trustable interactions, which we do see in long-term relationships built on shared values.
The architecture-level challenge of making AI reliably follow such principles is hard, yes, especially under extreme power asymmetry, but agreeing on relational alignment first is a necessary first step. Master/servant models may seem safe, but I believe carefully engineered relational principles offer a more robust and sustainable path.
I completely agree that AI isn’t human, mammal, or biological, and that any relational qualities it exhibits will only exist because we engineer them. I’m not suggesting we model AI on any specific relationship, like mother-child, or try to mimic familiar social roles. Rather, alignment should be based on abstract relational principles that matter for any human interaction without hierarchy or coercion.
I also hear the frequent concern about technical feasibility, and I take it seriously. I see it as an opportunity rather than a reason to avoid this work. I’d love the chance to brainstorm and refine these ideas, to explore how we might engineer architectures that are simple yet robust, capable of sustaining trust, repair, and cooperation without introducing subjugation or dependency.
Ultimately, relational design matters because humans inevitably interact through familiar social frameworks such as trust, repair, etc. If we ignore that, alignment risks producing systems that are powerful but alien in ways that matter for human flourishing.
Thanks for the detailed analysis, Zvi.
But it’s sad, isn't? Despite developing technologies with such incredible scale, we are using them to amplify our lack of incentives to build a better future, rather than actually transforming it.
Thanks for reading! I'm especially interested in feedback from folks working on mechanistic interpretability or deception threat models. Does this framing feel complementary, orthogonal, or maybe just irrelevant to your current assumptions? Happy to be redirected if there are blind spots I’m missing.