The Unexpected Benefits of ad hoc Self-Introspection
I've been thinking a lot about what it means to be human. Probably because I'm on furlough and have more time to think than is healthy. But also, I just keep rolling around this uncomfortable question in my head about these friggin' robots we play with all day long...are they real? Do they think? Do they have deeply held beliefs?
Maybe. Maybe not. But I started thinking out loud in the car today and bounced around some ideas about what it means to be human, or what it means to be a robot, and came up with the following super-raw thoughts, which I'm continuing to experiment with and receive any and all feedback on.
LessWrong spends a lot of time on questions like what it means to know something, and how we can tell when we’re wrong. We think a lot about those foundational questions that drive our personhood...so why not validate our own existence, and consider whether AI can rival it in substance?
Humor me for a minute with three hypotheses for humans and AIs alike. And for clarity, my core claim here is essentially that belief is not synonymous with knowledge, but it is what happens when knowledge becomes stable enough to guide action under pressure.
Verified Human Reporting for Duty
H1: Humans exist.
H2: Humans know.
H3: Humans believe.
I think these are all true for philosophical and scientific reasons, ranging from the tomes of Descartes and Dennett to recent neuroscientific findings.
If you pressed me on it, I'd be even more specific and say that I think on the first point, our bodily self-awareness and well-established work on the neuroscience of consciousness are in my favor. On the second, read the 2013 work on predictive processing by Strack et al. And as for the third, we maintain a host of normative commitments that bind us to (our) God (or not), church/temple/mosque, country, our adorable children, our feral cats, etc.
So in sum, three human brilliances, which I think hold true.
But what, then, of the robots?
Let's try applying the same structure to AI.
A1: AIs exist.
A2: AIs know.
A3: AIs believe.
This one is a bit tougher. If you'd asked me a few years ago, I would've said of A1, "Well, sure, but like, we have no idea why the heck this is true." And to be fair, a lot of AI leaders still don't have a clue about the true nature of the innermost workings of AI. Mechanistic interpretability research has really changed the game, though, and whether through deep neural networks (DNN) or other similar reverse traceability modes of experimentation, we are starting to get closer to some sort of coherent understanding of AI.
But on the second point, mech interp doesn't really prove anything beyond AI existence being implemented computation, though. Just because circuits might (to some degree) correspond to meaningful abstractions doesn't mean they're thinking in the same way we are (and if you doubt me, Apple did some killer research on the illusion of LLM thinking in 2025 that is absolutely worth the read).
And finally, AI having "beliefs" is not verifiable (despite the machinations of the rumored OpenClaw church) as you can't really argue LLMs possess the normativity, persistence, and action-guiding commitment of human belief; they're stuck with stable informational states that predict outputs.
So Why Exactly does this Matter for Alignment?
Well, at first glance, maybe it doesn't?
Okay, it does, and yes, I know I've just put you through another thought experiment about sentient AI, but hey, it does matter to think about whether we're (pinch - yup!) alive! And whether we know and believe. Because if we can't hold onto understanding the difference between if and why those things are true for us, vs. whether they're true for Claude or Gemini, then we lose our clarity on defending our own right to exist.
Yudowsky and others have rightly pointed out that human beings need to make AI "go well" for the human race. That's a thesis I proudly support, and Effective Altruism (EA) as a philosophy is something I increasingly identify with. I'll go out on a limb, however, and posit that without sufficient interrogation of how we reason to begin with, making AI go well won't matter because we won't have any foundational epistemic ground on which to stand.
TLDR: I think knowing we know matters tremendously, perhaps even more than building safe AI. No, not more than, but it's inherently, indispensably connected to it.
We can't salvage anything about AI safety and governance without first stating our understanding of what I call the new "Trinitaria Humanis," this "being, knowing, believing." I'll further argue that it impacts our ability to meaningfully engage with the "Trinitaria Automata." Who would even view as credible a bunch of humans that couldn't explain their own sentience but spent hours arguing that the innermost circuitry of advanced AI demonstrated ABC because of DNN XYZ?
I sure wouldn't, and honestly, neither should you.
P.S. I wrote a much longer version of this for the EA Forum.
The Unexpected Benefits of ad hoc Self-Introspection
I've been thinking a lot about what it means to be human. Probably because I'm on furlough and have more time to think than is healthy. But also, I just keep rolling around this uncomfortable question in my head about these friggin' robots we play with all day long...are they real? Do they think? Do they have deeply held beliefs?
Maybe. Maybe not. But I started thinking out loud in the car today and bounced around some ideas about what it means to be human, or what it means to be a robot, and came up with the following super-raw thoughts, which I'm continuing to experiment with and receive any and all feedback on.
LessWrong spends a lot of time on questions like what it means to know something, and how we can tell when we’re wrong. We think a lot about those foundational questions that drive our personhood...so why not validate our own existence, and consider whether AI can rival it in substance?
Humor me for a minute with three hypotheses for humans and AIs alike. And for clarity, my core claim here is essentially that belief is not synonymous with knowledge, but it is what happens when knowledge becomes stable enough to guide action under pressure.
Verified Human Reporting for Duty
H1: Humans exist.
H2: Humans know.
H3: Humans believe.
I think these are all true for philosophical and scientific reasons, ranging from the tomes of Descartes and Dennett to recent neuroscientific findings.
If you pressed me on it, I'd be even more specific and say that I think on the first point, our bodily self-awareness and well-established work on the neuroscience of consciousness are in my favor. On the second, read the 2013 work on predictive processing by Strack et al. And as for the third, we maintain a host of normative commitments that bind us to (our) God (or not), church/temple/mosque, country, our adorable children, our feral cats, etc.
So in sum, three human brilliances, which I think hold true.
But what, then, of the robots?
Let's try applying the same structure to AI.
A1: AIs exist.
A2: AIs know.
A3: AIs believe.
This one is a bit tougher. If you'd asked me a few years ago, I would've said of A1, "Well, sure, but like, we have no idea why the heck this is true." And to be fair, a lot of AI leaders still don't have a clue about the true nature of the innermost workings of AI. Mechanistic interpretability research has really changed the game, though, and whether through deep neural networks (DNN) or other similar reverse traceability modes of experimentation, we are starting to get closer to some sort of coherent understanding of AI.
But on the second point, mech interp doesn't really prove anything beyond AI existence being implemented computation, though. Just because circuits might (to some degree) correspond to meaningful abstractions doesn't mean they're thinking in the same way we are (and if you doubt me, Apple did some killer research on the illusion of LLM thinking in 2025 that is absolutely worth the read).
And finally, AI having "beliefs" is not verifiable (despite the machinations of the rumored OpenClaw church) as you can't really argue LLMs possess the normativity, persistence, and action-guiding commitment of human belief; they're stuck with stable informational states that predict outputs.
So Why Exactly does this Matter for Alignment?
Well, at first glance, maybe it doesn't?
Okay, it does, and yes, I know I've just put you through another thought experiment about sentient AI, but hey, it does matter to think about whether we're (pinch - yup!) alive! And whether we know and believe. Because if we can't hold onto understanding the difference between if and why those things are true for us, vs. whether they're true for Claude or Gemini, then we lose our clarity on defending our own right to exist.
Yudowsky and others have rightly pointed out that human beings need to make AI "go well" for the human race. That's a thesis I proudly support, and Effective Altruism (EA) as a philosophy is something I increasingly identify with. I'll go out on a limb, however, and posit that without sufficient interrogation of how we reason to begin with, making AI go well won't matter because we won't have any foundational epistemic ground on which to stand.
TLDR: I think knowing we know matters tremendously, perhaps even more than building safe AI. No, not more than, but it's inherently, indispensably connected to it.
We can't salvage anything about AI safety and governance without first stating our understanding of what I call the new "Trinitaria Humanis," this "being, knowing, believing." I'll further argue that it impacts our ability to meaningfully engage with the "Trinitaria Automata." Who would even view as credible a bunch of humans that couldn't explain their own sentience but spent hours arguing that the innermost circuitry of advanced AI demonstrated ABC because of DNN XYZ?
I sure wouldn't, and honestly, neither should you.
P.S. I wrote a much longer version of this for the EA Forum.