Beyond Interpretability: Building the Epistemic Infrastructure for Trustworthy AI

Jakub Drabik

Rejected for the following reason(s):

Not obviously not Language Model.

Read full explanation

Dario Amodei’s recent article The Urgency of Interpretability offers a timely and important argument: if we are to guide the development of increasingly powerful AI systems responsibly, we must be able to understand their internal processes before they reach levels of capability that could outpace human control. I strongly agree with this view, and I am encouraged by Amodei’s call for broader engagement, particularly from the academic community.

As someone working at the intersection of history, epistemology, and the public understanding of knowledge, I would like to suggest that while interpretability is indeed essential, it may not be sufficient. In a recent article for the LSE Business Review (An epistemic solution to do away with our illusion of AI objectivity), I explored the idea that the future of AI governance must also consider the epistemic structures AI systems are embedded in — how they affect human knowledge creation, verification, and trust.

Now, I would like to build on Amodei’s argument, proposing that alongside technical interpretability, we also need to think carefully about what I call epistemic alignment: ensuring that AI systems not only behave safely internally but also contribute to the broader infrastructure that sustains trustworthy knowledge in democratic societies.

To begin, it is important to recognise just how central the goal of interpretability is — not only for safety, but for any serious attempt to align AI development with human values.

Interpretability: A Foundation, But Not Enough

Amodei rightly points out that without the ability to understand and predict the internal processes of powerful AI systems, we risk building agents whose goals and behaviours could diverge from human interests in ways we cannot detect or correct. Interpretability is therefore the foundation upon which any form of responsible AI oversight must be built. Without it, both technical alignment and broader societal trust would be impossible.

Yet interpretability alone addresses only one layer of the challenge. Even if we fully succeed in developing tools that reveal what an AI system is "thinking" or planning, this would not guarantee that the system’s outputs support a healthy epistemic environment — one where knowledge remains contestable, verifiable, and open to revision.

Throughout history, philosophers and social thinkers — from sceptical traditions to Karl Popper’s critical rationalism and beyond — have shown that knowledge does not emerge from certainty, but from structured doubt, collective scrutiny, and institutional norms. Trustworthy knowledge depends not only on transparency, but on the resilience of the social processes that test and refine it.

In this light, I would suggest that alongside technical interpretability, we also need to think carefully about epistemic alignment: the extent to which AI systems reinforce or weaken the fragile human practices through which reliable knowledge is created and maintained.

If we focus only on whether AI systems are internally safe and transparent, but neglect the environments they shape externally, we risk solving one problem while leaving another — perhaps even deeper — unattended. Alignment must extend beyond internal safety into active epistemic responsibility — designing AI systems that reinforce, rather than erode, the conditions for reliable knowledge in democratic societies.

If we accept that epistemic alignment must become part of the conversation about AI governance, the next step is to begin imagining what such an epistemic infrastructure might look like in practice.

Toward an Epistemic Infrastructure for AI

Although I come to this conversation not as a technologist, but from the perspective of historical and epistemic studies, several possible directions suggest themselves for how AI systems might better support, rather than destabilise, public knowledge ecosystems.

First, AI systems could be developed to explicitly communicate the degree of uncertainty in their outputs. Rather than presenting every answer with the same surface-level confidence, models could signal when claims are tentative, debated, or require further verification — much as scholars and scientists do when engaging responsibly with knowledge. Encouraging a culture of epistemic humility in AI outputs would help users maintain a critical stance rather than assuming unwarranted certainty.

Second, building on ideas such as Amodei’s proposal for an “AI-MRI” — tools for inspecting internal model representations — we might also develop mechanisms for disclosing the provenance of AI-generated information. Rather than focusing solely on internal safety, these efforts could also serve an epistemic purpose: tracing the sources, influences, and training materials behind outputs, allowing users to critically evaluate credibility, much as historians assess the origins and biases of evidence.

Third, AI systems could be designed to make knowledge contestable in structured ways. Instead of presenting outputs as final or unquestionable, systems might allow users — and especially institutions committed to epistemic integrity — to flag, annotate, and revise AI-generated claims over time. Models could display when particular outputs are under dispute, reflect alternative interpretations, and preserve a visible history of corrections and challenges. This would not aim to replicate full peer review inside AI itself, but to embed AI outputs into a broader human epistemic process — one where knowledge evolves through scrutiny, disagreement, and self-correction.

None of these ideas are final answers or technical blueprints. They are, rather, suggestions for a direction of thought: a call to begin designing AI systems that are not only safe in their internal behaviour, but also epistemically responsible in their societal impact.

Conclusion

Dario Amodei’s call for urgent work on AI interpretability is both timely and necessary. Without interpretability, no deeper alignment is possible. But if we are serious about the future of human knowledge, we should also take one step further.

AI systems will not merely be tools — they will be participants in our epistemic landscape. Whether they reinforce our ability to build truth, or quietly undermine it, will depend on the choices we make now.

I do not claim to know the answers. But I believe we must ask the right questions — and ask them together, across disciplines, traditions, and ways of knowing.

(Note: This essay was developed with the assistance of AI-based tools, particularly for drafting and editing support.)

LESSWRONG
LW

LESSWRONG
LW

1

Beyond Interpretability: Building the Epistemic Infrastructure for Trustworthy AI

1

1

1