🔹 Background:
Lawyer by education, researcher by vocation.
Stress-Testing Reality Limited | Katalina Hernández | Substack
Started in data protection & privacy, now specializing in Privacy Engineering and Privacy-Enhancing Technologies (PETs). Pivoted to AI governance and AI Safety-adjacent research after realizing that compliance frameworks alone won’t ensure AI remains contestable and user-controllable.
🔹 Current Work: AI Safety for AI Governance
I work for a multinational as part of the Responsible AI team. I also carry out independent Policy research, focusing on bridging the gap between AI Regulation and Safety research for effective governance.
My main interest is on the intersection of AI governance, privacy engineering, and alignment-adjacent control mechanisms, particularly:
🔹 Why LessWrong?
🔹 Let’s Connect If:
Yes, Yes... Me, a **lawyer** posting in LW feels like:
I was reading "The Urgency of Interpretability" by Dario Amodei, and the following part made me think about our discussion.
"Second, governments can use light-touch rules to encourage the development of interpretability research and its application to addressing problems with frontier AI models. Given how nascent and undeveloped the practice of “AI MRI” is, it should be clear why it doesn’t make sense to regulate or mandate that companies conduct them, at least at this stage: it’s not even clear what a prospective law should ask companies to do. But a requirement for companies to transparently disclose their safety and security practices (their Responsible Scaling Policy, or RSP, and its execution), including how they’re using interpretability to test models before release, would allow companies to learn from each other while also making clear who is behaving more responsibly, fostering a “race to the top”."
Although I agree with you that (at this stage), regulatory requirement to disclose interpretability techniques used to test models before release would not be very useful for for outcome-based CoPs.
But I hope that there is a path forward for this approach in the near future.
What AI safety researchers should weigh in on:
- Whether training compute is a sufficient proxy for generality or risk, and what better metrics might exist.
- How to define and detect emergent capabilities that warrant reclassification or new evaluations.
- What kinds of model evaluations or interpretability audits should qualify as systemic risk mitigation.
- How downstream fine-tuning workflows (RLHF, scaffolding, etc.) may create latent alignment risk even at moderate compute scales.
- How the Code of Practice could embed meaningful safety standards beyond documentation. E.g., through commitments to research mechanistic transparency, continuous monitoring, and post-deployment control mechanisms.
This is where I'd personally hope that anyone giving feedback focuses their attention.
Even just to bring the right strategies to the attention of policy spheres.
Please, be thorough. I've provided a breakdown of the main points in the Targeted consultation document, but I'd recommend looking at the Safety and Security section of the Third draft of the General-Purpose AI Code of Practice.
It can be downloaded here: Third Draft of the General-Purpose AI Code of Practice published, written by independent experts | Shaping Europe’s digital future
Thanks a lot for your follow up. I'd love to connect on LinkedIn if that's okay, I'm very grateful for your feedback!
I'd say: "I believe that more feedback from alignment and interpretability researchers is needed" instead. Thoughts?
Thank you, Ariel! I guess I've let my personal opinion shine through. I do not see many regulatory efforts in general on verbalizing alignment or interpretability necessities or translating them into actionable compliance requirements. The AI Act mentions alignment vaguely, for example.
And as far as I saw, the third draft of the Codes (Safety & Security) mentions alignment / misalignment with a "this may be relevant to include" tone rather than providing specifics as to how GenAI providers are expected to document individual misalignment risks and appropriate mitigation strategies.
And interpretability / mech interp is not mentioned at all, not even in the context of model explainability or transparency.
This is why I hoped to see feedback from this community: to know whether I am over-estimating my concerns.
My understanding is that they expressed willingness to sign, but lobbying efforts on their side are still ongoing, as is the entire negotiation.
The only big provider I've heard that explicitly refused to sign is Meta: EIPA in Conversation WIth - Preparing for the EU GPAI Codes of Practice (somewhere from minute 34 to 38).
Way to go! :D. The important thing is that you've realized it. If you naturally already get those enquiries, you're halfway there: people already know you and reach out to you without you having to promote your expertise. Best of luck!
OpenAI, Anthropic and Google DeepMind are the main signatories already to these Codes of Practice.
So, whatever is agreed / negotiated is what will impact frontier AI companies. That is the problem.
I'd love to see specific criticisms from you on sections 3, 4 or 5 of this post! I am happy to provide feedback myself based on useful suggestions that come up in this thread.
It will probably be lengthy but thank you very much for contributing! DM me if you come across any "legal" question about the AI Act :).
Hey Neel! I just wanted to say thank you for writing this. It's honestly one of the most grounded and helpful takes I've seen in a while. I really appreciate your pragmatism, and the way you frame interpretability as an useful tool that still matters for early transformative systems (and real-world auditing!).
Quick question: do you plan to share more resources or thoughts on how interpretability can support black-box auditing and benchmarking for safety evaluations? I’m thinking a lot about this in the context of the General-Purpose AI Codes of Practice and how we can build technically grounded evaluations into policy frameworks.
Thanks again!