FYI I wouldn't say at all that AI safety is under-represented in the EU (if anything, it would be easier to argue the opposite). Many safety orgs (including mine) supported the Codes of Practice, and almost all the Chairs and vice chairs are respected governance researchers. But probably still good for people to give feedback, just don't want to give the impression that this is neglected.
Also no public mention of intention to sign the code has been made as far as I know. Though apart from copyright section, most of it is in line with RSPs, which makes signing more reasonable.
Thank you, Ariel! I guess I've let my personal opinion shine through. I do not see many regulatory efforts in general on verbalizing alignment or interpretability necessities or translating them into actionable compliance requirements. The AI Act mentions alignment vaguely, for example.
And as far as I saw, the third draft of the Codes (Safety & Security) mentions alignment / misalignment with a "this may be relevant to include" tone rather than providing specifics as to how GenAI providers are expected to document individual misalignment risks and appropriate mitigation strategies.
And interpretability / mech interp is not mentioned at all, not even in the context of model explainability or transparency.
This is why I hoped to see feedback from this community: to know whether I am over-estimating my concerns.
I'd suggest updating the language in the post to clarify things and not overstate :)
Regarding the 3rd draft - opinions varied between people I work with but we are generally happy. Loss of Control is included in the selected systemic risks, as well as CBRN. Appendix 1.2 also has useful things, though some valid concerns got raised there on compatibility with the AI Act language that still need tweaking (possobly merging parts of 1.2 into selected systemic risks). As far as interpretability - the code is meant to be outcome based, and the main reason evals are mentioned is that they are in the act. Prescribing interpretability isn't something the code can do, and also probably shouldn't as these techniques arent good enough yet to be prescribed as mandatory for mitigating systemic risks.
I was reading "The Urgency of Interpretability" by Dario Amodei, and the following part made me think about our discussion.
"Second, governments can use light-touch rules to encourage the development of interpretability research and its application to addressing problems with frontier AI models. Given how nascent and undeveloped the practice of “AI MRI” is, it should be clear why it doesn’t make sense to regulate or mandate that companies conduct them, at least at this stage: it’s not even clear what a prospective law should ask companies to do. But a requirement for companies to transparently disclose their safety and security practices (their Responsible Scaling Policy, or RSP, and its execution), including how they’re using interpretability to test models before release, would allow companies to learn from each other while also making clear who is behaving more responsibly, fostering a “race to the top”."
Although I agree with you that (at this stage), regulatory requirement to disclose interpretability techniques used to test models before release would not be very useful for for outcome-based CoPs.
But I hope that there is a path forward for this approach in the near future.
Thanks a lot for your follow up. I'd love to connect on LinkedIn if that's okay, I'm very grateful for your feedback!
I'd say: "I believe that more feedback from alignment and interpretability researchers is needed" instead. Thoughts?
Sure! and yeah regarding edits - I have not gone through the full request for feedback yet, I expect to have a better sense late next week of which contributions are most needed and how to prioritize. I mainly wanted to comment first on obvious things that stood out to me from the post.
There is also an Evals workshop in Brussels on Monday where we might learn more. I've know of some some non-EU based technical safety researchers who are attending, which is great to see.
Thanks for sharing this. Based on the About page, my 'vote' as a EU citizen working in an ML/AI position could conceivably count for a little more, so it seems worth doing it. I'll put it in my backlog and aim to get to it on time (it does seem like a lengthy task).
It will probably be lengthy but thank you very much for contributing! DM me if you come across any "legal" question about the AI Act :).
What AI safety researchers should weigh in on:
- Whether training compute is a sufficient proxy for generality or risk, and what better metrics might exist.
- How to define and detect emergent capabilities that warrant reclassification or new evaluations.
- What kinds of model evaluations or interpretability audits should qualify as systemic risk mitigation.
- How downstream fine-tuning workflows (RLHF, scaffolding, etc.) may create latent alignment risk even at moderate compute scales.
- How the Code of Practice could embed meaningful safety standards beyond documentation. E.g., through commitments to research mechanistic transparency, continuous monitoring, and post-deployment control mechanisms.
This is where I'd personally hope that anyone giving feedback focuses their attention.
Even just to bring the right strategies to the attention of policy spheres.
Please, be thorough. I've provided a breakdown of the main points in the Targeted consultation document, but I'd recommend looking at the Safety and Security section of the Third draft of the General-Purpose AI Code of Practice.
It can be downloaded here: Third Draft of the General-Purpose AI Code of Practice published, written by independent experts | Shaping Europe’s digital future
In an ideal world, well meaning regulation coming from EU could become a global standard and really make a difference. However, in reality, I see little value in EU-specific regulations like these. They are unlikely to impact frontier AI companies such as OpenAI, Anthropic, Google DeepMind, xAI, and DeepSeek, all of which are based outside the EU. These firms might accept the cost of exiting the EU market if regulations become too burdensome.
While the EU market is significant, in a fast-takeoff, winner-takes-all AI race (as outlined in the AI-2027 forecast), market access alone may not sway these companies’ safety policies. Worse, such regulations could backfire, locking the EU out of advanced AI models and crippling its competitiveness. This could deter other nations from adopting similar rules, further isolating the EU.
As an EU citizen, I view the game theory in an "AGI-soon" world as follows:
Alignment Hard
EU imposes strict AI regulations → Frontier companies exit the EU or withhold their latest models, continuing the AI race → Unaligned AI emerges, potentially catastrophic for all, including Europeans. Regulations prove futile.
Alignment Easy
EU imposes strict AI regulations → Frontier companies exit the EU, continuing the AI race → Aligned AI creates a utopia elsewhere (e.g., the US), while the EU lags, stuck in a technological "stone age."
Both scenarios are grim for Europe.
I could be mistaken, but the current US administration and leaders of top AI labs seem fully committed to a cutthroat AGI race, as articulated in situational awareness narratives. They appear prepared to go to extraordinary lengths to maintain supremacy, undeterred by EU demands. Their primary constraints are compute and, soon, energy - not money! If AI becomes a national security priority, access to near-infinite resources could render EU market losses a minor inconvenience. Notably, the comprehensive AI-2027 forecast barely mentions Europe, underscoring its diminishing relevance.
For the EU to remain significant, I see two viable strategies:
OpenAI, Anthropic and Google DeepMind are the main signatories already to these Codes of Practice.
So, whatever is agreed / negotiated is what will impact frontier AI companies. That is the problem.
I'd love to see specific criticisms from you on sections 3, 4 or 5 of this post! I am happy to provide feedback myself based on useful suggestions that come up in this thread.
Do you have any public evidence that OpenAI, Anthropic and Google DeepMind will sign?
From my perspective, this remains uncertain and will likely depend on several factors, including the position of the US government on this, and the final code's content (particularly regarding unpopular measures among companies like the independent third-party assessment in measure 11).
My understanding is that they expressed willingness to sign, but lobbying efforts on their side are still ongoing, as is the entire negotiation.
The only big provider I've heard that explicitly refused to sign is Meta: EIPA in Conversation WIth - Preparing for the EU GPAI Codes of Practice (somewhere from minute 34 to 38).
The European AI Office is currently writing the rules for how general-purpose AI (GPAI) models will be governed under the EU AI Act.
The are explicitly asking for feedback on how to interpret and operationalize key obligations under the AI Act.
This includes the thresholds for systemic risk, the definition of GPAI, how to estimate training compute, and when downstream fine-tuners become legally responsible.
The largest labs (OpenAI, Anthropic, Google DeepMind) have already expressed willigness to sign on to the Codes of Practice voluntarily.
These codes will become the de facto compliance baseline, and potentially a global reference point.
I believe that more feedback from alignment and interpretability researchers is needed
Input is urgently needed to ensure the guidelines reflect more specific concerns around misalignment, loss of control, emergent capabilities, robust model evaluation, and the need for interpretability audits.
Key intervention points include how "high-impact capabilities" are defined, what triggers systemic risk obligations, and how documentation, transparency, and ongoing risk mitigation should be operationalized for frontier models.
Without this input, we risk locking in a governance regime that optimizes for PR risk (copyright and vague definitions of bias), not existential or alignment-relevant risk.
This is a rare opportunity for independent voices to influence what responsible governance of frontier models should actually require. If left unchallenged, vague obligations could crystallize into compliance practices that are performative... what I understand as fake.
You do not need to be an European Citizen: anyone with relevant expertise can provide feedback (but please make sure to select the correct category under "Which stakeholder category would you consider yourself in?").
📅 Feedback is open until 22 May 2025, 12:00 CET.
🗳️ Submit your response here
I haven’t yet seen a technical, safety-focused summary of what the GPAI Codes of Practice are actually aiming to regulate, so I’ve put one together.
I hope it’s useful to the AI safety community. Since the full breakdown is long, here’s a TL;DR:
The Commission’s upcoming guidelines will define how the EU interprets the obligations for general-purpose AI providers under the AI Act. The guidelines will cover:
The AI Act defines “general-purpose AI model" as foundation models that display significant generality: they can perform a wide range of tasks and are used downstream across multiple applications.
Critically, the EU is trying to separate GPAI models from full AI systems, making it clear that the underlying model (e.g. LLaMA, GPT-4) can trigger obligations even if it's not wrapped in a user-facing product.
This means that training and release decisions upstream carry regulatory weight, even before fine-tuning or deployment.
Notably, models used exclusively for research, development, or prototyping are excluded, until they are released. Once placed on the market, obligations kick in.
Because there’s no mature benchmark for generality yet, the EU is anchoring its definition of GPAI on training compute. Their proposed threshold:
A model is presumed to be GPAI if it can generate text or images and was trained with >10²² FLOP.
This threshold acts as a presumption of generality.
If your model meets this compute threshold and has generative capacity, the EU assumes it can perform many distinct tasks and should be governed as a GPAI model.
However, the presumption is rebuttable: you can argue your model is too narrow despite the compute, or that a low-compute model has generality due to capabilities.
This is a crude but actionable standard. The EU is effectively saying:
Examples show how this might play out:
This marks the first time a training compute threshold is being proposed as a regulatory signal for generality.
While imperfect, it sets a precedent for model-based regulation, and may evolve into a cornerstone of GPAI classification across other jurisdictions.
It also suggests a future where training disclosures and FLOP estimates become a key part of legal compliance.
The EU is trying to draw a regulatory line between what counts as a new model versus a new version of an existing model.
This matters because many obligations are triggered per model so, if you release a “new” model, you might have to redo everything.
The preliminary approach is simple:
A “distinct model” starts with a new large pre-training run.
Everything based on that run (fine-tunes, upgrades, or checkpoints) is a model version.
However, if a model update by the same provider uses a large amount of compute (defined as >⅓ of the original model’s threshold), and it significantly changes the model’s risk profile, then it might count as a new model even if it's technically a version.
The thresholds are:
This distinction has direct implications for:
If this holds, providers could:
This is effectively a compliance modularity rule. It lets labs scale governance across model variants, but still holds them to task if new versions introduce emergent risk.
For safety researchers, this section could be leveraged to advocate for stronger triggers for reclassification, especially in the face of rapid capability shifts from relatively small training updates.
The EU distinguishes between providers (entities that develop or significantly modify general-purpose AI models) and deployers (those who build or use AI systems based on those models).
There’s special attention here on downstream actors: those who fine-tune or otherwise modify an existing GPAI model.
The guidelines introduce a framework to determine when downstream entities become providers in their own right, triggering their own set of obligations.
Key scenarios where an entity is considered a provider:
Even collaborative projects can count as providers, usually via the coordinator or lead entity.
This is one of the most consequential parts of the guidelines for open-source model ecosystems and fine-tuning labs.
The EU draws a line between minor modifications (e.g., light fine-tunes) and substantial overhauls that make you legally responsible for the resulting model.
You’re presumed to become a new “provider” (with specific compliance obligations) if:
In this case, you are only responsible for the modification, not the full model: meaning your documentation, data disclosure, and risk assessment duties apply only to the part you changed.
Things get more serious if you are modifying or contributing to a model that crosses the systemic risk threshold (currently 10²⁵ FLOP total training compute).
You’re treated as a new provider of a GPAI model with systemic risk if:
In these cases:
While no current modifications are assumed to meet this bar, the guidelines are explicitly future-proofing for when downstream players (including open-source projects) have access to higher compute.
The AI Act includes limited exemptions for open-source GPAI models, but only if specific criteria are met.
Critically, monetized distribution invalidates the exemption: this includes charging for access, offering paid support services, or collecting user data for anything other than basic interoperability or security.
The EU is formalizing training compute as a trigger for systemic risk obligations, creating a scalable governance mechanism that doesn't rely on subjective capability benchmarks.
While I personally do not fully believe that this is the best benchmark, these rules open the door to continuous monitoring and reporting infrastructure for high-compute training runs. And I think it's something alignment researchers could potentially build around.
If your model crosses certain FLOP thresholds, you're presumed to be developing:
To apply these thresholds, the EU is proposing methods to estimate compute, and defining when you need to notify the Commission if you're approaching or crossing them.
The EU outlines two accepted methods:
For transformers:
FLOP ≈ 6 × Parameters × Training Examples
Either method is valid. Providers choose based on feasibility, but must document assumptions if using approximations (e.g., for synthetic data generation).
The “cumulative compute” for regulatory purposes includes:
It does not include:
This cumulative total determines whether a model passes the systemic risk threshold (10²⁵ FLOP).
The guidance also covers model compositions. E.g., Mixture-of-Experts architectures must include compute from all contributing models.
You are expected to estimate compute before the large pre-training run begins (based on planned GPU allocations or token counts). If you’re not above the threshold at first, you’re still required to monitor ongoing compute usage and notify the EU Comission that you've met this threshold if you cross the line later.
Enter into force: 2 August 2027
Providers who released GPAI models before that date have until 2 August 2027 to comply.
This includes documentation, risk assessment, and training data transparency, though retroactive compliance is not required if it would involve retraining, unlearning, or disproportionate effort.
This gives labs and developers a two-year compliance window—but only for models placed on the market before August 2025. Anything new after that date must comply immediately.
Enforcement of the Code of Practice
The EU’s Code of Practice (CoP) will become the default pathway for demonstrating compliance with the AI Act’s obligations for general-purpose AI models. It’s not legally binding, but adhering to it comes with clear enforcement advantages:
Non-signatories must prove compliance through alternative methods, such as detailed reporting, gap analyses, or independent evaluations, and may face higher scrutiny.
Supervision by the AI Office
The AI Office will be the lead enforcement authority for all GPAI model obligations. Enforcement officially begins on 2 August 2026, following a one-year grace period.
The AI Office can:
Confidential business information and IP will be protected under Article 78, but the Commission is investing in a long-term regulatory infrastructure for evaluating frontier models.
If you have any questions or would like to discuss specific sections in more detail, feel free to comment below.
I’ll do my best to answer, either directly or by reaching out to contacts in the Working Groups currently drafting and negotiating the Codes of Practice.
You can also find additional contact details in my LessWrong profile.
If you're considering submitting feedback and would like to coordinate, compare notes, or collaborate on responses, please reach out! I'd be happy to connect.
This is where I'd personally hope that anyone giving feedback focuses their attention.
Even just to bring the right strategies to the attention of policy spheres.
Please, be thorough. I've provided a breakdown of the main points in the Targeted consultation document, but I'd recommend looking at the Safety and Security section of the Third draft of the General-Purpose AI Code of Practice.
It can be downloaded here: Third Draft of the General-Purpose AI Code of Practice published, written by independent experts | Shaping Europe’s digital future