Where We Are On Healthcare AI Safety and Regulation

ejri

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

A diagnostic look at how AI scribes, the most widely deployed clinical AI in the world right now, are slipping into permanent medical records under governance frameworks designed for office software.

In May 2026, the Office of the Auditor General of Ontario published a special report on AI use across the provincial government. The healthcare section is, as far as I can tell, the most concrete public-evidence base we have anywhere on AI-safety failures in a real clinical deployment pipeline. It is unusual in that it is not a research paper, not a media investigation, and not a vendor white paper. It is a procurement audit, with numbered findings, vendor-by-vendor results, and a paper trail of who approved what.

The headline findings, taken directly from the report:

All 20 approved AI scribe vendors showed at least one critical failure type in procurement testing.
9 of 20 systems (45%) fabricated information, including suggesting therapy referrals, ordering blood tests, or stating "no masses found" when none of these had been mentioned in the simulated consultations.
12 of 20 systems (60%) recorded the wrong drug compared to what the simulated physician prescribed.
17 of 20 systems (85%) missed key mental health details that were present in the recordings.
11 of 20 vendors did not submit third-party security audit reports (SOC 2 Type 2, HITRUST, or ISO 27001) despite this being required.
5 of 20 did not submit Threat Risk Assessments or Privacy Impact Assessments.
All 20 were approved as Vendors of Record anyway.

In the auditor's own words: "Inaccuracies in medical notes generated by AI Scribe systems could potentially result in inadequate or harmful treatment plans that may potentially impact patient health outcomes." And: "A comprehensive evaluation of vendors was not conducted to ensure that their AI Scribe systems mitigated the risk of bias."

This isn't a hypothetical about what AI might do. It's a description of what was already in the procurement pipeline at the time of audit.

The Ontario Minister responsible pushed back by noting that the software was still in a testing phase and that clinicians would always review the notes before clinical decisions are made. We'll come back to why that defense doesn't hold up.

The failure isn't that AI hallucinates

The obvious framing, and the one most coverage went with, is "AI scribes aren't safe yet." That framing is wrong, or at least it points at the wrong target.

Hallucination, omission, and wrong-entity substitution are expected behaviors of current generative models. We've known this for years. If you've spent any time with LLMs you know that asking one to faithfully transcribe and structure a conversation will, some non-trivial fraction of the time, produce confident-sounding text that wasn't in the source. That is a model property, not a procurement surprise.

The Ontario report does not primarily document a capability failure. It documents a systems failure. Look at how the procurement was actually constructed:

Criterion	Weighting
Domestic presence in Ontario	30%
Data privacy / legal controls	23% (of which "bias controls" = 2%)
System security controls	11% (of which TRAs/PIAs = 2%, SOC 2 = 4%)
Contract negotiation	10%
Clinical format / multi-speaker transcription	9%
Business (interface, training, pricing)	5%
Accuracy of medical notes generated	4%

Source: Supply Ontario RFB Stage 2 criteria, as published in the Auditor General's report.

A vendor could score zero on accuracy, zero on bias controls, and zero on system security and still clear the aggregate threshold of 371 points required for approval, because there were no minimum passing scores on any individual criterion. That isn't a hidden technicality, it's the entire design of the evaluation.

Read that table again. Whether the vendor has a Toronto office is worth 7.5× more than whether the medical notes the system generates are accurate. This is what the safety community sometimes calls Goodhart by design: the metric you optimize is the metric you hit, and if the metric is "Ontario presence" you get Ontario presence, not accuracy.

So the system was not designed to detect or filter out unsafe tools. It was designed to procure software that has an Ontario presence and an acceptable contract. That it also happens to generate clinical documentation was, in evaluation terms, a rounding error.

If you tried to construct a procurement pipeline that reliably selects for impressive demos while systematically ignoring safety properties, this would be a reasonable approximation.

Seven specific failure modes

The Ontario case is useful because it isolates failure modes that are usually entangled. I'll list them separately, because the interventions for each are different.

1. Risk classification mismatch. AI scribes were procured as productivity software. Their outputs enter permanent medical records, are read by other clinicians making future decisions, and constitute the canonical legal account of what happened in a consultation. Those are functionally clinical documentation systems. The generic-SaaS procurement frame is the wrong tool for the job.

2. Self-attestation without verification. Vendors were allowed to declare compliance with security and privacy requirements without supplying evidence. 11 of 20 did not submit third-party audit reports. They were approved anyway. The evaluation also did not assess whether submitted SOC 2 reports contained AI-relevant controls, it only checked whether the reports were submitted on time and whether they noted any exceptions. This selects for vendors who are good at claiming compliance, not vendors who are actually compliant.

3. No live demonstrations. Vendors were given simulated patient-clinician recordings, allowed to process them offline, and asked to submit the resulting notes for evaluation. The auditor explicitly flagged this as creating a risk that vendors could process recordings multiple times or alter outputs before submission. There was no observation of how the system behaves in real time, under real conditions, with real-world acoustics. This is best-of-n evaluation dressed up as live testing.

4. No minimum thresholds on safety-critical criteria. This is the structural keystone. In any high-assurance domain, aviation, nuclear, medical devices proper, safety dimensions are constraints, not weighted factors. You don't get to fail your engine-out test and pass on aesthetics. Ontario treated safety as a tradeoff dimension. As long as the total cleared 371, the distribution didn't matter.

5. No comprehensive bias evaluation. Vendors were asked to describe their bias-mitigation processes. They were not asked to provide bias-testing results, and the procurement team did no independent testing. In a country where accents are extremely diverse and where Indigenous data sovereignty (OCAP, CARE) imposes specific obligations, accepting a self-described process as evidence of bias mitigation is a meaningful regulatory gap.

6. Human-in-the-loop framed as a control rather than a known-degraded control. Ontario's defense, that doctors will review notes before clinical decisions are made, assumes a clean human-verification step. The Ministry's own implementation didn't require attestation that clinicians had actually verified outputs through a sign-off feature. Even if it did, there's a substantial literature on automation bias, time pressure, and behavioral drift: human-in-the-loop reliably degrades into human-as-rubber-stamp under sustained workflow pressure, especially when the AI output is plausible-sounding and time is short. Especially when the AI output is plausible-sounding and time is short.

7. No post-deployment monitoring. The auditor found no evidence of additional testing or evaluation after procurement. So even setting aside whether the initial gate was adequate, there's no continuing observation of real-world performance, no random audits of clinical notes, no mechanism for removing underperforming vendors. The system was approved once, on the basis of one-time offline tests, and then deployed.

These aren't seven discoveries. They're seven angles on a single underlying pattern.

The underlying pattern: governance lag

The deeper pattern across the Ontario findings, and, I'd argue, across healthcare AI globally right now, is this:

AI capability is advancing faster than institutional risk classification.

Systems that should trigger high-assurance pathways are being routed through low-friction adoption pathways, because:

Procurement frameworks are generic and weren't designed to handle generative outputs.
Risk taxonomies haven't been updated to reflect what these systems actually produce.
Efficiency pressure favors fast adoption over rigorous evaluation.
No one is clearly accountable for misclassification.

Concretely, the same system gets handled as different things at different layers:

Procurement treats it like SaaS.
Deployment treats it like a pilot.
Day-to-day use treats it like an assistant.
Outcomes behave like clinical documentation infrastructure.

The mismatch is where risk accumulates.

What's happening in other provinces, and why "pilot" is doing a lot of work

Ontario at least had a formal procurement process, flawed, but formal. Other provinces' situation is structurally different, and I think more concerning even though it's getting much less public attention.

Formal procurement ensures that vendors provide documentation about their solution for verification. Documentation is logged, and is available for audits, such as in the Auditor General's report.

Canada Health Infoway launched its national AI Scribe Program in June 2025, providing up to 10,000 fully funded one-year AI scribe licenses to primary care clinicians, including in BC. Adoption is happening fast. The framing, in BC and elsewhere, is "pilot."

Unlike formal procurement, this form of procurement often goes undetected, where vendors are not always required to provide documentation, verification, certifications, validated, or pass functional tests. This is called Proof-of-Concept (PoC), pilot, or terms of a similar effect, wherein procured software is often said to be validated within the testing environment, without the formal process to ensuring that it does.

Once users are comfortable enough with the solution, the PoC becomes an a fully integrated solution, whether through a Request for Proposal (RFP) procurement followup where the users are already comfortable with the solution already integrated into their systems and processes, or in a quasi-fully procured solution.

There is a well-known pattern in government procurement that has a name: proof-of-concept creep. The shape is:

A tool is introduced as a small-scale pilot.
Pilots are exempt from the full RFP and regulatory scrutiny that production deployments would face.
The tool becomes embedded in clinician workflows.
Scaling happens implicitly, additional licenses, expanded sites, deeper integration, rather than through a fresh approval gate.
The system becomes de facto infrastructure without ever passing a production-grade evaluation.

In this pattern, "pilot" functions as a regulatory bypass. The label preserves the appearance of caution while removing the substance of it. In this approach, a significant portion of documentation, verification, certifications, validated, or pass functional tests never make it into the audit logs.

Ontario shows what happens even when a formal procurement process exists. Other provinces suggests what happens when the dominant procurement model is "pilot for now, figure it out later."

Why fragmented Canadian governance makes this worse

Canada's healthcare governance is layered in a way that compounds the problem:

Federal level: regulates AI medical devices via Health Canada's Medical Device Regulations and the February 2025 Pre-Market Guidance for Machine Learning-Enabled Medical Devices. Has no comprehensive AI legislation, AIDA (the Artificial Intelligence and Data Act) died when Parliament was prorogued on January 6, 2025. PIPEDA, the federal privacy law, was explicitly described by the Office of the Privacy Commissioner as "not optimized for an AI environment."
Provincial level: sets high-level policy and runs procurement frameworks. Quebec's Law 25 has the strongest automated-decision-making provisions in North America. Ontario's Bill 194 is the first AI-specific provincial law for public institutions. BC currently operates under PIPA, FIPPA, and the E-Health Act, with no AI-specific provisions.
Regional health authorities: actually deploy the tools, write the operational policies, and absorb the consequences first.

The frontline policy decisions are being made at the regional level. But regional health authorities, almost by definition, are not where AI safety experts, ML engineers, or AI policy specialists choose to work. They have neither the technical capacity to evaluate these systems independently nor the procurement leverage to demand vendor evidence.

So the practical decision-making collapses to one of two actors:

Vendors, implicitly, through product defaults and marketing claims;
Administrators, improvising policy based on the information they happen to have.

This isn't a criticism of the people involved. It's a structural delegation of high-stakes technical governance to actors who weren't resourced to perform it. And nothing about the current federal-provincial-regional structure looks likely to fix it on its own.

International comparison: same risk, different layer

Other jurisdictions are dealing with the same model behavior. The interesting variation is where in the system they're trying to intervene.

United Kingdom. NHS England has moved further than most. In late 2025 / early 2026 it published guidance on AI-enabled ambient scribing products and launched a national Ambient Voice Technology (AVT) Self-Certified Supplier Registry. To appear on the registry, suppliers must hold Class I Medical Device registration with the MHRA, comply with the DCB0129 clinical safety standard, complete a Data Protection Impact Assessment, and meet the Digital Technology Assessment Criteria (DTAC). The registry's own documentation is explicit that NHS England does not endorse listed suppliers and that procurement decisions remain with individual NHS bodies, but the bar to even be listed is substantially higher than what Ontario required. The BMJ subsequently advised NHS staff to stop using unregistered AI scribing tools, treating registration as a hard precondition for use.

This is still self-certification rather than independent verification, and it doesn't solve the underlying generative-model problems. But it does at least insist on medical-device classification, which routes the tools onto a regulatory track designed for clinical risk.

Singapore. In January 2026, IMDA published a Model AI Governance Framework specifically for Agentic AI (version 1.0), building on its earlier model framework. Singapore has also granted regulatory approval for at least one clinical AI scribe, i.e., it's treating these systems as something requiring a real approval pathway rather than as commodity software. The Singapore approach is closer to active permissioning under oversight than to audit after the fact.

United States. The dominant story is research evidence and media coverage rather than government audit. Published work has documented hallucinations in widely-used transcription models, including Whisper, and JAMA Health Forum has published analyses of unintended consequences of AI scribes. The FDA regulates AI medical devices via existing pathways (510(k), De Novo, PMA) and has authorized more than 882 AI/ML-enabled devices, but generative ambient-scribe tools largely sit outside that device pathway. There is no US equivalent of the Ontario audit.

The pattern. Everyone is dealing with the same model behavior. The differentiator is whether the governance system is designed to absorb that reality. Ontario, the UK, and Singapore are intervening at three different layers, procurement audit, supplier registry, regulatory permission, and each layer has different failure modes. None of them are solved. The UK approach buys you classification and documentation but still relies on self-certification. Singapore buys you tighter oversight but at the cost of slower diffusion. Ontario didn't buy you anything except a public record of what went wrong.

Why "human-in-the-loop" isn't enough

The most common defense of current AI scribe deployments, used by the Ontario Ministry in response to the audit, and standard everywhere, is that clinicians will review the notes before they enter the record or affect care. This is worth taking seriously rather than dismissing, because it is a real control under some conditions.

Here is why it isn't sufficient as the primary safety control:

Automation bias is well documented. Users systematically trust automated outputs more than warranted, especially when the output is fluent and structured. AI scribes produce fluent, structured output.
Time pressure degrades verification. Primary care visits in Canada average a few minutes. The marginal time the AI scribe is supposed to save is the same time that would be needed for thorough verification. The intervention undermines its own safeguard.
Subtle errors are hard to catch. A wrong drug is catchable if the clinician reads carefully. An omission, say, the missing mental health detail in 85% of Ontario's tested systems, is not catchable by reading the note alone, because the missing information isn't in the note to be flagged.
Behavioral drift over time. Clinicians' note-taking habits change once they rely on a scribe. They take fewer contemporaneous notes during the visit. They rely more on the generated summary. The control degrades with use, not because the AI gets worse but because the human's independent record gets thinner.
No enforced attestation. Ontario didn't require clinicians to attest, via a system-enforced sign-off, that they had verified the AI output. In practice, "the clinician will review it" was a guideline, not an enforced workflow gate.

So "human-in-the-loop" is a real control, but it's a degrading control, and one whose strength is being explicitly traded away in exchange for the productivity gains that justify the deployment in the first place.

Toward solutions: concrete intervention points

The diagnostic part of this is the long part. The constructive part is shorter, because most of the interventions follow mechanically once you've identified the failure mode. The Ontario Auditor General's own recommendations cover several of these, and Supply Ontario has agreed to implement them; the broader pattern generalizes.

1. Minimum thresholds on safety-critical criteria. Any procurement of AI for clinical documentation should require minimum passing scores on accuracy, bias controls, and security, independent of aggregate score. A vendor that fails any of those is disqualified, full stop. This is standard medical-device practice and should be standard for systems generating clinical documentation.

2. Mandatory live demonstrations under realistic conditions. Offline submission with vendor-controlled processing is not evaluation. Live demos with unseen recordings, real-time generation observed by evaluators, and no post-hoc editing should be required.

3. Third-party verification, not self-attestation, with AI-specific scope. Require SOC 2 Type 2 reports, ISO 27001 or HITRUST certification, and TRA/PIA submissions, but also require that the controls assessed include AI-specific risks (training-data governance, prompt-injection resilience, output validation, drift monitoring). A SOC 2 report that doesn't cover AI controls is a SOC 2 report about something else.

4. Bias evidence, not bias narratives. Vendors should provide bias-testing results across relevant demographic and linguistic axes, accent, dialect, age, gender, and (where consent permits and OCAP/CARE allow) Indigenous-language considerations. Where vendors can't provide it, procurement bodies should conduct independent testing before selection.

5. System-enforced clinician attestation. If human review is the safety control, it has to be a workflow gate, not a guideline. AI scribe systems should require a clinician sign-off before notes enter the EMR, with that sign-off logged and auditable. This is what the Ontario AG explicitly recommended.

6. Post-deployment monitoring. Annual re-evaluation of security and clinical performance. Random audits of real-world outputs against source consultations. A mechanism for de-listing vendors whose real-world performance falls below the bar that got them listed in the first place. The medical-device world calls this post-market surveillance; AI scribes need an equivalent.

7. Centralized expert capacity above the regional level. Regional health authorities are not going to develop in-house AI safety expertise on the relevant timescale. Provincial- or federal-level shared expert capacity, something like a Canadian Healthcare AI Safety Institute, or a strengthened Health Canada AI/ML unit, could supply technical evaluation as a service to procuring entities. The current arrangement, where each regional authority improvises, doesn't scale.

8. A risk classification framework that matches function, not marketing. The single most leveraged change would be reclassifying AI systems by what they functionally do, not by how they're sold. A draft taxonomy:

Class	Criteria	Treatment
Low-risk	No patient data, no clinical influence	Standard procurement, basic security
Medium-risk	Patient data, no clinical decision influence	Privacy assessment, security audit, limited deployment
High-risk	Generates clinical documentation or influences treatment	Medical device pathway, minimum thresholds, live demos, bias testing, post-deployment surveillance

AI scribes belong in the third row. They are not, functionally, transcription software. They are systems that produce permanent clinical records and shape downstream clinical reasoning.

The uncomfortable implication

Nothing in this discussion requires frontier AI. AI scribes are narrow, relatively well-understood systems. The failure mode isn't that they're too advanced to control, it's that the institutions deploying them are not applying control frameworks that already exist for comparably risky systems.

That has an obvious extrapolation. As AI capability grows, the systems being deployed in healthcare will be more capable, more agentic, and more deeply integrated into clinical decision-making. If a procurement system can't reliably gate something as bounded as an ambient scribe in 2026, it is not going to spontaneously develop the capacity to gate diagnostic agents, treatment-planning agents, or embodied clinical AI by 2028.

The Ontario report is, in this sense, a useful early signal. It tells you what the institutional response to AI looks like before the hard problems arrive. Right now, capability is moving faster than governance. There is no specific reason to expect that gap to close on its own.

The interventions to close it aren't speculative. Most of them already exist, in medical-device regulation, in aviation, in nuclear, in pharmaceutical surveillance. The question is whether healthcare-AI procurement is going to inherit those frameworks before the next round of capability arrives, or after.

Epistemic status

The empirical findings (vendor counts, percentages, weightings, direct quotes) are from the Office of the Auditor General of Ontario's Special Report: Use of Artificial Intelligence in the Ontario Government, released May 12, 2026, with cross-checks against subsequent reporting by Canadian Healthcare Technology, CBC, The Canadian Press, and The Register.
The UK material (AVT registry, DCB0129, MHRA Class I registration, DTAC compliance) is from NHS England's own published guidance and registry documentation, and from the Find a Tender preliminary market notice for the AVT registry.
The Canadian regulatory description (AIDA, PIPEDA, Bill 194, Law 25, Health Canada Pre-Market Guidance Feb 2025, Canada Health Infoway AI Scribe Program) is from official Canadian government and OPC documentation.
Singapore's Model AI Governance Framework for Agentic AI (v1.0, January 2026) is from IMDA's published framework document.
The governance-lag interpretation, that the primary failure is risk misclassification rather than model capability, is a structural reading of the evidence rather than a claim I can prove from any single data point. It is, however, the simplest hypothesis consistent with the documented findings, and the interventions that follow from it map cleanly onto existing high-assurance regulatory frameworks. If I'm wrong about it, the most likely place I'm wrong is in underestimating how much of the failure is also about underlying model behavior that no procurement gate would catch. I'd take that critique seriously.

Sources

Office of the Auditor General of Ontario, Special Report: Use of Artificial Intelligence in the Ontario Government, May 12, 2026. https://www.auditor.on.ca/en/content/specialreports/specialreports/en26/2026_AI_EN.pdf
NHS England, Guidance on the use of AI-enabled ambient scribing products in health and care settings. https://www.england.nhs.uk/long-read/guidance-on-the-use-of-ai-enabled-ambient-scribing-products-in-health-and-care-settings/
NHS England, Ambient Voice Technology Self-Certified Supplier Registry. https://digital.nhs.uk/services/ambient-scribing/ambient-voice-technology-self-certified-supplier-registry
IMDA (Singapore), Model AI Governance Framework for Agentic AI, v1.0, January 22, 2026. https://www.imda.gov.sg/-/media/imda/files/about/emerging-tech-and-research/artificial-intelligence/mgffor-agentic-ai.pdf
Health Canada, Pre-market Guidance for Machine Learning-Enabled Medical Devices, February 2025.
Office of the Privacy Commissioner of Canada, Principles for responsible, trustworthy and privacy-protective generative AI technologies.
Government of Canada / ISED, The Artificial Intelligence and Data Act (AIDA), status as of January 6, 2025 (Parliament prorogued).
Canada Health Infoway, AI Scribe Program announcement, June 2025.