Clinical AI Is a Governance Problem, Not a Chatbot Problem

Three weeks ago, I sat in a conference room while a vendor demoed their new AI clinical decision support tool. The interface was slick. The responses were fast. A colleague leaned over and whispered, "This is basically ChatGPT for doctors, right?" The sales rep nodded along with a grin. And I thought: we are going to get someone hurt if this is how we evaluate these tools.

Clinical AI should be evaluated as a governance and workflow integration problem, not as a chatbot novelty. The core questions are not about how impressive the outputs look, but about who is accountable when the system is wrong, how it fits existing clinical workflows, and whether the organization can monitor performance over time.

I Used to Think the Technology Was the Hard Part

For years, I believed that getting the model accurate enough was the bottleneck. Train it on the right data, validate the outputs, and adoption would follow. I was genuinely impressed by what large language models could do in isolation: summarize notes, flag abnormalities, draft referral letters.

Then I watched three separate AI pilots fail. Not because the models were bad. Because nobody had answered the basic questions. Who reviews the output before it reaches the patient? What happens when the AI contradicts the attending? Who is liable?

These are not technology questions. They are governance questions.

A 2024 overview in ACM Computing Surveys on trustworthy ML in production found that fewer than 30% of deployed ML systems had formal monitoring pipelines for model drift or output validation. In healthcare, that should alarm everyone.

What I Would Not Do

I would not deploy any clinical AI tool that lacks a defined human-in-the-loop escalation pathway. Full stop.

I do not care how good the accuracy numbers look on a slide deck. A 2024 review in Pharmaceutics on ML applications in healthcare documented that model performance often degrades by 15-25% when moving from curated datasets to heterogeneous clinical populations. That is not a rounding error. That is a patient safety problem.

I also refuse to evaluate AI tools primarily on their chat interface. The conversational wrapper is the least important part. What matters is the data pipeline, the audit trail, and the governance structure underneath. When a colleague asked me to "just try out" a clinical summarization tool, I said, "That tells me what it can do in ideal conditions. It tells me nothing about what it will do at 2 AM when a tired resident is using it on incomplete records."

Five Questions Before Any AI Evaluation

Most conversations about clinical AI start with the model and work outward. The responsible approach starts with the workflow and works inward.

First: who owns the output? If the AI recommends a medication adjustment and the outcome is bad, most contracts I have reviewed leave liability dangerously ambiguous. Second: does the tool fit existing clinical workflows? Research in AI & Society (2024) on AI in situated action reviewed 48 ethnomethodological studies and found the most common failure mode was not inaccuracy but interactional misalignment, where outputs did not fit how clinicians actually make sequential decisions. Third: what is the monitoring plan post-deployment? Fourth: what transparency exists? A 2025 study in the Asia-Pacific Journal of Business Administration surveying 387 healthcare professionals found that perceived AI transparency had a statistically significant positive effect on trust. Fifth: does the organization have the regulatory infrastructure, from HIPAA to FDA software-as-medical-device guidance, to support this?

A chief nursing officer once told me, "Sina, we keep buying technology and then asking frontline staff to reorganize their entire day around it. What if we started by mapping the workflow first?" She was right. That reframing changed how I approach every evaluation now.

Back to the Conference Room

After that demo, I pulled the team aside and tabled the conversation about features entirely. I asked everyone to spend two weeks documenting the five workflows where they felt most burdened. Not where AI "could" help in theory. Where they actually needed help, right now.

One person said, "Finally, someone is asking the right question."

That is what responsible AI in healthcare looks like. Not finding the most impressive model. Building the governance scaffolding that makes any model safe and accountable. It is unglamorous work. It will never make a good demo. But it is the only approach I am willing to put my name behind as a Stanford-trained surgeon working in the AI space. The cost of getting this wrong is not a failed product launch. It is a patient.

Frequently Asked Questions

What does Dr. Sina Bari mean by treating clinical AI as a governance problem?

Dr. Bari argues that the critical challenges of clinical AI are not about model accuracy but about organizational accountability, workflow integration, post-deployment monitoring, and regulatory compliance. A governance-first approach means mapping existing workflows and defining liability structures before evaluating any specific AI tool.

Why do clinical AI pilots fail even when the model performs well?

Most failures stem from interactional misalignment rather than technical inaccuracy. AI outputs may be correct but presented at the wrong point in clinical workflow, or formatted in ways that do not match physician reasoning patterns. Research in AI & Society (2024) identified this mismatch as the most common failure mode across 48 studied deployments.

How should a hospital evaluate an AI vendor's clinical decision support tool?

Start by ignoring the demo. Ask five governance questions: who owns liability for clinical outputs, how the tool integrates into workflows, what the monitoring plan is for model drift, what transparency mechanisms exist, and whether the organization has the regulatory infrastructure to support deployment. If the vendor cannot answer these, the product is not ready.

What is Dr. Bari's approach to AI transparency in clinical settings?

Transparency is a prerequisite for safe adoption, not a feature. Without it, clinicians either ignore AI recommendations entirely or follow them without critical evaluation. Both create patient safety risks. Clinicians need to understand why a recommendation was made and when to override it.

Does clinical AI accuracy degrade after real-world deployment?

Yes. Published research documents 15-25% accuracy degradation when ML models move from controlled datasets to real clinical populations. Performance continues declining due to population shifts and system updates. Fewer than 30% of deployed ML systems have formal monitoring for this, making ongoing governance essential.