Publication

How High-Performing Doctors Are Building a Personal AI Stack

Author

Dr. Sina Bari, MD

Plastic & Reconstructive Surgeon | Medical Executive | Stanford Medicine

Published

April 25, 2025

Last Tuesday morning, before my first meeting at 8:30, I sat down with a cup of coffee and asked Claude to pull every RCT published in the last three years on AI-assisted surgical planning for mandibular reconstruction. Within four minutes I had a structured summary of nine papers with sample sizes, primary endpoints, and effect sizes. I cross-checked three of them against PubMed. All real. All accurate. That single query replaced what used to be a 90-minute literature search that I would have put off until the weekend and probably never finished.

That is what a personal AI stack does when it works. It compresses the non-clinical labor that quietly eats your week.

The physicians getting the most from AI are not the ones using the most tools. They are the ones who have built a clinical taste hierarchy -- a risk-tiered framework for deciding which decisions to delegate, which to augment, and which to keep entirely human. The stack should be organized by error tolerance, not by function.

But most physicians are building their stacks backwards. They start with whatever tool is available -- an ambient scribe here, a literature search there -- and accumulate without strategy. That is bottom-up adoption, driven by convenience. The physicians I have watched succeed are doing something different. They start with a question that has nothing to do with technology: which of my decisions am I willing to delegate, and at what error rate?

A recent editorial in Nature Medicine asks for evidence of the value of medical AI. The authors frame AI adoption through the lens of clinical outcomes, which has its place. But evaluating AI as a collection of discrete tools -- the way we evaluate a new suture material or imaging modality -- misses what makes it transformative. The shift is not adopting a tool. It is rebuilding how you allocate your cognitive labor across an entire workday.

The Clinical Taste Hierarchy: Organizing Your Stack by Risk, Not Function

According to the Doximity 2026 State of AI in Medicine Report, 63% of U.S. physicians now actively use AI, up from 47% just a year earlier. Among those users, 74% report daily use. Adoption has moved past the early-adopter phase into something closer to infrastructure.

But adoption is not integration. And integration without a risk framework is just organized chaos.

I think about my AI stack in three tiers, organized not by what the tools do but by how much damage a wrong output can cause.

Tier 1: Low error tolerance. Clinical decisions, diagnostic reasoning, anything that touches a patient directly. AI operates here only as a second reader. I never delegate. I verify everything. A wrong differential is a patient safety event. There is no acceptable error rate above zero for unreviewed AI output in this tier.

Tier 2: Moderate error tolerance. Medical writing, literature synthesis, patient education materials. AI generates structure and first drafts. I supply the clinical substance and make every final call. A hallucinated citation in a conference presentation is embarrassing. A hallucinated citation in a treatment plan is dangerous. The difference defines the tier boundary.

Tier 3: High error tolerance. Scheduling, formatting, billing code lookups, template generation. If the AI gets it wrong, a human catches it during routine workflow and the consequence is a minor delay, not a patient harm event. This is where full delegation makes sense.

Most physicians I talk to are running Tier 3 tools and treating them like Tier 1 achievements. The real leverage comes from building the Tier 2 layer deliberately -- which requires understanding where your own clinical judgment is non-negotiable and where AI output is genuinely good enough to trust after a quick review.

Why the Documentation Burden Makes This Urgent

The documentation burden is not an inconvenience. It is a structural failure. Shanafelt et al. in Mayo Clinic Proceedings (2019) found that physicians spend an average of two hours on administrative tasks for every one hour of direct patient care, with 65% reporting that documentation burden was a primary driver of professional dissatisfaction. Sinsky et al. in Annals of Internal Medicine (2016) quantified the EHR tax more precisely: for every hour of direct clinical face time, physicians spent nearly two additional hours on EHR and desk work, plus another one to two hours at night. The healthcare system has seen roughly 3,000% growth in administrative positions over recent decades versus approximately 150% in physician supply.

That ratio is the reason AI adoption in medicine is not optional. It is the reason the tier framework matters. You cannot fix a 2:1 admin-to-clinical ratio with a single tool. You need a system -- and that system needs to reflect what you are and are not willing to hand off.

Tier 2 in Practice: Research and Writing

Literature search is the most common AI use case among physicians -- 35% of all AI-assisted tasks according to Doximity. I use large language models with retrieval-augmented generation for rapid evidence synthesis, querying with precise clinical parameters rather than broad questions. Narrow queries anchored to PubMed-indexed literature produce verifiable results. Broad queries produce confident-sounding guesses.

The verification step is non-negotiable. Every claim gets traced back to the original abstract. If a citation does not resolve to a real paper, I discard the entire output. Hallucination is not a partial phenomenon.

I learned this the hard way in my first month. I used an AI-generated literature summary for a presentation on microsurgical outcomes, and two of the five citations were fabricated. The paper titles sounded plausible. The authors were real researchers in the field. But the papers did not exist. I stood in front of colleagues citing phantom research. That experience permanently changed how I interact with AI outputs. Trust nothing. Verify everything. The tool is useful precisely because you do not trust it.

For medical writing, my approach is specific: AI handles structural scaffolding and formatting. I supply the clinical substance. AI writes the first 60%. I write the last 40%. That last 40% is where clinical nuance lives -- the part where you decide whether a finding is clinically significant or just statistically interesting, where you choose which caveats to emphasize, where you calibrate confidence for the audience. No model does that well. That is the clinical taste layer.

Ambient scribes now represent 29% of physician AI use cases. In my experience, quality varies enormously by specialty and encounter complexity. A straightforward follow-up transcribes well. A complex surgical consultation with multiple decision points still demands significant editing. The error tolerance framework helps here: a follow-up note is Tier 3 work. A complex surgical consultation note is Tier 2 at minimum.

What I Tried That Did Not Work

I should be honest about the failures, because they taught me more about the tier framework than the successes did.

I spent three weeks trying to use a general-purpose LLM as a prior authorization letter writer. The output was grammatically perfect and clinically useless. It generated letters that sounded like a medical student summarizing a textbook chapter rather than a surgeon arguing for a specific procedure on a specific patient. The insurance reviewers could tell. Denial rates did not change. I went back to writing them myself, using AI only to format and check coding accuracy.

That failure was a tier miscategorization. I treated prior auth letters as Tier 3 -- routine admin. They are actually Tier 2 at best. A prior auth letter is a persuasive clinical argument written for a specific reviewer about a specific patient. It requires clinical taste. The model did not have it.

I also tried automating my own clinical note review with a custom prompt chain. The idea was that AI would flag inconsistencies between my operative notes and my post-op orders. In theory, elegant. In practice, the model flagged everything because it lacked the clinical context to distinguish between a genuine discrepancy and a deliberate clinical decision. I was spending more time dismissing false alerts than I would have spent reviewing the notes manually. Sometimes the right answer is less AI, not more.

A colleague of mine -- a dermatologist in private practice -- had the opposite experience with scheduling. She tried four different AI scheduling tools before finding one that actually reduced her front desk staff's phone time. "The first three just moved the problem," she said. "Instead of patients calling us, they were calling us to complain about the AI." The tool that worked was the one that handled the 80% of straightforward bookings silently and routed the 20% of complex scheduling to a human immediately. No hold music, no chatbot loop. Pure Tier 3 delegation. It worked because scheduling is genuinely low-stakes and pattern-driven. The mistake is assuming everything in your practice has that same error tolerance profile.

Building the Stack: Start with Your Decision Map, Not the App Store

Do not start with tools. Start with a decision audit. Spend one week tracking how you spend your non-clinical time. Categorize each task by tier. Most physicians discover that 60-70% of their administrative labor is Tier 3 -- genuinely delegatable -- and they have been treating it as if it requires Tier 1 attention. That mismatch is where the time is hiding.

Limit your stack to five or fewer tools. Every addition introduces a new interface, login, and data policy to evaluate. Audit quarterly, because the model behind your favorite tool in January may be deprecated by June.

I evaluate tools on three criteria. Does it integrate with existing systems without creating a parallel workflow? Does it maintain HIPAA compliance with auditable data handling? Does it actually reduce time spent -- measured in minutes per week, not in marketing language? Most tools fail that third test. They shift work rather than eliminate it.

As a Stanford-trained surgeon and physician-founder, I have learned that the physicians who benefit most from AI are not those using the most tools. They are the ones who have clearly defined what they will and will not delegate to a model. Clinical taste -- the ability to distinguish adequate AI output from excellent clinical communication -- is the differentiating skill. It is also the skill that no AI tool will develop for you.

The 2025 Offcall Physicians AI Report found that 84% of physicians say AI improves their performance, but over 80% are dissatisfied with organizational deployment. That gap is exactly where the personal stack becomes essential: a way to reclaim agency over your own workflow while the institutions figure it out. But reclaiming agency requires a framework, not just a collection of subscriptions.

Frequently Asked Questions

What is a clinical taste hierarchy and why does it matter for physician AI adoption?

A clinical taste hierarchy is a risk-tiered framework for deciding which tasks to delegate to AI and at what error tolerance. Instead of organizing your AI stack by function (research tools, writing tools, admin tools), you organize by consequence of failure. Tasks where a wrong output could harm a patient stay fully human-supervised. Tasks where errors cause minor delays get delegated freely. The framework prevents the most common adoption mistake: treating all AI output with the same level of scrutiny, which either wastes time or misses dangers.

How do physicians use AI for medical writing without compromising accuracy?

Effective physician AI writing follows a split workflow: AI generates structure and first-draft prose while the physician supplies clinical substance and performs final verification. Every clinical claim, dosage, and recommendation is checked against primary sources before use. The discipline is treating all AI output as a draft, never a finished product. In the tier framework, most medical writing sits at Tier 2 -- moderate error tolerance -- meaning the physician must review every output but does not need to generate from scratch.

What is Dr. Bari's approach to evaluating AI tools for clinical use?

I apply three criteria: seamless EHR integration without parallel workflows, verifiable HIPAA compliance with auditable data handling, and measurable time savings confirmed by staff feedback after a two-week trial. But before evaluating any tool, I ask which tier it serves. A Tier 3 tool that shifts work instead of eliminating it is not worth the login. A Tier 2 tool that lacks a clear verification step is not safe. Tools that claim to operate at Tier 1 autonomy -- replacing clinical judgment -- do not survive evaluation regardless of marketing claims.

Why are small practices adopting AI faster than large hospital systems?

Independent physicians retain decision-making authority over their tooling and can trial, evaluate, and discard tools in weeks. In large systems, 47% of physicians report institutional AI policies are confusing or still evolving, creating months of committee-driven friction. Small practices also have a natural advantage in the tier framework: the physician who uses the tool is the same person who evaluates it, which means error tolerance decisions happen in real time rather than by committee memo.