Beyond the Benchmarks: How to Validate & Monitor AI Tools Before Deployment?

Beyond the Benchmarks: How to Validate & Monitor AI Tools Before Deployment?
Healthcare AI mistakes aren't just expensive—they're potentially dangerous. While the technology holds huge promise, rushed deployments without proper guardrails often create more clinical burden than relief.
Our hosts discuss three major approaches to healthcare AI safety from tech giants Google, OpenAI, and Microsoft. The conversation reveals how physician-centered oversight and multi-agent systems can prevent AI hallucinations while maintaining clinical workflow efficiency.
This episode covers real-world deployment challenges, data drift monitoring, and why successful integration requires engagement from all stakeholders—patients, clinical staff, and leadership—throughout the evaluation process.
Healthcare organizations often find themselves caught between vendor promises and clinical reality. This deep dive provides practical frameworks for evaluation, implementation, and oversight—helping you make informed decisions rather than costly mistakes.
"Understanding that AI in healthcare is extremely high-stakes… You have to test all the time, and then validate. Think about guardrails on every single step, and then implement in a way that is not disruptive to your system, but rather provides true, clear value."
- Junaid Kalia, MD
Loading...
What You'll Discover:
[1:08] AI Guardrails 
[2:57] Understanding Why Healthcare AI Requires High-Stakes 
[6:39] Google: Asynchronous oversight & Multi-agent system 
[11:07] OpenAI: AI-based clinical decision support 
[12:42] Microsoft: Sequential diagnosis orchestration 
[13:47] Critical Implementation Considerations 
[16:20] The Essential "Village Approach" to AI Deployment
Resources
🖇️ Google Research: Towards Physician-Centered Oversight of Conversational Diagnostic AI" - GAMI (Guardrail Articulated Medical Intelligence Explorer)
🖇️OpenAI Research: "AI-Based Clinical Decision Support" - Primary care implementation study with 22,000 patients
🖇️Microsoft Research: "Sequential Diagnosis with Large Language Models" - Multi-agent orchestration framework for medical diagnosis
Transcript
Junaid
How do you actually implement or consider implementing AI in real life situations? And we're going to start with this Google paper. And then what we're going to do is we're going to go ahead and actually introduce three of the recent articles and then discuss one by one what we're talking about physician-centered oversight. And that is the core issue that we are thinking that how we use AI in a way that actually produces value and does not produce other issues in terms of hallucinations and ensure that it is providing proper value. So within this paper, we said that the PCP provides oversight, retains accountability of the clinical decision. And then of course, is we compare GMI to NPs and pH group of PCPs under the same guardrails. So that's the real benefit of this particular guardrail system and then how it is actually implemented in their lives. Before we start, let's talk about guardrails in AI. And then there are different ways to do the user experience, but I want Harvey to introduce that subject so that everyone understands what guardrails mean.
Harvey:
When it comes to guardrails, what exactly does that mean? Think of it this way. You're having a conversation with someone. Today, you're speaking to a high school student. Maybe you're speaking to someone that's a little bit further along. The AI has different mindsets, different IQ levels. Now, what does it mean to have a guardrail? Well, if you are speaking to, let's say, Albert Einstein, and you wanted to build a bomb, you could literally ask him, and he would tell you, these are the things that you need, and this is how it's done. But that's dangerous, right? We don't want people to have access. So AIs are being able to have been built in such a way that if you ask certain things, it guards against that. And then they can create parameters. In fact, when JBT first came out, they were really good about anytime you ask for medical advice or anything legal, it sounded like a lawyer, like, is not doctor advised, blah, blah, disclaimers. And so there was different guardrails set up for that. Now, as we progress with the large language models, we're starting to see those guardrails shift and change. Why? Because they know that the user behind it more than likely is going to be a doctor. And so that automatically starts taking off those guardrails. On an aside, some of these open source models, they have been created and people have manipulated them to the point where they take off the guardrails. And sorry to go off track a little bit, but when you look at GROG one of the things about it is they have also taken some of the guardrails off to certain degree. And that's why you're able to ask GROG some questions that you cannot ask at ChatGPT.
Junaid:
Absolutely right. So one of the things that, you know, building large language models or AI in healthcare and specifically healthcare, the first is definitionally what is, you know, we're building. For example, AI can write a blog post or produce an image that you're going to share on Facebook, LinkedIn, et cetera. It's fantastic. It's amazing. It saves a lot of time. But then that is called what we call low stakes. AI in terms of implementation. Then we have what we call medium-stakes AI. Medium stakes AI is that you are actually using it in your enterprise, your business solutions, et cetera. For example, AI is helping you calculate or lead a generation or lead qualifications, et cetera. And there's a bunch of leads coming. And then you say, OK, this lead is very important, et cetera. That is called what we call low to mid-state AI. Then we go to one step further.
in which, for example, you're using AI for your financial analysis and decisions. For example, a bunch of data comes in and then you use machine learning, not truly AI, to analyze that data and then produce insights. That's of, know, mid to high. And then we have the true high stakes AI. The true high stakes AI is essentially two objectives. One of them is what we call medical, which we go in because lives matter. And if AI makes a mistake, you can actually potentially kill a person.
And then, course, the second biggest one, high stakes AI is what Harvey appropriately suggested bombs, essentially is military. So within the military complex, when you are establishing AI, that is what we consider high stakes, military and health. Now we come to high stakes AI. When you have high stakes AI, what you want to build from any work through perspective is two things. One is that AI has an internal guardrails. The definition of guardrails is that when you build these systems from a technical perspective, an output is generated. That output is then checked against guardrails. it doesn't say, for example, you're talking just guardrails. Very simply is that you can use a chat box that I have created with my back end. Again, it's a front-end model. You can write a blog too. So the first step is that, OK, you're not even asking a stupid question that of course all the BS that goes around the world on the internet from, know, crap, I don't want to use even those words here. What I'm saying is that we don't have those kind of things added into the system or somebody's abusing the system for that. And then the last one and the most important one is that there is a recursive checking consistently of the output that is generated from the primary output. For example, we use the word verify. So we have another LLM that is constantly verifying the output from the first LLM and constantly checking it. So that's the guardrails. And it is so extremely amazingly important for healthcare AI and military AI because those are very high stakes AI solutions that we have. The second question over here is your input on Harvey. How do you implement it in real life?
Junaid:
One thing is to have guardrails. Excellent. But what we need to do is concentrate, of course, within this podcast is to bring these important insights so that people understand that if you're choosing a vendor or if you're implementing it yourself or you're experimenting, you need to know, most importantly, one thing is to build it. The other thing is to implement it. So if you want to go into, you know, AI oversight paradigm, clinical and this is the word asynchronous, then there is a multi-agent system. So you want to go into those details.
Harvey:
Asynchronous basically is it's not in sync, meaning it's you're my patient. You're asking for medication. I need to help you. But it's in middle of the night and you send me an email. My AI would look at what you're asking me, create a response. And then the reason it's asynchronous is this morning I would wake up, I would look at the response and say, yep, thumbs up. And then it would send it forward. That's why it's called asynchronous. Now, what's really interesting about how they created this next part, the multi-agent system, think of it this way. An agent is just like a helpful mind. So we have one agent that is going to look at the diagram of that question and create a response and it can respond and inadvertently, it may give a treatment plan. But then the second agent, this is where the guardrails come in. And I think this was genius. It goes through and look at that first response, reads it and they, whoa, whoa, this part is giving the medication. Yes, I'll give it, but it's now going to now it started to talk about treatment plans and we don't get treatment plans. So now they're literally take it away and give me the response that I need. Really good, because now as a provider, as a physician, now I don't have to go through and be like, I really can't say this or not. It's pretty much that. And the other part of it is I can give a treatment plan. But if I have my nurse or someone else that doesn't have that authority in a way, then the guardrail is there and then the nurse can push it instead of me pushing it to the patient. Just one.
Junaid:
So you talked about two important, use cases. And this is something that happens so much clinically. I mean, you have been an ED physician all the time. I've been an ICU stroke physicians. And believe it or not, when you are implementing it in a multi-site system, let's say if I'm sitting here at a quaternary center with director of data, ICU director of stroke, there is actually a hundred bed hospital. This is now primary care stroke center. And over there, even the physicians, not just nurse practitioners or PAs, have very low experience. mean, they actually see two stroke patients a month max as compared to me seeing literally 300 stroke patients a month. So the idea is that you're absolutely right that when we have to do this, we have to start with, you know, clinical dialogue without medical advice. So this is something that we actually see that. And then you can see that it can actually go into these agents. So, PCP and DA are actually making their SOAP note while, more importantly, there's a guardrail agent that is consistently taking this collection. And then they see over that. And then what they do is that within that SOAP note, there is a clinical cockpit. And the cockpit, again, looks exactly like what we're talking about. The clinician cockpit is that there's a dial-up transcript, subjective analysis assessment, patient message follow-up, and then they select A and B. Again, what they are doing is a second guardrail. That second guardrail is essentially a human in loop. There is HIP. that's another, there are different terms that people have used. Again, there are people who use different terms as well. So the idea is that we actually go ahead and look at this into a performative way that GAMI outperforms general PCPs and general MPs. So they compare this into the system. And what they realize is that you know, they were this particular guardrail system with asynchronous oversight is amazing in terms of implementation as far as in real world is concerned. So that is exactly what Dr. Castro was saying that how we should know that SOAP notes and asynchronous oversight is concerned. One of the key ways that this guardrail was done, again, we don't want to go into too much detail. They actually have I just want to make sure dialogue phase transition. So they have done at every phase, generate response, response, and then there's medical advice. If there is medical advice, device through the human in loop system. So I love this system, honestly. I think there's three different approaches. So let's just talk about the second approach that is done by, so this was Google's approach. Now let's just go talk about OpenAI's approach. So this is interesting that again, as I said, people have different approaches on how they have done. So this is the AI-based clinical decision support. Same thing, but they have restricted to primary care. And again, we discussed this in one of those meetings, one of the ideas. So this is their system as well. What they have done it is AI-based clinical decision support. And this is the real-world study, by the way. And they have done it for 22,000 clinical visits. Interesting. Yeah. So one of the things is like, when you look at this initial documentation, you know, blood loss, appearance, absent parasite, no, et cetera, crystals, metronidol supplement five, three times a day, contribute to AI consultant response. And they are saying that the treatment involves metronidol, which is not an indication for uncomplicated gastrointestinal interwrites. So this is exactly where AI consultant comes in. Now, this is really interesting and actually scary because what he's saying is that for primary care, what did they do? they actually added a GI consult and that GI consult is going to be a super special that is going to be available and that is going to say, you know what, you don't need metronidazole, you just need zinc and you just need ORS. That's it. Saving time, saving costs, saving potential risk side effects of the medication and most importantly, evaluating for this. And again, think of this particular thing is that you will not have problems with utilization review and getting reimbursement from the, you know, that is what is called from insurance agents. Okay. And this is the third approach. And the third approach is that that is from Microsoft. And they said that sequential diagnosis with large language models. And then they did, you know, a whole sort of medical orchestration. And what they did was that to improve, you know, overall output, decrease the hallucination. What they did was that they and their design technical framework was to make an orchestrator and then create multiple different agents below it. And then all of that agents are super specialized and have their own guardrail. And therefore, think of it that a pharmacist is there. Think of it as a occupational therapist is there while you're doing a PCP or a cardiologist or a neurologist. And the idea is that you have created multiple amazing agents. And this way you actually progressively take that input and then slowly combine that into one diagnostic. Well, again, this is only related to diagnosis rather than giving a complete sort of picture in terms of treatment and management, which is different from the open AI approach and of course, the Google approach again.
Junaid
We want everyone in the world through this podcast understand that, hey, how healthcare AI is extremely important, how it is very challenging to put it into production, and then making sure that if you choose a vendor or production, you need to understand if they have tested. And that's what we're doing is we're building, we're experimenting. And as a matter of fact, that's what I say. One of the big things that the big AI gurus say is that benchmarks are garbage. What you need to know is that how many evaluations you have done. So believe it or not, we have a thousand patient data set. have all these, know, physicians, histories, et cetera, patient histories done. We actually constantly check, you know, our models against our data set when it's deployed. This is my first thing when I actually deploy it. And then once this is done, it is going to be moving forward.
Harvey:
Let me let me add some information. So for people listening, he mentioned some interesting things. There's something called data drift. Let me explain that when you create AI and you have a certain model today, it's doing what I need. But with time that passes, that model starts changing because the way I ask the questions, the way I reinforce it and then with time, if someone's not watching the model, it drifts. What does that mean? Well, if today it's having me do, let's say this consult, but with time it starts morphing, morphing, and it stops doing the intended goal, then it has drifted. And so for you guys that are implementing AI into your system, make sure that you either have some kind of a chief AI officer in the company that knows about these things so that they can look out. And then more importantly, when you sign that contract with your provider that you're having come in for your AI, make sure you cover this: Is anyone maintaining the data drift? Is there something going on? Who is doing that? What parameters?
Junaid:
The goal of today of understanding all these things was and showing you in this idea, and that's why I do look at all this research on a consistent basis, is to understand that for you guys, again, is to understand that AI in healthcare is extremely high-stakes. You have to test, test, test, test all the time, and then validate, validate, validate. Think about guardrails on every single step, and then implement it in a way that Harvey suggested that it is not disruptive to your system, but rather providing true, clear value.
Harvey:
You know, want to add one more thing. We talked about this in one of the other episodes, but it takes a village. When we're creating this, make sure that we're including our patients. Make sure that we're including the frontline. You know, it's a lot of times, for example, I'm consulting hospital systems and they're like, Hey, Dr. Castro, we want to do XYZ. We already bought the system. We want to help have you come in and implement. I'm like, Whoa, whoa, whoa. Do we talk to the front lines? Do we talk to the patients? Do we have them in part of the system because from a cultural point of view, maybe the doctors, nurses and all say, whoa, whoa, whoa, we don't need that fix. And that workflow is horrible. This is going to make it slower for us and we don't want it. And so I want to stress that again today. We got to make sure that everybody's in line, that everybody's on the same page, that we're moving forward together because think about it. If a C-suite spends millions of dollars and all commits and it's the wrong problem. The only thing I want to just add a little light to is nobody talks about the electric bill. And you're like, what? Electric bill? Yes. These models take a lot of energy. And so if we start adding models, modeled models, and the way we're using it, it might end up being another line item that we hadn't thought of. So just wanted to play that scene.
Learn more about the work we do
Dr. Junaid Kalia, Neurocritical Care Specialist & Founder of Savelife.AI™
💼 LinkedInh
🔗 Website
📹 YouTube
Dr. Harvey Castro, ER Physician, #DrGPT™
💼 LinkedIn
🔗 Website
📷 Instagram
Edward Marx, CEO, Advisor
💼 LinkedIn
🔗 Website
© 2025 Signal & Symptoms. All rights reserved.