Data Diversity Is the New Standard for Clinical Safety (Here’s Why)
Loading...
When 71% of training data for FDA-approved AI comes from just three states, geographic bias shifts from a technical oversight to a direct clinical liability. To deliver on the promise of value-based care, we must move beyond "clean" academic datasets and integrate the diverse, real-world data that reflects the actual patients we serve.
In this episode, Dr. Junaid Kalia and Dr. Harvey Castro sit down with Dr. Martin Willemink, a radiologist and former Stanford faculty member who transitioned into the entrepreneurial world to solve AI’s most significant bottleneck: data diversity. As a clinician-turned-founder of Segmed, Dr. Willemink provides a systematic look at why "garbage in, garbage out" is a systemic risk to patient safety and how building a more representative data pipeline is the only way to achieve true generalizability in healthcare AI. The conversation moves beyond the hype of algorithms to the "messy reality" of medicine. We explore the strategic shift from vision-only models to multimodal vision-language models (VLMs) and the emerging role of synthetic data in augmenting training sets where real-world data is scarce.
Ultimately, solving the "data problem" is a prerequisite for clinical safety. By building representative pipelines that account for hardware heterogeneity and geographic diversity, we ensure that AI serves as a reliable extension of the clinician. This infrastructure is the only way to deliver accurate diagnostics and improved outcomes for every patient, ensuring that innovation translates into better care at the bedside rather than just a technical success in a silo.
"I don't think you can ever get rid of bias. But you can definitely try to improve bias by improving the generalizability of the AI model developed on the data."
- Dr. Martin Willemink
Loading...
What You’ll Discover
[00:00] Intro: The Data Bottleneck of Healthcare AI
[01:29] Dr. Willemink’s Journey from Stanford faculty to Y Combinator
[04:47] Addressing Data Bias in Clinical Models
[09:14] The Future of Multimodal AI in Healthcare
[13:01] Balancing Data Privacy & Patient Protection
Martin Willemink: I'm not saying there will be no bias. Of course, there'll be bias. I don't think you can ever get rid of bias. But you can definitely try to improve bias by improving the generalizability of the AI model developed on the data. So this data component is very, very important and of course you want to decrease the probability of bias in this whole process.
Junaid Kalia: So, we actually have to source that data with Segmed's help and that's where they come in because they have a massive diversity of data partners through their pipeline. For people like me, I don't have to go to like six different hospitals ourselves and figure out the process.
Harvey Castro: AI in healthcare is not an algorithm problem; it's a data problem. The future of diagnostics won't be determined by who has the best model, but who has the best data set. In medicine, we say "garbage in, garbage out," but AI just scales that problem.
Junaid Kalia: Good morning everyone. I am so excited that my good friend Martin from Segmed is here. So we're going to go ahead and let him first introduce himself—his amazing journey from becoming a scientist to a founder to YC Combinator—and then tell us about Segmed, and then we can figure out what tough questions Harvey has for you.
Martin Willemink: I'm Martin Willemink. I'm originally from the Netherlands; that's why I have a kind of a funny accent. I went to medical school there, did a PhD in medical imaging, and a master's in clinical epidemiology and statistics. I worked for a couple of years as a resident in radiology with the idea of becoming a radiologist and just sticking around there.
But after about two years I got a grant to go to Stanford. The idea was I'll do a postdoc and be a clinical researcher for one year and then go back and finish my residency. But I never came back. The one year became two years and then I became faculty at Stanford for another three years. I led the cardiovascular imaging research there at the department of radiology. I really had a fun time focusing on innovation and research rather than on clinical work, and I honestly didn't really miss it.
While I was at Stanford, I was consulting for an AI startup called Arterys. Consulting just meaning I was actually doing labeling and annotations for them. I was doing this at Stanford as well; we were doing a lot of projects and we needed to annotate and segment the data. It took a long time, but it was kind of fun, and I thought maybe there is something to do here. That's actually the original idea of Segmed—that we were going to be a segmentation and annotation company, hence the name.
I presented the idea at a program of the Stanford Graduate School of Business called Stanford Ignite, which is a 10-week part-time program. Anybody that attends can pitch their idea, collect a small group, and work on it for 10 weeks. But the other problem I faced in my career was that it's not easy to get access to medical imaging data. You need the annotations and labels, but before that, you actually need access to the actual data. That was a big problem, and we learned very quickly a lot of people have that same problem, so let's try to solve that. That's the basis of how we started Segmed and the problem we're tackling.
Junaid Kalia: Just a reminder, Martin is actually closely related to an actual labeling company. They don't participate in a commission-based program. These are different labeling companies who can directly access our data and therefore find radiologists for you to label them. We didn't use that service because we actually had access to radiologists.
Harvey Castro: Martin, awesome meeting you. Thanks for coming. Just to break it down, there's a bell curve here listening. We got some really advanced people and some people just starting out. To set the stage, we talk about AI in healthcare and algorithms getting so much attention, but in reality, we all know it's the data. Explain to us how you're addressing bias, because just to let people know, "we don't know what we don't know". If the AI has studied something and you ask it a question, it may seem biased because they're not answering this other part, but they've only been trained on this. I'd love to hear your thoughts on that.
Martin Willemink: That's a very important point and maybe the main point that excited me when we pivoted from doing just labels and annotation to actually providing data. In the meantime, we've built a large network of healthcare providers represented by all 50 US states. It's healthcare systems, teleradiology clinics, imaging clinics, and so on. There is a lot of diversity and heterogeneity in the data that we can provide to our customers, and that is the core of what's important here.
When we started Segmed in 2020, a year later there was a paper (I think it was JAMA) that looked at where the FDA-approved deep learning models for radiology AI were actually trained. It showed that 71% of the training data actually came from three states: Massachusetts, New York, and California. That's because that's where the Stanfords and MGHs of the world are located where people can get access through academic research collaborations.
You can imagine if you train a model based on data from those three places—fancy academic places with perfect CT and MRI scanners—and then you try to apply that model in the deep south of the US where the quality of materials isn't as good or you have a different racial or ethnic population, the model will not work very well. I'm not saying there will be no bias; of course, there'll be bias. I don't think you can ever get rid of bias. But you can definitely try to improve bias by improving the generalizability of the AI model. This data component is very important to decrease the probability of bias in this process.
Harvey Castro: If you're able to, what are some of the techniques that you do or that other people are doing to address this?
Martin Willemink: We are a company that provides the pipelines and technology to provide data in a centralized way, meaning we can transfer the data from the healthcare provider to the researcher. Our customers are doing a lot of AI development, like diagnostic models; that's not what we do. We help the hospitals with making their data—it's their data—available. We're just sub-licensing it; we are not the owners. We are representing the health systems.
Junaid Kalia: Even before Martin's team gets involved, AI developers like our team take care of that bias problem from the get-go. Martin's team is amazing in making sure we get access to data. We design the non-biased generalizability from the start and then submit it to Martin's team, and they get back to us specifically on what we can and cannot do.
Now just for the "new bell curve" audience, what does that mean? Whenever we are developing radiology AI, we need generalizability—for example, breast imaging. We have to make sure it is going to work for a 45-year-old African-American female at a certain density to be able to predict. We have so many things to control that we have to make sure we have the data available so it can be generalizable to the community.
Martin, now that you're looking at the progress moving from vision models to vision language models (VLMs) and AI drafting for radiology reports, how do you foresee Segmed helping future developers in producing VLMs and not just vision models?
Martin Willemink: We're seeing a trend towards multimodal data because models are getting more complex and can handle images and text-based information. When I say text-based, I'm talking about the report of the radiologist. That becomes very valuable if you want to develop these more advanced models. We're starting to work with our health systems to integrate with EHR systems to add other kinds of information on top of the images. Images are still at the core of what we're doing, but we can provide additional data on top of that.
Harvey Castro: From a data science point of view, I'm curious how you would do this. If I'm Hospital X and I show you my data, how are you checking for bias and true representation? How are you "biopsying" that population?
Martin Willemink: We don't have super sophisticated data science technologies for that right now, though it would be interesting to apply. Right now, we look at it from a perspective of coverage. When we started, we worked with imaging clinics that were mostly outpatient, meaning we wouldn't see a lot of complex diseases or certain therapeutic areas. That led us to reach out to more hospitals that have inpatients so we can have more coverage of diseases.
When you think about bias, there's a lot of different types. Representation of vendors and machines in radiology is very important. If you only train data on CT scanners from Siemens, how is this going to work on scanners from Philips, GE, or Canon? The AI model may not know how it works regarding differences in spatial resolution.
The reason this became clear to us is the FDA. More than 45 FDA approvals have been done with data from Segmed, and the FDA will tell you that you need at least two geographic regions, a 50-50 male-female representation, and specific vendor percentages (e.g., 25% Vendor A, 25% Vendor B). If you don't have that, you cannot provide the data set.
Junaid Kalia: We have to source that data with Segmed's help because they have massive diversity of data partners so someone like me doesn't have to go to six different hospitals to figure out the process. I know the big thing in governance is privacy. Some people worry about imaging data creating privacy risks. How do you balance that access of data and protecting the patients?
Martin Willemink: That's the main question. You have to make sure the privacy security is in place. That is the number one importance at Segmed—being HIPAA compliant. We have SOC 2 and ISO 27001. We had a statistics expert dig into our database for months to look into whether it is possible to re-identify patients, and we got the "Expert Determination" stamp. We are both Safe Harbor and Expert Determination accredited. He concluded there is a very low risk of re-identification.
Junaid Kalia: To give a live example, there are three things to de-identify: the radiology images (metadata comes cleansed), the radiology report itself, and then the data for foundation models. These foundation models require exponentially more data than traditional AI. They are broad models that can implement many different features rather than just looking at one, like lung nodules.
Martin Willemink: One thing I want to note is that volume is one thing, quality is another. In radiological practice, half if not more of the exams are normal. You always need some normal exams, but you definitely need the abnormals. For example, we are onboarding Cancer Centers because we know there is a big need for oncological cases for cancer research and innovation.
Harvey Castro: What are your feelings and understanding on synthetic data? I'm curious what you have to say about how the FDA looks at it and where the future is.
Martin Willemink: It's an interesting topic. Yes, there are still populations that cannot get access to healthcare even in the US, and we cannot get data that doesn't exist. That's where synthetic data comes in. There will be bias in any data, but we're trying to decrease it.
The opportunity to generate images that look real but aren't is exciting. However, the FDA right now for medical imaging does not love the idea of a model working on 2,000 "fake" patients. They won't give you the stamp of approval for the real world based on that. Currently, there is no role for synthetic data in validation, but there is in training. I wouldn't say 100% of the data can be synthetic, but I see it as an augmentation or enrichment opportunity. I don't think the market is ready for it yet, but in a few years, synthetic data will have an important role in augmenting training data sets.
Junaid Kalia: Since all the current data is retrospective, are you planning a pathway with your current partners to do more than retrospective data? For example, if I have an IRB, could you be the middle person to generate what we call Real World Evidence, which the FDA is moving towards?
Martin Willemink: When I started Segmed, I wanted to do a thousand things and realized quickly we have to focus. Theoretically, it's possible because we have direct connections with the PACS and RIS systems and the IT systems of healthcare providers. We could prospectively collect data. Right now we are focusing on radiology and expanding to other modalities. We are also interested in providing insights on top of the data for pharma and life science customers who don't necessarily know how to handle DICOM files or aren't experienced in developing models.
Junaid Kalia: Martin, any last thoughts? How do we reach you and what do you suggest to our audience?
Martin Willemink: Well, first of all, thank you for having me. Great conversation and questions. You have a good audience here that knows what they're talking about. You can reach me on LinkedIn or search for "Martin Willemink". I'm happy to talk about anything—data, AI in medical imaging, real-world imaging data. If you are a researcher, academic or commercial, and you need access to data to develop an AI model or do research, please let me know; we will very likely be able to help you.
Harvey Castro: I'm just going to summarize in five sentences:
AI in healthcare is not an algorithm problem; it's a data problem. Future diagnostics won't be determined by who is the best model, but who has the best data set. Data diversity is the vaccine against AI bias. In medicine we say "garbage in, garbage out," but AI just scales that problem. The real breakthrough in healthcare AI will come when models are trained on the "messy reality" of medicine and not the clean data sets seen in research labs.
Learn more about the work we do
Dr. Junaid Kalia, Neurocritical Care Specialist & Founder of Savelife.AI™