Every so often in research, a project lands at just the right moment, addressing a challenge many people were quietly struggling with. For us at Segmed, that project was our 2020 paper: Preparing Medical Imaging Data for Machine Learning, published in Radiology.
We didn’t set out to write a 'landmark paper'. We were simply trying to make sense of the chaos we were seeing firsthand: the inconsistent, fragmented, and inaccessible world of medical imaging data for AI.
Now, five years later, that paper has been cited more than 1,000 times! A milestone we’re proud of, not just because of the number, but because of what it represents for healthcare AI and for Segmed.
At the time, AI in medical imaging was gaining momentum, many AI startups got founded, and larger companies started developing AI models as well. But almost nobody was talking about the elephant in the room: preparing imaging data for AI is hard. Clinical imaging data is stored in secure silos (for good reason). Which is great for clinical use, but makes it challenging for AI developers to access this data. Additionally, de-identification of imaging data is complex, every institution has different IT systems, annotations vary wildly. Metadata was missing or incompatible. And privacy concerns made data sharing nearly impossible.
We wrote this paper because we needed a clear, practical guide for how to actually get imaging data AI-ready.
And that exact problem is what led us to found Segmed.
The mission behind Segmed has always been simple:
While our mission may not be sexy, this paper (and Segmed’s significant growth!) showed that it is important! This paper was the first expression of that vision. It set out the foundational workflows, best practices, and safeguards we still rely on and evolve today.
The paper outlines the fundamental steps for preparing imaging data in AI algorithm development, explains current limitations to data curation, and explores new approaches to address the problem of data availability. Topics of the paper include:
We also highlighted the common mistakes and challenges teams encounter when working with real-world imaging data, things we’ve spent years addressing through Segmed’s platform.
Because it addressed a universal problem. AI teams everywhere were running into the same frustrations: inconsistent data, annotation gaps, privacy barriers, and poor model generalization.
This paper gave them a starting point: a practical, field-tested set of recommendations they could build from.
It’s since been cited by AI healthcare startups, university research groups, hospital systems, regulatory submissions, and even the World Health Organization. And it still serves as a go-to reference for anyone working to develop AI responsibly in clinical imaging.
At Segmed, this milestone feels personal. It validates not just a publication, but the core belief we built this company on:
Better data makes better AI. And better AI improves healthcare for everyone.
Since the paper’s publication, we’ve expanded that mission - offering AI developers access to curated, de-identified imaging datasets and clinical metadata through our platform, built on the very workflows we first described in this paper.
As the technology in healthcare continues to evolve, our commitment remains the same: clean, reliable, ethically sourced imaging data for healthcare innovation. Segmed has been part of more than 35 FDA clearances, multiple foundation models, and fit-for-purpose real world evidence research projects.
To the AI researchers, clinicians, developers, and innovators who’ve read, cited, and built upon this work: thank you. Science moves forward because of communities like this.
And of course, thank you to the authors of the paper: Martin Willemink, Adam Koszek, Cailin Hardell, Jie Wu, Dominik Fleischmann, Hugh Harvey, Les Folio, Ronald Summers, Daniel Rubin, and Matt Lungren.
Here’s to 1,000 citations, countless lives touched through better AI, and the work we still have ahead.
Curious? Read the original paper here.
And if you’re building AI for healthcare, let’s talk.