Close
Close

Data Preparation for Medical Imaging AI Model Development

Author: 

Segmed Team

Reading time / 
4 min
Product

The introduction of AI (Artificial Intelligence) has invaded every industry, enhancing outcomes particularly in healthcare. AI in healthcare provides numerous benefits, the major one being precise diagnosis, which leads to better decision making. AI models in healthcare lead to better diagnosis and treatment plans, ultimately leading to better overall healthcare.

One aspect of healthcare that contributes to a vast amount of data generation is medical imaging. And owing to huge volumes, it has encouraged professionals to embrace technologies like AI for the management of this data. AI also helps access the rich detail hidden within imaging data. Medical imaging data, when used to develop healthcare AI models, supports important steps like diagnosis and treatment planning. For these reasons, AI is coming forward as a leading solution, one that could transform medical research.

For instance, AI diagnostics in breast cancer through mammography scans have led to earlier diagnosis. AI models have the capability of analyzing the imaging scans at a deep level, detecting microcalicifications or suspicious masses, which helps in early diagnosis. 

But to develop such efficient models, the most important factor is data preparation. The robustness and reliability of the AI model are directly dependent on the data. The data must be prepared in such a format that it is ready to be processed. Especially for healthcare AI models, which have an impact on the decision-making process for health outcomes, data preparation becomes even more important. 

Importance of data preparation in imaging data-driven AI models


Data preparation refers to the process of preparing and maintaining data in a format that can be easily processed by the AI models. This can mean standardizing the data format to ensure consistency and also adding details that should be analyzed to develop a model. Without data preparation, healthcare AI models will not be able to interpret data and provide meaningful outcomes.

Data preparation in the context of medical imaging data becomes even more crucial, as raw imaging scans inherently contain some complexity and inconsistencies. This may include resolution, quality, noise, artifacts, and any irrelevant background information that can deviate from the actual insights.

Data preparation of medical imaging data makes the images standardized, consistent, and ready to be analyzed by the AI models. In data preparation of imaging data, the important aspect that helps imaging scans become more interpretable is data annotation. It involves accurately labeling tumors, lesions, or anatomical structures. This precise labeling will help the AI model deliver better results. 

Apart from technical requirements, data preparation of imaging data should also comply with privacy laws. Data preparation should also take into consideration incorporation of diverse datasets to reduce bias. 

Without proper data preparation of imaging data, the model can be inaccurate and unreliable. This may cause failure of the entire purpose of healthcare AI model development.


Steps in Imaging Data Preparation for AI model development 


a. Image acquisition/data collection

Image acquisition is the first step in creating imaging-based AI models, which consists of capturing or acquiring images for training, validation, and testing. The images must be high-quality and diverse, and require data from different modalities such as CT, MRI, X-ray, ultrasound, and PET. It is a necessity to have diversity in demographics, geography, and equipment across imaging data to develop reliable models.


b. Image de-identification

Image de-identification refers to the process of removal of any information from the imaging data that can lead to identification of the patient. This step is necessary so as to prepare the data for further research and AI model development without risking the privacy of the patient. 

Medical imaging data is stored in DICOM (Digital Imaging and Communications in Medicine) format, which may contain clinical metadata. This includes the patient's name, date of birth or other personal information of the patient. To ensure that Protected Health Information (PHI) remains protected, the process of de-identification is essential. De-identification involves removing/anonymizing PHI in the DICOM, in compliance with HIPAA and GDPR. While removing the sensitive information, it is also made sure that important clinical information is retained so that the data is valuable for AI training and validation. De-identification of medical imaging data is complex, since it requires removal of PHI from text (e.g. radiology report), image (e.g. x-ray with burnt in text), and DICOM meta data.


c. Data curation (cleaning & normalization)

Imaging data is seldom clean when it is collected. Corrupt scans, duplicates, unfinished studies, or mixed formats are common encounters in imaging data. Data curation is the process of eliminating unusable images, converting formats to a standard (such as from DICOM to NIfTI or JPEG for machine learning applications). This also includes normalizing properties such as resolution, pixel value, and slice thickness. Organized and cleaned-up data enables the AI to grasp the data in an understandable fashion to minimize errors later on.


d. Image storage

Image storage entails safe archiving of medical images and metadata for quick access and long-term use. Digital storage is achieved with contemporary systems through the use of PACS (Picture Archiving and Communication Systems), which substitute film-based imaging. Quality assurance (QA) guarantees patient information and study integrity before archiving. Data integrity and compliance are preserved by proper storage, allowing it to support AI training and clinical research.


e. Annotation

To train and test AI algorithms, image annotations play a crucial role. Image annotation refers to labeling the images with the requisite information (e.g., spatial position, categorization), and is also commonly referred to as ground truth. Such data is usually stored within the same DICOM file or in an ancillary text report. This data must be reformatted to a more readable and standardized format, such as JSON or CSV, to be processed later and for AI development.

Annotation of medical image datasets is necessary for training AI models since annotations can offer the "ground truth" that algorithms learn from. For example, the ground truth for a fracture is the radiological image. In other cases, annotation of medical images provide additional value in training an AI model. By highlighting tumors, lesions, organs, or other relevant features, annotations enable the model to identify and classify such features accurately. This enables AI to learn relevant patterns and not unimportant details. Consequently, this precision results in more trustworthy models that can assist in real-world healthcare decisions.


Challenges in data preparation


While the importance of data preparation for AI model development is undeniable, it still faces challenges. These challenges include the fragmentation of healthcare data as it is stored in varying formats and protocols across different healthcare settings. Fragmentation of data hinders the development of unified and linked datasets.

One of the chief obstacles to development and clinical implementation of AI algorithms includes availability of large, curated, and representative training data that includes expert labeling (eg, annotations). Current supervised AI methods require a curation process for data to optimally train, validate, and test algorithms. Currently, most research groups and industry have limited data access based on small sample sizes from small geographic areas. In addition, the preparation of data is a costly and time-intensive process, the results of which are algorithms with limited utility and poor generalization.

Lastly, due to the mandate of regulatory bodies, healthcare data needs to go through the process of de-identification while preserving clinical context. And since AI models heavily rely on large volumes of data, data quality, consistency, and diversity are also important factors to be considered.


How Segmed helps in the process of AI model development


The Segmed team wrote the landmark paper on preparing medical imaging data for AI model development in collaboration with researchers from Stanford and NIH. Imaging-driven AI models are changing healthcare. They help detect diseases earlier, personalize treatments, and improve patient outcomes. However, creating these models requires large amounts of high-quality, diverse, and compliant data.

It is a universally accepted fact that better data makes better AI. And the quality of AI further improves healthcare for everyone.

With this mission in mind, Segmed provides access to 100M imaging studies from diverse modalities like X-rays, MRIs, CT scans, and ultrasounds. Our regulatory-grade, de-identified, and annotated datasets are ideal for developing AI models across oncology, neurology, and cardiology disease areas. Segmed has been part of more than 35 FDA clearances, multiple foundation models, and fit-for-purpose real-world evidence research projects.

Connect with us to explore how our diverse, high-quality tokenized imaging datasets can enhance the training and validation of healthcare AI models.