How to de-identify medical image data


At Segmed we know that 21st century medical diagnostics are undergoing a big transition right now. There were some significant breakthroughs in recent years using  artificial intelligence algorithms. Advanced methods, once thought to be unattainable due to processor and power constraints, are now becoming a reality. With advances in both computing power as well as AI algorithms, we have already started seeing news that some of them  successfully outperform doctors in terms of speed and even accuracy.

One of the dominant methods in automatic image-based diagnostics is Deep Learning. This type of algorithm must be fed large volumes of medical data to learn how to spot suspicious abnormalities on  medical images, such as a tumor or a fracture. The more data you feed into the algorithm, the more accurate the algorithm gets.

But what does medical data used for training really entail? First, pixel data are essential. It contains rich information regarding patient’s health condition.


Of course spotting an anomaly wouldn’t be possible without the image itself. Second, labels are needed to train the algorithm. Labels on medical data are usually performed by doctors, and signify the ‘ground truth’ about what the image is and what it contains. A label can be thought of as a hint given to the computer so it can learn how to interpret these cases. These hints are similar to what a child’s brain needs to be given to learn to recognize the world around them.  An algorithm for chest x-ray could contain ‘hints’ such as “chest” or “cancer”. Another hint might be “left lung”. With labels, machine learning (ML) algorithms can start recognizing patterns. After the ML algorithm is trained, it will be sent to hospitals for validating the performance and applying for FDA clearance. 

And that’s all.

Wait, you ask. That’s all? What about my name? Or last name? And birthday? 

Well, the computer algorithm doesn’t usually need your personal data. Name and last name are not required. Your address isn’t required either. Personal data such as weight and age are sometimes required for an algorithm to perform well, but not usually. It all depends on what the algorithm is being trained to do.

Sometimes algorithms perform better by ingesting some more data about a patient, called ‘metadata’. The types of data needed are not always personal information, but details like the region of the world he/she is from, or the type of scanner used to acquire the image.

So how does Segmed ensure that the images we use for training do not contain private patient information? Well, medical images come in a special type of an image format called DICOM. It’s similar to a video file, but it contains a screenshot of your body. Each frame in this video is a cross-section of your exam. And it’s up to your doctor to put your name and last name in this exam. MRI and CT images have very many frames, and this number grows the more products and software for imaging get better. X-rays however are single-frame. What’s the most important here, is that DICOM handles that all.

At Segmed, we care very much about patient privacy, and we take great care to strip all Personal Health Information (PHI) from the data we ingest. This process is called de-identification. The steps are as follows.

  1. Our data partners export patients’ data from the PACS inside hospital or imaging center

  2. They remove the PHI from the DICOM file headers using guidelines that we provide

  3. They ensure that names or folder have no PHI in file names

  4. They check data which can be burned-in into the DICOM image

  5. They export data to a cloud

On the Segmed side we double check whether all of these steps have been completed successfully. We alarm our partners and do not add the data to our library if we find something suspicious. 

Technical background

A typical DICOM file contains not only images but also headers with various information including patient’s name, ID, exam date, hospital name, location, equipment info, etc. which is commonly referred as metadata. The de-identification process requires careful removal or replacement of values of these sensitive headers with various data types as well as possible burned information in the images acquired by some old machines. Thus it can be tricky to fully anonymize DICOM files.  We will use multiple tools together to do that step by step.


The first step is to remove necessary headers in the DICOM files. There are multiple free or paid tools available for this purpose. We use two of them alternatively, DICOM Anonymizer Pro developed by NeoLogica and the CTP DICOM anonymizer by MIRC, depending on the technical experience of the data handler at the hospital or imaging center. DICOM Anonymizer Pro provides a friendly graphic user interface on Windows, Mac OS X and Linux. With DICOM Anonymizer Pro, we just need to select desired folders with patients’ DICOM files, import into interface, clean the headers and export into new folders. The video (link to youtube) demonstrates the usage of DICOM Anonymizer Pro using default setting. 

To ensure the quality of de-identification process, we will go through headers of target DICOM files and make sure those with identified information are selected in the cleaning process according to GDPR and HIPAA.  However, one drawback of this kind of DICOM Anonymizer is that they will commonly keep the name of the original folders. Unfortunately, some of the PACS exporting function will use patients’ name as folder names when they export DICOM. Thus we developed a script to automatically rename the folders as demonstrated in the video (link to youtube). Data handlers only need to put patients’ DICOM files in the same folder with the downloaded script and run the script to rename and index the folders.