Introducing Segmed’s LLM-based Data De-identification Playground

Author:

Rachel Madukayil

Reading time /

3 min

Product

Here at Segmed, we’re simplifying access to real-world imaging data for AI research and collaboration. Due to this, we regularly process massive quantities of data - both radiology reports and associated DICOMs. To date, we have accumulated 60M+ patient records sourced from 1500+ sites across 5 continents (and this number continues to grow!).

Medical data has incredible potential - researchers and developers in medical AI benefit from high-quality, standardized data to train, test and validate their algorithms. However, prior to sharing and collaborating on this data, patient privacy and confidentiality must be maintained.

For this, said data must be effectively de-identified.

‍

What is data de-identification?

Organizations that use patient data for research must prevent the exposure of protected health information (PHI). HIPAA (Health Insurance Portability and Accountability Act) mandates the removal of any information that could potentially lead to the re-identification of a patient.

This can be done by redacting or masking specific categories of identifiers from patient data - in relation to the data Segmed works with, this involves the redaction and masking of information from radiology reports and the associated DICOMs (both metadata and the images themselves).

Direct identifiers must be removed from the data entirely. As the term suggests, these refer to specific data that directly identifies the patient or individual, including but not limited to patient name, phone number, medical registration number (MRN), etc. Indirect identifiers must also be removed - these include variables like patient sex, date of birth, hospital, postal/ZIP code etc. Though not directly related to a patient, the unique combination of multiple indirect identifiers could potentially lead to the re-identification of an individual.

Ultimately, effective data de-identification makes it difficult to establish any links between the original patient and the data.

‍

What is Segmed’s LLM de-id playground?

Over the past three years, Segmed has developed in-house technology to effectively de-identify both our radiology reports and the associated DICOMs.

With the uptick in popularity of large language models (LLMs) over the past several months, we wanted to put LLMs to the test for one of our use cases - medical report de-identification. The result was our de-id playground, which we are thrilled to share with you.

This prototype makes use of Open AI’s davinci model to identify and subsequently redact potential protected health information (PHI) or personal identifying information (PII) within the provided input. The output will be a de-identified version of the original input, where both direct and indirect identifiers - like those mentioned above - have been redacted. This ensures that the resulting output is HIPAA compliant, and

We encourage you to test and play around with this tool - no input or output data is stored or saved. Feedback is welcomed - let us know what you think either at info@segmed.ai.

Future iterations of this prototype will allow users to conduct batch operations (i.e. upload a csv of text data they would like de-identified), as well as perform more customized data redaction.

‍

Closing thoughts:

The application of LLMs for medical report de-identification is a unique one, and we at Segmed are excited to further incorporate these tools and techniques into our existing de-id toolset.

However, it is important to recognize that there are nuances to understanding how to train and apply LLMs and there are many use cases where LLMs alone are likely not sufficient. Large Language Models are now one of the many components in Segmed’s combined approach to data processing.

Working to scale and fine-tune said data processing and de-id will allow us to accelerate collaboration with innovators in medical AI.

Questions? Concerns? Areas for improvement? Get in touch with us at info@segmed.ai.

‍