Webinar Recap

Vision Language Foundation Model for Chest X-Ray Generation

9 min
Nathan Huff


This article explores the Vision Language Foundation model, a model that is transforming the way chest X-rays are generated in medical research. Learn more about how it works as well its benefits and drawbacks. 


With the evolution of artificial intelligence and image generation, the applications of these models have been a point of contention. AI and Stable Diffusion have the ability to produce realistic, variable controlled images - and as such, their potential applications have drawn the attention of healthcare professionals.

Currently, it is unclear so far how many medical concepts and image features found in healthcare applications are incorporated in these models. For instance, if you only use Stable Diffusion and no other modifications are made, you get very artificial-looking images, more akin to stock photos or fabricated visualizations. Before anyone can think about how to adapt these models to produce outputs that are more useful, we should also consider why someone would want to do that in the first place.

Overview of Vision Language Foundation Models (VLF)

The main characteristic that helps to define these foundation models is their capacity to process images and natural language text. This process is reliant on the inputs, outputs, and the task these models are asked to perform.

A vision-language model usually comprises three key elements: a text encoder, an image encoder, and a strategy to integrate information from them both. These key elements are tightly coupled together as the loss functions are designed around both the model architecture and the learning strategy.

The main components of the Stable Diffusion model are the variational auto-encoder (probabilistic generative models that enlist neural networks), the denoising unit, and conditioning methods. The variational autoencoder compresses the input into a smaller latent representation, and the denoising unit generates and modifies the image contents based on this representation. The conditioning methods, such as text conditioning, play a role in steering the generative process.

Chest X-Ray Image Generation 


Upon investigation into how well the compression step preserves medically relevant radio-topological features, the reconstruction step of the variational auto-encoder successfully recreates the main image features, but struggles with preserving high-fidelity text annotations, an important part of medical research and image analysis. Experimenting with fine-tuning different components of the model using in-context learning and using a small number of image-text pairs for training, the results showed that fine-tuning both the unit and the text encoder led to better classification performance and more realistic-looking images.

Scaling up the efforts by using tens of thousands of image-text pairs and joint fine-tuning of the unit and text encoder components further improves the results. The generated chest x-ray images become highly realistic, with diverse features and variations based on a given prompt.

The qualities of a generative model, such as fidelity, diversity, and authenticity, can be evaluated through visual analysis and quantitative metrics. The resulting model should display high fidelity to real data, diverse outputs, and mixed features representation. The approach also allowed for using different models, such as a text-to-image model and an image-to-text model, for further evaluation and comparison. 

Benefits and Advantages of the Vision Language Foundation Model

One application scenario that immediately comes to mind is data augmentation. Data scarcity has been a long-standing problem in the field of deep learning; getting new data is resource-intensive, and existing datasets might suffer from various biases. Data augmentation with synthetic data could remedy this problem; it would provide reliable data that has no real human identifiers, thus carrying no PHI.

Another potential use is in the case of rare diseases. Usually, rarer diseases tend to have less data due to their lack of occurrence; rare institutional lung diseases, for example, pose a challenge due to their low prevalence in the general population. Having a way to generate more data representative of such cases would be beneficial, granted there is enough data to train on for each disease.

Additionally, synthetic data can be used to preserve patient privacy. Currently, data sets must be meticulously ridden of any identifying PHI. By sharing generative models or synthetic data between institutions instead of actual patient data, strategies like federated learning can be utilized. This also massively reduces the risk of any patient data not being identified and censored, as the data will not have any true PHI included.

Explainability is another application scenario. When steering image output through text prompts, it allows for direct control and evaluation of the generated image's alignment with the input. By purposely crafting a prompt with the required information, the image is guaranteed to be specifically what is needed, saving both time and effort.

Limitations/Challenges of Vision-Language Foundations 

Limited data availability

The primary issue with this system is the current lack of data. As it stands, it is not easy to find data on rarer diseases and sourcing the data is an expensive process; thus training these models to give accurate results takes a significant amount of time, effort and cost, as well as requiring significant expertise. Also, the data bias is still a factor - if the model is trained on these biases they may replicate them - and therefore varied data sets are essential, which will be harder for rarer conditions.

Computational complexity

The term computational complexity refers to the resources required to use and maintain a system like the VLF model. Not only does generating large amounts of data require ample time to complete, but also computing power and know-how. This may deter the adoption of creating data in some cases, essentially limiting the usage of this model to data companies due to their focus on data and ability to distribute. 

Not a “true” Chest X-Ray

One of the limitations to consider is that the generated images are not true chest x-rays and lack certain characteristics present in real data, such as windowing for better contrast. As such, many of the tools a radiologist may have at hand to observe different areas of an image are unavailable with the VLF model. This would make using it for training purposes somewhat less useful, as the student will not have the same facility as a radiologist would and could not produce a 1:1 experience.


Especially in the case of medical imaging where privacy is a main concern, models that do not reproduce its initial training data are highly desired. Unfortunately, several cases of training data (with PHI) being 'generated' have been documented, raising concerns over the potential leak of PHI from models such as these. If the model is unable to guarantee no PHI is released, it stands to reason that it will not be implemented officially for general use.

Limited Language Context (CLIP token length)

Another thing that is a more technical aspect, is the issue that currently users are limited to short text prompts because of the CLIP token length. This number sits at roughly 77, meaning that single generation prompts cannot exceed 77 words. There are possible ways to avoid this limit, but for the moment, the focus will have to be on reports that don't have very complex descriptions, but only short text descriptions. 

Limited to 2D

The whole pre-trained model is based on two-dimensional data, whereas a lot of medical data is three-dimensional. For example, CT scans and MRIs are three-dimensional volumetric data, and currently cannot be used in training data or be generated by the model. This limits the usage of the model in general use; only flat 2D scans are viable and thus the model isn't universally applicable to all scans.

Are there any real benefits to synthetic data?

Finally, it still needs to be investigated what the real benefits of synthetic data are. As previously mentioned, there are different application scenarios where one could potentially use synthetic data, but in an academic setting (such as image analysis), this must be thoroughly evaluated. If the model generates inconsistent medical images or anatomically impossible scans, the training becomes obsolete and misinformation. Arguably, there are concerns over the ownership of generations as well; not only is it taking information from real data but also replicating it somewhat. Questions over the anonymity of this process will continue to be a sticking point in its application.


Implications of Results

So, what do these results mean? Radiologists have responded well to this, according to Christian Bluethgen MD, MSc, which is a strong signal to keep pushing in that direction. In order to continue the growth of this project, collaboration and positive discussions will be necessary. This is also part of the reason why the weights to the model that was trained was released, under the caveat that currently the access is only possible for investigators who are credentialed to use the MIMIC training data, for the reasons mentioned before relating to data privacy.

Ability to synthesize high-quality images

This model does have the ability to synthesize high-quality chest x-ray images, with its medical image interpretation definitely at a good standard. Obviously, this is a massive positive; the ability to construct data exactly for your needs is always a welcome tool. With this tool, we can accurately create 2D images with prompts and apply it.

Synthetic data may alleviate true data needs

Preliminary data shows that synthetic data and image synthesis may alleviate some data needs, but this still needs to be properly evaluated. Theoretically, it solves the issue of data availability - an abundance of data, manually made and infinite - providing training data for other machine learning applications in radiology.

Technical and ethical issues remain

Finally, the ethical and technical dilemmas remain. Ethically, there are still grey areas around whether using real patient data and its potential to be 'generated' by the model and exposing sensitive information to the user. This is also a technical issue; the model shouldn't be allowing that information to be released, meaning there must be amendments to the model before its wide scale release - although as it is early in the development. Another issue with the technical side is that necessary knowledge of the model may be required by an operator - meaning training could be necessary to maintain the model. General knowledge for any user operating the system may be important to yielding the best results.


This is a summary of a virtual webinar that took place on Thursday February 23, 2023. These webinars are a part of the Bytes of Innovation series, hosted by Segmed, which aims to host prominent and renowned researchers, physicians, and investors speaking about up-and-coming artificial intelligence in healthcare. Our guest was Christian Bluethgen, MD MSc - postdoctoral researcher at the Stanford AIMI Center and current clinical scientist / attending radiologist at USZ. You can find the talk at this link.

Accelerate your development pipeline

Curate your training & validation datasets on Segmed Insight!