Source: flutie8211/Pixabay
Just like humans, artificial intelligence (AI) machine learning models are subject to bias. Understanding the nature of AI bias is critical for critical applications that can impact life-and-death decisions, such as in medicine and healthcare. New peer-reviewed research published in Natural medicine not only shows that AI medical imaging models that excel at predicting race and gender don’t perform as well at predicting disease diagnoses, but also provides best practices for addressing this disparity.
“Our results highlight the need for regular evaluation of model performance under changing distributions, challenging the popular view that a single model is fair across different contexts,” wrote lead author Marzyeh Ghassemi, Ph.D., associate professor of electrical engineering and computer science at the Massachusetts Institute of Technology (MIT), in collaboration with Dina Katabi, Ph.D., professor of computer science and electrical engineering at MIT, MIT CSAIL graduate student Yuzhe Yang, MIT graduate student Haoran Zhang, and Judy Gichoya, Ph.D., associate professor of radiology at Emory University School of Medicine.
The use of AI machine learning in medical imaging is growing. The AI in medical imaging market size is expected to reach USD 8.18 billion by 2030 globally, growing at a CAGR of 34.8% during 2023-2030, according to an industry report by Grand View Research. The use of AI in neurology was the largest segment, with a share of 38.3%, and the North American region had a revenue share of 44% in 2022, according to Grand View Research. Examples of companies in the field of AI medical imaging include IBM, GE Healthcare, Aidoc, Arterys, Enlitic, iCAD Inc., Caption Health, Gleamer, Fujifilm Holdings Corporation, Butterfly Network, AZmed, Siemens Healthineers, Koninklijke Philips, Agfa-Gevaert Group/Agfa Health Care, Imagia Cybernetics Inc., Lunit, ContextVision AB, Blackford Analysis, and others.
The rise of AI in medical imaging requires maximizing accuracy and minimizing bias. The term bias encompasses partiality, bias, preference, and systematically erroneous thought patterns. In humans, biases can be conscious or unconscious. There are many human biases; examples include stereotypes, fad, placebo effect, confirmation bias, halo effect, optimism, hindsight bias, anchoring bias, availability heuristic, survivorship bias, familiarity bias, gender bias, gambler’s fallacy, group attribution error, self-attribution bias, and many others.
AI bias impacts overall performance accuracy. In AI machine learning, algorithms learn from massive amounts of training data rather than explicitly hard-coded instructions. Several factors impact the resilience of AI models to bias. These include not only the quantity of training data, but also the quality, which is affected by the level of objectivity of the data, the structure of the data itself, data collection practices, and data sources.
Additionally, AI models can be vulnerable to cognitive biases inherent in the humans tasked with creating the algorithm, the weightings assigned to data points, and the absence or inclusion of indicators. For example, in February 2024, Google decided to pause image generation for its chatbot Gemini (formerly Bard) after users complained that it was generating historically inaccurate images. This was because Gemini tended to favor generating images of non-white people and was often misrepresented as the wrong race and/or gender. For example, before the hiatus, Gemini mistakenly depicted George Washington and the Vikings as black men. Google attributed the fine-tuning of its image generator called Imagen 2, which ultimately caused the AI model to go berserk in a company blog post.
“We confirm that medical imaging AI exploits demographic shortcuts in disease classification,” the MIT researchers reported.
This was not surprising, given that two years earlier, Ghassemi, Gichoya, and Zhang were among the co-authors of a separate study from MIT and Harvard Medical School published in Digital health in The Lancet The 2022 study showed that AI deep learning models can predict a person’s self-reported race from medical image pixel data as inputs with a high degree of accuracy. The 2022 study demonstrated that AI easily learned to spot self-reported racial identity from medical images. However, humans do not understand how AI does this, highlighting the need to mitigate risks through regular audits and evaluations of medical AI.
For the current study, the scientists sought to determine whether AI-based disease classification models use demographic information as a heuristic and whether these shortcuts lead to biased predictions. The AI models were trained to predict whether a patient had one of three conditions: a collapsed lung, an enlarged heart, or fluid buildup in the lungs.
The AI models were trained on chest X-rays from a large public dataset of chest X-rays from Beth Israel Deaconess Medical Center in Boston called MIMIC-CXR, and then evaluated on a composite out-of-distribution dataset consisting of data from CheXpert, NIH, SIIM, PadChest, and VinDr. In data science, out-of-distribution data (OOD) is new data that the AI model has not been trained on and is therefore considered “unseen,” as opposed to in-distribution data (ID) that the AI model has previously “seen” in the training data. In total, the researchers used more than 854,000 chest X-rays from radiology datasets from around the world spanning three continents, 6,800 ophthalmology images, and more than 32,000 dermatology images.
AI models performed well overall in their predictions, but showed disparities in prediction accuracy across gender and race. AI models that performed best in demographic prediction showed greater disparities in diagnostic accuracy for images of different genders or races.
The team then investigated how effective state-of-the-art techniques can mitigate these shortcuts to create less biased AI models. They found that they could mitigate the disparity in scenarios to reduce bias. However, these methods were most effective when the model was evaluated on the same type of patients that the AI was originally trained on, or in other words, with distributional data.
In a real-world clinical setting, it’s not uncommon for AI models to be trained on data from another hospital. As a best practice, the study suggests that hospitals that use AI models developed on external, out-of-distribution data should carefully evaluate the algorithms on their own patient data to understand how accurately they perform across demographics such as race and gender.
“This calls into question the effectiveness of developer assurances about model fairness at test time and highlights the need for regulators to consider monitoring of real-world performance, including fairness degradation,” the researchers concluded.
Copyright © 2024 Cami Rosso. All rights reserved.