A recent JAMA study investigates whether systematically biased artificial intelligence (AI) impacted clinicians’ diagnostic accuracy and whether image-based AI model explanations can reduce model errors.
Background
The capacity of AI to identify anomalies or diseases in clinical images has been discussed in previous studies. These studies revealed that AI-based tools can detect diabetic retinopathy from fundus images, pneumonia from chest radiographs, and skin cancer from histopathology images.
Compared to clinicians’ diagnoses without AI, integrating AI models in clinical decision-making could result in more accurate diagnoses. However, using systematically biased AI models that have consistently misdiagnosed patients can cause potential harm to patients. For example, one previous study reported that an AI model consistently underdiagnosed female patients for heart disease.
Ideally, clinicians must follow AI predictions when correct but ignore them when not. Over-reliance on biased AI models will affect a clinician’s diagnosis. It is important to understand whether the application of AI to support diagnostic decisions is safe.
Recently, the incorporation of AI explanations to interpret model predictions could help clinicians understand a model’s logic before considering it for clinical decisions. Suitable explanations for AI predictions could minimize the development of systematically biased models. For example, image-based explanations provided by models to support their decision helped assess the diagnostic accuracy of AI models by clinicians.
About the study
The current study assessed whether AI explanations increased clinician diagnostic accuracy and reduced the development of systematically biased models. For these, a randomized clinical vignette web-based survey was conducted.
Hospitalist physicians, physician assistants, and nurse practitioners associated with caring for patients with acute respiratory failure were recruited. A total of 45 clinical vignettes were created based on hospitalized patients with acute respiratory failure in 2017.
Each patient’s medical history was assessed to better understand their past and present medical history. At least four pulmonary physicians independently reviewed the patient’s medical records.
All participants followed the same vignette order, in which two vignettes were created without AI predictions, six vignettes with AI predictions, and one vignette with a clinical consultation. Participants were randomized to AI predictions with or without AI explanations for six AI vignettes.
After each vignette, participants were asked to score on a scale of zero to 100 to indicate the likelihood of heart failure, pneumonia, or COPD to be the contributing factor to the patient’s respiratory failure. These responses were continuously collected to calculate the association between AI model scores and participant responses.
Study findings
Of 1,024 participants, 457 completed at least one vignette and were included in the primary analysis; however, 418 participants completed all nine vignettes. Importantly, participant demographics did not significantly differ across randomized groups.
The specialization of most participants was hospital medicine. The mean age of the participants was 34 years, and around 58% of the cohort was female. Only 13% of the cohort positively interacted with clinical decision support tools, whereas the majority were unaware of the systematic biases of AI models.
The study findings indicate that clinicians’ diagnostic accuracy improved when AI models performed accurately for clinical predictions and decreased in the presence of systematically biased AI predictions. AI explanations did not significantly improve the harmful effects of biased models on a clinician’s diagnostic accuracy, estimated to be about 81%. Therefore, a combination of AI models and clinicians could be effectively used for complex diagnostic tasks.
The explanation could not mitigate the errors since the AI model entirely relied on features unrelated to the clinical condition. However, some studies have suggested that state-of-the-art explanations could improve user-AI instructions. The greater possibility of users being deceived by incompetent AI models was indicated, mainly because many failed to understand the models’ simple explanations.
Because of their limited AI literacy, clinicians failed to comprehend and consider the models’ explanations. In addition, these models must be trained to provide better image-based explanations.
Conclusions
Although AI models’ predictions and explanations enhanced diagnostic accuracy, biased predictions impacted diagnostic accuracy. Thus, it is essential to validate AI models before their incorporation into clinical settings.