Are Artificial Intelligence Models Ready for Clinical Use?

Image courtesy of Wikimedia Commons.

As machine learning algorithms continue to increase in accuracy, their clinical applications have gained more and more interest, as machine learning models are able to learn from massive amounts of data and recognize patterns that humans are unable to comprehend. Often, these algorithms can perform with equal or greater accuracy than experienced clinicians. However, in a study published in Nature Digital Medicine, a group of researchers from UCSF caution against just how ready such algorithms are and stress the importance of rigorous testing before real-world clinical use of artificial intelligence models.

The team of researchers, led by Michael Keiser, a professor at the Institute for Neurodegenerative Diseases, and Maria Wei, a professor and dermatologist at the UCSF School of Medicine, developed and trained a machine learning model to identify melanomas, then exposed the model to a variety of stress tests to mimic real-world conditions. To Albert Young, a medical student at UCSF and the first author of the paper, the project was a culmination of ideas he developed from reading the computer science literature and talking to members within the lab. “[We wanted] to evaluate these tools in ways to gauge how clinic-ready they are and what the barriers are to actually using them to help patients, and I think that’s where the project came about,” Young said. 

While the model matched or exceeded dermatologist-level classification on conventionally reported metrics with curated datasets, it performed worse than dermatologist-level classification on non-curated datasets, which included lower-quality images that had been excluded from the curated datasets. In addition, the group found that model performance on images of the same lesion with different settings (such as angles, magnifications, or tools) and performance after image augmentations, such as rotations or changes in brightness and contrast, became less robust and often yielded inconsistent predictions.

The study highlights the importance of rigorously testing AI algorithms before they are adopted for clinical use. “There have been no clinical trials that I know about or real-world applications so far; they’ve all been done with datasets that have been retrospectively collected, and a lot of them have been curated,” Young said. “The very challenging conditions one might encounter in real life [should] also be replicated when testing these algorithms.” 

Young also stresses the magnifying effect AI models can have on healthcare disparities once they are integrated in a clinical setting. “One can imagine that if you were to implement AI tools without really thinking about who might be using them, then those who have the most access to technology—the most privileged in the healthcare system—are going to benefit more, and even if this is a good tool, it’ll only serve to widen existing healthcare disparities.” 

Despite the challenges of incorporating AI models into healthcare pipelines and clinical settings, Young remains excited about the future of AI. He hopes to continue to stay involved in AI, especially as he looks beyond his final year of medical school towards residency. “I see myself as being a bridge between physicians who will be using these [AI tools] and the people who are developing them, asking the right questions as to how they were developed, what kind of biases might they have, as well as the technical and practical challenges of actually getting them to patients.”


Young, A.T., Fernandez, K., Pfau, J. et al. Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models. npj Digit. Med. 4, 10 (2021). 

Esteva, A., Kuprel, B., Novoa, R. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).