Key considerations for dermatologists as AI continues to expand.
Although discussion of artificial intelligence (AI) in medicine has been ongoing for years, and the technology is making its way into the clinic, there is still much to learn. In fact, given that a hallmark of machine learning is that it can be adaptive, learning will be ongoing for both clinicians and the software as it becomes integrated into clinical practice.
THE BOTTOM LINE
AI generally refers to artificial intelligence, but the notion of augmented intelligence reflects the reality that new technologies and evolving software solutions are not intended to replace clinicians. Rather they are intended to support clinicians by offering additional data and information. Dermatologists and patients require education on the potential benefits of AI-based interventions. While software designed for clinical use shows promise, there are significant concerns about consumer-facing apps. Attention to patient diversity during the development of AI for skin cancer diagnosis can help to alleviate existing disparities in skin cancer.
Artificial Intelligence (AI) is a broad term used to describe programs and devices that are designed to “think.” The most popularly cited “definition” of AI is likely that from Stanford University’s John McCarthy1:
“It is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable.”
Under the broad umbrella of AI is the concept of Machine Learning (ML), which, according to FDA, trains software algorithms to learn from and act on data. Algorithms may be “locked” so that their function does not change or “adaptive,” meaning that behavior can change over time based on new data.
In its examples of artificial intelligence and machine learning technologies, FDA includes, “An imaging system that uses algorithms to give diagnostic information for skin cancer in patients.” (Chan, et al. provide an excellent overview of machine learning that is worth accessing.2)
Convoluted neural networks (CNNs) are a form of deep learning, which is a form of machine learning. CNN was initially designed for image analysis and is thus well-suited for use in dermatology diagnostics. CNN contains two basic operations: convolution and pooling. Zhu et al, explain, “The convolution operation, using multiple filters, is able to extract features (feature map) from the data set, through which their corresponding spatial information can be preserved. The pooling operation, also called subsampling, is used to reduce the dimensionality of feature maps from the convolution operation.”3
“CNN simulates the processing of biological neurons and is the state-of-the-art network for pattern recognition in medical image analysis,” Cockerell et al note, thus offering “advantages over traditional analytical techniques.”4
A significant proportion of ML development in health care currently is focused on software as a medical device (SaMD). As opposed to software in a medical device (sometimes called “embedded” software), software as a medical device meets several criteria listed in the sidebar. Most importantly, SaMD is a medical device and includes in vitro diagnostic (IVD) medical devices and can run on general purpose (non-medical purpose) computing platforms.5 Two key elements stand out in this description for SaMD: it is fully intended to be diagnostic, and it is distinct from software that may be developed to work exclusively with a specific device, although such software may also incorporate AI.
According to FDA, one of the greatest benefits of AI/ML in software resides in its ability to learn from real-world use and experience, and its capability to improve its performance, known as training and adaptation, respectively.
Currently, FDA has not established a clear regulatory framework specific to SaMD. The agency says it has cleared or approved several AI/ML-based SaMD, but these typically have only included algorithms that are “locked” prior to marketing. In other words, such SaMD are not adaptive.
FDA last fall issued “Artificial Intelligence and Machine Learning (AI/ML) Software as a Medical Device Action Plan” as an initial step in establishing standards for approval of AI-based devices. The publication is not official guidance, but it sets the stage for progress toward a formal framework.
As developments continue in AI in dermatology and especially in skin cancer diagnosis, dermatologists should keep several key considerations in mind.
What Is Software as a Medical Device?
- SaMD is a medical device and includes in-vitro diagnostic (IVD) medical device.
- SaMD is capable of running on general purpose (non-medical purpose) computing platforms.
- “Without being part of” means software not necessary for a hardware medical device to achieve its intended medical purpose.
- Software does not meet the definition of SaMD if its intended purpose is to drive a medical device.
- SaMD may be used in combination (e.g., as a module) with other products including medical devices.
- SaMD may be interfaced with other medical devices.
- Mobile apps that meet the definition above are considered SaMD.
A. AUGMENTED INTELLIGENCE IS IDEAL
Traditionally, AI has stood for Artificial Intelligence, but across medicine and especially in dermatology, the term Augmented Intelligence is taking hold. Augmented Intelligence reflects the notion that new technologies and evolving hardware and software solutions are not intended to replace clinicians. Rather they are intended to support clinicians by offering additional data and information. In some cases, AI may serve to affirm the dermatologist’s clinical decision making. In other cases, it may point to alternative or additional considerations that may be relevant.
In one of the most favorable findings for AI, a CNN trained by open-source images outperformed the majority (136 of 157) of dermatologists in terms of average specificity and sensitivity. Participating dermatologists had various levels of experience.6
But in another analysis of AI use in the detection of potential skin malignancies, researchers concluded, “Good quality AI-based support of clinical decision-making improves diagnostic accuracy over that of either AI or physicians alone, and that the least experienced clinicians gain the most from AI-based support.”7
Another analysis of 19 studies identified favorable findings for AI, but the authors suggest caution. Overall, tested AI demonstrated “superior or at least equivalent performance of CNN-based classifiers compared with clinicians.” In all studies, the authors note, AI testing was “in highly artificial settings,” used a single image of the suspicious lesions, and may not have included patient populations and melanoma subtypes encountered in clinical practice.8 These findings support the notion that the combined intelligence of the clinician and the software may be optimal.
Tschandl et al describe “human-computer collaboration” in skin cancer diagnosis, finding that, “good quality AI-based support of clinical decision-making improves diagnostic accuracy over that of either AI or physicians alone, and that the least experienced clinicians gain the most from AI-based support. They caution that AI alone could be misleading but that coupled with clinician knowledge, it can improve human diagnosis.9
AI performs well in controlled trials, and certain technologies appear poised to improve patient care. As with any technology, much depends on the specific application. CNNs that analyze images do not consider factors such as history that may influence a dermatologist’s assessment of lesions.
B. BE INFORMED
As technologies emerge, it is incumbent upon dermatologists to understand the specific risks, benefits, and appropriate use of them for skin cancer detection. Tshandl et al caution that poorly designed AI can mislead even experienced diagnosticians.9
If “machine learning” is possible, then we must consider how the machine is trained. Ultimately, the utility of any system will rely on the quality of the data used for software training.
There are evolving standards for Good Machine Learning Practice (GMLP, see sidebar). Among important considerations are that: Clinical study participants and data sets are representative of the intended patient population and that training data sets are independent of test sets. These will be important considerations for dermatologists as they vet emerging technologies.
As patients become increasingly savvy about AI, dermatologists must be prepared to explain its proper role in diagnosis to patients. Interestingly, research suggests that even digital native patients will lean on their physicians for guidance on the role and application of AI. In one survey, the majority of individuals under 35 years of age indicated they were ready to accept AI-based diagnostic solutions for early detection of skin cancer, but they indicated a need for increased explainability of AI-based tools.10
C. CONSUMER-FACING HASN’T CAUGHT UP
Just as important—possibly more important—as explanation of AI used in the clinic will be dermatologists’ explanation of consumer-facing mobile apps for skin cancer detection. Several apps are already on the market aimed at helping patients/consumers identify suspicious lesions using AI technology. These are not to be confused with certain telehealth apps that use store-and-forward technology to allow virtual assessment by a dermatologist. AI-based health app developers are not required to seek FDA clearance for their software, and most are likely to attempt to skirt regulation with disclaimers that their products are not intended to diagnose or treat disease.
While the potential for patients to be able to identify suspicious lesions with their phones and then seek early intervention is promising, the technology simply is not yet ready for wide public use. Thinking in terms of “augmented intelligence,” an ideal app will provide additional information to aid diagnosis rather than simply render a diagnosis. For now, patients are best served by their own knowledge and intuition: any new or changing mole requires evaluation; lesions can be assessed based on the ABCDEs. If a patient is suspicious enough to reach for an app, then they ought to see a dermatologist.
One study tested a consumer-facing app’s ability to identify suspicious lesions and benign controls; lesion status was histopathologically confirmed after app assessment. The app had an overall 86.9 percent sensitivity and 70.4 percent specificity. The device used influenced the app’s performance. The sensitivity was significantly higher on the iOS device compared to the Android device (91.0 vs. 83.0 percent).11
A separate study of a different app compared the technology to assessment by a dermatologist. Of 199 lesions, the app was unable to provide an analysis for 90 (45 percent). Among these lesions not analyzed were nine BCC, four atypical nevi, and one lentigo maligna. Among lesions rated by the app as high or medium risk, 67 percent and 79 percent, respectively, were diagnosed as benign nevi or seborrheic keratoses. Interobserver agreement between the app and the dermatologist was poor.12
Another study, now six years old—which can be significant in terms of technology—assessed three apps intended to classify lesions as suspicious or benign. Dermatologists pre-screened patients prior to assessment with the app. The apps’ sensitivity and specificity ranged from 21-72 percent and 27-100 percent, respectively, when compared with the specialists’ decisions. Interrater agreement between dermatologists and apps was poor to slight.13
D. DIVERSITY MATTERS
Dermatologists use the ABCDE and seven-point dermoscopic criteria for assessing suspicious lesions. While these criteria are, in a sense, based on recognition of patterns, this is not the same as the pattern recognition inherent in machine learning. The clinician’s approach to assessing lesions and the machine-learning approach are not truly analogous. In fact, while we understand the fundamentals of machine learning, we do not yet fully understand how machine learning in medical imaging takes place; such approaches are deemed “black-box.”
Nonetheless, it is understood that the outcomes from machine learning algorithms depend on the quality of the data from which the algorithm trains. (See sidebar on Good Machine Learning Practice). The majority of clinical images of skin cancer and its mimickers are of individuals with fair skin. Overall, skin of color has been underrepresented in both traditional and online image resources. In an analysis of 15,445 images across six textbooks and two online image resources, only about 20 percent were dark skin individuals; neoplasms of the skin had lowest representation.14
By all indications, AI developers will be tasked to significantly augment the representation of skin of color in ML training sets. If there is insufficient data from dark skin patients included in the ML process, it will inhibit the ability of the algorithms to identify skin cancers on dark skin.
Given ongoing disparities in skin cancer diagnosis and outcomes among Hispanic and Black patients, it is imperative that future technology advancement seek to close and not widen the care gap.
GOOD MACHINE LEARNING PRACTICE
FDA, Health Canada, and the United Kingdom’s Medicines and Healthcare products Regulatory Agency (MHRA) have jointly identified 10 guiding principles that can inform the development of Good Machine Learning Practice (GMLP).
They envision these guiding principles may be used to:
- Adopt good practices that have been proven in other sectors
- Tailor practices from other sectors so they are applicable to medical technology and the health care sector
- Create new practices specific for medical technology and the health care sector
1. Multi-Disciplinary Expertise Is Leveraged Throughout the Total Product Life Cycle: In-depth understanding of a model’s intended integration into clinical workflow, and the desired benefits and associated patient risks, can help ensure that ML-enabled medical devices are safe and effective and address clinically meaningful needs over the lifecycle of the device.
2. Good Software Engineering and Security Practices Are Implemented: Model design is implemented with attention to the “fundamentals”: good software engineering practices, data quality assurance, data management, and robust cybersecurity practices. These practices include methodical risk management and design process that can appropriately capture and communicate design, implementation, and risk management decisions and rationale, as well as ensure data authenticity and integrity.
3. Clinical Study Participants and Data Sets Are Representative of the Intended Patient Population: Data collection protocols should ensure that the relevant characteristics of the intended patient population (for example, in terms of age, gender, sex, race, and ethnicity), use, and measurement inputs are sufficiently represented in a sample of adequate size in the clinical study and training and test datasets, so that results can be reasonably generalized to the population of interest. This is important to manage any bias, promote appropriate and generalizable performance across the intended patient population, assess usability, and identify circumstances where the model may underperform.
4. Training Data Sets Are Independent of Test Sets: Training and test datasets are selected and maintained to be appropriately independent of one another. All potential sources of dependence, including patient, data acquisition, and site factors, are considered and addressed to assure independence.
5. Selected Reference Datasets Are Based Upon Best Available Methods: Accepted, best available methods for developing a reference dataset (that is, a reference standard) ensure that clinically relevant and well characterized data are collected and the limitations of the reference are understood. If available, accepted reference datasets in model development and testing that promote and demonstrate model robustness and generalizability across the intended patient population are used.
6. Model Design Is Tailored to the Available Data and Reflects the Intended Use of the Device: Model design is suited to the available data and supports the active mitigation of known risks, like overfitting, performance degradation, and security risks. The clinical benefits and risks related to the product are well understood, used to derive clinically meaningful performance goals for testing, and support that the product can safely and effectively achieve its intended use. Considerations include the impact of both global and local performance and uncertainty/variability in the device inputs, outputs, intended patient populations, and clinical use conditions.
7. Focus Is Placed on the Performance of the Human-AI Team: Where the model has a “human in the loop,” human factors considerations and the human interpretability of the model outputs are addressed with emphasis on the performance of the Human-AI team, rather than just the performance of the model in isolation.
8. Testing Demonstrates Device Performance during Clinically Relevant Conditions: Statistically sound test plans are developed and executed to generate clinically relevant device performance information independently of the training data set. Considerations include the intended patient population, important subgroups, clinical environment and use by the Human-AI team, measurement inputs, and potential confounding factors.
9. Users Are Provided Clear, Essential Information: Users are provided ready access to clear, contextually relevant information that is appropriate for the intended audience (such as health care providers or patients) including: the product’s intended use and indications for use, performance of the model for appropriate subgroups, characteristics of the data used to train and test the model, acceptable inputs, known limitations, user interface interpretation, and clinical workflow integration of the model. Users are also made aware of device modifications and updates from real-world performance monitoring, the basis for decision-making when available, and a means to communicate product concerns to the developer.
10. Deployed Models Are Monitored for Performance and Re-training Risks are Managed: Deployed models have the capability to be monitored in “real world” use with a focus on maintained or improved safety and performance. Additionally, when models are periodically or continually trained after deployment, there are appropriate controls in place to manage risks of overfitting, unintended bias, or degradation of the model (for example, dataset drift) that may impact the safety and performance of the model as it is used by the Human-AI team.
E. EXPANSION IS IMMINENT
The question no longer is whether AI will support the diagnosis of melanoma and NMSC in the clinic. The question is when. Existing technologies already show promise. The key to successful integration of AI in the clinic is to optimize outcomes, reduce errors, and limit liability for clinicians who adopt such technology. Those who integrate technologies must fully understand the ML process that supported its development. Users must vet not only the quality of the ML process and its training sets but also the evidence from well-designed trials of the software.
The notion of augmented intelligence encapsulates the ideal approach to use of ML in the clinic. It is not intended to replace assessment by an expert, rather it is intended to support assessment by an expert. In this vein, consumer facing apps intended to detect skin cancers are not currently recommended as they generally have not shown a reliable benefit and cannot replace evaluation by a dermatologist.
Accessed March 17, 2022
2. Chan S, Reddy V, Myers B, Thibodeaux Q, Brownstone N, Liao W. Machine Learning in Dermatology: Current Applications, Opportunities, and Limitations. Dermatol Ther (Heidelb). 2020 Jun;10(3):365-386. doi: 10.1007/s13555-020-00372-0. Epub 2020 Apr 6. PMID: 32253623; PMCID: PMC7211783
Accessed March 18, 2022.
4. Das K, Cockerell CJ, Patil A, Pietkiewicz P, Giulini M, Grabbe S, Goldust M. Machine Learning and Its Application in Skin Cancer. Int J Environ Res Public Health. 2021 Dec 20;18(24):13409.
6. Brinker TJ, Hekler A, Enk AH, Klode J, Hauschild A, Berking C, Schilling B, Haferkamp S, Schadendorf D, Holland-Letz T, Utikal JS, von Kalle C; Collaborators. Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task. Eur J Cancer. 2019 May;113:47-54.
7. Tschandl P, Rinner C, Apalla Z, Argenziano G, Codella N, Halpern A, Janda M, Lallas A, Longo C, Malvehy J, Paoli J, Puig S, Rosendahl C, Soyer HP, Zalaudek I, Kittler H. Human-computer collaboration for skin cancer recognition. Nat Med. 2020 Aug;26(8):1229-1234.
8. Haggenmüller S, Maron RC, Hekler A, Utikal JS, Barata C, Barnhill RL, Beltraminelli H, Berking C, Betz-Stablein B, Blum A, Braun SA, Carr R, Combalia M, Fernandez-Figueras MT, Ferrara G, Fraitag S, French LE, Gellrich FF, Ghoreschi K, Goebeler M, Guitera P, Haenssle HA, Haferkamp S, Heinzerling L, Heppt MV, Hilke FJ, Hobelsberger S, Krahl D, Kutzner H, Lallas A, Liopyris K, Llamas-Velasco M, Malvehy J, Meier F, Müller CSL, Navarini AA, Navarrete-Dechent C, Perasole A, Poch G, Podlipnik S, Requena L, Rotemberg VM, Saggini A, Sangueza OP, Santonja C, Schadendorf D, Schilling B, Schlaak M, Schlager JG, Sergon M, Sondermann W, Soyer HP, Starz H, Stolz W, Vale E, Weyers W, Zink A, Krieghoff-Henning E, Kather JN, von Kalle C, Lipka DB, Fröhling S, Hauschild A, Kittler H, Brinker TJ. Skin cancer classification via convolutional neural networks: systematic review of studies involving human experts. Eur J Cancer. 2021 Oct;156:202-216.
9. Tschandl P, Rinner C, Apalla Z, Argenziano G, Codella N, Halpern A, Janda M, Lallas A, Longo C, Malvehy J, Paoli J, Puig S, Rosendahl C, Soyer HP, Zalaudek I, Kittler H. Human-computer collaboration for skin cancer recognition. Nat Med. 2020 Aug;26(8):1229-1234.
10. Haggenmüller S, Krieghoff-Henning E, Jutzi T, Trapp N, Kiehl L, Utikal JS, Fabian S, Brinker TJ. Digital Natives’ Preferences on Mobile Artificial Intelligence Apps for Skin Cancer Diagnostics: Survey Study. JMIR Mhealth Uhealth. 2021 Aug 27;9(8):e22909.
11. Sangers T, Reeder S, van der Vet S, Jhingoer S, Mooyaart A, Siegel DM, Nijsten T, Wakkee M. Validation of a Market-Approved Artificial Intelligence Mobile Health App for Skin Cancer Screening: A Prospective Multicenter Diagnostic Accuracy Study. Dermatology. 2022 Feb 4:1-8.
12. Chung Y, van der Sande AAJ, de Roos KP, Bekkenk MW, de Haas ERM, Kelleners-Smeets NWJ, Kukutsch NA. Poor agreement between the automated risk assessment of a smartphone application for skin cancer detection and the rating by dermatologists. J Eur Acad Dermatol Venereol. 2020 Feb;34(2):274-278.
13. Ngoo A, Finnane A, McMeniman E, Tan JM, Janda M, Soyer HP. Efficacy of smartphone applications in high-risk pigmented lesions. Australas J Dermatol. 2018 Aug;59(3):e175-e182
14. Alvarado SM, Feng H. Representation of dark skin images of common dermatologic conditions in educational resources: A cross-sectional analysis. J Am Acad Dermatol. 2021 May;84(5):1427-1431