Validity evidence supporting clinical skills assessment by artificial intelligence compared with trained clinician raters

Research output: Contribution to journal › Journal article › Research › peer-review

Standard

Validity evidence supporting clinical skills assessment by artificial intelligence compared with trained clinician raters. / Johnsson, Vilma; Søndergaard, Morten Bo; Kulasegaram, Kulamakan; Sundberg, Karin; Tiblad, Eleonor; Herling, Lotta; Petersen, Olav Bjørn; Tolsgaard, Martin G.

In: Medical Education, Vol. 58, No. 1, 2024, p. 105-117.

Research output: Contribution to journal › Journal article › Research › peer-review

Harvard

Johnsson, V, Søndergaard, MB, Kulasegaram, K, Sundberg, K, Tiblad, E, Herling, L, Petersen, OB & Tolsgaard, MG 2024, 'Validity evidence supporting clinical skills assessment by artificial intelligence compared with trained clinician raters', Medical Education, vol. 58, no. 1, pp. 105-117. https://doi.org/10.1111/medu.15190

APA

Johnsson, V., Søndergaard, M. B., Kulasegaram, K., Sundberg, K., Tiblad, E., Herling, L., Petersen, O. B., & Tolsgaard, M. G. (2024). Validity evidence supporting clinical skills assessment by artificial intelligence compared with trained clinician raters. Medical Education, 58(1), 105-117. https://doi.org/10.1111/medu.15190

Vancouver

Johnsson V, Søndergaard MB, Kulasegaram K, Sundberg K, Tiblad E, Herling L et al. Validity evidence supporting clinical skills assessment by artificial intelligence compared with trained clinician raters. Medical Education. 2024;58(1):105-117. https://doi.org/10.1111/medu.15190

Author

Johnsson, Vilma ; Søndergaard, Morten Bo ; Kulasegaram, Kulamakan ; Sundberg, Karin ; Tiblad, Eleonor ; Herling, Lotta ; Petersen, Olav Bjørn ; Tolsgaard, Martin G. / Validity evidence supporting clinical skills assessment by artificial intelligence compared with trained clinician raters. In: Medical Education. 2024 ; Vol. 58, No. 1. pp. 105-117.

Bibtex

@article{a844cc36658c4d288238c81ea503f5fe,

title = "Validity evidence supporting clinical skills assessment by artificial intelligence compared with trained clinician raters",

abstract = "Background: Artificial intelligence (AI) is becoming increasingly used in medical education, but our understanding of the validity of AI-based assessments (AIBA) as compared with traditional clinical expert-based assessments (EBA) is limited. In this study, the authors aimed to compare and contrast the validity evidence for the assessment of a complex clinical skill based on scores generated from an AI and trained clinical experts, respectively. Methods: The study was conducted between September 2020 to October 2022. The authors used Kane's validity framework to prioritise and organise their evidence according to the four inferences: scoring, generalisation, extrapolation and implications. The context of the study was chorionic villus sampling performed within the simulated setting. AIBA and EBA were used to evaluate performances of experts, intermediates and novice based on video recordings. The clinical experts used a scoring instrument developed in a previous international consensus study. The AI used convolutional neural networks for capturing features on video recordings, motion tracking and eye movements to arrive at a final composite score. Results: A total of 45 individuals participated in the study (22 novices, 12 intermediates and 11 experts). The authors demonstrated validity evidence for scoring, generalisation, extrapolation and implications for both EBA and AIBA. The plausibility of assumptions related to scoring, evidence of reproducibility and relation to different training levels was examined. Issues relating to construct underrepresentation, lack of explainability, and threats to robustness were identified as potential weak links in the AIBA validity argument compared with the EBA validity argument. Conclusion: There were weak links in the use of AIBA compared with EBA, mainly in their representation of the underlying construct but also regarding their explainability and ability to transfer to other datasets. However, combining AI and clinical expert-based assessments may offer complementary benefits, which is a promising subject for future research.",

author = "Vilma Johnsson and S{\o}ndergaard, {Morten Bo} and Kulamakan Kulasegaram and Karin Sundberg and Eleonor Tiblad and Lotta Herling and Petersen, {Olav Bj{\o}rn} and Tolsgaard, {Martin G.}",

note = "Publisher Copyright: {\textcopyright} 2023 The Authors. Medical Education published by Association for the Study of Medical Education and John Wiley & Sons Ltd.",

year = "2024",

doi = "10.1111/medu.15190",

language = "English",

volume = "58",

pages = "105--117",

journal = "Medical Education",

issn = "0308-0110",

publisher = "Wiley",

number = "1",

}

RIS

TY - JOUR

T1 - Validity evidence supporting clinical skills assessment by artificial intelligence compared with trained clinician raters

AU - Johnsson, Vilma

AU - Søndergaard, Morten Bo

AU - Kulasegaram, Kulamakan

AU - Sundberg, Karin

AU - Tiblad, Eleonor

AU - Herling, Lotta

AU - Petersen, Olav Bjørn

AU - Tolsgaard, Martin G.

PY - 2024

Y1 - 2024

N2 - Background: Artificial intelligence (AI) is becoming increasingly used in medical education, but our understanding of the validity of AI-based assessments (AIBA) as compared with traditional clinical expert-based assessments (EBA) is limited. In this study, the authors aimed to compare and contrast the validity evidence for the assessment of a complex clinical skill based on scores generated from an AI and trained clinical experts, respectively. Methods: The study was conducted between September 2020 to October 2022. The authors used Kane's validity framework to prioritise and organise their evidence according to the four inferences: scoring, generalisation, extrapolation and implications. The context of the study was chorionic villus sampling performed within the simulated setting. AIBA and EBA were used to evaluate performances of experts, intermediates and novice based on video recordings. The clinical experts used a scoring instrument developed in a previous international consensus study. The AI used convolutional neural networks for capturing features on video recordings, motion tracking and eye movements to arrive at a final composite score. Results: A total of 45 individuals participated in the study (22 novices, 12 intermediates and 11 experts). The authors demonstrated validity evidence for scoring, generalisation, extrapolation and implications for both EBA and AIBA. The plausibility of assumptions related to scoring, evidence of reproducibility and relation to different training levels was examined. Issues relating to construct underrepresentation, lack of explainability, and threats to robustness were identified as potential weak links in the AIBA validity argument compared with the EBA validity argument. Conclusion: There were weak links in the use of AIBA compared with EBA, mainly in their representation of the underlying construct but also regarding their explainability and ability to transfer to other datasets. However, combining AI and clinical expert-based assessments may offer complementary benefits, which is a promising subject for future research.

AB - Background: Artificial intelligence (AI) is becoming increasingly used in medical education, but our understanding of the validity of AI-based assessments (AIBA) as compared with traditional clinical expert-based assessments (EBA) is limited. In this study, the authors aimed to compare and contrast the validity evidence for the assessment of a complex clinical skill based on scores generated from an AI and trained clinical experts, respectively. Methods: The study was conducted between September 2020 to October 2022. The authors used Kane's validity framework to prioritise and organise their evidence according to the four inferences: scoring, generalisation, extrapolation and implications. The context of the study was chorionic villus sampling performed within the simulated setting. AIBA and EBA were used to evaluate performances of experts, intermediates and novice based on video recordings. The clinical experts used a scoring instrument developed in a previous international consensus study. The AI used convolutional neural networks for capturing features on video recordings, motion tracking and eye movements to arrive at a final composite score. Results: A total of 45 individuals participated in the study (22 novices, 12 intermediates and 11 experts). The authors demonstrated validity evidence for scoring, generalisation, extrapolation and implications for both EBA and AIBA. The plausibility of assumptions related to scoring, evidence of reproducibility and relation to different training levels was examined. Issues relating to construct underrepresentation, lack of explainability, and threats to robustness were identified as potential weak links in the AIBA validity argument compared with the EBA validity argument. Conclusion: There were weak links in the use of AIBA compared with EBA, mainly in their representation of the underlying construct but also regarding their explainability and ability to transfer to other datasets. However, combining AI and clinical expert-based assessments may offer complementary benefits, which is a promising subject for future research.

U2 - 10.1111/medu.15190

DO - 10.1111/medu.15190

M3 - Journal article

C2 - 37615058

AN - SCOPUS:85168601396

VL - 58

SP - 105

EP - 117

JO - Medical Education

JF - Medical Education

SN - 0308-0110

IS - 1

ER -

ID: 366993713

Faculty of Law