Capable exam-taker and question-generator: the dual role of generative AI in medical education assessment

Yihong Qiu; Chang Liu

doi:10.1515/gme-2024-0021

Article Open Access

Capable exam-taker and question-generator: the dual role of generative AI in medical education assessment

Published/Copyright: January 14, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Global Medical Education Volume 2 Issue 1

Abstract

Objectives

Artificial intelligence (AI) is being increasingly used in medical education. This narrative review presents a comprehensive analysis of generative AI tools’ performance in answering and generating medical exam questions, thereby providing a broader perspective on AI’s strengths and limitations in the medical education context.

Methods

The Scopus database was searched for studies on generative AI in medical examinations from 2022 to 2024. Duplicates were removed, and relevant full texts were retrieved following inclusion and exclusion criteria. Narrative analysis and descriptive statistics were used to analyze the contents of the included studies.

Results

A total of 70 studies were included for analysis. The results showed that AI tools’ performance varied when answering different types of questions and different specialty questions, with best average accuracy in psychiatry, and were influenced by prompts. With well-crafted prompts, AI models can efficiently produce high-quality examination questions.

Conclusion

Generative AI possesses the ability to answer and produce medical questions using carefully designed prompts. Its potential use in medical assessment is vast, ranging from detecting question error, aiding in exam preparation, facilitating formative assessments, to supporting personalized learning. However, it’s crucial for educators to always double-check the AI’s responses to maintain accuracy and prevent the spread of misinformation.

Keywords: generative artificial intelligence; large language model; medical examination; medical education

Corresponding author: Chang Liu, Department of Immunology and Microbiology, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China, E-mail: tiantianlc@sjtu.edu.cn

Funding source: Medical Education Branch of Chinese Medical Association

Award Identifier / Grant number: 2023B344

Acknowledgments

The authors thanked the reviewers for their valuable comments and suggestions.

Research ethics: This review was conducted based on published studies; therefore, no ethical review was required.
Informed consent: Not applicable.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Use of Large Language Models, AI and Machine Learning Tools: None declared.
Conflict of interest: The authors state no conflict of interest.
Research funding: This work was supported by Medical Education Research Project of Medical Education Branch of Chinese Medical Association (2023B344).
Data availability: The authors confirm that the data supporting the findings of this study are available within the article and its supplementary materials.

References

1. Gordon, M, Daniel, M, Ajiboye, A, Uraiby, H, Xu, NY, Bartlett, R, et al.. A scoping review of artificial intelligence in medical education: BEME Guide No. 84. Med Teach 2024;46:446–70. https://doi.org/10.1080/0142159x.2024.2314198.Search in Google Scholar PubMed

2. Benítez, TM, Xu, Y, Boudreau, JD, Kow, AWC, Bello, F, Phuoc, LV, et al.. Harnessing the potential of large language models in medical education: promise and pitfalls. J Am Med Inf Assoc 2024;31:776–83. https://doi.org/10.1093/jamia/ocad252.Search in Google Scholar PubMed PubMed Central

3. Lucas, HC, Upperman, JS, Robinson, JR. A systematic review of large language models and their implications in medical education. Med Educ 2024;58:1276–85. https://doi.org/10.1111/medu.15402.Search in Google Scholar PubMed

4. Liu, M, Okuhara, T, Chang, X, Shirabe, R, Nishiie, Y, Okada, H, et al.. Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis. J Med Internet Res 2024;26:e60807. https://doi.org/10.2196/60807.Search in Google Scholar PubMed PubMed Central

5. Long, C, Lowe, K, Zhang, J, dos Santos, A, Alanazi, A, O’Brien, D, et al.. A novel evaluation model for assessing ChatGPT on otolaryngology-head and neck surgery certification examinations: performance study. JMIR Med Educ 2024;10:e49970. https://doi.org/10.2196/49970.Search in Google Scholar PubMed PubMed Central

6. Sadeq, MA, Ghorab, RMF, Ashry, MH, Abozaid, AM, Banihani, HA, Salem, M, et al.. AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study. Sci Rep 2024;14:18859. https://doi.org/10.1038/s41598-024-68996-2.Search in Google Scholar PubMed PubMed Central

7. Akhter, M. Accuracy of GPT’s artificial intelligence on emergency medicine board recertification exam. Am J Emerg Med 2024;76:254–5. https://doi.org/10.1016/j.ajem.2023.11.061.Search in Google Scholar PubMed

8. Kelloniemi, M, Koljonen, V. AI did not pass finnish plastic surgery written board examination. J Plast Reconstr Aesthetic Surg 2023;87:172–9. https://doi.org/10.1016/j.bjps.2023.10.059.Search in Google Scholar PubMed

9. D’Souza, FR, Amanullah, S, Mathew, M, Surapaneni, KM. Appraising the performance of ChatGPT in psychiatry using 100 clinical case vignettes. Asian J Psychiatry 2023;89:103770.10.1016/j.ajp.2023.103770Search in Google Scholar PubMed

10. Roos, J, Kasapovic, A, Jansen, T, Kaczmarczyk, R. Artificial intelligence in medical education: comparative analysis of ChatGPT, bing, and medical students in Germany. JMIR Med Educ 2023;9:e46482. https://doi.org/10.2196/46482.Search in Google Scholar PubMed PubMed Central

11. Saad, A, Iyengar, KP, Kurisunkal, V, Botchu, R. Assessing ChatGPT’s ability to pass the FRCS orthopaedic part A exam: a critical analysis. Surgeon 2023;21:263–6. https://doi.org/10.1016/j.surge.2023.07.001.Search in Google Scholar PubMed

12. Herrmann-Werner, A, Festl-Wietek, T, Holderried, F, Herschbach, L, Griewatz, J, Masters, K, et al.. Assessing ChatGPT’s mastery of Bloom’s taxonomy using psychosomatic medicine exam questions: mixed-methods study. J Med Internet Res 2024;26:e52113. https://doi.org/10.2196/52113.Search in Google Scholar PubMed PubMed Central

13. Kufel, J, Bielówka, M, Rojek, M, Mitręga, A, Czogalik, Ł, Kaczyńska, D, et al.. Assessing ChatGPT’s performance in national nuclear medicine specialty examination: an evaluative analysis. Iran J Nucl Med 2024;32:60–5.Search in Google Scholar

14. Surapaneni, KM. Assessing the performance of ChatGPT in medical biochemistry using clinical case vignettes: observational study. JMIR Med Educ 2023;9:e47191. https://doi.org/10.2196/47191.Search in Google Scholar PubMed PubMed Central

15. Siebielec, J, Ordak, M, Oskroba, A, Dworakowska, A, Bujalska-Zadrozny, M. Assessment study of ChatGPT-3.5’s performance on the final polish medical examination: accuracy in answering 980 questions. Healthcare Switz 2024;12:1637. https://doi.org/10.3390/healthcare12161637.Search in Google Scholar PubMed PubMed Central

16. Huang, Y, Gomaa, A, Semrau, S, Haderlein, M, Lettmaier, S, Weissmann, T, et al.. Benchmarking ChatGPT-4 on a radiation oncology in-training exam and Red Journal Gray Zone cases: potentials and challenges for ai-assisted medical education and decision making in radiation oncology. Front Oncol 2023;13:1265024. https://doi.org/10.3389/fonc.2023.1265024.Search in Google Scholar PubMed PubMed Central

17. Stengel, FC, Stienen, MN, Ivanov, M, Gandía-González, ML, Raffa, G, Ganau, M, et al.. Can AI pass the written European board examination in neurological surgery? – Ethical and practical issues. Brain Spine 2024;4:102765. https://doi.org/10.1016/j.bas.2024.102765.Search in Google Scholar PubMed PubMed Central

18. Maitland, A, Fowkes, R, Maitland, S. Can ChatGPT pass the MRCP (UK) written examinations? Analysis of performance and errors using a clinical decision-reasoning framework. BMJ Open 2024;14:e080558. https://doi.org/10.1136/bmjopen-2023-080558.Search in Google Scholar PubMed PubMed Central

19. Gencer, A, Aydin, S. Can ChatGPT pass the thoracic surgery exam? Am J Med Sci 2023;366:291–5. https://doi.org/10.1016/j.amjms.2023.08.001.Search in Google Scholar PubMed

20. Ghanem, D, Nassar, JE, El Bachour, J, Hanna, T. ChatGPT earns American board certification in hand surgery. Hand Surg Rehabil 2024;43:101688. https://doi.org/10.1016/j.hansur.2024.101688.Search in Google Scholar PubMed

21. Ebrahimian, M, Behnam, B, Ghayebi, N, Sobhrakhshankhah, E. ChatGPT in Iranian medical licensing examination: evaluating the diagnostic accuracy and decision-making capabilities of an AI-based model. BMJ Health Care Inform 2023;30:e100815. https://doi.org/10.1136/bmjhci-2023-100815.Search in Google Scholar PubMed PubMed Central

22. Meo, SA, Al-Masri, AA, Alotaibi, M, Meo, MZS, Meo, MOS. ChatGPT knowledge evaluation in basic and clinical medical sciences: multiple choice question examination-based performance. Healthcare Switz 2023;11:2046. https://doi.org/10.3390/healthcare11142046.Search in Google Scholar PubMed PubMed Central

23. Fiedler, B, Azua, EN, Phillips, T, Ahmed, AS. ChatGPT performance on the American Shoulder and Elbow Surgeons maintenance of certification exam. J Shoulder Elbow Surg 2024;33:1888–93. https://doi.org/10.1016/j.jse.2024.02.029.Search in Google Scholar PubMed

24. Suwała, S, Szulc, P, Guzowski, C, Kamińska, B, Dorobiała, J, Wojciechowska, K, et al.. ChatGPT-3.5 passes Poland’s medical final examination—is it possible for ChatGPT to become a doctor in Poland? SAGE Open Med 2024;12:1–7. https://doi.org/10.1177/20503121241257777.Search in Google Scholar PubMed PubMed Central

25. Funk, PF, Hoch, CC, Knoedler, S, Knoedler, L, Cotofana, S, Sofo, G, et al.. ChatGPT’s response consistency: a study on repeated queries of medical examination questions. Eur J Investig Health Psychol Educ 2024;14:657–68. https://doi.org/10.3390/ejihpe14030043.Search in Google Scholar PubMed PubMed Central

26. Meyer, A, Riese, J, Streichert, T. Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination: observational study. JMIR Med Educ 2024;10:e50965. https://doi.org/10.2196/50965.Search in Google Scholar PubMed PubMed Central

27. Farhat, F, Chaudhry, BM, Nadeem, M, Sohail, SS, Madsen, DØ. Evaluating large language models for the national premedical exam in India: comparative analysis of GPT-3.5, GPT-4, and bard. JMIR Med Educ 2024;10:e51523. https://doi.org/10.2196/51523.Search in Google Scholar PubMed PubMed Central

28. Guillen-Grima, F, Guillen-Aguinaga, S, Guillen-Aguinaga, L, Alas-Brun, R, Onambele, L, Ortega, W, et al.. Evaluating the efficacy of ChatGPT in navigating the Spanish medical residency entrance examination (MIR): promising horizons for AI in clinical medicine. Clin Pract 2023;13:1460–87. https://doi.org/10.3390/clinpract13060130.Search in Google Scholar PubMed PubMed Central

29. Giannos, P. Evaluating the limits of AI in medical specialisation: ChatGPT’s performance on the UK neurology specialty certificate examination. BMJ Neurol Open 2023;5:e000451. https://doi.org/10.1136/bmjno-2023-000451.Search in Google Scholar PubMed PubMed Central

30. Tsoutsanis, P, Tsoutsanis, A. Evaluation of large language model performance on the multi-specialty recruitment assessment (MSRA) exam. Comput Biol Med 2024;168:107794. https://doi.org/10.1016/j.compbiomed.2023.107794.Search in Google Scholar PubMed

31. Rojas, M, Rojas, M, Burgess, V, Toro-Pérez, J, Salehi, S. Exploring the proficiency of ChatGPT 3.5, 4, and 4 with vision in the Chilean medical licensing exam: an observational study. JMIR Med Educ 2024;10:e55048. https://doi.org/10.2196/55048.Search in Google Scholar PubMed PubMed Central

32. Lin, SY, Chan, PK, Hsu, WH, Kao, CH. Exploring the proficiency of ChatGPT-4: an evaluation of its performance in the Taiwan advanced medical licensing examination. Digital Health 2024;10:1–11. https://doi.org/10.1177/20552076241237678.Search in Google Scholar PubMed PubMed Central

33. Sood, A, Mansoor, N, Memmi, C, Lynch, M, Lynch, J. Generative pretrained transformer-4, an artificial intelligence text predictive model, has a high capability for passing novel written radiology exam questions. Int J Comput Assist Radiol Surg 2024;19:645–53. https://doi.org/10.1007/s11548-024-03071-9.Search in Google Scholar PubMed

34. Hirano, Y, Hanaoka, S, Nakao, T, Miki, S, Kikuchi, T, Nakamura, Y, et al.. GPT-4 Turbo with vision fails to outperform text-only GPT-4 Turbo in the Japan diagnostic radiology board examination. Jpn J Radiol 2024;42:918–26. https://doi.org/10.1007/s11604-024-01561-z.Search in Google Scholar PubMed PubMed Central

35. Kawahara, T, Sumi, Y. GPT-4/4V’s performance on the Japanese national medical licensing examination. Med Teach 2024:1–8. https://doi.org/10.1080/0142159X.2024.2342545.Search in Google Scholar PubMed

36. Kollitsch, L, Eredics, K, Marszalek, M, Rauchenwald, M, Brookman-May, SD, Burger, M, et al.. How does artificial intelligence master urological board examinations? A comparative analysis of different Large Language Models’ accuracy and reliability in the 2022 In-Service Assessment of the European Board of Urology. World J Urol 2024;42:20. https://doi.org/10.1007/s00345-023-04749-6.Search in Google Scholar PubMed

37. Gilson, A, Safranek, CW, Huang, T, Socrates, V, Chi, L, Taylor, RA, et al.. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023;9:e45312. https://doi.org/10.2196/45312.Search in Google Scholar PubMed PubMed Central

38. Knoedler, L, Knoedler, S, Hoch, CC, Prantl, L, Frank, K, Soiderer, L, et al.. In-depth analysis of ChatGPT’s performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions. Sci Rep 2024;14:13553. https://doi.org/10.1038/s41598-024-63997-7.Search in Google Scholar PubMed PubMed Central

39. Haze, T, Kawano, R, Takase, H, Suzuki, S, Hirawa, N, Tamura, K. Influence on the accuracy in ChatGPT: differences in the amount of information per medical field. Int J Med Inf 2023;180:105283. https://doi.org/10.1016/j.ijmedinf.2023.105283.Search in Google Scholar PubMed

40. Wu, J, Wu, X, Qiu, Z, Li, M, Lin, S, Zhang, Y, et al.. Large language models leverage external knowledge to extend clinical insight beyond language boundaries. J Am Med Inf Assoc 2024;31:2054–64. https://doi.org/10.1093/jamia/ocae079.Search in Google Scholar PubMed PubMed Central

41. Wang, H, Wu, W, Dou, Z, He, L, Yang, L. Performance and exploration of ChatGPT in medical examination, records and education in Chinese: pave the way for medical AI. Int J Med Inf 2023;177:105173. https://doi.org/10.1016/j.ijmedinf.2023.105173.Search in Google Scholar PubMed

42. Cheong, RCT, Pang, KP, Unadkat, S, Mcneillis, V, Williamson, A, Joseph, J, et al.. Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google Bard. Eur Arch Otorhinolaryngol 2024;281:2137–43. https://doi.org/10.1007/s00405-023-08381-3.Search in Google Scholar PubMed

43. Noda, R, Izaki, Y, Kitano, F, Komatsu, J, Ichikawa, D, Shibagaki, Y. Performance of ChatGPT and Bard in self-assessment questions for nephrology board renewal. Clin Exp Nephrol 2024;28:465–9. https://doi.org/10.1007/s10157-023-02451-w.Search in Google Scholar PubMed

44. Ali, R, Tang, OY, Connolly, ID, Zadnik Sullivan, PL, Shin, JH, Fridley, JS, et al.. Performance of ChatGPT and GPT-4 on neurosurgery written board examinations. Neurosurgery 2023;93:1353–65. https://doi.org/10.1227/neu.0000000000002632.Search in Google Scholar PubMed

45. Ozeri, DJ, Cohen, A, Bacharach, N, Ukashi, O, Oppenheim, A. Performance of ChatGPT in Israeli Hebrew internal medicine national residency exam. Isr Med Assoc J 2024;26:86–8.Search in Google Scholar

46. Cohen, A, Alter, R, Lessans, N, Meyer, R, Brezinov, Y, Levin, G. Performance of ChatGPT in Israeli Hebrew OBGYN national residency examinations. Arch Gynecol Obstet 2023;308:1797–802. https://doi.org/10.1007/s00404-023-07185-4.Search in Google Scholar PubMed

47. Zong, H, Li, J, Wu, E, Wu, R, Lu, J, Shen, B. Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses. BMC Med Educ 2024;24:143. https://doi.org/10.1186/s12909-024-05125-7.Search in Google Scholar PubMed PubMed Central

48. van Nuland, M, Erdogan, A, Aςar, C, Contrucci, R, Hilbrants, S, Maanach, L, et al.. Performance of ChatGPT on factual knowledge questions regarding clinical pharmacy. J Clin Pharmacol 2024;64:1095–100. https://doi.org/10.1002/jcph.2443.Search in Google Scholar PubMed

49. Kim, SE, Lee, JH, Choi, BS, Han, HS, Lee, MC, Ro, DH. Performance of ChatGPT on solving orthopedic board-style questions: a comparative analysis of ChatGPT 3.5 and ChatGPT 4. CIOS Clin Orthop Surg 2024;16:669–73. https://doi.org/10.4055/cios23179.Search in Google Scholar PubMed PubMed Central

50. Moglia, A, Georgiou, K, Cerveri, P, Mainardi, L, Satava, RM, Cuschieri, A. Large language models in healthcare: from a systematic review on medical examinations to a comparative analysis on fundamentals of robotic surgery online test. Artif Intell Rev 2024;57:231. https://doi.org/10.1007/s10462-024-10849-5.Search in Google Scholar

51. Levin, G, Horesh, N, Brezinov, Y, Meyer, R. Performance of ChatGPT in medical examinations: a systematic review and a meta-analysis. BJOG Int J Obstet Gynaecol 2024;131:378–80. https://doi.org/10.1111/1471-0528.17641.Search in Google Scholar PubMed

52. Vij, O, Calver, H, Myall, N, Dey, M, Kouranloo, K. Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments. PLOS ONE 2024. https://doi.org/10.1371/journal.pone.0307372.Search in Google Scholar PubMed PubMed Central

53. Bartoli, A, May, AT, Al-Awadhi, A, Schaller, K. Probing artificial intelligence in neurosurgical training: ChatGPT takes a neurosurgical residents written exam. Brain Spine 2024;4:102715. https://doi.org/10.1016/j.bas.2023.102715.Search in Google Scholar PubMed PubMed Central

54. Ayub, I, Hamann, D, Hamann, CR, Davis, MJ. Exploring the potential and limitations of chat generative pre-trained transformer (ChatGPT) in generating board-style dermatology questions: a qualitative analysis. Cureus 2023;15:e43717. https://doi.org/10.7759/cureus.43717.Search in Google Scholar PubMed PubMed Central

55. Klang, E, Portugez, S, Gross, R, Kassif Lerner, R, Brenner, A, Gilboa, M, et al.. Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4. BMC Med Educ 2023;23:772. https://doi.org/10.1186/s12909-023-04752-w.Search in Google Scholar PubMed PubMed Central

56. Agarwal, M, Sharma, P, Goswami, A. Analysing the applicability of ChatGPT, bard, and bing to generate reasoning-based Multiple_Choice questions in medical physiology. Cureus 2023;15:e40977. https://doi.org/10.7759/cureus.40977.Search in Google Scholar PubMed PubMed Central

57. Zuckerman, M, Flood, R, Tan, RJB, Kelp, N, Ecker, DJ, Menke, J, et al.. ChatGPT for assessment writing. Med Teach 2023;45:1224–7. https://doi.org/10.1080/0142159x.2023.2249239.Search in Google Scholar PubMed

58. Kıyak, YS, Coşkun, Ö, Budakoğlu, Iİ, Uluoğlu, C. ChatGPT for generating multiple-choice questions: evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam. Eur J Clin Pharmacol 2024;80:729–35. https://doi.org/10.1007/s00228-024-03649-x.Search in Google Scholar PubMed

59. Coşkun, Ö, Kıyak, YS, Budakoğlu, Iİ. ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: a randomized controlled experiment. Med Teach 2024:1–7. https://doi.org/10.1080/0142159X.2024.2327477.Search in Google Scholar PubMed

60. Cheung, BHH, Lau, GKK, Wong, GTC, Lee, EYP, Kulkarni, D, Seow, CS, et al.. ChatGPT versus human in generating medical graduate exam multiple choice questions—a multinational prospective study (Hong Kong S. A.R., Singapore, Ireland, and the United Kingdom). PLoS One 2023;18:e0290691. https://doi.org/10.1371/journal.pone.0290691.Search in Google Scholar PubMed PubMed Central

61. Grévisse, C, Pavlou, MAS, Schneider, JG. Docimological quality analysis of LLM-generated multiple choice questions in computer science and medicine. SN Comput Sci 2024;5:636. https://doi.org/10.1007/s42979-024-02963-6.Search in Google Scholar

62. Rivera-Rosas, CN, Calleja-López, JRT, Ruibal-Tavares, E, Villanueva-Neri, A, Flores-Felix, CM, Trujillo-López, S. Exploring the potential of ChatGPT to create multiple-choice question exams. Educ Méd 2024;25:100930. https://doi.org/10.1016/j.edumed.2024.100930.Search in Google Scholar

63. Laupichler, MC, Rother, JF, Grunwald Kadow, IC, Ahmadi, S, Raupach, T. Large Language models in medical education: comparing ChatGPT- to human-generated exam questions. Acad Med 2024;99:508–12. https://doi.org/10.1097/acm.0000000000005626.Search in Google Scholar

64. Page, MJ, McKenzie, JE, Bossuyt, PM, Boutron, I, Hoffmann, TC, Mulrow, CD, et al.. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. PLoS Med 2021;18:e1003583. https://doi.org/10.1371/journal.pmed.1003583.Search in Google Scholar PubMed PubMed Central

65. Gandhi, AP, Joesph, FK, Rajagopal, V, Aparnavi, P, Katkuri, S, Dayama, S, et al.. Performance of ChatGPT on the India undergraduate community medicine examination: cross-sectional study. JMIR Form Res 2024;8:e49964. https://doi.org/10.2196/49964.Search in Google Scholar PubMed PubMed Central

66. Mousavi, M, Shafiee, S, Harley, JM, Cheung, JCK, Abbasgholizadeh Rahimi, S. Performance of generative pre-trained transformers (GPTs) in certification examination of the college of family physicians of Canada. Fam Med Community Health 2024;12:e002626. https://doi.org/10.1136/fmch-2023-002626.Search in Google Scholar PubMed PubMed Central

67. Flores-Cohaila, JA, García-Vicente, A, Vizcarra-Jiménez, SF, De la Cruz-Galán, JP, Gutiérrez-Arratia, JD, Quiroga Torres, BG, et al.. Performance of ChatGPT on the peruvian national licensing medical examination: cross-sectional study. JMIR Med Educ 2023;9:e48039. https://doi.org/10.2196/48039.Search in Google Scholar PubMed PubMed Central

68. Smith, J, Choi, PMC, Buntine, P. Will code one day run a code? Performance of language models on ACEM primary examinations and implications. EMA Emerg Med Australas 2023;35:876–8. https://doi.org/10.1111/1742-6723.14280.Search in Google Scholar PubMed

69. Torres-Zegarra, BC, Rios-Garcia, W, Ñaña-Cordova, AM, Arteaga-Cisneros, KF, Benavente Chalco, XC, Bustamante Ordoñez, MA, et al.. Performance of ChatGPT, bard, claude, and bing on the peruvian national licensing medical examination: a cross-sectional study. J Educ Eval Health Prof 2023;20:30. https://doi.org/10.3352/jeehp.2023.20.30.Search in Google Scholar PubMed PubMed Central

70. Gritti, MN, AlTurki, H, Farid, P, Morgan, CT. Progression of an artificial intelligence chatbot (ChatGPT) for pediatric cardiology educational knowledge assessment. Pediatr Cardiol 2024;45:309–13. https://doi.org/10.1007/s00246-023-03385-6.Search in Google Scholar PubMed

71. Nicikowski, J, Szczepański, M, Miedziaszczyk, M, Kudliński, B. The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland. Clin Kidney J 2024;17:sfae193. https://doi.org/10.1093/ckj/sfae193.Search in Google Scholar PubMed PubMed Central

72. Kufel, J, Paszkiewicz, I, Bielówka, M, Bartnikowska, W, Janik, M, Stencel, M, et al.. Will ChatGPT pass the polish specialty exam in radiology and diagnostic imaging? Insights into strengths and limitations. Pol J Radiol 2023;88:e430–4. https://doi.org/10.5114/pjr.2023.131215.Search in Google Scholar PubMed PubMed Central

73. National Coordinating Council for Medication Error Reporting and Prevention (NCC MERP). NCC MERP taxonomy of medication errors; 1998. Available from: https://www.nccmerp.org/sites/default/files/taxonomy2001-07-31.pdf.Search in Google Scholar

74. Shakurnia, A, Aslami, M, Bijanzadeh, M. The effect of question generation activity on students’ learning and perception. J Adv Med Educ Prof 2018;6:70–7.Search in Google Scholar

75. Hutchinson, D, Wells, J. An inquiry into the effectiveness of student generated MCQs as a method of assessment to improve teaching and learning. Creat Educ 2013;4:117–25. https://doi.org/10.4236/ce.2013.47a2014.Search in Google Scholar

76. Sanchez-Elez, M, Pardines, I, Garcia, P, Miñana, G, Roman, S, Sanchez, M, et al.. Enhancing students’ learning process through self-generated tests. J Sci Educ Technol 2014;23:15–25. https://doi.org/10.1007/s10956-013-9447-7.Search in Google Scholar

77. Panthier, C, Gatinel, D. Success of ChatGPT, an AI language model, in taking the French language version of the European Board of Ophthalmology examination: a novel approach to medical knowledge assessment. J Fr Ophtalmol 2023;46:706–11. https://doi.org/10.1016/j.jfo.2023.05.006.Search in Google Scholar PubMed

78. Artsi, Y, Sorin, V, Konen, E, Glicksberg, BS, Nadkarni, G, Klang, E. Large language models for generating medical examinations: systematic review. BMC Med Educ 2024;24:354. https://doi.org/10.1186/s12909-024-05239-y.Search in Google Scholar PubMed PubMed Central

79. Stadler, M, Horrer, A, Fischer, MR. Crafting medical MCQs with generative AI: a how-to guide on leveraging ChatGPT. GMS J Med Educ 2024;41:1–5.Search in Google Scholar

80. Meskó, B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J Med Internet Res 2023;25:e50638. https://doi.org/10.2196/50638.Search in Google Scholar PubMed PubMed Central

81. Indran, IR, Paranthaman, P, Gupta, N, Mustafa, N. Twelve tips to leverage AI for efficient and effective medical question generation: a guide for educators using Chat GPT. Med Teach 2024;46:1021–6. https://doi.org/10.1080/0142159x.2023.2294703.Search in Google Scholar PubMed

82. Huang, CH, Hsiao, HJ, Yeh, PC, Wu, KC, Kao, CH. Performance of ChatGPT on Stage 1 of the Taiwanese medical licensing exam. Digital Health 2024;10:1–8. https://doi.org/10.1177/20552076241233144.Search in Google Scholar PubMed PubMed Central

83. Wójcik, S, Rulkiewicz, A, Pruszczyk, P, Lisik, W, Poboży, M, Domienik-Karłowicz, J. Reshaping medical education: performance of ChatGPT on a PES medical examination. Cardiol J 2024;31:442–50. https://doi.org/10.5603/cj.97517.Search in Google Scholar PubMed PubMed Central

Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/gme-2024-0021).

Received: 2024-11-15

Accepted: 2024-12-10

Published Online: 2025-01-14

This work is licensed under the Creative Commons Attribution 4.0 International License.

You are currently not able to access this content.

https://doi.org/10.1515/gme-2024-0021

Keywords for this article

generative artificial intelligence; large language model; medical examination; medical education

Creative Commons

BY 4.0