Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions

Newton, Phil; Summers, Chris; ZAHEER, MUHAMMAD; XIROMERITI, MARIA; STOKES, JEMIMA; BHANGU, JASKARAN; ROOME, ELIS; ROBERTS-PHILLIPS, ALANNA; MAZAHERI-ASADI, DARIUS; JONES, CAMERON; HUGHES, STUART; GILBERT, DOMINIC; JONES, EWAN; ESSEX, KEIONI; ELLIS, EMILY; DAVEY, ROSS; COX, ADRIENNE; BASSETT, JESSICA

doi:10.1007/s40670-025-02293-z

Journal article 281 views 140 downloads

Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions

Phil Newton

, Chris Summers

, MUHAMMAD ZAHEER, MARIA XIROMERITI, JEMIMA STOKES, JASKARAN BHANGU, ELIS ROOME, ALANNA ROBERTS-PHILLIPS, DARIUS MAZAHERI-ASADI, CAMERON JONES, STUART HUGHES, DOMINIC GILBERT, EWAN JONES, KEIONI ESSEX, EMILY ELLIS, ROSS DAVEY, ADRIENNE COX, JESSICA BASSETT

Medical Science Educator, Volume: 35, Issue: 2, Pages: 721 - 729

Swansea University Authors: Phil Newton , Chris Summers , MUHAMMAD ZAHEER, MARIA XIROMERITI, JEMIMA STOKES, JASKARAN BHANGU, ELIS ROOME, ALANNA ROBERTS-PHILLIPS, DARIUS MAZAHERI-ASADI, CAMERON JONES, STUART HUGHES, DOMINIC GILBERT, EWAN JONES, KEIONI ESSEX, EMILY ELLIS, ROSS DAVEY, ADRIENNE COX, JESSICA BASSETT

PDF | Version of Record

© The Author(s) 2025. Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0).
Download (900.08KB)

Check full text

DOI (Published version): 10.1007/s40670-025-02293-z

Abstract

ChatGPT apparently shows excellent performance on high-level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has previously shown weake...

Full description

Published in:	Medical Science Educator
ISSN:	2156-8650
Published:	Springer Nature 2025
Online Access:	Check full text
URI:	https://cronfa.swan.ac.uk/Record/cronfa67970

first_indexed	2025-01-30T13:19:48Z
last_indexed	2025-05-14T12:09:28Z
id	cronfa67970
recordtype	SURis
fullrecord	<?xml version="1.0"?><rfc1807><datestamp>2025-05-13T10:43:05.6504253</datestamp><bib-version>v2</bib-version><id>67970</id><entry>2024-10-11</entry><title>Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions</title><swanseaauthors><author><sid>6e0a363d04c407371184d82f7a5bddc8</sid><ORCID>0000-0002-5272-7979</ORCID><firstname>Phil</firstname><surname>Newton</surname><name>Phil Newton</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>12285129663749039fa74e265339c198</sid><ORCID>0009-0000-5336-2492</ORCID><firstname>Chris</firstname><surname>Summers</surname><name>Chris Summers</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>5f4e9a5cfe8456bcf4a72b62a9a3f971</sid><firstname>MUHAMMAD</firstname><surname>ZAHEER</surname><name>MUHAMMAD ZAHEER</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>4701c86eff14cbe3661d10eaba7fa8d3</sid><firstname>MARIA</firstname><surname>XIROMERITI</surname><name>MARIA XIROMERITI</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>9e457cdcf2e490affc3d2733d46f5f73</sid><firstname>JEMIMA</firstname><surname>STOKES</surname><name>JEMIMA STOKES</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>4cf22ca1ea998ec4a9cd23ba35cbf55b</sid><firstname>JASKARAN</firstname><surname>BHANGU</surname><name>JASKARAN BHANGU</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>697dcdb246cfc62876530b805cfb8b8b</sid><firstname>ELIS</firstname><surname>ROOME</surname><name>ELIS ROOME</name><active>true</active><ethesisStudent>true</ethesisStudent></author><author><sid>1286b5e6e2a81b219f324015faa9d595</sid><firstname>ALANNA</firstname><surname>ROBERTS-PHILLIPS</surname><name>ALANNA ROBERTS-PHILLIPS</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>419cc1a63e5b4118328e538b7aa6370c</sid><firstname>DARIUS</firstname><surname>MAZAHERI-ASADI</surname><name>DARIUS MAZAHERI-ASADI</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>ea5d2b505a7735f7bc478c0b43833b6a</sid><firstname>CAMERON</firstname><surname>JONES</surname><name>CAMERON JONES</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>19fda632f2c5ab74c564f1b555139ad2</sid><firstname>STUART</firstname><surname>HUGHES</surname><name>STUART HUGHES</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>5b78250dda5dfb7cbae45c6268589c40</sid><firstname>DOMINIC</firstname><surname>GILBERT</surname><name>DOMINIC GILBERT</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>eb72578c573a6ba56ef3d544c7a6748c</sid><firstname>EWAN</firstname><surname>JONES</surname><name>EWAN JONES</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>cbc0d4197020272eb74db22a5e8e6434</sid><firstname>KEIONI</firstname><surname>ESSEX</surname><name>KEIONI ESSEX</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>b520fff403c40a6bf5bfecebd255421c</sid><firstname>EMILY</firstname><surname>ELLIS</surname><name>EMILY ELLIS</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>84b76a9917f9ca57d07e02ce8d2b85f3</sid><firstname>ROSS</firstname><surname>DAVEY</surname><name>ROSS DAVEY</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>99322b05343a30f91435c1f9e73afdce</sid><firstname>ADRIENNE</firstname><surname>COX</surname><name>ADRIENNE COX</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>eec704f6a293c7bed4f16f61df6ed719</sid><firstname>JESSICA</firstname><surname>BASSETT</surname><name>JESSICA BASSETT</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2024-10-11</date><deptcode>MEDS</deptcode><abstract>ChatGPT apparently shows excellent performance on high-level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has previously shown weaker performance on questions with pictures, and there have been concerns that ChatGPT’s performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested. Here, we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams. ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show reduced performance on questions containing images when the answer options were added to an image as text labels. These data demonstrate that the performance of ChatGPT continues to improve and that secure testing environments are required for the valid assessment of both foundational and higher order learning.</abstract><type>Journal Article</type><journal>Medical Science Educator</journal><volume>35</volume><journalNumber>2</journalNumber><paginationStart>721</paginationStart><paginationEnd>729</paginationEnd><publisher>Springer Nature</publisher><placeOfPublication/><isbnPrint/><isbnElectronic/><issnPrint/><issnElectronic>2156-8650</issnElectronic><keywords>Assessment validity; Academic integrity; Cheating; Evidence-based education; MCQs; Pragmatism</keywords><publishedDay>1</publishedDay><publishedMonth>4</publishedMonth><publishedYear>2025</publishedYear><publishedDate>2025-04-01</publishedDate><doi>10.1007/s40670-025-02293-z</doi><url/><notes/><college>COLLEGE NANME</college><department>Medical School</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>MEDS</DepartmentCode><institution>Swansea University</institution><apcterm>SU Library paid the OA fee (TA Institutional Deal)</apcterm><funders>Swansea University</funders><projectreference/><lastEdited>2025-05-13T10:43:05.6504253</lastEdited><Created>2024-10-11T17:43:24.3809347</Created><path><level id="1">Faculty of Medicine, Health and Life Sciences</level><level id="2">Swansea University Medical School - Medicine</level></path><authors><author><firstname>Phil</firstname><surname>Newton</surname><orcid>0000-0002-5272-7979</orcid><order>1</order></author><author><firstname>Chris</firstname><surname>Summers</surname><orcid>0009-0000-5336-2492</orcid><order>2</order></author><author><firstname>MUHAMMAD</firstname><surname>ZAHEER</surname><order>3</order></author><author><firstname>MARIA</firstname><surname>XIROMERITI</surname><order>4</order></author><author><firstname>JEMIMA</firstname><surname>STOKES</surname><order>5</order></author><author><firstname>JASKARAN</firstname><surname>BHANGU</surname><order>6</order></author><author><firstname>ELIS</firstname><surname>ROOME</surname><order>7</order></author><author><firstname>ALANNA</firstname><surname>ROBERTS-PHILLIPS</surname><order>8</order></author><author><firstname>DARIUS</firstname><surname>MAZAHERI-ASADI</surname><order>9</order></author><author><firstname>CAMERON</firstname><surname>JONES</surname><order>10</order></author><author><firstname>STUART</firstname><surname>HUGHES</surname><order>11</order></author><author><firstname>DOMINIC</firstname><surname>GILBERT</surname><order>12</order></author><author><firstname>EWAN</firstname><surname>JONES</surname><order>13</order></author><author><firstname>KEIONI</firstname><surname>ESSEX</surname><order>14</order></author><author><firstname>EMILY</firstname><surname>ELLIS</surname><order>15</order></author><author><firstname>ROSS</firstname><surname>DAVEY</surname><order>16</order></author><author><firstname>ADRIENNE</firstname><surname>COX</surname><order>17</order></author><author><firstname>JESSICA</firstname><surname>BASSETT</surname><order>18</order></author></authors><documents><document><filename>67970__33506__633a278dc99a496e8c3a30d699023872.pdf</filename><originalFilename>67970.VOR.pdf</originalFilename><uploaded>2025-02-05T13:32:09.9244873</uploaded><type>Output</type><contentLength>921678</contentLength><contentType>application/pdf</contentType><version>Version of Record</version><cronfaStatus>true</cronfaStatus><documentNotes>© The Author(s) 2025. Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0).</documentNotes><copyrightCorrect>true</copyrightCorrect><language>eng</language><licence>http://creativecommons.org/licenses/by/4.0/</licence></document></documents><OutputDurs/></rfc1807>
spelling	2025-05-13T10:43:05.6504253 v2 67970 2024-10-11 Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions 6e0a363d04c407371184d82f7a5bddc8 0000-0002-5272-7979 Phil Newton Phil Newton true false 12285129663749039fa74e265339c198 0009-0000-5336-2492 Chris Summers Chris Summers true false 5f4e9a5cfe8456bcf4a72b62a9a3f971 MUHAMMAD ZAHEER MUHAMMAD ZAHEER true false 4701c86eff14cbe3661d10eaba7fa8d3 MARIA XIROMERITI MARIA XIROMERITI true false 9e457cdcf2e490affc3d2733d46f5f73 JEMIMA STOKES JEMIMA STOKES true false 4cf22ca1ea998ec4a9cd23ba35cbf55b JASKARAN BHANGU JASKARAN BHANGU true false 697dcdb246cfc62876530b805cfb8b8b ELIS ROOME ELIS ROOME true true 1286b5e6e2a81b219f324015faa9d595 ALANNA ROBERTS-PHILLIPS ALANNA ROBERTS-PHILLIPS true false 419cc1a63e5b4118328e538b7aa6370c DARIUS MAZAHERI-ASADI DARIUS MAZAHERI-ASADI true false ea5d2b505a7735f7bc478c0b43833b6a CAMERON JONES CAMERON JONES true false 19fda632f2c5ab74c564f1b555139ad2 STUART HUGHES STUART HUGHES true false 5b78250dda5dfb7cbae45c6268589c40 DOMINIC GILBERT DOMINIC GILBERT true false eb72578c573a6ba56ef3d544c7a6748c EWAN JONES EWAN JONES true false cbc0d4197020272eb74db22a5e8e6434 KEIONI ESSEX KEIONI ESSEX true false b520fff403c40a6bf5bfecebd255421c EMILY ELLIS EMILY ELLIS true false 84b76a9917f9ca57d07e02ce8d2b85f3 ROSS DAVEY ROSS DAVEY true false 99322b05343a30f91435c1f9e73afdce ADRIENNE COX ADRIENNE COX true false eec704f6a293c7bed4f16f61df6ed719 JESSICA BASSETT JESSICA BASSETT true false 2024-10-11 MEDS ChatGPT apparently shows excellent performance on high-level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has previously shown weaker performance on questions with pictures, and there have been concerns that ChatGPT’s performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested. Here, we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams. ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show reduced performance on questions containing images when the answer options were added to an image as text labels. These data demonstrate that the performance of ChatGPT continues to improve and that secure testing environments are required for the valid assessment of both foundational and higher order learning. Journal Article Medical Science Educator 35 2 721 729 Springer Nature 2156-8650 Assessment validity; Academic integrity; Cheating; Evidence-based education; MCQs; Pragmatism 1 4 2025 2025-04-01 10.1007/s40670-025-02293-z COLLEGE NANME Medical School COLLEGE CODE MEDS Swansea University SU Library paid the OA fee (TA Institutional Deal) Swansea University 2025-05-13T10:43:05.6504253 2024-10-11T17:43:24.3809347 Faculty of Medicine, Health and Life Sciences Swansea University Medical School - Medicine Phil Newton 0000-0002-5272-7979 1 Chris Summers 0009-0000-5336-2492 2 MUHAMMAD ZAHEER 3 MARIA XIROMERITI 4 JEMIMA STOKES 5 JASKARAN BHANGU 6 ELIS ROOME 7 ALANNA ROBERTS-PHILLIPS 8 DARIUS MAZAHERI-ASADI 9 CAMERON JONES 10 STUART HUGHES 11 DOMINIC GILBERT 12 EWAN JONES 13 KEIONI ESSEX 14 EMILY ELLIS 15 ROSS DAVEY 16 ADRIENNE COX 17 JESSICA BASSETT 18 67970__33506__633a278dc99a496e8c3a30d699023872.pdf 67970.VOR.pdf 2025-02-05T13:32:09.9244873 Output 921678 application/pdf Version of Record true © The Author(s) 2025. Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0). true eng http://creativecommons.org/licenses/by/4.0/
title	Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions
spellingShingle	Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions Phil Newton Chris Summers MUHAMMAD ZAHEER MARIA XIROMERITI JEMIMA STOKES JASKARAN BHANGU ELIS ROOME ALANNA ROBERTS-PHILLIPS DARIUS MAZAHERI-ASADI CAMERON JONES STUART HUGHES DOMINIC GILBERT EWAN JONES KEIONI ESSEX EMILY ELLIS ROSS DAVEY ADRIENNE COX JESSICA BASSETT
title_short	Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions
title_full	Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions
title_fullStr	Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions
title_full_unstemmed	Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions
title_sort	Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions
author_id_str_mv	6e0a363d04c407371184d82f7a5bddc8 12285129663749039fa74e265339c198 5f4e9a5cfe8456bcf4a72b62a9a3f971 4701c86eff14cbe3661d10eaba7fa8d3 9e457cdcf2e490affc3d2733d46f5f73 4cf22ca1ea998ec4a9cd23ba35cbf55b 697dcdb246cfc62876530b805cfb8b8b 1286b5e6e2a81b219f324015faa9d595 419cc1a63e5b4118328e538b7aa6370c ea5d2b505a7735f7bc478c0b43833b6a 19fda632f2c5ab74c564f1b555139ad2 5b78250dda5dfb7cbae45c6268589c40 eb72578c573a6ba56ef3d544c7a6748c cbc0d4197020272eb74db22a5e8e6434 b520fff403c40a6bf5bfecebd255421c 84b76a9917f9ca57d07e02ce8d2b85f3 99322b05343a30f91435c1f9e73afdce eec704f6a293c7bed4f16f61df6ed719
author_id_fullname_str_mv	6e0a363d04c407371184d82f7a5bddc8_*_Phil Newton 12285129663749039fa74e265339c198__Chris Summers 5f4e9a5cfe8456bcf4a72b62a9a3f971__MUHAMMAD ZAHEER 4701c86eff14cbe3661d10eaba7fa8d3__MARIA XIROMERITI 9e457cdcf2e490affc3d2733d46f5f73__JEMIMA STOKES 4cf22ca1ea998ec4a9cd23ba35cbf55b__JASKARAN BHANGU 697dcdb246cfc62876530b805cfb8b8b__ELIS ROOME 1286b5e6e2a81b219f324015faa9d595__ALANNA ROBERTS-PHILLIPS 419cc1a63e5b4118328e538b7aa6370c__DARIUS MAZAHERI-ASADI ea5d2b505a7735f7bc478c0b43833b6a__CAMERON JONES 19fda632f2c5ab74c564f1b555139ad2__STUART HUGHES 5b78250dda5dfb7cbae45c6268589c40__DOMINIC GILBERT eb72578c573a6ba56ef3d544c7a6748c__EWAN JONES cbc0d4197020272eb74db22a5e8e6434__KEIONI ESSEX b520fff403c40a6bf5bfecebd255421c__EMILY ELLIS 84b76a9917f9ca57d07e02ce8d2b85f3__ROSS DAVEY 99322b05343a30f91435c1f9e73afdce__ADRIENNE COX eec704f6a293c7bed4f16f61df6ed719_*_JESSICA BASSETT
author	Phil Newton Chris Summers MUHAMMAD ZAHEER MARIA XIROMERITI JEMIMA STOKES JASKARAN BHANGU ELIS ROOME ALANNA ROBERTS-PHILLIPS DARIUS MAZAHERI-ASADI CAMERON JONES STUART HUGHES DOMINIC GILBERT EWAN JONES KEIONI ESSEX EMILY ELLIS ROSS DAVEY ADRIENNE COX JESSICA BASSETT
author2	Phil Newton Chris Summers MUHAMMAD ZAHEER MARIA XIROMERITI JEMIMA STOKES JASKARAN BHANGU ELIS ROOME ALANNA ROBERTS-PHILLIPS DARIUS MAZAHERI-ASADI CAMERON JONES STUART HUGHES DOMINIC GILBERT EWAN JONES KEIONI ESSEX EMILY ELLIS ROSS DAVEY ADRIENNE COX JESSICA BASSETT
format	Journal article
container_title	Medical Science Educator
container_volume	35
container_issue	2
container_start_page	721
publishDate	2025
institution	Swansea University
issn	2156-8650
doi_str_mv	10.1007/s40670-025-02293-z
publisher	Springer Nature
college_str	Faculty of Medicine, Health and Life Sciences
hierarchytype
hierarchy_top_id	facultyofmedicinehealthandlifesciences
hierarchy_top_title	Faculty of Medicine, Health and Life Sciences
hierarchy_parent_id	facultyofmedicinehealthandlifesciences
hierarchy_parent_title	Faculty of Medicine, Health and Life Sciences
department_str	Swansea University Medical School - Medicine{{{_:::_}}}Faculty of Medicine, Health and Life Sciences{{{_:::_}}}Swansea University Medical School - Medicine
document_store_str	1
active_str	0
description	ChatGPT apparently shows excellent performance on high-level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has previously shown weaker performance on questions with pictures, and there have been concerns that ChatGPT’s performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested. Here, we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams. ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show reduced performance on questions containing images when the answer options were added to an image as text labels. These data demonstrate that the performance of ChatGPT continues to improve and that secure testing environments are required for the valid assessment of both foundational and higher order learning.
published_date	2025-04-01T07:37:47Z
_version_	1836697326788280320
score	11.066966

Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions

Similar Items