Quantifying Underspecification in Machine Learning with Explainable AI

Hinns, James

doi:https://doi.org/

E-Thesis 783 views 182 downloads

Quantifying Underspecification in Machine Learning with Explainable AI / James Hinns

Swansea University Author: James Hinns

PDF | E-Thesis – open access

Copyright: The author, James Hinns, 2022.
Download (6.94MB)

Abstract

To evaluate a trained machine learning (ML) model’s performance, it is general practice to test its performance by predicting targets from a held-out testing set. For such a dataset, various models can be constructed with different reasoning that produce near-optimal test performance. However, due t...

Full description

Published:	Swansea 2022
Institution:	Swansea University
Degree level:	Master of Research
Degree name:	MSc by Research
Supervisor:	Roggenbach, Markus; Fan, Xiuyi
URI:	https://cronfa.swan.ac.uk/Record/cronfa61751

first_indexed	2022-11-01T13:26:45Z
last_indexed	2023-01-13T19:22:43Z
id	cronfa61751
recordtype	RisThesis
fullrecord	<?xml version="1.0"?><rfc1807><datestamp>2022-11-01T13:35:36.6545822</datestamp><bib-version>v2</bib-version><id>61751</id><entry>2022-11-01</entry><title>Quantifying Underspecification in Machine Learning with Explainable AI</title><swanseaauthors><author><sid>00ed3d6ac9eafa037b2fa3aa26cde6c0</sid><firstname>James</firstname><surname>Hinns</surname><name>James Hinns</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2022-11-01</date><deptcode>MACS</deptcode><abstract>To evaluate a trained machine learning (ML) model’s performance, it is general practice to test its performance by predicting targets from a held-out testing set. For such a dataset, various models can be constructed with different reasoning that produce near-optimal test performance. However, due to this variance in reasoning some models can generalise, whilst some perform unexpectedly on further unseen data. The existence of multiple equally performing models exhibits underspecification of the ML pipeline used for producing such models. Underspecification poses challenges towards the credibility of such test performance evaluations and has been identified as a key reason why many models that perform well in testing, exhibit poor performance in deployment.In this work, we propose identifying underspecification by estimating the variance of reasoning within a set of near-optimal models produced by a pipeline, also called a Rashomon set. We iteratively train models using the same pipeline to produce an em-pirical Rashomon set of a fixed size. In order to quantify the variation of models within this Rashomon set, we measure the variation of SHapley Additive exPlanations that the models produce using a variety of metrics. This provides us with an index representing the variation of reasoning within this Rashomon set, and thus the pipeline. This index therefore represents the extent of underspecification the pipeline exhibits.We provide an implementation for this approach, and make it publicly available on github. We validate that this implementation shows the trends we expect using evaluation techniques previously used to prove the existence of underspecification. Fur-thermore, we demonstrate our approach on multiple datasets drawn from the literature, and in a COVID-19 virus transmission case study.</abstract><type>E-Thesis</type><journal/><volume/><journalNumber/><paginationStart/><paginationEnd/><publisher/><placeOfPublication>Swansea</placeOfPublication><isbnPrint/><isbnElectronic/><issnPrint/><issnElectronic/><keywords>Underspecification, Explainable AI</keywords><publishedDay>28</publishedDay><publishedMonth>10</publishedMonth><publishedYear>2022</publishedYear><publishedDate>2022-10-28</publishedDate><doi/><url/><notes>ORCiD identifier: https://orcid.org/0000-0002-4144-5757</notes><college>COLLEGE NANME</college><department>Mathematics and Computer Science School</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>MACS</DepartmentCode><institution>Swansea University</institution><supervisor>Roggenbach, Markus; Fan, Xiuyi</supervisor><degreelevel>Master of Research</degreelevel><degreename>MSc by Research</degreename><apcterm/><funders/><projectreference/><lastEdited>2022-11-01T13:35:36.6545822</lastEdited><Created>2022-11-01T13:22:29.2840454</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>James</firstname><surname>Hinns</surname><order>1</order></author></authors><documents><document><filename>61751__25627__a9e075a91ea3466da14cb1d74896f74e.pdf</filename><originalFilename>Hinns_James_MSc_Research_Thesis_Final_Redacted_Signature.pdf</originalFilename><uploaded>2022-11-01T13:33:31.0385235</uploaded><type>Output</type><contentLength>7272793</contentLength><contentType>application/pdf</contentType><version>E-Thesis – open access</version><cronfaStatus>true</cronfaStatus><documentNotes>Copyright: The author, James Hinns, 2022.</documentNotes><copyrightCorrect>true</copyrightCorrect><language>eng</language></document></documents><OutputDurs/></rfc1807>
spelling	2022-11-01T13:35:36.6545822 v2 61751 2022-11-01 Quantifying Underspecification in Machine Learning with Explainable AI 00ed3d6ac9eafa037b2fa3aa26cde6c0 James Hinns James Hinns true false 2022-11-01 MACS To evaluate a trained machine learning (ML) model’s performance, it is general practice to test its performance by predicting targets from a held-out testing set. For such a dataset, various models can be constructed with different reasoning that produce near-optimal test performance. However, due to this variance in reasoning some models can generalise, whilst some perform unexpectedly on further unseen data. The existence of multiple equally performing models exhibits underspecification of the ML pipeline used for producing such models. Underspecification poses challenges towards the credibility of such test performance evaluations and has been identified as a key reason why many models that perform well in testing, exhibit poor performance in deployment.In this work, we propose identifying underspecification by estimating the variance of reasoning within a set of near-optimal models produced by a pipeline, also called a Rashomon set. We iteratively train models using the same pipeline to produce an em-pirical Rashomon set of a fixed size. In order to quantify the variation of models within this Rashomon set, we measure the variation of SHapley Additive exPlanations that the models produce using a variety of metrics. This provides us with an index representing the variation of reasoning within this Rashomon set, and thus the pipeline. This index therefore represents the extent of underspecification the pipeline exhibits.We provide an implementation for this approach, and make it publicly available on github. We validate that this implementation shows the trends we expect using evaluation techniques previously used to prove the existence of underspecification. Fur-thermore, we demonstrate our approach on multiple datasets drawn from the literature, and in a COVID-19 virus transmission case study. E-Thesis Swansea Underspecification, Explainable AI 28 10 2022 2022-10-28 ORCiD identifier: https://orcid.org/0000-0002-4144-5757 COLLEGE NANME Mathematics and Computer Science School COLLEGE CODE MACS Swansea University Roggenbach, Markus; Fan, Xiuyi Master of Research MSc by Research 2022-11-01T13:35:36.6545822 2022-11-01T13:22:29.2840454 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science James Hinns 1 61751__25627__a9e075a91ea3466da14cb1d74896f74e.pdf Hinns_James_MSc_Research_Thesis_Final_Redacted_Signature.pdf 2022-11-01T13:33:31.0385235 Output 7272793 application/pdf E-Thesis – open access true Copyright: The author, James Hinns, 2022. true eng
title	Quantifying Underspecification in Machine Learning with Explainable AI
spellingShingle	Quantifying Underspecification in Machine Learning with Explainable AI James Hinns
title_short	Quantifying Underspecification in Machine Learning with Explainable AI
title_full	Quantifying Underspecification in Machine Learning with Explainable AI
title_fullStr	Quantifying Underspecification in Machine Learning with Explainable AI
title_full_unstemmed	Quantifying Underspecification in Machine Learning with Explainable AI
title_sort	Quantifying Underspecification in Machine Learning with Explainable AI
author_id_str_mv	00ed3d6ac9eafa037b2fa3aa26cde6c0
author_id_fullname_str_mv	00ed3d6ac9eafa037b2fa3aa26cde6c0_***_James Hinns
author	James Hinns
author2	James Hinns
format	E-Thesis
publishDate	2022
institution	Swansea University
college_str	Faculty of Science and Engineering
hierarchytype
hierarchy_top_id	facultyofscienceandengineering
hierarchy_top_title	Faculty of Science and Engineering
hierarchy_parent_id	facultyofscienceandengineering
hierarchy_parent_title	Faculty of Science and Engineering
department_str	School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science
document_store_str	1
active_str	0
description	To evaluate a trained machine learning (ML) model’s performance, it is general practice to test its performance by predicting targets from a held-out testing set. For such a dataset, various models can be constructed with different reasoning that produce near-optimal test performance. However, due to this variance in reasoning some models can generalise, whilst some perform unexpectedly on further unseen data. The existence of multiple equally performing models exhibits underspecification of the ML pipeline used for producing such models. Underspecification poses challenges towards the credibility of such test performance evaluations and has been identified as a key reason why many models that perform well in testing, exhibit poor performance in deployment.In this work, we propose identifying underspecification by estimating the variance of reasoning within a set of near-optimal models produced by a pipeline, also called a Rashomon set. We iteratively train models using the same pipeline to produce an em-pirical Rashomon set of a fixed size. In order to quantify the variation of models within this Rashomon set, we measure the variation of SHapley Additive exPlanations that the models produce using a variety of metrics. This provides us with an index representing the variation of reasoning within this Rashomon set, and thus the pipeline. This index therefore represents the extent of underspecification the pipeline exhibits.We provide an implementation for this approach, and make it publicly available on github. We validate that this implementation shows the trends we expect using evaluation techniques previously used to prove the existence of underspecification. Fur-thermore, we demonstrate our approach on multiple datasets drawn from the literature, and in a COVID-19 virus transmission case study.
published_date	2022-10-28T08:00:55Z
_version_	1829269830611501056
score	11.0578165

Quantifying Underspecification in Machine Learning with Explainable AI / James Hinns

Similar Items