No Cover Image

E-Thesis 485 views 98 downloads

Quantifying Underspecification in Machine Learning with Explainable AI / James Hinns

Swansea University Author: James Hinns

Abstract

To evaluate a trained machine learning (ML) model’s performance, it is general practice to test its performance by predicting targets from a held-out testing set. For such a dataset, various models can be constructed with different reasoning that produce near-optimal test performance. However, due t...

Full description

Published: Swansea 2022
Institution: Swansea University
Degree level: Master of Research
Degree name: MSc by Research
Supervisor: Roggenbach, Markus; Fan, Xiuyi
URI: https://cronfa.swan.ac.uk/Record/cronfa61751
Tags: Add Tag
No Tags, Be the first to tag this record!
first_indexed 2022-11-01T13:26:45Z
last_indexed 2023-01-13T19:22:43Z
id cronfa61751
recordtype RisThesis
fullrecord <?xml version="1.0"?><rfc1807><datestamp>2022-11-01T13:35:36.6545822</datestamp><bib-version>v2</bib-version><id>61751</id><entry>2022-11-01</entry><title>Quantifying Underspecification in Machine Learning with Explainable AI</title><swanseaauthors><author><sid>00ed3d6ac9eafa037b2fa3aa26cde6c0</sid><firstname>James</firstname><surname>Hinns</surname><name>James Hinns</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2022-11-01</date><deptcode>SCS</deptcode><abstract>To evaluate a trained machine learning (ML) model&#x2019;s performance, it is general practice to test its performance by predicting targets from a held-out testing set. For such a dataset, various models can be constructed with different reasoning that produce near-optimal test performance. However, due to this variance in reasoning some models can generalise, whilst some perform unexpectedly on further unseen data. The existence of multiple equally performing models exhibits underspecification of the ML pipeline used for producing such models. Underspecification poses challenges towards the credibility of such test performance evaluations and has been identified as a key reason why many models that perform well in testing, exhibit poor performance in deployment.In this work, we propose identifying underspecification by estimating the variance of reasoning within a set of near-optimal models produced by a pipeline, also called a Rashomon set. We iteratively train models using the same pipeline to produce an em-pirical Rashomon set of a fixed size. In order to quantify the variation of models within this Rashomon set, we measure the variation of SHapley Additive exPlanations that the models produce using a variety of metrics. This provides us with an index representing the variation of reasoning within this Rashomon set, and thus the pipeline. This index therefore represents the extent of underspecification the pipeline exhibits.We provide an implementation for this approach, and make it publicly available on github. We validate that this implementation shows the trends we expect using evaluation techniques previously used to prove the existence of underspecification. Fur-thermore, we demonstrate our approach on multiple datasets drawn from the literature, and in a COVID-19 virus transmission case study.</abstract><type>E-Thesis</type><journal/><volume/><journalNumber/><paginationStart/><paginationEnd/><publisher/><placeOfPublication>Swansea</placeOfPublication><isbnPrint/><isbnElectronic/><issnPrint/><issnElectronic/><keywords>Underspecification, Explainable AI</keywords><publishedDay>28</publishedDay><publishedMonth>10</publishedMonth><publishedYear>2022</publishedYear><publishedDate>2022-10-28</publishedDate><doi/><url/><notes>ORCiD identifier: https://orcid.org/0000-0002-4144-5757</notes><college>COLLEGE NANME</college><department>Computer Science</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>SCS</DepartmentCode><institution>Swansea University</institution><supervisor>Roggenbach, Markus; Fan, Xiuyi</supervisor><degreelevel>Master of Research</degreelevel><degreename>MSc by Research</degreename><apcterm/><funders/><projectreference/><lastEdited>2022-11-01T13:35:36.6545822</lastEdited><Created>2022-11-01T13:22:29.2840454</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>James</firstname><surname>Hinns</surname><order>1</order></author></authors><documents><document><filename>61751__25627__a9e075a91ea3466da14cb1d74896f74e.pdf</filename><originalFilename>Hinns_James_MSc_Research_Thesis_Final_Redacted_Signature.pdf</originalFilename><uploaded>2022-11-01T13:33:31.0385235</uploaded><type>Output</type><contentLength>7272793</contentLength><contentType>application/pdf</contentType><version>E-Thesis &#x2013; open access</version><cronfaStatus>true</cronfaStatus><documentNotes>Copyright: The author, James Hinns, 2022.</documentNotes><copyrightCorrect>true</copyrightCorrect><language>eng</language></document></documents><OutputDurs/></rfc1807>
spelling 2022-11-01T13:35:36.6545822 v2 61751 2022-11-01 Quantifying Underspecification in Machine Learning with Explainable AI 00ed3d6ac9eafa037b2fa3aa26cde6c0 James Hinns James Hinns true false 2022-11-01 SCS To evaluate a trained machine learning (ML) model’s performance, it is general practice to test its performance by predicting targets from a held-out testing set. For such a dataset, various models can be constructed with different reasoning that produce near-optimal test performance. However, due to this variance in reasoning some models can generalise, whilst some perform unexpectedly on further unseen data. The existence of multiple equally performing models exhibits underspecification of the ML pipeline used for producing such models. Underspecification poses challenges towards the credibility of such test performance evaluations and has been identified as a key reason why many models that perform well in testing, exhibit poor performance in deployment.In this work, we propose identifying underspecification by estimating the variance of reasoning within a set of near-optimal models produced by a pipeline, also called a Rashomon set. We iteratively train models using the same pipeline to produce an em-pirical Rashomon set of a fixed size. In order to quantify the variation of models within this Rashomon set, we measure the variation of SHapley Additive exPlanations that the models produce using a variety of metrics. This provides us with an index representing the variation of reasoning within this Rashomon set, and thus the pipeline. This index therefore represents the extent of underspecification the pipeline exhibits.We provide an implementation for this approach, and make it publicly available on github. We validate that this implementation shows the trends we expect using evaluation techniques previously used to prove the existence of underspecification. Fur-thermore, we demonstrate our approach on multiple datasets drawn from the literature, and in a COVID-19 virus transmission case study. E-Thesis Swansea Underspecification, Explainable AI 28 10 2022 2022-10-28 ORCiD identifier: https://orcid.org/0000-0002-4144-5757 COLLEGE NANME Computer Science COLLEGE CODE SCS Swansea University Roggenbach, Markus; Fan, Xiuyi Master of Research MSc by Research 2022-11-01T13:35:36.6545822 2022-11-01T13:22:29.2840454 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science James Hinns 1 61751__25627__a9e075a91ea3466da14cb1d74896f74e.pdf Hinns_James_MSc_Research_Thesis_Final_Redacted_Signature.pdf 2022-11-01T13:33:31.0385235 Output 7272793 application/pdf E-Thesis – open access true Copyright: The author, James Hinns, 2022. true eng
title Quantifying Underspecification in Machine Learning with Explainable AI
spellingShingle Quantifying Underspecification in Machine Learning with Explainable AI
James Hinns
title_short Quantifying Underspecification in Machine Learning with Explainable AI
title_full Quantifying Underspecification in Machine Learning with Explainable AI
title_fullStr Quantifying Underspecification in Machine Learning with Explainable AI
title_full_unstemmed Quantifying Underspecification in Machine Learning with Explainable AI
title_sort Quantifying Underspecification in Machine Learning with Explainable AI
author_id_str_mv 00ed3d6ac9eafa037b2fa3aa26cde6c0
author_id_fullname_str_mv 00ed3d6ac9eafa037b2fa3aa26cde6c0_***_James Hinns
author James Hinns
author2 James Hinns
format E-Thesis
publishDate 2022
institution Swansea University
college_str Faculty of Science and Engineering
hierarchytype
hierarchy_top_id facultyofscienceandengineering
hierarchy_top_title Faculty of Science and Engineering
hierarchy_parent_id facultyofscienceandengineering
hierarchy_parent_title Faculty of Science and Engineering
department_str School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science
document_store_str 1
active_str 0
description To evaluate a trained machine learning (ML) model’s performance, it is general practice to test its performance by predicting targets from a held-out testing set. For such a dataset, various models can be constructed with different reasoning that produce near-optimal test performance. However, due to this variance in reasoning some models can generalise, whilst some perform unexpectedly on further unseen data. The existence of multiple equally performing models exhibits underspecification of the ML pipeline used for producing such models. Underspecification poses challenges towards the credibility of such test performance evaluations and has been identified as a key reason why many models that perform well in testing, exhibit poor performance in deployment.In this work, we propose identifying underspecification by estimating the variance of reasoning within a set of near-optimal models produced by a pipeline, also called a Rashomon set. We iteratively train models using the same pipeline to produce an em-pirical Rashomon set of a fixed size. In order to quantify the variation of models within this Rashomon set, we measure the variation of SHapley Additive exPlanations that the models produce using a variety of metrics. This provides us with an index representing the variation of reasoning within this Rashomon set, and thus the pipeline. This index therefore represents the extent of underspecification the pipeline exhibits.We provide an implementation for this approach, and make it publicly available on github. We validate that this implementation shows the trends we expect using evaluation techniques previously used to prove the existence of underspecification. Fur-thermore, we demonstrate our approach on multiple datasets drawn from the literature, and in a COVID-19 virus transmission case study.
published_date 2022-10-28T04:20:48Z
_version_ 1763754376079343616
score 11.013148