E-Thesis 705 views 159 downloads
Quantifying Underspecification in Machine Learning with Explainable AI / James Hinns
Swansea University Author: James Hinns
Abstract
To evaluate a trained machine learning (ML) model’s performance, it is general practice to test its performance by predicting targets from a held-out testing set. For such a dataset, various models can be constructed with different reasoning that produce near-optimal test performance. However, due t...
Published: |
Swansea
2022
|
---|---|
Institution: | Swansea University |
Degree level: | Master of Research |
Degree name: | MSc by Research |
Supervisor: | Roggenbach, Markus; Fan, Xiuyi |
URI: | https://cronfa.swan.ac.uk/Record/cronfa61751 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
first_indexed |
2022-11-01T13:26:45Z |
---|---|
last_indexed |
2023-01-13T19:22:43Z |
id |
cronfa61751 |
recordtype |
RisThesis |
fullrecord |
<?xml version="1.0"?><rfc1807><datestamp>2022-11-01T13:35:36.6545822</datestamp><bib-version>v2</bib-version><id>61751</id><entry>2022-11-01</entry><title>Quantifying Underspecification in Machine Learning with Explainable AI</title><swanseaauthors><author><sid>00ed3d6ac9eafa037b2fa3aa26cde6c0</sid><firstname>James</firstname><surname>Hinns</surname><name>James Hinns</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2022-11-01</date><deptcode>SCS</deptcode><abstract>To evaluate a trained machine learning (ML) model’s performance, it is general practice to test its performance by predicting targets from a held-out testing set. For such a dataset, various models can be constructed with different reasoning that produce near-optimal test performance. However, due to this variance in reasoning some models can generalise, whilst some perform unexpectedly on further unseen data. The existence of multiple equally performing models exhibits underspecification of the ML pipeline used for producing such models. Underspecification poses challenges towards the credibility of such test performance evaluations and has been identified as a key reason why many models that perform well in testing, exhibit poor performance in deployment.In this work, we propose identifying underspecification by estimating the variance of reasoning within a set of near-optimal models produced by a pipeline, also called a Rashomon set. We iteratively train models using the same pipeline to produce an em-pirical Rashomon set of a fixed size. In order to quantify the variation of models within this Rashomon set, we measure the variation of SHapley Additive exPlanations that the models produce using a variety of metrics. This provides us with an index representing the variation of reasoning within this Rashomon set, and thus the pipeline. This index therefore represents the extent of underspecification the pipeline exhibits.We provide an implementation for this approach, and make it publicly available on github. We validate that this implementation shows the trends we expect using evaluation techniques previously used to prove the existence of underspecification. Fur-thermore, we demonstrate our approach on multiple datasets drawn from the literature, and in a COVID-19 virus transmission case study.</abstract><type>E-Thesis</type><journal/><volume/><journalNumber/><paginationStart/><paginationEnd/><publisher/><placeOfPublication>Swansea</placeOfPublication><isbnPrint/><isbnElectronic/><issnPrint/><issnElectronic/><keywords>Underspecification, Explainable AI</keywords><publishedDay>28</publishedDay><publishedMonth>10</publishedMonth><publishedYear>2022</publishedYear><publishedDate>2022-10-28</publishedDate><doi/><url/><notes>ORCiD identifier: https://orcid.org/0000-0002-4144-5757</notes><college>COLLEGE NANME</college><department>Computer Science</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>SCS</DepartmentCode><institution>Swansea University</institution><supervisor>Roggenbach, Markus; Fan, Xiuyi</supervisor><degreelevel>Master of Research</degreelevel><degreename>MSc by Research</degreename><apcterm/><funders/><projectreference/><lastEdited>2022-11-01T13:35:36.6545822</lastEdited><Created>2022-11-01T13:22:29.2840454</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>James</firstname><surname>Hinns</surname><order>1</order></author></authors><documents><document><filename>61751__25627__a9e075a91ea3466da14cb1d74896f74e.pdf</filename><originalFilename>Hinns_James_MSc_Research_Thesis_Final_Redacted_Signature.pdf</originalFilename><uploaded>2022-11-01T13:33:31.0385235</uploaded><type>Output</type><contentLength>7272793</contentLength><contentType>application/pdf</contentType><version>E-Thesis – open access</version><cronfaStatus>true</cronfaStatus><documentNotes>Copyright: The author, James Hinns, 2022.</documentNotes><copyrightCorrect>true</copyrightCorrect><language>eng</language></document></documents><OutputDurs/></rfc1807> |
spelling |
2022-11-01T13:35:36.6545822 v2 61751 2022-11-01 Quantifying Underspecification in Machine Learning with Explainable AI 00ed3d6ac9eafa037b2fa3aa26cde6c0 James Hinns James Hinns true false 2022-11-01 SCS To evaluate a trained machine learning (ML) model’s performance, it is general practice to test its performance by predicting targets from a held-out testing set. For such a dataset, various models can be constructed with different reasoning that produce near-optimal test performance. However, due to this variance in reasoning some models can generalise, whilst some perform unexpectedly on further unseen data. The existence of multiple equally performing models exhibits underspecification of the ML pipeline used for producing such models. Underspecification poses challenges towards the credibility of such test performance evaluations and has been identified as a key reason why many models that perform well in testing, exhibit poor performance in deployment.In this work, we propose identifying underspecification by estimating the variance of reasoning within a set of near-optimal models produced by a pipeline, also called a Rashomon set. We iteratively train models using the same pipeline to produce an em-pirical Rashomon set of a fixed size. In order to quantify the variation of models within this Rashomon set, we measure the variation of SHapley Additive exPlanations that the models produce using a variety of metrics. This provides us with an index representing the variation of reasoning within this Rashomon set, and thus the pipeline. This index therefore represents the extent of underspecification the pipeline exhibits.We provide an implementation for this approach, and make it publicly available on github. We validate that this implementation shows the trends we expect using evaluation techniques previously used to prove the existence of underspecification. Fur-thermore, we demonstrate our approach on multiple datasets drawn from the literature, and in a COVID-19 virus transmission case study. E-Thesis Swansea Underspecification, Explainable AI 28 10 2022 2022-10-28 ORCiD identifier: https://orcid.org/0000-0002-4144-5757 COLLEGE NANME Computer Science COLLEGE CODE SCS Swansea University Roggenbach, Markus; Fan, Xiuyi Master of Research MSc by Research 2022-11-01T13:35:36.6545822 2022-11-01T13:22:29.2840454 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science James Hinns 1 61751__25627__a9e075a91ea3466da14cb1d74896f74e.pdf Hinns_James_MSc_Research_Thesis_Final_Redacted_Signature.pdf 2022-11-01T13:33:31.0385235 Output 7272793 application/pdf E-Thesis – open access true Copyright: The author, James Hinns, 2022. true eng |
title |
Quantifying Underspecification in Machine Learning with Explainable AI |
spellingShingle |
Quantifying Underspecification in Machine Learning with Explainable AI James Hinns |
title_short |
Quantifying Underspecification in Machine Learning with Explainable AI |
title_full |
Quantifying Underspecification in Machine Learning with Explainable AI |
title_fullStr |
Quantifying Underspecification in Machine Learning with Explainable AI |
title_full_unstemmed |
Quantifying Underspecification in Machine Learning with Explainable AI |
title_sort |
Quantifying Underspecification in Machine Learning with Explainable AI |
author_id_str_mv |
00ed3d6ac9eafa037b2fa3aa26cde6c0 |
author_id_fullname_str_mv |
00ed3d6ac9eafa037b2fa3aa26cde6c0_***_James Hinns |
author |
James Hinns |
author2 |
James Hinns |
format |
E-Thesis |
publishDate |
2022 |
institution |
Swansea University |
college_str |
Faculty of Science and Engineering |
hierarchytype |
|
hierarchy_top_id |
facultyofscienceandengineering |
hierarchy_top_title |
Faculty of Science and Engineering |
hierarchy_parent_id |
facultyofscienceandengineering |
hierarchy_parent_title |
Faculty of Science and Engineering |
department_str |
School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science |
document_store_str |
1 |
active_str |
0 |
description |
To evaluate a trained machine learning (ML) model’s performance, it is general practice to test its performance by predicting targets from a held-out testing set. For such a dataset, various models can be constructed with different reasoning that produce near-optimal test performance. However, due to this variance in reasoning some models can generalise, whilst some perform unexpectedly on further unseen data. The existence of multiple equally performing models exhibits underspecification of the ML pipeline used for producing such models. Underspecification poses challenges towards the credibility of such test performance evaluations and has been identified as a key reason why many models that perform well in testing, exhibit poor performance in deployment.In this work, we propose identifying underspecification by estimating the variance of reasoning within a set of near-optimal models produced by a pipeline, also called a Rashomon set. We iteratively train models using the same pipeline to produce an em-pirical Rashomon set of a fixed size. In order to quantify the variation of models within this Rashomon set, we measure the variation of SHapley Additive exPlanations that the models produce using a variety of metrics. This provides us with an index representing the variation of reasoning within this Rashomon set, and thus the pipeline. This index therefore represents the extent of underspecification the pipeline exhibits.We provide an implementation for this approach, and make it publicly available on github. We validate that this implementation shows the trends we expect using evaluation techniques previously used to prove the existence of underspecification. Fur-thermore, we demonstrate our approach on multiple datasets drawn from the literature, and in a COVID-19 virus transmission case study. |
published_date |
2022-10-28T04:20:48Z |
_version_ |
1763754376079343616 |
score |
11.037581 |