No Cover Image

E-Thesis 212 views 568 downloads

Interpretable Machine Learning for Predictive Analytics with High-Fidelity Imbalanced Clinical Data / SURAJ RAMCHAND

Swansea University Author: SURAJ RAMCHAND

  • 2025_Ramchand_S.final.70069.pdf

    PDF | E-Thesis – open access

    Copyright: The author, Suraj N. Ramchand, 2025 Distributed under the terms of a Creative Commons Attribution 4.0 License (CC BY 4.0).

    Download (5.8MB)

DOI (Published version): 10.23889/SUThesis.70069

Abstract

Healthcare professionals frequently manage complex, under-represented clinical events that challenge established diagnostic and decision-making pathways. These include infrequently diagnosed conditions, early-stage deterioration, and diseases with heterogeneous manifestations, where low prevalence,...

Full description

Published: Swansea University, Wales, UK 2025
Institution: Swansea University
Degree level: Doctoral
Degree name: Ph.D
Supervisor: Xie, X., and Cole, D.
URI: https://cronfa.swan.ac.uk/Record/cronfa70069
first_indexed 2025-07-31T10:30:21Z
last_indexed 2025-08-01T14:33:59Z
id cronfa70069
recordtype RisThesis
fullrecord <?xml version="1.0"?><rfc1807><datestamp>2025-07-31T11:34:01.4497200</datestamp><bib-version>v2</bib-version><id>70069</id><entry>2025-07-31</entry><title>Interpretable Machine Learning for Predictive Analytics with High-Fidelity Imbalanced Clinical Data</title><swanseaauthors><author><sid>9193a8bef3c382a398ebcb287f59ed24</sid><firstname>SURAJ</firstname><surname>RAMCHAND</surname><name>SURAJ RAMCHAND</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2025-07-31</date><abstract>Healthcare professionals frequently manage complex, under-represented clinical events that challenge established diagnostic and decision-making pathways. These include infrequently diagnosed conditions, early-stage deterioration, and diseases with heterogeneous manifestations, where low prevalence, variable presentation, and limited structured data introduce signi&#xFB01;cant uncertainty. Clinicians apply deep expertise to interpret these cases, often under constraints of time, information, and documentation. This thesis examines how high-&#xFB01;delity clinical data can be analysed using machine learning (ML) techniques to support clinical judgement.It explores how computational models might surface latent patterns, identify early risk signals, and make complex data more interpretable in settings de&#xFB01;ned by imbalance, variability, and diagnostic ambiguity. Across a series of empirical studies, the thesis addresses technical challenges common to rare or complex event prediction: class imbalance, missingness, temporal variation, and heterogeneity in clinical presentation. The &#xFB01;rst case study focuses on sepsis, a common but acutely time-sensitive condition. While sepsis is well recognised, the rapid progression of the disease limits the availability of temporally labelled pre-onset data. Using intensive care unit (ICU) records, the thesis develops preprocessing pipelines and tests attention-based temporal models to support early warning in a moderately imbalanced context, where timely prediction is critical and observational data are sparse. With the advent of COVID-19, a fast-spreading and clinically disruptive condition, under-standing the factors that lead to hospitalisation became a pressing research need. While early studies primarily focused on symptom onset and complications, less attention was paid to the early signals embedded in longitudinal health records. This study draws on primary care data that capture patients&#x2019; health trajectories over time, enabling analysis of events leading up to hospitalisation. Within this broader population dataset, hospital admission occurs in approximately 10% of cases, resulting in a naturally imbalanced outcome. Class-sensitive objectives and temporally structured features are applied to surface early risk markers that may inform triage decisions and support timely intervention. Building on the hospitalisation study, the thesis further explores how injecting structured clinical knowledge into models can improve the representation of infrequent manifestations within the same imbalanced primary care setting. Medical ontologies are integrated into language models to enhance the embedding of rare clinical terms and improve classi&#xFB01;cation performance. This approach is applied to two additional prediction tasks using the same dataset.COVID-19-related mortality ( 1%) serves as a highly imbalanced clinical outcome for evaluating model sensitivity to low-signal, high-risk events, whilst stroke ( 30%) is used as a more common benchmark for assessing generalisability across conditions of varying prevalence and complexity. Together, these tasks extend the modelling framework to uncover how structured knowledge improves the encoding of infrequent clinical features and contributes to more robust and generalisable prediction across diverse diagnostic contexts. Grounded in the clinical utility of differentiating hypertrophic cardiomyopathy (HCM)from Fabry disease, a rare metabolic disorder frequently misdiagnosed due to overlapping cardiac features, this &#xFB01;nal case study extends the thesis&#x2019;s exploration of imbalance to include disease heterogeneity and population-level rarity. A novel multimodal dataset was collected from hospital cardiac records, including echocardiography, ECG, and Holter monitoring data from genetically con&#xFB01;rmed Fabry patients and matched HCM controls. Traditional ML classi&#xFB01;ers trained on this dataset showed promising discriminatory ability using routinely acquired clinical measurements. These &#xFB01;ndings suggest that standard diagnostic tools may help raise earlier consideration of rare conditions in everyday practice when modelled with care and clinical insight. Together, these studies propose a generalisable approach to modelling rare and under-represented clinical events using ML. The thesis focuses on structuring and improving the quality and utility of clinical data, surfacing patterns of clinical relevance, and supporting decision-making in diagnostically uncertain environments. The &#xFB01;nal clinical interface is de-signed as a practical tool to assist interpretation and integrate predictions into clinical work-&#xFB02;ows and as a space to understand better how models behave in context. These contributions are made with deep appreciation for the complexity of clinical decision-making and the expertise of those who carry it out. The work aims to support that expertise by offering tools and insights that are transparent, interpretable, and aligned with real-world care.</abstract><type>E-Thesis</type><journal/><volume/><journalNumber/><paginationStart/><paginationEnd/><publisher/><placeOfPublication>Swansea University, Wales, UK</placeOfPublication><isbnPrint/><isbnElectronic/><issnPrint/><issnElectronic/><keywords>Machine Learning, Rare Diseases, Data Imbalance, Large Language Models, Graph Neural Networks, Interpretable AI</keywords><publishedDay>9</publishedDay><publishedMonth>6</publishedMonth><publishedYear>2025</publishedYear><publishedDate>2025-06-09</publishedDate><doi>10.23889/SUThesis.70069</doi><url/><notes>A selection of content is redacted or is partially redacted from this thesis to protect sensitive and personal information.</notes><college>COLLEGE NANME</college><CollegeCode>COLLEGE CODE</CollegeCode><institution>Swansea University</institution><supervisor>Xie, X., and Cole, D.</supervisor><degreelevel>Doctoral</degreelevel><degreename>Ph.D</degreename><degreesponsorsfunders>EPSRC doctoral training grant in Enhancing Human Interactions and Collaborations with Data and Intelligence Driven Systems</degreesponsorsfunders><apcterm/><funders>EPSRC doctoral training grant in Enhancing Human Interactions and Collaborations with Data and Intelligence Driven Systems</funders><projectreference/><lastEdited>2025-07-31T11:34:01.4497200</lastEdited><Created>2025-07-31T11:21:35.0057669</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>SURAJ</firstname><surname>RAMCHAND</surname><order>1</order></author></authors><documents><document><filename>70069__34880__d1783755d0144bdd9738c145846f9cf1.pdf</filename><originalFilename>2025_Ramchand_S.final.70069.pdf</originalFilename><uploaded>2025-07-31T11:29:47.3709806</uploaded><type>Output</type><contentLength>6078159</contentLength><contentType>application/pdf</contentType><version>E-Thesis &#x2013; open access</version><cronfaStatus>true</cronfaStatus><documentNotes>Copyright: The author, Suraj N. Ramchand, 2025 Distributed under the terms of a Creative Commons Attribution 4.0 License (CC BY 4.0).</documentNotes><copyrightCorrect>true</copyrightCorrect><language>eng</language><licence>https://creativecommons.org/licenses/by/4.0/</licence></document></documents><OutputDurs/></rfc1807>
spelling 2025-07-31T11:34:01.4497200 v2 70069 2025-07-31 Interpretable Machine Learning for Predictive Analytics with High-Fidelity Imbalanced Clinical Data 9193a8bef3c382a398ebcb287f59ed24 SURAJ RAMCHAND SURAJ RAMCHAND true false 2025-07-31 Healthcare professionals frequently manage complex, under-represented clinical events that challenge established diagnostic and decision-making pathways. These include infrequently diagnosed conditions, early-stage deterioration, and diseases with heterogeneous manifestations, where low prevalence, variable presentation, and limited structured data introduce significant uncertainty. Clinicians apply deep expertise to interpret these cases, often under constraints of time, information, and documentation. This thesis examines how high-fidelity clinical data can be analysed using machine learning (ML) techniques to support clinical judgement.It explores how computational models might surface latent patterns, identify early risk signals, and make complex data more interpretable in settings defined by imbalance, variability, and diagnostic ambiguity. Across a series of empirical studies, the thesis addresses technical challenges common to rare or complex event prediction: class imbalance, missingness, temporal variation, and heterogeneity in clinical presentation. The first case study focuses on sepsis, a common but acutely time-sensitive condition. While sepsis is well recognised, the rapid progression of the disease limits the availability of temporally labelled pre-onset data. Using intensive care unit (ICU) records, the thesis develops preprocessing pipelines and tests attention-based temporal models to support early warning in a moderately imbalanced context, where timely prediction is critical and observational data are sparse. With the advent of COVID-19, a fast-spreading and clinically disruptive condition, under-standing the factors that lead to hospitalisation became a pressing research need. While early studies primarily focused on symptom onset and complications, less attention was paid to the early signals embedded in longitudinal health records. This study draws on primary care data that capture patients’ health trajectories over time, enabling analysis of events leading up to hospitalisation. Within this broader population dataset, hospital admission occurs in approximately 10% of cases, resulting in a naturally imbalanced outcome. Class-sensitive objectives and temporally structured features are applied to surface early risk markers that may inform triage decisions and support timely intervention. Building on the hospitalisation study, the thesis further explores how injecting structured clinical knowledge into models can improve the representation of infrequent manifestations within the same imbalanced primary care setting. Medical ontologies are integrated into language models to enhance the embedding of rare clinical terms and improve classification performance. This approach is applied to two additional prediction tasks using the same dataset.COVID-19-related mortality ( 1%) serves as a highly imbalanced clinical outcome for evaluating model sensitivity to low-signal, high-risk events, whilst stroke ( 30%) is used as a more common benchmark for assessing generalisability across conditions of varying prevalence and complexity. Together, these tasks extend the modelling framework to uncover how structured knowledge improves the encoding of infrequent clinical features and contributes to more robust and generalisable prediction across diverse diagnostic contexts. Grounded in the clinical utility of differentiating hypertrophic cardiomyopathy (HCM)from Fabry disease, a rare metabolic disorder frequently misdiagnosed due to overlapping cardiac features, this final case study extends the thesis’s exploration of imbalance to include disease heterogeneity and population-level rarity. A novel multimodal dataset was collected from hospital cardiac records, including echocardiography, ECG, and Holter monitoring data from genetically confirmed Fabry patients and matched HCM controls. Traditional ML classifiers trained on this dataset showed promising discriminatory ability using routinely acquired clinical measurements. These findings suggest that standard diagnostic tools may help raise earlier consideration of rare conditions in everyday practice when modelled with care and clinical insight. Together, these studies propose a generalisable approach to modelling rare and under-represented clinical events using ML. The thesis focuses on structuring and improving the quality and utility of clinical data, surfacing patterns of clinical relevance, and supporting decision-making in diagnostically uncertain environments. The final clinical interface is de-signed as a practical tool to assist interpretation and integrate predictions into clinical work-flows and as a space to understand better how models behave in context. These contributions are made with deep appreciation for the complexity of clinical decision-making and the expertise of those who carry it out. The work aims to support that expertise by offering tools and insights that are transparent, interpretable, and aligned with real-world care. E-Thesis Swansea University, Wales, UK Machine Learning, Rare Diseases, Data Imbalance, Large Language Models, Graph Neural Networks, Interpretable AI 9 6 2025 2025-06-09 10.23889/SUThesis.70069 A selection of content is redacted or is partially redacted from this thesis to protect sensitive and personal information. COLLEGE NANME COLLEGE CODE Swansea University Xie, X., and Cole, D. Doctoral Ph.D EPSRC doctoral training grant in Enhancing Human Interactions and Collaborations with Data and Intelligence Driven Systems EPSRC doctoral training grant in Enhancing Human Interactions and Collaborations with Data and Intelligence Driven Systems 2025-07-31T11:34:01.4497200 2025-07-31T11:21:35.0057669 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science SURAJ RAMCHAND 1 70069__34880__d1783755d0144bdd9738c145846f9cf1.pdf 2025_Ramchand_S.final.70069.pdf 2025-07-31T11:29:47.3709806 Output 6078159 application/pdf E-Thesis – open access true Copyright: The author, Suraj N. Ramchand, 2025 Distributed under the terms of a Creative Commons Attribution 4.0 License (CC BY 4.0). true eng https://creativecommons.org/licenses/by/4.0/
title Interpretable Machine Learning for Predictive Analytics with High-Fidelity Imbalanced Clinical Data
spellingShingle Interpretable Machine Learning for Predictive Analytics with High-Fidelity Imbalanced Clinical Data
SURAJ RAMCHAND
title_short Interpretable Machine Learning for Predictive Analytics with High-Fidelity Imbalanced Clinical Data
title_full Interpretable Machine Learning for Predictive Analytics with High-Fidelity Imbalanced Clinical Data
title_fullStr Interpretable Machine Learning for Predictive Analytics with High-Fidelity Imbalanced Clinical Data
title_full_unstemmed Interpretable Machine Learning for Predictive Analytics with High-Fidelity Imbalanced Clinical Data
title_sort Interpretable Machine Learning for Predictive Analytics with High-Fidelity Imbalanced Clinical Data
author_id_str_mv 9193a8bef3c382a398ebcb287f59ed24
author_id_fullname_str_mv 9193a8bef3c382a398ebcb287f59ed24_***_SURAJ RAMCHAND
author SURAJ RAMCHAND
author2 SURAJ RAMCHAND
format E-Thesis
publishDate 2025
institution Swansea University
doi_str_mv 10.23889/SUThesis.70069
college_str Faculty of Science and Engineering
hierarchytype
hierarchy_top_id facultyofscienceandengineering
hierarchy_top_title Faculty of Science and Engineering
hierarchy_parent_id facultyofscienceandengineering
hierarchy_parent_title Faculty of Science and Engineering
department_str School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science
document_store_str 1
active_str 0
description Healthcare professionals frequently manage complex, under-represented clinical events that challenge established diagnostic and decision-making pathways. These include infrequently diagnosed conditions, early-stage deterioration, and diseases with heterogeneous manifestations, where low prevalence, variable presentation, and limited structured data introduce significant uncertainty. Clinicians apply deep expertise to interpret these cases, often under constraints of time, information, and documentation. This thesis examines how high-fidelity clinical data can be analysed using machine learning (ML) techniques to support clinical judgement.It explores how computational models might surface latent patterns, identify early risk signals, and make complex data more interpretable in settings defined by imbalance, variability, and diagnostic ambiguity. Across a series of empirical studies, the thesis addresses technical challenges common to rare or complex event prediction: class imbalance, missingness, temporal variation, and heterogeneity in clinical presentation. The first case study focuses on sepsis, a common but acutely time-sensitive condition. While sepsis is well recognised, the rapid progression of the disease limits the availability of temporally labelled pre-onset data. Using intensive care unit (ICU) records, the thesis develops preprocessing pipelines and tests attention-based temporal models to support early warning in a moderately imbalanced context, where timely prediction is critical and observational data are sparse. With the advent of COVID-19, a fast-spreading and clinically disruptive condition, under-standing the factors that lead to hospitalisation became a pressing research need. While early studies primarily focused on symptom onset and complications, less attention was paid to the early signals embedded in longitudinal health records. This study draws on primary care data that capture patients’ health trajectories over time, enabling analysis of events leading up to hospitalisation. Within this broader population dataset, hospital admission occurs in approximately 10% of cases, resulting in a naturally imbalanced outcome. Class-sensitive objectives and temporally structured features are applied to surface early risk markers that may inform triage decisions and support timely intervention. Building on the hospitalisation study, the thesis further explores how injecting structured clinical knowledge into models can improve the representation of infrequent manifestations within the same imbalanced primary care setting. Medical ontologies are integrated into language models to enhance the embedding of rare clinical terms and improve classification performance. This approach is applied to two additional prediction tasks using the same dataset.COVID-19-related mortality ( 1%) serves as a highly imbalanced clinical outcome for evaluating model sensitivity to low-signal, high-risk events, whilst stroke ( 30%) is used as a more common benchmark for assessing generalisability across conditions of varying prevalence and complexity. Together, these tasks extend the modelling framework to uncover how structured knowledge improves the encoding of infrequent clinical features and contributes to more robust and generalisable prediction across diverse diagnostic contexts. Grounded in the clinical utility of differentiating hypertrophic cardiomyopathy (HCM)from Fabry disease, a rare metabolic disorder frequently misdiagnosed due to overlapping cardiac features, this final case study extends the thesis’s exploration of imbalance to include disease heterogeneity and population-level rarity. A novel multimodal dataset was collected from hospital cardiac records, including echocardiography, ECG, and Holter monitoring data from genetically confirmed Fabry patients and matched HCM controls. Traditional ML classifiers trained on this dataset showed promising discriminatory ability using routinely acquired clinical measurements. These findings suggest that standard diagnostic tools may help raise earlier consideration of rare conditions in everyday practice when modelled with care and clinical insight. Together, these studies propose a generalisable approach to modelling rare and under-represented clinical events using ML. The thesis focuses on structuring and improving the quality and utility of clinical data, surfacing patterns of clinical relevance, and supporting decision-making in diagnostically uncertain environments. The final clinical interface is de-signed as a practical tool to assist interpretation and integrate predictions into clinical work-flows and as a space to understand better how models behave in context. These contributions are made with deep appreciation for the complexity of clinical decision-making and the expertise of those who carry it out. The work aims to support that expertise by offering tools and insights that are transparent, interpretable, and aligned with real-world care.
published_date 2025-06-09T05:29:52Z
_version_ 1851097971944325120
score 11.089386