No Cover Image

E-Thesis 212 views 568 downloads

Interpretable Machine Learning for Predictive Analytics with High-Fidelity Imbalanced Clinical Data / SURAJ RAMCHAND

Swansea University Author: SURAJ RAMCHAND

  • 2025_Ramchand_S.final.70069.pdf

    PDF | E-Thesis – open access

    Copyright: The author, Suraj N. Ramchand, 2025 Distributed under the terms of a Creative Commons Attribution 4.0 License (CC BY 4.0).

    Download (5.8MB)

DOI (Published version): 10.23889/SUThesis.70069

Abstract

Healthcare professionals frequently manage complex, under-represented clinical events that challenge established diagnostic and decision-making pathways. These include infrequently diagnosed conditions, early-stage deterioration, and diseases with heterogeneous manifestations, where low prevalence,...

Full description

Published: Swansea University, Wales, UK 2025
Institution: Swansea University
Degree level: Doctoral
Degree name: Ph.D
Supervisor: Xie, X., and Cole, D.
URI: https://cronfa.swan.ac.uk/Record/cronfa70069
Abstract: Healthcare professionals frequently manage complex, under-represented clinical events that challenge established diagnostic and decision-making pathways. These include infrequently diagnosed conditions, early-stage deterioration, and diseases with heterogeneous manifestations, where low prevalence, variable presentation, and limited structured data introduce significant uncertainty. Clinicians apply deep expertise to interpret these cases, often under constraints of time, information, and documentation. This thesis examines how high-fidelity clinical data can be analysed using machine learning (ML) techniques to support clinical judgement.It explores how computational models might surface latent patterns, identify early risk signals, and make complex data more interpretable in settings defined by imbalance, variability, and diagnostic ambiguity. Across a series of empirical studies, the thesis addresses technical challenges common to rare or complex event prediction: class imbalance, missingness, temporal variation, and heterogeneity in clinical presentation. The first case study focuses on sepsis, a common but acutely time-sensitive condition. While sepsis is well recognised, the rapid progression of the disease limits the availability of temporally labelled pre-onset data. Using intensive care unit (ICU) records, the thesis develops preprocessing pipelines and tests attention-based temporal models to support early warning in a moderately imbalanced context, where timely prediction is critical and observational data are sparse. With the advent of COVID-19, a fast-spreading and clinically disruptive condition, under-standing the factors that lead to hospitalisation became a pressing research need. While early studies primarily focused on symptom onset and complications, less attention was paid to the early signals embedded in longitudinal health records. This study draws on primary care data that capture patients’ health trajectories over time, enabling analysis of events leading up to hospitalisation. Within this broader population dataset, hospital admission occurs in approximately 10% of cases, resulting in a naturally imbalanced outcome. Class-sensitive objectives and temporally structured features are applied to surface early risk markers that may inform triage decisions and support timely intervention. Building on the hospitalisation study, the thesis further explores how injecting structured clinical knowledge into models can improve the representation of infrequent manifestations within the same imbalanced primary care setting. Medical ontologies are integrated into language models to enhance the embedding of rare clinical terms and improve classification performance. This approach is applied to two additional prediction tasks using the same dataset.COVID-19-related mortality ( 1%) serves as a highly imbalanced clinical outcome for evaluating model sensitivity to low-signal, high-risk events, whilst stroke ( 30%) is used as a more common benchmark for assessing generalisability across conditions of varying prevalence and complexity. Together, these tasks extend the modelling framework to uncover how structured knowledge improves the encoding of infrequent clinical features and contributes to more robust and generalisable prediction across diverse diagnostic contexts. Grounded in the clinical utility of differentiating hypertrophic cardiomyopathy (HCM)from Fabry disease, a rare metabolic disorder frequently misdiagnosed due to overlapping cardiac features, this final case study extends the thesis’s exploration of imbalance to include disease heterogeneity and population-level rarity. A novel multimodal dataset was collected from hospital cardiac records, including echocardiography, ECG, and Holter monitoring data from genetically confirmed Fabry patients and matched HCM controls. Traditional ML classifiers trained on this dataset showed promising discriminatory ability using routinely acquired clinical measurements. These findings suggest that standard diagnostic tools may help raise earlier consideration of rare conditions in everyday practice when modelled with care and clinical insight. Together, these studies propose a generalisable approach to modelling rare and under-represented clinical events using ML. The thesis focuses on structuring and improving the quality and utility of clinical data, surfacing patterns of clinical relevance, and supporting decision-making in diagnostically uncertain environments. The final clinical interface is de-signed as a practical tool to assist interpretation and integrate predictions into clinical work-flows and as a space to understand better how models behave in context. These contributions are made with deep appreciation for the complexity of clinical decision-making and the expertise of those who carry it out. The work aims to support that expertise by offering tools and insights that are transparent, interpretable, and aligned with real-world care.
Item Description: A selection of content is redacted or is partially redacted from this thesis to protect sensitive and personal information.
Keywords: Machine Learning, Rare Diseases, Data Imbalance, Large Language Models, Graph Neural Networks, Interpretable AI
College: Faculty of Science and Engineering
Funders: EPSRC doctoral training grant in Enhancing Human Interactions and Collaborations with Data and Intelligence Driven Systems