No Cover Image

Journal article 846 views 103 downloads

Mining Primary Care Electronic Health Records for Automatic Disease Phenotyping: A Transparent Machine Learning Framework

Fabiola Fernandez-Gutierrez, Jonathan Kennedy, Roxanne Cooksey Orcid Logo, Mark Atkinson Orcid Logo, Ernest Choy, Sinead Brophy Orcid Logo, Lin Huo, Shang-ming Zhou Orcid Logo

Diagnostics, Volume: 11, Issue: 10, Start page: 1908

Swansea University Authors: Fabiola Fernandez-Gutierrez, Jonathan Kennedy, Roxanne Cooksey Orcid Logo, Mark Atkinson Orcid Logo, Sinead Brophy Orcid Logo, Shang-ming Zhou Orcid Logo

  • diagnostics-11-01908.pdf

    PDF | Version of Record

    Copyright: © 2021 by the authors. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license

    Download (1.72MB)

Abstract

(1) Background: We aimed to develop a transparent machine-learning (ML) framework to automatically identify patients with a condition from electronic health records (EHRs) via a parsimonious set of features. (2) Methods: We linked multiple sources of EHRs, including 917,496,869 primary care records...

Full description

Published in: Diagnostics
ISSN: 2075-4418
Published: MDPI AG 2021
Online Access: Check full text

URI: https://cronfa.swan.ac.uk/Record/cronfa58381
Tags: Add Tag
No Tags, Be the first to tag this record!
first_indexed 2021-10-18T09:36:47Z
last_indexed 2021-11-04T04:24:41Z
id cronfa58381
recordtype SURis
fullrecord <?xml version="1.0"?><rfc1807><datestamp>2021-11-03T16:23:36.0989205</datestamp><bib-version>v2</bib-version><id>58381</id><entry>2021-10-18</entry><title>Mining Primary Care Electronic Health Records for Automatic Disease Phenotyping: A Transparent Machine Learning Framework</title><swanseaauthors><author><sid>8a4f37e624a83c0b3d22a8b0e37aa149</sid><firstname>Fabiola</firstname><surname>Fernandez-Gutierrez</surname><name>Fabiola Fernandez-Gutierrez</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>08163d1f58d7fefcb1c695bcc2e0ef68</sid><ORCID/><firstname>Jonathan</firstname><surname>Kennedy</surname><name>Jonathan Kennedy</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>df63826249b712dcb03cb0161d0f3daf</sid><ORCID>0000-0002-6763-9373</ORCID><firstname>Roxanne</firstname><surname>Cooksey</surname><name>Roxanne Cooksey</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>8f85ae301cc97a48eaf58fe343c5a797</sid><ORCID>0000-0003-4237-3588</ORCID><firstname>Mark</firstname><surname>Atkinson</surname><name>Mark Atkinson</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>84f5661b35a729f55047f9e793d8798b</sid><ORCID>0000-0001-7417-2858</ORCID><firstname>Sinead</firstname><surname>Brophy</surname><name>Sinead Brophy</name><active>true</active><ethesisStudent>false</ethesisStudent></author><author><sid>118578a62021ba8ef61398da0a8750da</sid><ORCID>0000-0002-0719-9353</ORCID><firstname>Shang-ming</firstname><surname>Zhou</surname><name>Shang-ming Zhou</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2021-10-18</date><deptcode>PMSC</deptcode><abstract>(1) Background: We aimed to develop a transparent machine-learning (ML) framework to automatically identify patients with a condition from electronic health records (EHRs) via a parsimonious set of features. (2) Methods: We linked multiple sources of EHRs, including 917,496,869 primary care records and 40,656,805 secondary care records and 694,954 records from specialist surgeries between 2002 and 2012, to generate a unique dataset. Then, we treated patient identification as a problem of text classification and proposed a transparent disease-phenotyping framework. This framework comprises a generation of patient representation, feature selection, and optimal phenotyping algorithm development to tackle the imbalanced nature of the data. This framework was extensively evaluated by identifying rheumatoid arthritis (RA) and ankylosing spondylitis (AS). (3) Results: Being applied to the linked dataset of 9657 patients with 1484 cases of rheumatoid arthritis (RA) and 204 cases of ankylosing spondylitis (AS), this framework achieved accuracy and positive predictive values of 86.19% and 88.46%, respectively, for RA and 99.23% and 97.75% for AS, comparable with expert knowledge-driven methods. (4) Conclusions: This framework could potentially be used as an efficient tool for identifying patients with a condition of interest from EHRs, helping clinicians in clinical decision-support process.</abstract><type>Journal Article</type><journal>Diagnostics</journal><volume>11</volume><journalNumber>10</journalNumber><paginationStart>1908</paginationStart><paginationEnd/><publisher>MDPI AG</publisher><placeOfPublication/><isbnPrint/><isbnElectronic/><issnPrint/><issnElectronic>2075-4418</issnElectronic><keywords>phenotyping, rheumatology, cohort identification, electronic health records, feature selection, transparent machine learning, text mining, big data, artificial intelligence</keywords><publishedDay>15</publishedDay><publishedMonth>10</publishedMonth><publishedYear>2021</publishedYear><publishedDate>2021-10-15</publishedDate><doi>10.3390/diagnostics11101908</doi><url/><notes/><college>COLLEGE NANME</college><department>Medicine</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>PMSC</DepartmentCode><institution>Swansea University</institution><apcterm/><funders>The authors acknowledge the supports from the Farr Institute of Health Informatics Research (MR/K006525/1) and Health Data Research UK (NIWA1). This research was also supported by &#x201C;Major Project of National Social Science Foundation of China (16ZDA0092)&#x201D; and &#x201C;Guangxi University &#x2018;Digital ASEAN Cloud Big Data Security and Mining Technology&#x2019; Innovation Team&#x201D;</funders><lastEdited>2021-11-03T16:23:36.0989205</lastEdited><Created>2021-10-18T10:30:01.8124969</Created><path><level id="1">Faculty of Medicine, Health and Life Sciences</level><level id="2">Swansea University Medical School - Medicine</level></path><authors><author><firstname>Fabiola</firstname><surname>Fernandez-Gutierrez</surname><order>1</order></author><author><firstname>Jonathan</firstname><surname>Kennedy</surname><orcid/><order>2</order></author><author><firstname>Roxanne</firstname><surname>Cooksey</surname><orcid>0000-0002-6763-9373</orcid><order>3</order></author><author><firstname>Mark</firstname><surname>Atkinson</surname><orcid>0000-0003-4237-3588</orcid><order>4</order></author><author><firstname>Ernest</firstname><surname>Choy</surname><order>5</order></author><author><firstname>Sinead</firstname><surname>Brophy</surname><orcid>0000-0001-7417-2858</orcid><order>6</order></author><author><firstname>Lin</firstname><surname>Huo</surname><order>7</order></author><author><firstname>Shang-ming</firstname><surname>Zhou</surname><orcid>0000-0002-0719-9353</orcid><order>8</order></author></authors><documents><document><filename>58381__21201__52d3826cd99047d8894372e1ca6a5678.pdf</filename><originalFilename>diagnostics-11-01908.pdf</originalFilename><uploaded>2021-10-18T10:30:01.8102720</uploaded><type>Output</type><contentLength>1798726</contentLength><contentType>application/pdf</contentType><version>Version of Record</version><cronfaStatus>true</cronfaStatus><documentNotes>Copyright: &#xA9; 2021 by the authors. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license</documentNotes><copyrightCorrect>true</copyrightCorrect><language>eng</language><licence>https://creativecommons.org/licenses/by/4.0/</licence></document></documents><OutputDurs/></rfc1807>
spelling 2021-11-03T16:23:36.0989205 v2 58381 2021-10-18 Mining Primary Care Electronic Health Records for Automatic Disease Phenotyping: A Transparent Machine Learning Framework 8a4f37e624a83c0b3d22a8b0e37aa149 Fabiola Fernandez-Gutierrez Fabiola Fernandez-Gutierrez true false 08163d1f58d7fefcb1c695bcc2e0ef68 Jonathan Kennedy Jonathan Kennedy true false df63826249b712dcb03cb0161d0f3daf 0000-0002-6763-9373 Roxanne Cooksey Roxanne Cooksey true false 8f85ae301cc97a48eaf58fe343c5a797 0000-0003-4237-3588 Mark Atkinson Mark Atkinson true false 84f5661b35a729f55047f9e793d8798b 0000-0001-7417-2858 Sinead Brophy Sinead Brophy true false 118578a62021ba8ef61398da0a8750da 0000-0002-0719-9353 Shang-ming Zhou Shang-ming Zhou true false 2021-10-18 PMSC (1) Background: We aimed to develop a transparent machine-learning (ML) framework to automatically identify patients with a condition from electronic health records (EHRs) via a parsimonious set of features. (2) Methods: We linked multiple sources of EHRs, including 917,496,869 primary care records and 40,656,805 secondary care records and 694,954 records from specialist surgeries between 2002 and 2012, to generate a unique dataset. Then, we treated patient identification as a problem of text classification and proposed a transparent disease-phenotyping framework. This framework comprises a generation of patient representation, feature selection, and optimal phenotyping algorithm development to tackle the imbalanced nature of the data. This framework was extensively evaluated by identifying rheumatoid arthritis (RA) and ankylosing spondylitis (AS). (3) Results: Being applied to the linked dataset of 9657 patients with 1484 cases of rheumatoid arthritis (RA) and 204 cases of ankylosing spondylitis (AS), this framework achieved accuracy and positive predictive values of 86.19% and 88.46%, respectively, for RA and 99.23% and 97.75% for AS, comparable with expert knowledge-driven methods. (4) Conclusions: This framework could potentially be used as an efficient tool for identifying patients with a condition of interest from EHRs, helping clinicians in clinical decision-support process. Journal Article Diagnostics 11 10 1908 MDPI AG 2075-4418 phenotyping, rheumatology, cohort identification, electronic health records, feature selection, transparent machine learning, text mining, big data, artificial intelligence 15 10 2021 2021-10-15 10.3390/diagnostics11101908 COLLEGE NANME Medicine COLLEGE CODE PMSC Swansea University The authors acknowledge the supports from the Farr Institute of Health Informatics Research (MR/K006525/1) and Health Data Research UK (NIWA1). This research was also supported by “Major Project of National Social Science Foundation of China (16ZDA0092)” and “Guangxi University ‘Digital ASEAN Cloud Big Data Security and Mining Technology’ Innovation Team” 2021-11-03T16:23:36.0989205 2021-10-18T10:30:01.8124969 Faculty of Medicine, Health and Life Sciences Swansea University Medical School - Medicine Fabiola Fernandez-Gutierrez 1 Jonathan Kennedy 2 Roxanne Cooksey 0000-0002-6763-9373 3 Mark Atkinson 0000-0003-4237-3588 4 Ernest Choy 5 Sinead Brophy 0000-0001-7417-2858 6 Lin Huo 7 Shang-ming Zhou 0000-0002-0719-9353 8 58381__21201__52d3826cd99047d8894372e1ca6a5678.pdf diagnostics-11-01908.pdf 2021-10-18T10:30:01.8102720 Output 1798726 application/pdf Version of Record true Copyright: © 2021 by the authors. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license true eng https://creativecommons.org/licenses/by/4.0/
title Mining Primary Care Electronic Health Records for Automatic Disease Phenotyping: A Transparent Machine Learning Framework
spellingShingle Mining Primary Care Electronic Health Records for Automatic Disease Phenotyping: A Transparent Machine Learning Framework
Fabiola Fernandez-Gutierrez
Jonathan Kennedy
Roxanne Cooksey
Mark Atkinson
Sinead Brophy
Shang-ming Zhou
title_short Mining Primary Care Electronic Health Records for Automatic Disease Phenotyping: A Transparent Machine Learning Framework
title_full Mining Primary Care Electronic Health Records for Automatic Disease Phenotyping: A Transparent Machine Learning Framework
title_fullStr Mining Primary Care Electronic Health Records for Automatic Disease Phenotyping: A Transparent Machine Learning Framework
title_full_unstemmed Mining Primary Care Electronic Health Records for Automatic Disease Phenotyping: A Transparent Machine Learning Framework
title_sort Mining Primary Care Electronic Health Records for Automatic Disease Phenotyping: A Transparent Machine Learning Framework
author_id_str_mv 8a4f37e624a83c0b3d22a8b0e37aa149
08163d1f58d7fefcb1c695bcc2e0ef68
df63826249b712dcb03cb0161d0f3daf
8f85ae301cc97a48eaf58fe343c5a797
84f5661b35a729f55047f9e793d8798b
118578a62021ba8ef61398da0a8750da
author_id_fullname_str_mv 8a4f37e624a83c0b3d22a8b0e37aa149_***_Fabiola Fernandez-Gutierrez
08163d1f58d7fefcb1c695bcc2e0ef68_***_Jonathan Kennedy
df63826249b712dcb03cb0161d0f3daf_***_Roxanne Cooksey
8f85ae301cc97a48eaf58fe343c5a797_***_Mark Atkinson
84f5661b35a729f55047f9e793d8798b_***_Sinead Brophy
118578a62021ba8ef61398da0a8750da_***_Shang-ming Zhou
author Fabiola Fernandez-Gutierrez
Jonathan Kennedy
Roxanne Cooksey
Mark Atkinson
Sinead Brophy
Shang-ming Zhou
author2 Fabiola Fernandez-Gutierrez
Jonathan Kennedy
Roxanne Cooksey
Mark Atkinson
Ernest Choy
Sinead Brophy
Lin Huo
Shang-ming Zhou
format Journal article
container_title Diagnostics
container_volume 11
container_issue 10
container_start_page 1908
publishDate 2021
institution Swansea University
issn 2075-4418
doi_str_mv 10.3390/diagnostics11101908
publisher MDPI AG
college_str Faculty of Medicine, Health and Life Sciences
hierarchytype
hierarchy_top_id facultyofmedicinehealthandlifesciences
hierarchy_top_title Faculty of Medicine, Health and Life Sciences
hierarchy_parent_id facultyofmedicinehealthandlifesciences
hierarchy_parent_title Faculty of Medicine, Health and Life Sciences
department_str Swansea University Medical School - Medicine{{{_:::_}}}Faculty of Medicine, Health and Life Sciences{{{_:::_}}}Swansea University Medical School - Medicine
document_store_str 1
active_str 0
description (1) Background: We aimed to develop a transparent machine-learning (ML) framework to automatically identify patients with a condition from electronic health records (EHRs) via a parsimonious set of features. (2) Methods: We linked multiple sources of EHRs, including 917,496,869 primary care records and 40,656,805 secondary care records and 694,954 records from specialist surgeries between 2002 and 2012, to generate a unique dataset. Then, we treated patient identification as a problem of text classification and proposed a transparent disease-phenotyping framework. This framework comprises a generation of patient representation, feature selection, and optimal phenotyping algorithm development to tackle the imbalanced nature of the data. This framework was extensively evaluated by identifying rheumatoid arthritis (RA) and ankylosing spondylitis (AS). (3) Results: Being applied to the linked dataset of 9657 patients with 1484 cases of rheumatoid arthritis (RA) and 204 cases of ankylosing spondylitis (AS), this framework achieved accuracy and positive predictive values of 86.19% and 88.46%, respectively, for RA and 99.23% and 97.75% for AS, comparable with expert knowledge-driven methods. (4) Conclusions: This framework could potentially be used as an efficient tool for identifying patients with a condition of interest from EHRs, helping clinicians in clinical decision-support process.
published_date 2021-10-15T04:14:52Z
_version_ 1763754002422431744
score 11.037166