A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history

None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None

A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history

Journal Article (2022)

Author(s)

Marc P. Maurits (Leiden University Medical Center)

Ilya Korsunsky (Harvard Medical School)

Soumya Raychaudhuri (Harvard Medical School)

Shawn N. Murphy (Mass General Brigham, Boston)

Jordan W. Smoller (Massachusetts General Hospital)

Scott T. Weiss (Harvard Medical School)

Thomas W.J. Huizinga (Leiden University Medical Center)

Marcel J.T. Reinders (TU Delft - Pattern Recognition and Bioinformatics, Delft Bioinformatics Lab, Leiden University Medical Center)

Erik B. Van Den Akker (Leiden University Medical Center)

undefined More Authors (External organisation)

Research Group

Pattern Recognition and Bioinformatics

Clustering Electronic health records Electronic medical records EMERGE ICD PhenoGraph

DOI related publication

https://doi.org/10.1093/jamia/ocac008 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:3ab2510b-bbe1-4bee-895b-fdd49b2f5a00

More Info

expand_more

Publication Year

2022

Language

English

Research Group

Pattern Recognition and Bioinformatics

Issue number

5

Volume number

29

Pages (from-to)

761-769

Downloads counter

215

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Objective: To facilitate patient disease subset and risk factor identification by constructing a pipeline which is generalizable, provides easily interpretable results, and allows replication by overcoming electronic health records (EHRs) batch effects. Material and Methods: We used 1872 billing codes in EHRs of 102 880 patients from 12 healthcare systems. Using tools borrowed from single-cell omics, we mitigated center-specific batch effects and performed clustering to identify patients with highly similar medical history patterns across the various centers. Our visualization method (PheSpec) depicts the phenotypic profile of clusters, applies a novel filtering of noninformative codes (Ranked Scope Pervasion), and indicates the most distinguishing features. Results: We observed 114 clinically meaningful profiles, for example, linking prostate hyperplasia with cancer and diabetes with cardiovascular problems and grouping pediatric developmental disorders. Our framework identified disease subsets, exemplified by 6 "other headache"clusters, where phenotypic profiles suggested different underlying mechanisms: migraine, convulsion, injury, eye problems, joint pain, and pituitary gland disorders. Phenotypic patterns replicated well, with high correlations of ≥0.75 to an average of 6 (2-8) of the 12 different cohorts, demonstrating the consistency with which our method discovers disease history profiles. Discussion: Costly clinical research ventures should be based on solid hypotheses. We repurpose methods from single-cell omics to build these hypotheses from observational EHR data, distilling useful information from complex data. Conclusion: We establish a generalizable pipeline for the identification and replication of clinically meaningful (sub)phenotypes from widely available high-dimensional billing codes. This approach overcomes datatype problems and produces comprehensive visualizations of validation-ready phenotypes.

Files

Ocac008.pdf

(pdf | 0.866 Mb)