1932

Abstract

The progress of precision medicine research hinges on the gathering and analysis of extensive and diverse clinical datasets. With the continued expansion of modalities, scales, and sources of clinical datasets, it becomes imperative to devise methods for aggregating information from these varied sources to achieve a comprehensive understanding of diseases. In this review, we describe two important approaches for the analysis of diverse clinical datasets, namely the centralized model and federated model. We compare and contrast the strengths and weaknesses inherent in each model and present recent progress in methodologies and their associated challenges. Finally, we present an outlook on the opportunities that both models hold for the future analysis of clinical data.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-biodatasci-122220-115746
2024-08-23
2025-01-10
The full text of this item is not currently available.

Literature Cited

  1. 1.
    Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, et al. 2013.. GenBank. . Nucleic Acids Res. 41:(D1):D3642
    [Crossref] [Google Scholar]
  2. 2.
    Lifschitz S, Haeusler EH, Catanho M, de Miranda AB, Molina de Armas E, et al. 2022.. Bio-strings: a relational database data-type for dealing with large biosequences. . BioTech 11:(3):31
    [Crossref] [Google Scholar]
  3. 3.
    Evans RS. 2016.. Electronic health records: then, now, and in the future. . Yearb. Med. Inform. 2016:(Suppl. 1):S4861
    [Google Scholar]
  4. 4.
    Bowie J, Barnett GO. 1976.. MUMPS – an economical and efficient time-sharing system for information management. . Comput. Programs Biomed. 6:(1):1122
    [Crossref] [Google Scholar]
  5. 5.
    Anumula N, Sanelli PC. 2012.. Meaningful use. . Am. J. Neuroradiol. 33:(8):145557
    [Crossref] [Google Scholar]
  6. 6.
    Aljuraid R, Justinia T. 2022.. Classification of challenges and threats in healthcare cybersecurity: a systematic review. . Stud. Health Technol. Inform. 295::36265
    [Google Scholar]
  7. 7.
    Verheij RA, Curcin V, Delaney BC, McGilchrist MM. 2018.. Possible sources of bias in primary care electronic health record data use and reuse. . J. Med. Internet Res. 20:(5):e185
    [Crossref] [Google Scholar]
  8. 8.
    Weiskopf NG, Weng C. 2013.. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. . J. Am. Med. Inform. Assoc. 20:(1):14451
    [Crossref] [Google Scholar]
  9. 9.
    Getzen E, Ungar L, Mowery D, Jiang X, Long Q. 2023.. Mining for equitable health: assessing the impact of missing data in electronic health records. . J. Biomed. Inform. 139::104269
    [Crossref] [Google Scholar]
  10. 10.
    Stang PE, Ryan PB, Racoosin JA, Overhage JM, Hartzema AG, et al. 2010.. Advancing the science for active surveillance: rationale and design for the Observational Medical Outcomes Partnership. . Ann. Intern. Med. 153:(9):6006
    [Crossref] [Google Scholar]
  11. 11.
    Klann JG, Joss MAH, Embree K, Murphy SN. 2019.. Data model harmonization for the All Of Us Research Program: transforming i2b2 data into the OMOP common data model. . PLOS ONE 14:(2):e0212463
    [Crossref] [Google Scholar]
  12. 12.
    Weeks J, Pardee R. 2019.. Learning to share health care data: a brief timeline of influential common data models and distributed health data networks in U.S. health care research. . eGEMs 7:(1):4
    [Crossref] [Google Scholar]
  13. 13.
    Hripcsak G, Duke JD, Shah NH, Reich CG, Huser V, et al. 2015.. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. . Stud. Health Technol. Inform. 216::57478
    [Google Scholar]
  14. 14.
    Yoon D, Schuemie MJ, Kim JH, Kim DK, Park MY, et al. 2016.. A normalization method for combination of laboratory test results from different electronic healthcare databases in a distributed research network. . Pharmacoepidemiol. Drug Saf. 25:(3):30716
    [Crossref] [Google Scholar]
  15. 15.
    Yoon D, Ahn EK, Park MY, Cho SY, Ryan P, et al. 2016.. Conversion and data quality assessment of electronic health record data at a Korean tertiary teaching hospital to a common data model for distributed network research. . Healthc. Inform. Res. 22:(1):5458
    [Crossref] [Google Scholar]
  16. 16.
    Papez V, Moinat M, Voss EA, Bazakou S, Van Winzum A, et al. 2023.. Transforming and evaluating the UK Biobank to the OMOP Common Data Model for COVID-19 research and beyond. . J. Am. Med. Inform. Assoc. 30:(1):10311
    [Crossref] [Google Scholar]
  17. 17.
    Verma A, Damrauer SM, Naseer N, Weaver JE, Kripke CM, et al. 2022.. The Penn Medicine BioBank: towards a genomics-enabled learning healthcare system to accelerate precision medicine in a diverse population. . J. Pers. Med. 12:(12):1974
    [Crossref] [Google Scholar]
  18. 18.
    Platt R, Brown JS, Robb M, McClellan M, Ball R, et al. 2018.. The FDA Sentinel Initiative—an evolving national resource. . New Engl. J. Med. 379:(22):209193
    [Crossref] [Google Scholar]
  19. 19.
    Behrman RE, Benner JS, Brown JS, McClellan M, Woodcock J, Platt R. 2011.. Developing the Sentinel system—a national resource for evidence development. . New Eng. J. Med. 364:(6):49899
    [Crossref] [Google Scholar]
  20. 20.
    Robb MA, Racoosin JA, Sherman RE, Gross TP, Ball R, et al. 2012.. The US Food and Drug Administration's Sentinel Initiative: expanding the horizons of medical product safety. . Pharmacoepidemiol. Drug Saf. 21:(Suppl. 1):911
    [Crossref] [Google Scholar]
  21. 21.
    Cocoros NM, Pokorney SD, Haynes K, Garcia C, Al-Khalidi HR, et al. 2019.. FDA-Catalyst—using FDA's Sentinel Initiative for large-scale pragmatic randomized trials: approach and lessons learned during the planning phase of the first trial. . Clin. Trials 16:(1):9097
    [Crossref] [Google Scholar]
  22. 22.
    Forrest CB, McTigue KM, Hernandez AF, Cohen LW, Cruz H, et al. 2021.. PCORnet® 2020: current state, accomplishments, and future directions. . J. Clin. Epidemiol. 129::6067
    [Crossref] [Google Scholar]
  23. 23.
    Kroes JA, Bansal AT, Berret E, Christian N, Kremer A, et al. 2022.. Blueprint for harmonising unstandardised disease registries to allow federated data analysis: prepare for the future. . ERJ Open Res. 8:(4):00168-2022
    [Crossref] [Google Scholar]
  24. 24.
    Brown JS, Maro JC, Nguyen M, Ball R. 2020.. Using and improving distributed data networks to generate actionable evidence: the case of real-world outcomes in the Food and Drug Administration's Sentinel system. . J. Am. Med. Inform. Assoc. 27:(5):79397
    [Crossref] [Google Scholar]
  25. 25.
    Ramirez AH, Sulieman L, Schlueter DJ, Halvorson A, Qian J, et al. 2022.. The All of Us Research Program: data quality, utility, and diversity. . Patterns 3:(8):100570
    [Crossref] [Google Scholar]
  26. 26.
    Gottesman O, Kuivaniemi H, Tromp G, Faucett WA, Li R, et al. 2013.. The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future. . Genet. Med. 15:(10):76171
    [Crossref] [Google Scholar]
  27. 27.
    Kirby JC, Speltz P, Rasmussen LV, Basford M, Gottesman O, et al. 2016.. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. . J. Am. Med. Inform. Assoc. 23:(6):104652
    [Crossref] [Google Scholar]
  28. 28.
    Tarabichi Y, Frees A, Honeywell S, Huang C, Naidech AM, et al. 2021.. The Cosmos collaborative: a vendor-facilitated electronic health record data aggregation platform. . ACI Open 5:(1):e3646
    [Crossref] [Google Scholar]
  29. 29.
    Haendel MA, Chute CG, Bennett TD, Eichmann DA, Guinney J, et al. 2021.. The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment. . J. Am. Med. Inform. Assoc. 28:(3):42743
    [Crossref] [Google Scholar]
  30. 30.
    He Y, Landrum MB, Zaslavsky AM. 2014.. Combining information from two data sources with misreporting and incompleteness to assess hospice-use among cancer patients: a multiple imputation approach. . Stat. Med. 33:(21):371024
    [Crossref] [Google Scholar]
  31. 31.
    Bonett DG. 2002.. Sample size requirements for estimating intraclass correlations with desired precision. . Stat. Med. 21:(9):133135
    [Crossref] [Google Scholar]
  32. 32.
    Hripcsak G, Ryan PB, Duke JD, Shah NH, Park RW, et al. 2016.. Characterizing treatment pathways at scale using the OHDSI network. . PNAS 113:(27):732936
    [Crossref] [Google Scholar]
  33. 33.
    Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. 2009.. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. . Epidemiology 20:(4):51222
    [Crossref] [Google Scholar]
  34. 34.
    Simpao AF, Ahumada LM, Gálvez JA, Rehman MA. 2014.. A review of analytics and clinical informatics in health care. . J. Med. Syst. 38:(4):45
    [Crossref] [Google Scholar]
  35. 35.
    Dixon BE, Wen C, French T, Williams JL, Duke JD, Grannis SJ. 2020.. Extending an open-source tool to measure data quality: case report on Observational Health Data Science and Informatics (OHDSI). . BMJ Health Care Inform. 27:(1):e100054
    [Crossref] [Google Scholar]
  36. 36.
    Chen IY, Pierson E, Rose S, Joshi S, Ferryman K, Ghassemi M. 2021.. Ethical machine learning in healthcare. . Annu. Rev. Biomed. Data Sci. 4::12344
    [Crossref] [Google Scholar]
  37. 37.
    Brat GA, Weber GM, Gehlenborg N, Avillach P, Palmer NP, et al. 2020.. International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium. . NPJ Digit. Med. 3::109
    [Crossref] [Google Scholar]
  38. 38.
    Wu Y, Jiang X, Kim J, Ohno-Machado L. 2012.. Grid binary logistic regression (GLORE): building shared models without sharing data. . J. Am. Med. Inform. Assoc. 19:(5):75864
    [Crossref] [Google Scholar]
  39. 39.
    Lu C-L, Wang S, Ji Z, Wu Y, Xiong L, et al. 2015.. WebDISCO: a web service for distributed cox model learning without patient-level data sharing. . J. Am. Med. Inform. Assoc. 22:(6):121219
    [Crossref] [Google Scholar]
  40. 40.
    Longford NT. 1987.. A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects. . Biometrika 74:(4):81727
    [Crossref] [Google Scholar]
  41. 41.
    Ypma TJ. 1995.. Historical development of the Newton-Raphson method. . SIAM Rev. 37:(4):53151
    [Crossref] [Google Scholar]
  42. 42.
    Raphson J. 1697.. Analysis Aequationum Universalis. Thomas Bradyll
    [Google Scholar]
  43. 43.
    Duan R, Boland MR, Liu Z, Liu Y, Chang HH, et al. 2020.. Learning from electronic health records across multiple sites: a communication-efficient and privacy-preserving distributed algorithm. . J. Am. Med. Inform. Assoc. 27:(3):37685
    [Crossref] [Google Scholar]
  44. 44.
    Duan R, Boland MR, Moore JH, Chen Y. 2019.. ODAL: a one-shot distributed algorithm to perform logistic regressions on electronic health records data from multiple clinical sites. . Pac. Symp. Biocomput. 24::3041
    [Google Scholar]
  45. 45.
    Duan R, Luo C, Schuemie MJ, Tong J, Liang CJ, et al. 2020.. Learning from local to global: an efficient distributed algorithm for modeling time-to-event data. . J. Am. Med. Inform. Assoc. 27:(7):102836
    [Crossref] [Google Scholar]
  46. 46.
    Luo C, Islam MN, Sheils NE, Buresh J, Schuemie MJ, et al. 2022.. dPQL: a lossless distributed algorithm for generalized linear mixed model with application to privacy-preserving hospital profiling. . J. Am. Med. Inform. Assoc. 29:(8):136671
    [Crossref] [Google Scholar]
  47. 47.
    Luo C, Islam MN, Sheils NE, Buresh J, Reps J, et al. 2022.. DLMM as a lossless one-shot algorithm for collaborative multi-site distributed linear mixed models. . Nat. Commun. 13:(1):1678
    [Crossref] [Google Scholar]
  48. 48.
    Li R, Duan R, Zhang X, Lumley T, Pendergrass S, et al. 2021.. Lossless integration of multiple electronic health records for identifying pleiotropy using summary statistics. . Nat. Commun. 12:(1):168
    [Crossref] [Google Scholar]
  49. 49.
    Dwork C, McSherry F, Nissim K, Smith A. 2006.. Calibrating noise to sensitivity in private data analysis. . In Theory of Cryptography, ed. S Halevi, T Rabin , pp. 26584. Berlin:: Springer
    [Google Scholar]
  50. 50.
    Wasserman L, Zhou S. 2010.. A statistical framework for differential privacy. . J. Am. Stat. Assoc. 105:(489):37589
    [Crossref] [Google Scholar]
  51. 51.
    Froelicher D, Troncoso-Pastoriza JR, Raisaro JL, Cuendet MA, Sousa JS, et al. 2021.. Truly privacy-preserving federated analytics for precision medicine with multiparty homomorphic encryption. . Nat. Commun. 12:(1):5910
    [Crossref] [Google Scholar]
  52. 52.
    Canetti R, Feige U, Goldreich O, Naor M. 1996.. Adaptively secure multi-party computation. . In STOC '96: Proceedings of the Twenty-Eighth Annual Symposium on Theory of Computing, pp. 63948. New York:: Assoc. Comput. Mach.
    [Google Scholar]
  53. 53.
    Sweeney L. 2002.. k-Anonymity: a model for protecting privacy. . Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10:(5):55770
    [Crossref] [Google Scholar]
  54. 54.
    Navarese EP, Robinson JG, Kowalewski M, Kołodziejczak M, Andreotti F, et al. 2018.. Association between baseline LDL-C level and total and cardiovascular mortality after LDL-C lowering a systematic review and meta-analysis. . JAMA 319:(15):156679
    [Crossref] [Google Scholar]
  55. 55.
    Ohno-Machado L, Agha Z, Bell DS, Dahm L, Day ME, et al. 2014.. pSCANNER: patient-centered scalable national network for effectiveness research. . J. Am. Med. Inform. Assoc. 21:(4):62126
    [Crossref] [Google Scholar]
  56. 56.
    Wang J, Kolar M, Srebro N, Zhang T. 2017.. Efficient distributed learning with sparsity. . Proc. Mach. Learn. Res. 70::363645
    [Google Scholar]
  57. 57.
    Jordan MI, Lee JD, Yang Y. 2019.. Communication-efficient distributed statistical inference. . J. Am. Stat. Assoc. 114:(526):66881
    [Crossref] [Google Scholar]
  58. 58.
    Fan J, Guo Y, Wang K. 2023.. Communication-efficient accurate statistical estimation. . J. Am. Stat. Assoc. 118:(542):100010
    [Crossref] [Google Scholar]
  59. 59.
    Edmondson MJ, Luo C, Nazmul Islam M, Sheils NE, Buresh J, et al. 2022.. Distributed quasi-Poisson regression algorithm for modeling multi-site count outcomes in distributed data networks. . J. Biomed. Inform. 131::104097
    [Crossref] [Google Scholar]
  60. 60.
    Edmondson MJ, Luo C, Duan R, Maltenfort M, Chen Z, et al. 2021.. An efficient and accurate distributed learning algorithm for modeling multi-site zero-inflated count outcomes. . Sci. Rep. 11:(1):19647
    [Crossref] [Google Scholar]
  61. 61.
    Duan R, Ning Y, Chen Y. 2022.. Heterogeneity-aware and communication-efficient distributed statistical inference. . Biometrika 109:(1):6783
    [Crossref] [Google Scholar]
  62. 62.
    Liang K-Y. 1987.. Extended Mantel-Haenszel estimating procedure for multivariate logistic regression models. . Biometrics 43:(2):28999
    [Crossref] [Google Scholar]
  63. 63.
    Tong J, Luo C, Islam MN, Sheils NE, Buresh J, et al. 2022.. Distributed learning for heterogeneous clinical data with application to integrating COVID-19 data across 230 sites. . NPJ Digit. Med. 5:(1):76
    [Crossref] [Google Scholar]
  64. 64.
    Luo C, Duan R, Naj AC, Kranzler HR, Bian J, Chen Y. 2022.. ODACH: a one-shot distributed algorithm for Cox model with heterogeneous multi-center data. . Sci. Rep. 12::6627
    [Crossref] [Google Scholar]
  65. 65.
    Li D, Lu W, Shu D, Toh S, Wang R. 2023.. Distributed Cox proportional hazards regression using summary-level information. . Biostatistics 24:(3):77694
    [Crossref] [Google Scholar]
  66. 66.
    Wang X, Yang Z, Chen X, Liu W. 2018.. Distributed inference for linear support vector machine. . J. Mach. Learn. Res. 20::141
    [Google Scholar]
  67. 67.
    Fan J, Wang D, Wang K, Zhu Z. 2019.. Distributed estimation of principal eigenspaces. . Ann. Stat. 47:(6):300931
    [Crossref] [Google Scholar]
  68. 68.
    Newman D, Asuncion A, Smyth P, Welling M. 2007.. Distributed inference for Latent Dirichlet Allocation. . In NIPS'07: Proceedings of the 20th International Conference on Neural Information Processing Systems, ed. JC Platt, D Koller , pp. 108188. Red Hook, NY:: Curran
    [Google Scholar]
  69. 69.
    Newman D, Asuncion A, Smyth P, Welling M. 2009.. Distributed algorithms for topic models. . J. Mach. Learn. Res. 10::180128
    [Google Scholar]
  70. 70.
    Chen X, Liu W, Mao X, Yang Z. 2020.. Distributed high-dimensional regression under a quantile loss function. . J. Mach. Learn. Res. 21::143
    [Google Scholar]
  71. 71.
    Zhu R, Jiang C, Wang X, Wang S, Zheng H, Tang H. 2020.. Privacy-preserving construction of generalized linear mixed model for biomedical computation. . Bioinformatics 36: (Suppl. 1):i12835
    [Crossref] [Google Scholar]
  72. 72.
    Breslow NE, Clayton DG. 1993.. Approximate inference in generalized linear mixed models. . J. Am. Stat. Assoc. 88:(421):925
    [Crossref] [Google Scholar]
  73. 73.
    Dempster AP, Laird NM, Rubin DB. 1977.. Maximum likelihood from incomplete data via the EM algorithm. . J. R. Stat. Soc. B 39:(1):122
    [Crossref] [Google Scholar]
  74. 74.
    Balakrishnan S, Wainwright MJ, Yu B. 2017.. Statistical guarantees for the EM algorithm: from population to sample-based analysis. . Ann. Stat. 45:(1):77120
    [Crossref] [Google Scholar]
  75. 75.
    Liu X, Duan R, Luo C, Ogdie A, Moore JH, et al. 2022.. Multisite learning of high-dimensional heterogeneous data with applications to opioid use disorder study of 15,000 patients across 5 clinical sites. . Sci. Rep. 12:(1):11073
    [Crossref] [Google Scholar]
  76. 76.
    Cai T, Liu M, Xia Y. 2022.. Individual data protected integrative regression analysis of high-dimensional heterogeneous data. . J. Am. Stat. Assoc. 117:(540):210519
    [Crossref] [Google Scholar]
  77. 77.
    Dobriban E, Sheng Y. 2021.. Distributed linear regression by averaging. . Ann. Stat. 49:(2):91843
    [Crossref] [Google Scholar]
  78. 78.
    Battey H, Fan J, Liu H, Lu J, Zhu Z. 2018.. Distributed testing and estimation under sparse high dimensional models. . Ann. Stat. 46:(3):135282
    [Crossref] [Google Scholar]
  79. 79.
    Dobriban E, Sheng Y. 2020.. One-shot distributed ridge regression in high dimensions. . Proc. Mach. Learn. Res. 119::876372
    [Google Scholar]
  80. 80.
    Tong J, Duan R, Li R, Scheuemie MJ, Moore JH, Chen Y. 2020.. Robust-ODAL: learning from heterogeneous health systems without sharing patient-level data. . Pac. Symp. Biocomput. 25::695706
    [Google Scholar]
  81. 81.
    Xu K, Zhu L, Fan J. 2022.. Distributed sufficient dimension reduction for heterogeneous massive data. . Stat. Sin. 32::245576
    [Google Scholar]
  82. 82.
    Battey H, Fan J, Liu H, Lu J, Zhu Z. 2018.. Distributed testing and estimation under sparse high dimensional models. . Ann. Stat. 46:(3):135282
    [Crossref] [Google Scholar]
  83. 83.
    Hripcsak G, Ryan PB, Duke JD, Shah NH, Park RW, et al. 2016.. Characterizing treatment pathways at scale using the OHDSI network. . PNAS 113:(27):732936
    [Crossref] [Google Scholar]
  84. 84.
    Deleted in proof
  85. 85.
    Chapman M, Mumtaz S, Rasmussen LV, Karwath A, Gkoutos GV, et al. 2021.. Desiderata for the development of next-generation electronic health record phenotype libraries. . Gigascience 10:(9):giab059
    [Crossref] [Google Scholar]
  86. 86.
    Lewis AE, Weiskopf N, Abrams ZB, Foraker R, Lai AM, et al. 2023.. Electronic health record data quality assessment and tools: a systematic review. . J. Am. Med. Inform. Assoc. 30:(10):173040
    [Crossref] [Google Scholar]
  87. 87.
    Shin SJ, You SC, Roh J, Park YR, Park RW. 2019.. Genomic common data model for biomedical data in clinical practice. . Stud. Health Technol. Inform. 264::184344
    [Google Scholar]
  88. 88.
    Yurkovich JT, Evans SJ, Rappaport N, Boore JL, Lovejoy JC, et al. 2023.. The transition from genomics to phenomics in personalized population health. . Nat. Rev. Genet. 25::286302
    [Crossref] [Google Scholar]
  89. 89.
    Bazoge A, Morin E, Daille B, Gourraud PA. 2022.. Applying natural language processing to textual data from clinical data warehouses: systematic review. . JMIR Med. Inform. 11::e42477
    [Crossref] [Google Scholar]
  90. 90.
    Freedman HG, Williams H, Miller MA, Birtwell D, Mowery DL, Stoeckert CJ. 2020.. A novel tool for standardizing clinical data in a semantically rich model. . J. Biomed. Inform. 112:(Suppl.):100086
    [Crossref] [Google Scholar]
  91. 91.
    Duda RO, Shortliffe EH. 1983.. Expert systems research. . Science 220:(4594):26168
    [Crossref] [Google Scholar]
  92. 92.
    Lannelongue L, Grealey J, Inouye M. 2021.. Green algorithms: quantifying the carbon footprint of computation. . Adv. Sci. 8:(12):2100707
    [Crossref] [Google Scholar]
  93. 93.
    Yoon J, Mizrahi M, Ghalaty NF, Jarvinen T, Ravi AS, et al. 2023.. EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records. . NPJ Digit. Med. 6:(1):141
    [Crossref] [Google Scholar]
  94. 94.
    Foraker RE, Yu SC, Gupta A, Michelson AP, Pineda Soto JA, et al. 2020.. Spot the difference: comparing results of analyses from real patient data and synthetic derivatives. . JAMIA Open 3:(4):55766
    [Crossref] [Google Scholar]
/content/journals/10.1146/annurev-biodatasci-122220-115746
Loading
/content/journals/10.1146/annurev-biodatasci-122220-115746
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error