Volume 5

Annual Review of Biomedical Data Science - Volume 5, 2022

Volume 5, 2022

- Phenotypic Causal Inference Using Genome-Wide Association Study Data: Mendelian Randomization and Beyond
  
  Venexia M. Walker, Jie Zheng, Tom R. Gaunt, and George Davey Smith
  
  Vol. 5 (2022), pp. 1–17
  
  https://doi.org/10.1146/annurev-biodatasci-122120-024910
  More Less
  
  Summary statistics for genome-wide association studies (GWAS) are increasingly available for downstream analyses. Meanwhile, the popularity of causal inference methods has grown as we look to gather robust evidence for novel medical and public health interventions. This has led to the development of methods that use GWAS summary statistics for causal inference. Here, we describe these methods in order of their escalating complexity, from genetic associations to extensions of Mendelian randomization that consider thousands of phenotypes simultaneously. We also cover the assumptions and limitations of these approaches before considering the challenges faced by researchers performing causal inference using GWAS data. GWAS summary statistics constitute an important data source for causal inference research that offers a counterpoint to nongenetic methods when triangulating evidence. Continued efforts to address the challenges in using GWAS data for causal inference will allow the full impact of these approaches to be realized.
  
  Add to my favoritesFavourites
  
  Email this

- Static and Motion Facial Analysis for Craniofacial Assessment and Diagnosing Diseases
  
  Harold Matthews, Guido de Jong, Thomas Maal, and Peter Claes
  
  Vol. 5 (2022), pp. 19–42
  
  https://doi.org/10.1146/annurev-biodatasci-122120-111413
  More Less
  
  Deviation from a normal facial shape and symmetry can arise from numerous sources, including physical injury and congenital birth defects. Such abnormalities can have important aesthetic and functional consequences. Furthermore, in clinical genetics distinctive facial appearances are often associated with clinical or genetic diagnoses; the recognition of a characteristic facial appearance can substantially narrow the search space of potential diagnoses for the clinician. Unusual patterns of facial movement and expression can indicate disturbances to normal mechanical functioning or emotional affect. Computational analyses of static and moving 2D and 3D images can serve clinicians and researchers by detecting and describing facial structural, mechanical, and affective abnormalities objectively. In this review we survey traditional and emerging methods of facial analysis, including statistical shape modeling, syndrome classification, modeling clinical face phenotype spaces, and analysis of facial motion and affect.
  
  Add to my favoritesFavourites
  
  Email this

- Machine Learning in Chemoinformatics and Medicinal Chemistry
  
  Raquel Rodríguez-Pérez, Filip Miljković, and Jürgen Bajorath
  
  Vol. 5 (2022), pp. 43–65
  
  https://doi.org/10.1146/annurev-biodatasci-122120-124216
  More Less
  
  In chemoinformatics and medicinal chemistry, machine learning has evolved into an important approach. In recent years, increasing computational resources and new deep learning algorithms have put machine learning onto a new level, addressing previously unmet challenges in pharmaceutical research. In silico approaches for compound activity predictions, de novo design, and reaction modeling have been further advanced by new algorithmic developments and the emergence of big data in the field. Herein, novel applications of machine learning and deep learning in chemoinformatics and medicinal chemistry are reviewed. Opportunities and challenges for new methods and applications are discussed, placing emphasis on proper baseline comparisons, robust validation methodologies, and new applicability domains.
  
  Add to my favoritesFavourites
  
  Email this

- Cotranslational Mechanisms of Protein Biogenesis and Complex Assembly in Eukaryotes
  
  Fabián Morales-Polanco, Jae Ho Lee, Natália M. Barbosa, and Judith Frydman
  
  Vol. 5 (2022), pp. 67–94
  
  https://doi.org/10.1146/annurev-biodatasci-121721-095858
  More Less
  
  The formation of protein complexes is crucial to most biological functions. The cellular mechanisms governing protein complex biogenesis are not yet well understood, but some principles of cotranslational and posttranslational assembly are beginning to emerge. In bacteria, this process is favored by operons encoding subunits of protein complexes. Eukaryotic cells do not have polycistronic mRNAs, raising the question of how they orchestrate the encounter of unassembled subunits. Here we review the constraints and mechanisms governing eukaryotic co- and posttranslational protein folding and assembly, including the influence of elongation rate on nascent chain targeting, folding, and chaperone interactions. Recent evidence shows that mRNAs encoding subunits of oligomeric assemblies can undergo localized translation and form cytoplasmic condensates that might facilitate the assembly of protein complexes. Understanding the interplay between localized mRNA translation and cotranslational proteostasis will be critical to defining protein complex assembly in vivo.
  
  Add to my favoritesFavourites
  
  Email this

- Open Structural Data in Precision Medicine
  
  Ruth Nussinov, Hyunbum Jang, Guy Nir, Chung-Jung Tsai, and Feixiong Cheng
  
  Vol. 5 (2022), pp. 95–117
  
  https://doi.org/10.1146/annurev-biodatasci-122220-012951
  More Less
  
  Three-dimensional protein structural data at the molecular level are pivotal for successful precision medicine. Such data are crucial not only for discovering drugs that act to block the active site of the target mutant protein but also for clarifying to the patient and the clinician how the mutations harbored by the patient work. The relative paucity of structural data reflects their cost, challenges in their interpretation, and lack of clinical guidelines for their utilization. Rapid technological advancements in experimental high-resolution structural determination increasingly generate structures. Computationally, modeling algorithms, including molecular dynamics simulations, are becoming more powerful, as are compute-intensive hardware, particularly graphics processing units, overlapping with the inception of the exascale era. Accessible, freely available, and detailed structural and dynamical data can be merged with big data to powerfully transform personalizedpharmacology. Here we review protein and emerging genome high-resolution data, along with means, applications, and examples underscoring their usefulness in precision medicine.
  
  Add to my favoritesFavourites
  
  Email this

- Functional Characterization of Genetic Variant Effects on Expression
  
  Elise D. Flynn, and Tuuli Lappalainen
  
  Vol. 5 (2022), pp. 119–139
  
  https://doi.org/10.1146/annurev-biodatasci-122120-010010
  More Less
  
  Thousands of common genetic variants in the human population have been associated with disease risk and phenotypic variation by genome-wide association studies (GWAS). However, the majority of GWAS variants fall into noncoding regions of the genome, complicating our understanding of their regulatory functions, and few molecular mechanisms of GWAS variant effects have been clearly elucidated. Here, we set out to review genetic variant effects, focusing on expression quantitative trait loci (eQTLs), including their utility in interpreting GWAS variant mechanisms. We discuss the interrelated challenges and opportunities for eQTL analysis, covering determining causal variants, elucidating molecular mechanisms of action, and understanding context variability. Addressing these questions can enable better functional characterization of disease-associated loci and provide insights into fundamental biological questions of the noncoding genetic regulatory code and its control of gene expression.
  
  Add to my favoritesFavourites
  
  Email this

- Integration of Protein Structure and Population-Scale DNA Sequence Data for Disease Gene Discovery and Variant Interpretation
  
  Bian Li, Bowen Jin, John A. Capra, and William S. Bush
  
  Vol. 5 (2022), pp. 141–161
  
  https://doi.org/10.1146/annurev-biodatasci-122220-112147
  More Less
  
  The experimental and computational techniques for capturing information about protein structures and genetic variation within the human genome have advanced dramatically in the past 20 years, generating extensive new data resources. In this review, we discuss these advances, along with new approaches for determining the impact a genetic variant has on protein function. We focus on the potential of new methods that integrate human genetic variation into protein structures to discover relationships to disease, including the discovery of mutational hotspots in cancer-related proteins, the localization of protein-altering variants within protein regions for common complex diseases, and the assessment of variants of unknown significance for Mendelian traits. We expect that approaches that integratethese data sources will play increasingly important roles in disease gene discovery and variant interpretation.
  
  Add to my favoritesFavourites
  
  Email this

- Genome Privacy and Trust
  
  Gamze Gürsoy
  
  Vol. 5 (2022), pp. 163–181
  
  https://doi.org/10.1146/annurev-biodatasci-122120-021311
  More Less
  
  Genomics data are important for advancing biomedical research, improving clinical care, and informing other disciplines such as forensics and genealogy. However, privacy concerns arise when genomic data are shared. In particular, the identifying nature of genetic information, its direct relationship to health status, and the potential financial harm and stigmatization posed to individuals and their blood relatives call for a survey of the privacy issues related to sharing genetic and related data and potential solutions to overcome these issues. In this work, we provide an overview of the importance of genomic privacy, the information gleaned from genomics data, the sources of potential private information leakages in genomics, and ways to preserve privacy while utilizing the genetic information in research. We discuss the relationship between trust in the scientific community and protecting privacy, illuminating a future roadmap for data sharing and study participation.
  
  Add to my favoritesFavourites
  
  Email this

- Computational Approaches for Understanding Sequence Variation Effects on the 3D Genome Architecture
  
  Pavel Avdeyev, and Jian Zhou
  
  Vol. 5 (2022), pp. 183–204
  
  https://doi.org/10.1146/annurev-biodatasci-102521-012018
  More Less
  
  Decoding how genomic sequence and its variations affect 3D genome architecture is indispensable for understanding the genetic architecture of various traits and diseases. The 3D genome organization can be significantly altered by genome variations and in turn impact the function of the genomic sequence. Techniques for measuring the 3D genome architecture across spatial scales have opened up new possibilities for understanding how the 3D genome depends upon the genomic sequence and how it can be altered by sequence variations. Computational methods have become instrumental in analyzing and modeling the sequence effects on 3D genome architecture, and recent development in deep learning sequence models have opened up new opportunities for studying the interplay between sequence variations and the 3D genome. In this review, we focus on computational approaches for both the detection and modeling of sequence variation effects on the 3D genome, and we discuss the opportunities presented by these approaches.
  
  Add to my favoritesFavourites
  
  Email this

- Bioinformatics of Corals: Investigating Heterogeneous Omics Data from Coral Holobionts for Insight into Reef Health and Resilience
  
  Lenore J. Cowen, and Hollie M. Putnam
  
  Vol. 5 (2022), pp. 205–231
  
  https://doi.org/10.1146/annurev-biodatasci-122120-030732
  More Less
  
  Coral reefs are home to over two million species and provide habitat for roughly 25% of all marine animals, but they are being severely threatened by pollution and climate change. A large amount of genomic, transcriptomic, and other omics data is becoming increasingly available from different species of reef-building corals, the unicellular dinoflagellates, and the coral microbiome (bacteria, archaea, viruses, fungi, etc.). Such new data present an opportunity for bioinformatics researchers and computational biologists to contribute to a timely, compelling, and urgent investigation of critical factors that influence reef health and resilience.
  
  Add to my favoritesFavourites
  
  Email this

- Exchange of Human Data Across International Boundaries
  
  Heidi Beate Bentzen
  
  Vol. 5 (2022), pp. 233–250
  
  https://doi.org/10.1146/annurev-biodatasci-122220-110811
  More Less
  
  There is a need to share personal data across jurisdictional boundaries. However, the laws regulating such transfers are not harmonized, and sometimes even conflict, causing challenges and occasional data stalls. This review describes the legal landscape for transfer of human data across international boundaries. The European Union's data protection legislation is used as the starting point for illustrating the legislation of countries across the world, how these diverge, and one's options for exchanging human data internationally in a legally compliant manner.
  
  Add to my favoritesFavourites
  
  Email this

- Best Practices on Big Data Analytics to Address Sex-Specific Biases in Our Understanding of the Etiology, Diagnosis, and Prognosis of Diseases
  
  Su Golder, Karen O'Connor, Yunwen Wang, Robin Stevens, and Graciela Gonzalez-Hernandez
  
  Vol. 5 (2022), pp. 251–267
  
  https://doi.org/10.1146/annurev-biodatasci-122120-025806
  More Less
  
  A bias in health research to favor understanding diseases as they present in men can have a grave impact on the health of women. This paper reports on a conceptual review of the literature on machine learning or natural language processing (NLP) techniques to interrogate big data for identifying sex-specific health disparities. We searched Ovid MEDLINE, Embase, and PsycINFO in October 2021 using synonyms and indexing terms for (a) “women,” “men,” or “sex”; (b) “big data,” “artificial intelligence,” or “NLP”; and (c) “disparities” or “differences.” From 902 records, 22 studies met the inclusion criteria and were analyzed. Results demonstrate that the inclusion by sex is inconsistent and often unreported, although the inclusion of men in these studies is disproportionately less than women. Even though artificial intelligence and NLP techniques are widely applied in healthresearch, few studies use them to take advantage of unstructured text to investigate sex-related differences or disparities. Researchers are increasingly aware of sex-based data bias, but the process toward correction is slow. We reflect on best practices on using big data analytics to address sex-specific biases in understanding the etiology, diagnosis, and prognosis of diseases.
  
  Add to my favoritesFavourites
  
  Email this

- Extracellular Vesicle–Based Multianalyte Liquid Biopsy as a Diagnostic for Cancer
  
  Andrew A. Lin, Vivek Nimgaonkar, David Issadore, and Erica L. Carpenter
  
  Vol. 5 (2022), pp. 269–292
  
  https://doi.org/10.1146/annurev-biodatasci-122120-113218
  More Less
  
  Liquid biopsy is the analysis of materials shed by tumors into circulation, such as circulating tumor cells, nucleic acids, and extracellular vesicles (EVs), for the diagnosis and management of cancer. These assays have rapidly evolved with recent FDA approvals of single biomarkers in patients with advanced metastatic disease. However, they have lacked sensitivity or specificity as a diagnostic in early-stage cancer, primarily due to low concentrations in circulating plasma. EVs, membrane-enclosed nanoscale vesicles shed by tumor and other cells into circulation, are a promising liquid biopsy analyte owing to their protein and nucleic acid cargoes carried from their mother cells, their surface proteins specific to their cells of origin, and their higher concentrations over other noninvasive biomarkers across disease stages. Recently, the combination of EVs with non-EV biomarkers has driven improvements in sensitivity and accuracy; this has been fueled by the use of machine learning (ML) to algorithmically identify and combine multiple biomarkers into a composite biomarker for clinical prediction. This review presents an analysis of EV isolation methods, surveys approaches for and issues with using ML in multianalyte EV datasets, and describes best practices for bringing multianalyte liquid biopsy to clinical implementation.
  
  Add to my favoritesFavourites
  
  Email this

- Challenges and Opportunities for Developing More Generalizable Polygenic Risk Scores
  
  Ying Wang, Kristin Tsuo, Masahiro Kanai, Benjamin M. Neale, and Alicia R. Martin
  
  Vol. 5 (2022), pp. 293–320
  
  https://doi.org/10.1146/annurev-biodatasci-111721-074830
  More Less
  
  Polygenic risk scores (PRS) estimate an individual's genetic likelihood of complex traits and diseases by aggregating information across multiple genetic variants identified from genome-wide association studies. PRS can predict a broad spectrum of diseases and have therefore been widely used in research settings. Some work has investigated their potential applications as biomarkers in preventative medicine, but significant work is still needed to definitively establish and communicate absolute risk to patients for genetic and modifiable risk factors across demographic groups. However, the biggest limitation of PRS currently is that they show poor generalizability across diverse ancestries and cohorts. Major efforts are underway through methodological development and data generation initiatives to improve their generalizability. This review aims to comprehensively discuss current progress on the development of PRS, the factors that affect their generalizability, and promising areas for improving their accuracy, portability, and implementation.
  
  Add to my favoritesFavourites
  
  Email this

- Importance of Including Non-European Populations in Large Human Genetic Studies to Enhance Precision Medicine
  
  Dan Ju, Daniel Hui, Dorothy A. Hammond, Ambroise Wonkam, and Sarah A. Tishkoff
  
  Vol. 5 (2022), pp. 321–339
  
  https://doi.org/10.1146/annurev-biodatasci-122220-112550
  More Less
  
  One goal of genomic medicine is to uncover an individual's genetic risk for disease, which generally requires data connecting genotype to phenotype, as done in genome-wide association studies (GWAS). While there may be clinical promise to employing prediction tools such as polygenic risk scores (PRS), it currently stands that individuals of non-European ancestry may not reap the benefits of genomic medicine because of underrepresentation in large-scale genetics studies. Here, we discuss why this inequity poses a problem for genomic medicine and the reasons for the low transferability of PRS across populations. We also survey the ancestry representation of published GWAS and investigate how estimates of ancestry diversity in GWASparticipants might be biased. We highlight the importance of expanding genetic research in Africa, one of the most underrepresented regions in human genomics research, and discuss issues of ethics, resources, and technology for equitable advancement of genomic medicine.
  
  Add to my favoritesFavourites
  
  Email this

- The Cell Physiome: What Do We Need in a Computational Physiology Framework for Predicting Single-Cell Biology?
  
  Vijay Rajagopal, Senthil Arumugam, Peter J. Hunter, Afshin Khadangi, Joshua Chung, and Michael Pan
  
  Vol. 5 (2022), pp. 341–366
  
  https://doi.org/10.1146/annurev-biodatasci-072018-021246
  More Less
  
  Modern biology and biomedicine are undergoing a big data explosion, needing advanced computational algorithms to extract mechanistic insights on the physiological state of living cells. We present the motivation for the Cell Physiome Project: a framework and approach for creating, sharing, and using biophysics-based computational models of single-cell physiology. Using examples in calcium signaling, bioenergetics, and endosomal trafficking, we highlight the need for spatially detailed, biophysics-based computational models to uncover new mechanisms underlying cell biology. We review progress and challenges to date toward creating cell physiome models. We then introduce bond graphs as an efficient way to create cell physiome models that integrate chemical, mechanical, electromagnetic, and thermal processes while maintaining mass and energy balance. Bond graphs enhance modularization and reusability of computational models of cells at scale. We conclude with a look forward at steps that will help fully realize this exciting new field of mechanistic biomedical data science.
  
  Add to my favoritesFavourites
  
  Email this

- Discovering Biological Conflict Systems Through Genome Analysis: Evolutionary Principles and Biochemical Novelty
  
  L. Aravind, Lakshminarayan M. Iyer, and A. Maxwell Burroughs
  
  Vol. 5 (2022), pp. 367–391
  
  https://doi.org/10.1146/annurev-biodatasci-122220-101119
  More Less
  
  Biological replicators, from genes within a genome to whole organisms, are locked in conflicts. Comparative genomics has revealed a staggering diversity of molecular armaments and mechanisms regulating their deployment, collectively termed biological conflict systems. These encompass toxins used in inter- and intraspecific interactions, self/nonself discrimination, antiviral immune mechanisms, and counter-host effectors deployed by viruses and intragenomic selfish elements. These systems possess shared syntactical features in their organizational logic and a set of effectors targeting genetic information flow through the Central Dogma, certain membranes, and key molecules like NAD⁺. These principles can be exploited to discover new conflict systems through sensitive computational analyses. This has led to significant advances in our understanding of the biology of these systems and furnished new biotechnological reagents for genome editing, sequencing, and beyond. We discuss these advances using specific examples of toxins, restriction-modification, apoptosis, CRISPR/second messenger–regulated systems, and other enigmatic nucleic acid–targeting systems.
  
  Add to my favoritesFavourites
  
  Email this

- Developing and Implementing Predictive Models in a Learning Healthcare System: Traditional and Artificial Intelligence Approaches in the Veterans Health Administration
  
  David Atkins, Christos A. Makridis, Gil Alterovitz, Rachel Ramoni, and Carolyn Clancy
  
  Vol. 5 (2022), pp. 393–413
  
  https://doi.org/10.1146/annurev-biodatasci-122220-110053
  More Less
  
  Predicting clinical risk is an important part of healthcare and can inform decisions about treatments, preventive interventions, and provision of extra services. The field of predictive models has been revolutionized over the past two decades by electronic health record data; the ability to link such data with other demographic, socioeconomic, and geographic information; the availability of high-capacity computing; and new machine learning and artificial intelligence methods for extracting insights from complex datasets. These advances have produced a new generation of computerized predictive models, but debate continues about their development, reporting, validation, evaluation, and implementation. In this review we reflect on more than 10 years of experience at the Veterans Health Administration, the largest integrated healthcare system in the United States, in developing, testing, and implementing such models at scale. We report lessons from the implementation of national risk prediction models and suggest an agenda for research.
  
  Add to my favoritesFavourites
  
  Email this

Annual Review of Biomedical Data Science - Volume 5, 2022

Volume 5, 2022

Phenotypic Causal Inference Using Genome-Wide Association Study Data: Mendelian Randomization and Beyond

Static and Motion Facial Analysis for Craniofacial Assessment and Diagnosing Diseases

Machine Learning in Chemoinformatics and Medicinal Chemistry

Cotranslational Mechanisms of Protein Biogenesis and Complex Assembly in Eukaryotes

Open Structural Data in Precision Medicine

Functional Characterization of Genetic Variant Effects on Expression

Integration of Protein Structure and Population-Scale DNA Sequence Data for Disease Gene Discovery and Variant Interpretation

Genome Privacy and Trust

Computational Approaches for Understanding Sequence Variation Effects on the 3D Genome Architecture

Bioinformatics of Corals: Investigating Heterogeneous Omics Data from Coral Holobionts for Insight into Reef Health and Resilience

Exchange of Human Data Across International Boundaries

Best Practices on Big Data Analytics to Address Sex-Specific Biases in Our Understanding of the Etiology, Diagnosis, and Prognosis of Diseases

Extracellular Vesicle–Based Multianalyte Liquid Biopsy as a Diagnostic for Cancer

Challenges and Opportunities for Developing More Generalizable Polygenic Risk Scores

Importance of Including Non-European Populations in Large Human Genetic Studies to Enhance Precision Medicine

The Cell Physiome: What Do We Need in a Computational Physiology Framework for Predicting Single-Cell Biology?

Discovering Biological Conflict Systems Through Genome Analysis: Evolutionary Principles and Biochemical Novelty

Developing and Implementing Predictive Models in a Learning Healthcare System: Traditional and Artificial Intelligence Approaches in the Veterans Health Administration

Previous Volumes

Volume 6 (2023)

Volume 5 (2022)

Volume 4 (2021)

Volume 3 (2020)

Volume 2 (2019)

Volume 1 (2018)

Volume 0 (1932)