Annual Review of Biomedical Data Science - Volume 5, 2022
Volume 5, 2022
-
-
Phenotypic Causal Inference Using Genome-Wide Association Study Data: Mendelian Randomization and Beyond
Vol. 5 (2022), pp. 1–17More LessSummary statistics for genome-wide association studies (GWAS) are increasingly available for downstream analyses. Meanwhile, the popularity of causal inference methods has grown as we look to gather robust evidence for novel medical and public health interventions. This has led to the development of methods that use GWAS summary statistics for causal inference. Here, we describe these methods in order of their escalating complexity, from genetic associations to extensions of Mendelian randomization that consider thousands of phenotypes simultaneously. We also cover the assumptions and limitations of these approaches before considering the challenges faced by researchers performing causal inference using GWAS data. GWAS summary statistics constitute an important data source for causal inference research that offers a counterpoint to nongenetic methods when triangulating evidence. Continued efforts to address the challenges in using GWAS data for causal inference will allow the full impact of these approaches to be realized.
-
-
-
Static and Motion Facial Analysis for Craniofacial Assessment and Diagnosing Diseases
Vol. 5 (2022), pp. 19–42More LessDeviation from a normal facial shape and symmetry can arise from numerous sources, including physical injury and congenital birth defects. Such abnormalities can have important aesthetic and functional consequences. Furthermore, in clinical genetics distinctive facial appearances are often associated with clinical or genetic diagnoses; the recognition of a characteristic facial appearance can substantially narrow the search space of potential diagnoses for the clinician. Unusual patterns of facial movement and expression can indicate disturbances to normal mechanical functioning or emotional affect. Computational analyses of static and moving 2D and 3D images can serve clinicians and researchers by detecting and describing facial structural, mechanical, and affective abnormalities objectively. In this review we survey traditional and emerging methods of facial analysis, including statistical shape modeling, syndrome classification, modeling clinical face phenotype spaces, and analysis of facial motion and affect.
-
-
-
Machine Learning in Chemoinformatics and Medicinal Chemistry
Vol. 5 (2022), pp. 43–65More LessIn chemoinformatics and medicinal chemistry, machine learning has evolved into an important approach. In recent years, increasing computational resources and new deep learning algorithms have put machine learning onto a new level, addressing previously unmet challenges in pharmaceutical research. In silico approaches for compound activity predictions, de novo design, and reaction modeling have been further advanced by new algorithmic developments and the emergence of big data in the field. Herein, novel applications of machine learning and deep learning in chemoinformatics and medicinal chemistry are reviewed. Opportunities and challenges for new methods and applications are discussed, placing emphasis on proper baseline comparisons, robust validation methodologies, and new applicability domains.
-
-
-
Cotranslational Mechanisms of Protein Biogenesis and Complex Assembly in Eukaryotes
Vol. 5 (2022), pp. 67–94More LessThe formation of protein complexes is crucial to most biological functions. The cellular mechanisms governing protein complex biogenesis are not yet well understood, but some principles of cotranslational and posttranslational assembly are beginning to emerge. In bacteria, this process is favored by operons encoding subunits of protein complexes. Eukaryotic cells do not have polycistronic mRNAs, raising the question of how they orchestrate the encounter of unassembled subunits. Here we review the constraints and mechanisms governing eukaryotic co- and posttranslational protein folding and assembly, including the influence of elongation rate on nascent chain targeting, folding, and chaperone interactions. Recent evidence shows that mRNAs encoding subunits of oligomeric assemblies can undergo localized translation and form cytoplasmic condensates that might facilitate the assembly of protein complexes. Understanding the interplay between localized mRNA translation and cotranslational proteostasis will be critical to defining protein complex assembly in vivo.
-
-
-
Open Structural Data in Precision Medicine
Vol. 5 (2022), pp. 95–117More LessThree-dimensional protein structural data at the molecular level are pivotal for successful precision medicine. Such data are crucial not only for discovering drugs that act to block the active site of the target mutant protein but also for clarifying to the patient and the clinician how the mutations harbored by the patient work. The relative paucity of structural data reflects their cost, challenges in their interpretation, and lack of clinical guidelines for their utilization. Rapid technological advancements in experimental high-resolution structural determination increasingly generate structures. Computationally, modeling algorithms, including molecular dynamics simulations, are becoming more powerful, as are compute-intensive hardware, particularly graphics processing units, overlapping with the inception of the exascale era. Accessible, freely available, and detailed structural and dynamical data can be merged with big data to powerfully transform personalizedpharmacology. Here we review protein and emerging genome high-resolution data, along with means, applications, and examples underscoring their usefulness in precision medicine.
-
-
-
Functional Characterization of Genetic Variant Effects on Expression
Vol. 5 (2022), pp. 119–139More LessThousands of common genetic variants in the human population have been associated with disease risk and phenotypic variation by genome-wide association studies (GWAS). However, the majority of GWAS variants fall into noncoding regions of the genome, complicating our understanding of their regulatory functions, and few molecular mechanisms of GWAS variant effects have been clearly elucidated. Here, we set out to review genetic variant effects, focusing on expression quantitative trait loci (eQTLs), including their utility in interpreting GWAS variant mechanisms. We discuss the interrelated challenges and opportunities for eQTL analysis, covering determining causal variants, elucidating molecular mechanisms of action, and understanding context variability. Addressing these questions can enable better functional characterization of disease-associated loci and provide insights into fundamental biological questions of the noncoding genetic regulatory code and its control of gene expression.
-
-
-
Integration of Protein Structure and Population-Scale DNA Sequence Data for Disease Gene Discovery and Variant Interpretation
Vol. 5 (2022), pp. 141–161More LessThe experimental and computational techniques for capturing information about protein structures and genetic variation within the human genome have advanced dramatically in the past 20 years, generating extensive new data resources. In this review, we discuss these advances, along with new approaches for determining the impact a genetic variant has on protein function. We focus on the potential of new methods that integrate human genetic variation into protein structures to discover relationships to disease, including the discovery of mutational hotspots in cancer-related proteins, the localization of protein-altering variants within protein regions for common complex diseases, and the assessment of variants of unknown significance for Mendelian traits. We expect that approaches that integratethese data sources will play increasingly important roles in disease gene discovery and variant interpretation.
-
-
-
Genome Privacy and Trust
Vol. 5 (2022), pp. 163–181More LessGenomics data are important for advancing biomedical research, improving clinical care, and informing other disciplines such as forensics and genealogy. However, privacy concerns arise when genomic data are shared. In particular, the identifying nature of genetic information, its direct relationship to health status, and the potential financial harm and stigmatization posed to individuals and their blood relatives call for a survey of the privacy issues related to sharing genetic and related data and potential solutions to overcome these issues. In this work, we provide an overview of the importance of genomic privacy, the information gleaned from genomics data, the sources of potential private information leakages in genomics, and ways to preserve privacy while utilizing the genetic information in research. We discuss the relationship between trust in the scientific community and protecting privacy, illuminating a future roadmap for data sharing and study participation.
-
-
-
Computational Approaches for Understanding Sequence Variation Effects on the 3D Genome Architecture
Pavel Avdeyev, and Jian ZhouVol. 5 (2022), pp. 183–204More LessDecoding how genomic sequence and its variations affect 3D genome architecture is indispensable for understanding the genetic architecture of various traits and diseases. The 3D genome organization can be significantly altered by genome variations and in turn impact the function of the genomic sequence. Techniques for measuring the 3D genome architecture across spatial scales have opened up new possibilities for understanding how the 3D genome depends upon the genomic sequence and how it can be altered by sequence variations. Computational methods have become instrumental in analyzing and modeling the sequence effects on 3D genome architecture, and recent development in deep learning sequence models have opened up new opportunities for studying the interplay between sequence variations and the 3D genome. In this review, we focus on computational approaches for both the detection and modeling of sequence variation effects on the 3D genome, and we discuss the opportunities presented by these approaches.
-
-
-
Bioinformatics of Corals: Investigating Heterogeneous Omics Data from Coral Holobionts for Insight into Reef Health and Resilience
Vol. 5 (2022), pp. 205–231More LessCoral reefs are home to over two million species and provide habitat for roughly 25% of all marine animals, but they are being severely threatened by pollution and climate change. A large amount of genomic, transcriptomic, and other omics data is becoming increasingly available from different species of reef-building corals, the unicellular dinoflagellates, and the coral microbiome (bacteria, archaea, viruses, fungi, etc.). Such new data present an opportunity for bioinformatics researchers and computational biologists to contribute to a timely, compelling, and urgent investigation of critical factors that influence reef health and resilience.
-
-
-
Exchange of Human Data Across International Boundaries
Vol. 5 (2022), pp. 233–250More LessThere is a need to share personal data across jurisdictional boundaries. However, the laws regulating such transfers are not harmonized, and sometimes even conflict, causing challenges and occasional data stalls. This review describes the legal landscape for transfer of human data across international boundaries. The European Union's data protection legislation is used as the starting point for illustrating the legislation of countries across the world, how these diverge, and one's options for exchanging human data internationally in a legally compliant manner.
-
-
-
Best Practices on Big Data Analytics to Address Sex-Specific Biases in Our Understanding of the Etiology, Diagnosis, and Prognosis of Diseases
Vol. 5 (2022), pp. 251–267More LessA bias in health research to favor understanding diseases as they present in men can have a grave impact on the health of women. This paper reports on a conceptual review of the literature on machine learning or natural language processing (NLP) techniques to interrogate big data for identifying sex-specific health disparities. We searched Ovid MEDLINE, Embase, and PsycINFO in October 2021 using synonyms and indexing terms for (a) “women,” “men,” or “sex”; (b) “big data,” “artificial intelligence,” or “NLP”; and (c) “disparities” or “differences.” From 902 records, 22 studies met the inclusion criteria and were analyzed. Results demonstrate that the inclusion by sex is inconsistent and often unreported, although the inclusion of men in these studies is disproportionately less than women. Even though artificial intelligence and NLP techniques are widely applied in healthresearch, few studies use them to take advantage of unstructured text to investigate sex-related differences or disparities. Researchers are increasingly aware of sex-based data bias, but the process toward correction is slow. We reflect on best practices on using big data analytics to address sex-specific biases in understanding the etiology, diagnosis, and prognosis of diseases.
-
-
-
Extracellular Vesicle–Based Multianalyte Liquid Biopsy as a Diagnostic for Cancer
Vol. 5 (2022), pp. 269–292More LessLiquid biopsy is the analysis of materials shed by tumors into circulation, such as circulating tumor cells, nucleic acids, and extracellular vesicles (EVs), for the diagnosis and management of cancer. These assays have rapidly evolved with recent FDA approvals of single biomarkers in patients with advanced metastatic disease. However, they have lacked sensitivity or specificity as a diagnostic in early-stage cancer, primarily due to low concentrations in circulating plasma. EVs, membrane-enclosed nanoscale vesicles shed by tumor and other cells into circulation, are a promising liquid biopsy analyte owing to their protein and nucleic acid cargoes carried from their mother cells, their surface proteins specific to their cells of origin, and their higher concentrations over other noninvasive biomarkers across disease stages. Recently, the combination of EVs with non-EV biomarkers has driven improvements in sensitivity and accuracy; this has been fueled by the use of machine learning (ML) to algorithmically identify and combine multiple biomarkers into a composite biomarker for clinical prediction. This review presents an analysis of EV isolation methods, surveys approaches for and issues with using ML in multianalyte EV datasets, and describes best practices for bringing multianalyte liquid biopsy to clinical implementation.
-
-
-
Challenges and Opportunities for Developing More Generalizable Polygenic Risk Scores
Vol. 5 (2022), pp. 293–320More LessPolygenic risk scores (PRS) estimate an individual's genetic likelihood of complex traits and diseases by aggregating information across multiple genetic variants identified from genome-wide association studies. PRS can predict a broad spectrum of diseases and have therefore been widely used in research settings. Some work has investigated their potential applications as biomarkers in preventative medicine, but significant work is still needed to definitively establish and communicate absolute risk to patients for genetic and modifiable risk factors across demographic groups. However, the biggest limitation of PRS currently is that they show poor generalizability across diverse ancestries and cohorts. Major efforts are underway through methodological development and data generation initiatives to improve their generalizability. This review aims to comprehensively discuss current progress on the development of PRS, the factors that affect their generalizability, and promising areas for improving their accuracy, portability, and implementation.
-
-
-
Importance of Including Non-European Populations in Large Human Genetic Studies to Enhance Precision Medicine
Vol. 5 (2022), pp. 321–339More LessOne goal of genomic medicine is to uncover an individual's genetic risk for disease, which generally requires data connecting genotype to phenotype, as done in genome-wide association studies (GWAS). While there may be clinical promise to employing prediction tools such as polygenic risk scores (PRS), it currently stands that individuals of non-European ancestry may not reap the benefits of genomic medicine because of underrepresentation in large-scale genetics studies. Here, we discuss why this inequity poses a problem for genomic medicine and the reasons for the low transferability of PRS across populations. We also survey the ancestry representation of published GWAS and investigate how estimates of ancestry diversity in GWASparticipants might be biased. We highlight the importance of expanding genetic research in Africa, one of the most underrepresented regions in human genomics research, and discuss issues of ethics, resources, and technology for equitable advancement of genomic medicine.
-
-
-
The Cell Physiome: What Do We Need in a Computational Physiology Framework for Predicting Single-Cell Biology?
Vol. 5 (2022), pp. 341–366More LessModern biology and biomedicine are undergoing a big data explosion, needing advanced computational algorithms to extract mechanistic insights on the physiological state of living cells. We present the motivation for the Cell Physiome Project: a framework and approach for creating, sharing, and using biophysics-based computational models of single-cell physiology. Using examples in calcium signaling, bioenergetics, and endosomal trafficking, we highlight the need for spatially detailed, biophysics-based computational models to uncover new mechanisms underlying cell biology. We review progress and challenges to date toward creating cell physiome models. We then introduce bond graphs as an efficient way to create cell physiome models that integrate chemical, mechanical, electromagnetic, and thermal processes while maintaining mass and energy balance. Bond graphs enhance modularization and reusability of computational models of cells at scale. We conclude with a look forward at steps that will help fully realize this exciting new field of mechanistic biomedical data science.
-
-
-
Discovering Biological Conflict Systems Through Genome Analysis: Evolutionary Principles and Biochemical Novelty
Vol. 5 (2022), pp. 367–391More LessBiological replicators, from genes within a genome to whole organisms, are locked in conflicts. Comparative genomics has revealed a staggering diversity of molecular armaments and mechanisms regulating their deployment, collectively termed biological conflict systems. These encompass toxins used in inter- and intraspecific interactions, self/nonself discrimination, antiviral immune mechanisms, and counter-host effectors deployed by viruses and intragenomic selfish elements. These systems possess shared syntactical features in their organizational logic and a set of effectors targeting genetic information flow through the Central Dogma, certain membranes, and key molecules like NAD+. These principles can be exploited to discover new conflict systems through sensitive computational analyses. This has led to significant advances in our understanding of the biology of these systems and furnished new biotechnological reagents for genome editing, sequencing, and beyond. We discuss these advances using specific examples of toxins, restriction-modification, apoptosis, CRISPR/second messenger–regulated systems, and other enigmatic nucleic acid–targeting systems.
-
-
-
Developing and Implementing Predictive Models in a Learning Healthcare System: Traditional and Artificial Intelligence Approaches in the Veterans Health Administration
Vol. 5 (2022), pp. 393–413More LessPredicting clinical risk is an important part of healthcare and can inform decisions about treatments, preventive interventions, and provision of extra services. The field of predictive models has been revolutionized over the past two decades by electronic health record data; the ability to link such data with other demographic, socioeconomic, and geographic information; the availability of high-capacity computing; and new machine learning and artificial intelligence methods for extracting insights from complex datasets. These advances have produced a new generation of computerized predictive models, but debate continues about their development, reporting, validation, evaluation, and implementation. In this review we reflect on more than 10 years of experience at the Veterans Health Administration, the largest integrated healthcare system in the United States, in developing, testing, and implementing such models at scale. We report lessons from the implementation of national risk prediction models and suggest an agenda for research.
-