Our overall research interest is to advance cardiovascular medicine through a better understanding of cardiac proteins on a global scale. A major focus is to develop proteomics and data science methods to interrogate how changes in protein expression orchestrate higher physiological functions in normal and diseased hearts. Additionally, we ask questions about disease phenotypes which are best answered through unification of clinical and experimental findings within unstructured text data, such as clinical case reports (CCRs) and notes contained within electronic health records (EHRs).
Our lab also leads the Integrated Data Science Training in CardioVascular Medicine program (iDISCOVER). This program fosters a cross-campus, interdisciplinary research environment to teach and mentor future scientists in cutting-edge computational and data science methods. It supports creation of a next-generation workforce with an advanced understanding of data science tactics for addressing real-world cardiovascular problems and ultimately realizing precision cardiovascular medicine. To that end, our research interests concern this interface between data science and intrinsically complex cardiovascular disease phenotypes.
Major projects are as follows.
Learning from big data is now established as a viable strategy in research and is established as one of the cornerstones of furthering advancements in understanding human disease. The vast majority of the data produced by current studies, whether as descriptions of experimental results or as clinical observations, is unstructured: it follows no consistent format and is not organized in a predictable way. Much of this data is text, and much of that text is in highly technical language. Integrating and answering questions about these immense, varied data sources therefore poses a challenge in how we may extract information from it. We are applying natural language processing (NLP) methods, including strategies for named entity recognition (NER) and relation extraction (RE) to finding connections across experimental and clinical literature. The technical aspects of this work are informed by Prof. Jiawei Han (Univ. of Illinois at Urbana-Champaign) and his extensive accomplishments in text mining and NLP.
Sigdel, D. et al. Cloud-Based Phrase Mining and Analysis of User-Defined Phrase-Category Association in Biomedical Publications. Journal of Visualized Experiments 59108 (2019).
Caufield, J. H., Zhou, Y., and Garlid, A. O. et al. A reference set of curated biomedical data and metadata from clinical case reports. Scientific Data 5, 180258 (2018).
Liem, D. A. et al. Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease. American Journal of Physiology-Heart and Circulatory Physiology 315, H910–H924 (2018).
Caufield, J. H. et al. A Metadata Extraction Approach for Clinical Case Reports to Enable Advanced Understanding of Biomedical Concepts. Journal of Visualized Experiments (2018).
Liem, D. A. et al. Phrase Mining and Machine Learning in Textual Data to Uncover Distinct Protein Patterns in Cardiovascular Disease. Journal of Molecular and Cellular Cardiology 112, 168–169 (2017).
Knowledge graphs are powerful concepts for representing, comparing, and learning from varied data sources. We are assembling these graphs along with the necessary infrastructure for ensuring they contain consistent, validated observations capable of informing focused insights on cardiovascular disease phenotypes. As with the above project, the technical aspects of this work are informed by Prof. Jiawei Han (Univ. of Illinois at Urbana-Champaign) and his extensive accomplishments in data mining.
Ping, P., Watson, K., Han, J. & Bui, A. Individualized Knowledge Graph: A Viable Informatics Path to Precision Medicine. Circulation Research 120, 1078–1080 (2017).
Working with the observational data within clinical documents such as CCRs and EHRs is a daunting task, even with the assistance of text mining. We are constructing approaches to make use of the newest releases of clinical coding systems (e.g., ICD-11, formally released by the World Health Organization in 2018) to extract relationships among observations and disease diagnoses from clinical text.
We are using newly developed computational pipelines to uncover hidden relationships among cardiovascular drugs, their molecular targets, and potential adverse effects in the setting of the pathogenesis of heart disease.
A current bottleneck for widespread adoption of proteomics technologies in cardiovascular medicine is the limited accessibility of bioinformatics tools and the quality and quantity of protein functional annotations. To address these shortcomings we created the Cardiac Organellar Protein Atlas Knowledgebase (COPaKB), a knowledgebase aimed at connecting proteomics data with cardiovascular biology knowledge. We are also exploring cloud-based computational infrastructures optimized for Big Data to support remote, high-performance data access and analysis that will transform the research sharing infrastructure and encourage interoperability in biomedical research. The ultimate goal of this long-term effort is to provide a comprehensive platform to bridge traditional data-driven proteomic studies and hypothesis-driven investigations widely employed by the cardiovascular community.
Current annotations on genes, proteins, metabolites, and their functions in health and diseases are fragmented and incomplete, causing valuable information to be scattered across multiple websites, databases, local files, and research articles. To address this challenge, we are building upon the growing trend of crowdsourcing to enable collaborative annotation on existing datasets. We have coordinated substantial contributions to publicly-available knowledge on mitochondrial biology and disease through the Gene Wiki project, in which more than 50 students created more than 500 detailed descriptions of mitochondrial genes. Efforts to harmonize knowledge in this field, particularly between biomolecular knowledge and that provided by CCRs, led to creation of the MitoCases project. These community and computational efforts are designed to allow end-users to obtain concise and accurate information on cardiovascular health and disease, and thus facilitating the translation of Big Data to biomedical knowledge.
Oxidative post-translational modifications (O-PTMs) of proteins are highly prevalent cellular features that enable diverse and nuanced functions, and elicit critical effects on human health and disease. Thus far, at least 35 different types of O-PTMs have been reported in various model systems and in humans (link). However, very few studies have been able to elucidate a molecular fingerprint of O-PTMs, mainly due to the lack of sophistication of mass spec and data science technologies. To address this challenge, we collaborate with Drs. Alex Bui, Wei Wang, Karol Watson from UCLA, Dr. Jiawei Han from UIUC, Dr. Lan Huang from UCI, Dr. John Yates from Scripps, and Henning Hermjakob from EMBL-EBI and utilize our expertise spans across O-PTM biology, MS/MS based O-PTM technology, database & knowledgebase query, ML analytics, KG construction, and text mining. Our project aims to comprehensively characterize the molecular landscape, elucidate mechanistic insights, and define the translational value of O-PTMs in biomedical processes, cells, organelles, model systems, and humans.
Our lab is interested in understanding the mechanisms of heart diseases and injuries through interrogating the large-scale alterations of protein expression and dynamics during disease development, and inferring novel disease proteins by combining multi-scale molecular parameters. A major emphasis is to expand the number of proteome parameters one can observe on a large-scale, including quantification of proteome-wide post-translational modification, localizations, and temporal dynamics, in order to detect "hidden" disease signatures. We also recognize that protein temporal dynamics play a critical role in time-dimensional pathophysiological processes, including the gradual cardiac remodeling that occurs in early-stage heart failure. Many potential disease associations in protein homeostasis may manifest in disrupted protein half-life but are masked in measurements of protein expression, yet method developments for quantitative assessments of protein kinetics lag behind that for the assessment of protein expression. The data are revealing a quantitative and longitudinal view of cardiac remodeling at the molecular level, where widespread kinetic regulations occur in calcium signaling, metabolism, proteostasis, and mitochondrial dynamics.
Lau, E. et al. Integrated omics dissection of proteome dynamics during cardiac remodeling. Nat Commun 9, 120 (2018).
Lau, E. et al. A large dataset of protein dynamics in the mammalian heart proteome. Scientific data 3, 160015 (2016).
Lam, M. P. Y., Ping, P. & Murphy, E. Proteomics Research in Cardiovascular Medicine and Biomarker Discovery. Journal of the American College of Cardiology 68, 2819–2830 (2016).
Aside from protein expression, the proteome is defined by numerous dynamic parameters that currently remain underexplored. We are applying machine learning and data mining algorithms to inference of spatiotemporal models from protein data, and to integration of molecular profiles with biomedical variables. In collaboration with Henning Hermjakob from EMBL-EBI, we have also built OmicsDI, a unified access point to locate and acquire transcript, protein, and metabolite datasets. Our long-term goal is to implement, deliver, and execute these tools and algorithms on the cloud to provide easy access to cardiovascular researchers and clinicians, as well as the broader biomedical community.
Perez-Riverol, Y. et al. Discovering and linking public omics data sets using the Omics Discovery Index. Nat Biotechnol 35, 406–409 (2017).