Training in the iDISCOVER Program

Overview: The UCLA Integrated Data Science Training in Cardiovascular Medicine (iDISCOVER) Program supports graduate students and postdoctoral fellows seeking training at the intersection of data science and cardiovascular (CV) medicine. It also opens to undergraduate students who wish to gain basic concepts of data science and have hands-on experience of entry-level projects.

Motivation: The need for a new generation of researchers capable of working with the diverse forms of data now common to CV disease research is urgent. Accordingly, the iDISCOVER program provides trainees with direct experience in current methods and varied data types needed for modern CV research. The program supports PhD students who have completed their 1st year of training as well as postdocs; all support is for a maximum of two years per trainee. The support for undergraduates is evaluated on a quarterly basis. See “Summary of Trainee Types and Support Sources” table below for further details.

I. CV Data Science Training Program for UCLA Graduate Students and Postdoctoral Fellows.

Coursework: Trainees engage in coursework covering omics phenotyping-supported outcome studies, machine learning-supported approaches in CV medicine; and information extraction and knowledgebase construction. Courses in Bioengineering and Bioinformatics are popular options.

Projects: All trainees are encouraged to pursue their own hands-on research project. These projects encompass both self-directed and supervised experience with fundamental principles and methods of data science. All projects generally include identification of a specific cardiovascular use case (e.g., determination of cardiac protein expression patterns correlated with a heart failure phenotype), curation and integration of data from varied sources (e.g., research abstract text from PubMed; biomolecular pathways from Reactome, or chest X-ray images), design of a data analysis pipeline, and interpretation of results. Trainees work closely with lab members to understand, contextualize, and visualize their own accomplishments. At the conclusion of a trainee’s project, they complete and turn in a written project summary. They also have opportunities to present and discuss their accomplishments in regular meetings and symposia. All trainees are expected to leave the program with a comprehensive understanding of data science as it applies to CV medicine.

Supervision and mentorship: Research projects are supervised by faculty from the UCLA Schools of Medicine and Engineering, including members of the departments of Physiology, Cardiology, Computer Science, Medical Imaging & Informatics, and Computational Medicine. Trainees may enter co-mentoring arrangements to gain mentorship in both CV and data science topics.

Reading materials:

Individualized Knowledge Graph: A Viable Informatics Path to Precision Medicine - learn about how knowledge graphs can unify cardiovascular data.
A large dataset of protein dynamics in the mammalian heart proteome - understand a temporal dimension of cardiac proteomics.
Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease - explore relationships between ECM and CV disease with text mining.

Frequently used tools and resources:

The Context-aware Semantic Online Analytical Processing (CaseOLAP) pipeline
Knowledgebases (Reactome, UniProtKB), ontologies (MeSH, Disease Ontology), and coding systems (ICD-11)
Python frameworks for data analysis (pandas), text mining (NLTK, Flair), and machine learning (PyTorch)

II. CV Data Science Training Program for UCLA Graduate Students and Postdoctoral Fellows.

The Cardiovascular Data Science Training Program offers undergraduate students experience with the fundamental principles of data science. Training is offered to students throughout the year. Support for students is available through an award from the NIH/NHLBI (See “Trainee Types and Support Sources” table below for further details). The number of available positions for undergraduate research projects is limited, so please contact the program as early as possible before intending to begin a project.

Course registration: First and second-year UCLA students may receive training for course credits through the Student Research Program as an SRP 99 course. Third and fourth year students are eligible to register for a 199-level Directed Research course. Students registering for credit must fulfill all training requirements on a quarterly basis and turn in a project summary at the end of the quarter. Please consult with your home department (i.e., that of your major) prior to registration to confirm your eligibility or to learn about any requirements specific to your program. Note that this training may be attended without course credit: these students are still expected to turn in a project summary at the completion of their training but both weekly scheduling and requirement deadlines will be more flexible.

Projects: Students work with a supervisor to learn data science concepts, principles, and methods. Given a cardiovascular question to focus on, they then apply what they have learned by developing data analysis pipelines to address the question. All students acquire practical biomedical data science skills, including obtaining data through APIs, managing multiple data formats and unifying varied concepts (e.g., given the name of a disease such as “tetralogy of Fallot”, what is its corresponding MeSH?). Projects approach major questions from both an engineering perspective and from the context of biological relevance, with final results produced through both computational analysis and human knowledge.

Supervision and mentorship: Research projects are supervised by graduate students, postdoctoral fellows, or other laboratory staff. Students will be able to consult and share their work with faculty from the UCLA Schools of Medicine and Engineering.

Appropriate majors: Students from all UCLA majors are encouraged to apply. Previous data science or cardiovascular medicine experience is not required but a desire to learn is essential.

Summary of Trainee Types and Support Sources
Trainee	Source of Support	Project Type
Undergraduate Student	NIH/NHLBI Award: R35HL135772	Entry level or part of project led by graduate student or postdoc fellow
Graduate Student	NIH/NHLBI Award: T32HL139450	Independent project, may work with undergraduate researchers
Postdoctoral Fellow	NIH/NHLBI Award: T32HL139450	Independent project, may lead undergraduate researchers

Research Projects for Undergraduate and Graduate Students

A study of Drug to Cardiovascular Disease (CVD) Associations with SemRep and Deep Learning

Description: Starting with well defined oxidative stress categories (e.g., Initiation, Regulation and Outcome of Oxidative Stress) and a list of drugs in cardiovascular disease (CVD), we will explore SemRep to extract all relevant SPO- triplets. We further build knowledge graphs with these triplets and prepare a muli-order association matrix to represent graph data structure. Using this graph structure, we will build a sequence prediction model for drug to CVD association. This project will provide a detailed analysis of drugs to CVD association with both qualitative evidence and quantitative scores.

Project leaders: David Liem (dliem@mednet.ucla.edu), Dibakar Sigdel (sigdeldkr@gmail.com)

Education goals: The students will learn how to work with innovative text mining tools (e.g., SemRep, CaseOLAP, Neo4J) for biomedical documents and machine learning approach (RNN, LSTM) for model development and implementation to answer important biomedical questions.

Scientific goals: The students will explore knowledge graphs for drug and CVD associations with a focus on oxidative stress categories (e.g., Initiation, Regulation and Outcome) and underlying molecular mechanism.

A study of Covid-19 Knowledge Graphs for different Age Groups and CVD Cases

Description: Covid-19 is caused by a coronavirus called SARS-CoV-2 and often presents with symptoms of high fever, cough and shortness of breath. In severe cases, Covid-19 may lead to acute respiratory distress syndrome (ARDS) and multiple organ dysfunction and eventually to death. It is clear that the severity and mortality of Covid-19 is much higher than any other known coronaviruses. New data from Covid-19 cases have indicated that the severity and mortality of this disease are significantly higher in elderly patients and patients with a history of CVD. Applying a Text Mining approach, the students will explore the role of risk factors such as ageing and several cardiovascular diseases (e.g., coronary artery disease) on the severity of Covid-19, and unravel possible underlying mechanisms.

Project leaders: David Liem (dliem@mednet.ucla.edu), Dibakar Sigdel (sigdeldkr@gmail.com)

Education goals: Students will learn how to apply innovative tools in text mining and knowledge graphs (e.g., Neo4J and Spark) for data exploration and for the development of search algorithms with specific tasks in biomedical scenarios.

Scientific goals: Students will learn how to hypothesize meaningful biomedical questions from available tools and databases in CVD and Covid-19. (e.g., Which age groups and pre-existing CVD significantly increase the risk of mortality in Covid-19, and what are the underlying mechanisms?) The search results can be further explored to investigate the underlying age based mechanism.

A study of Covid-19 Knowledge Graphs for Drugs and CVD Cases

Description: Covid-19 is caused by a coronavirus called SARS-CoV-2. It is believed that this virus has a pivotel interaction with the renin-angiotensin-aldosterone system to enter cells in the body. Accordingly, concerns exist that certain CVD drugs such as angiotensin-converting enzyme blockers (ACE inhibitors) and angiotensin receptor blockers (ARBs) may increase the susceptibility to SARS CoV-2 as well as the severity of Covid-19. In this project, the students will apply a text mining approach to create a Covid-19 KG for ACE inhibitors and ARBs and identify relevant underlying molecular pathways and mechanisms that may play a role in Covid-19.

Project leaders: David Liem (dliem@mednet.ucla.edu), Dibakar Sigdel (sigdeldkr@gmail.com)

Education goals: The students will learn how to work with innovative tools in text mining and knowledge graphs (e.g., Neo4J and Spark) for data exploration and development of search algorithms for specific tasks in biomedical scenarios.

Scientific goals: To learn how to hypothesize meaningful biomedical questions from available tools and databases in CVD and Covid-19. (e.g., Which drug or drug category has a significant effect on the severity and mortality of Covid-19, and what are the underlying mechanisms?) The search results can be further explored to investigate underlying age based molecular mechanisms.

Mapping Collective Knowledge of the Cardiac Proteome

Description: By definition, we expect that a proteome lists each protein within a particular tissue or organ. A cardiac proteome, for example, should include identities and amounts of each protein in the heart. This definition becomes clouded once we begin considering specific conditions: how does an unhealthy (e.g., hypertrophic or failing) heart’s proteome differ from that of a healthy one? Does the proteome change over time? How may the proteome vary between hearts from male or female individuals? Our ability to address these questions may be limited by the samples used to define each proteome as well as by inherent experimental variability. We may search across current and past literature to rigorously define and merge differing (and in some cases, conflicting) observations of cardiac protein expression, with the goal of assembling an updated proteome of the human heart. This process requires intensive application of text mining coupled with an understanding of cardiac-specific biological pathways. This project will place particular focus on three types of proteins: contractile proteins, proteins impacted by oxidative stress, and proteins with metabolic functions (especially those involved in branched chain amino acid, or BCAA, metabolism) as these topics are foci of other lab efforts. Assembly of an updated cardiac proteome will produce a crucial reference for classification of a peptide’s relevance to the heart.

Project leader: Harry Caufield (j.harry.caufield@gmail.com)

Education goals: An understanding of PubMed and the language used in biomedical research literature. Experience with obtaining text data through an API. Familiarity with computational methods for bibliometrics, text mining, information extraction, and natural language processing. Knowledge of biomolecular pathways in cardiac function.

Scientific goals: Construction of a literature-derived cardiac proteome, serving as a comprehensive resource for identification of proteins most relevant to healthy and diseased cardiac phenotypes.

Constructing an Integrated Cardiovascular Knowledge Graph to Discover Disease Phenotype Relationships

Description: Modern bioinformatics and biomedical informatics projects rely upon well-curated knowledge bases and data repositories. These resources contain structured information describing proteins (e.g., UniProtKB), biomolecular interactions (e.g., IntAct), or genotype-phenotype relationships (e.g., OMIM), among numerous other topics. Similarly, carefully engineered ontologies and coding systems define relationships between diseases (e.g., Disease Ontology; ICD) or broader sets of biomedical concepts (e.g., MeSH). Though each of these resources are data-rich and highly valuable, we rarely need to use any one of them in their entirety - and we would like to use knowledge curated from multiple sources, even when their structures present obstacles to data integration. By exploring the subset of each knowledge base and ontology through the perspective of cardiovascular disease research, we may identify the most relevant elements and unify them within a single graph structure. The resulting knowledge graph supports asking complex questions about cardiovascular phenomena. With some additional engineering, higher-level representations of these knowledge graphs can drive machine learning approaches for understanding cardiovascular disease.

Project leader: Harry Caufield (j.harry.caufield@gmail.com)

Education goals: An understanding of the technical methods required to integrate heterogeneous biomedical relationships described in text and knowledge bases. Skills to gain familiarity with include: data retrieval through APIs, text data analysis and natural language processing with Python, and data management in Neo4j. Experience with the data formats and structures used to store biomolecular data and metadata, as well as ontologies (e.g., OBO or OWL formats) and other data (e.g., JSON).

Scientific goals: Assemble a consistently-structured knowledge resource optimized for phenomena relevant to cardiovascular disease, including relationships between disease phenotypes, biomolecules, biomolecular pathways, symptoms, and therapeutics. Identify best practices for merging specific knowledge sources. Develop reusable code for obtaining and integrating knowledge base contents.

Knowledge Graph construction and analysis to support heart failure classification

Description: New cases of heart failure, or HF, are diagnosed by the millions each year. Not all hearts fail in the same manner, however: HF cases may be categorized by their percentage of healthy ejection fraction, or EF. An EF below 40% is considered HF with reduced EF (HFrEF) while HF with an EF greater than 50% - while often physiologically normal outside the context of disease - constitutes HF with preserved ejection fraction, or HFpEF. HFpEF is increasingly common and is distinguished from HFrEF by a variety of presentation factors, patient traits, comorbidities, and other factors such as systemic inflammation. How may we organize these varied factors in a consistent manner? If clinical and biomolecular correlates with HFrEF or HFpEF are structured as relationships, may we assemble them into a knowledge graph? What may this knowledge graph allow us to infer regarding HF classification?

Project leader: Harry Caufield (j.harry.caufield@gmail.com)

Education goals: An understanding of the technical methods required to integrate heterogeneous biomedical relationships described in text and knowledge bases. Skills to gain familiarity with include: data retrieval through APIs, text data analysis and natural language processing with Python, and data management in Neo4j. The ability to analyze knowledge graphs (and, by extension, other networks of biomedical relationships) to identify relationships supporting conclusions about cardiovascular disease. Students will also gain knowledge of the symptomology of heart disease.

Scientific goals: Identify specific patterns of biomedical relationships associated with specific subtypes of heart failure, such that text describing heart failure may be classified without explicit definitions being present (e.g., HFpEF may be described implicitly).

Mass Spectrometry (MS)-based Proteomics in Cardiovascular Research

Description: Proteomics is the large-scale study of proteomes within a biological system. Building on advances in mass spectrometry and data sciences, proteomics approaches have offered powerful means in understanding of cardiovascular diseases. Massive mass spectrometry datasets are the intersection between proteomics and data science. In this project, students will learn the proteomics sample processing techniques and gain the knowledge in mass spectrometry for applying downstream data analysis on studying cardiovascular diseases.

Project leaders: Dr. Ding Wang (dingwang@g.ucla.edu), Dr. Dominic Ng (dominicng@g.ucla.edu), Dr. Howard Choi (cjh9595@g.ucla.edu)

Education goals: Students will learn the fundamental concepts of mass spectrometry, get familiar with sample preparation protocols and data acquisition workflow for MS-based proteomics, and learn how to extract the MS data for downstream data.

Scientific goals: Introduce fundamental concepts of mass spectrometry and proteomics to students. After the training, the students will be able to tell the differences between Top-down and bottom-up approaches, apprehend standard proteomic applications in biomedical research, and know what information can be retrieved from proteomic datasets.

Bioinformatics Pipelines for Proteomics Data Analyses

Description: Bioinformatics tools, including the Integrated Proteomics Pipeline (IP2), in-house generated software packages, are employed to characterize properties of individual protein at the proteome-level, in a high-throughput fashion. Publicly available kownledgebases (e.g., Uniprot & Reactome) support proteomics data analyses and enable further data interpretation.

Project leader: Dr. Howard Choi (cjh9595@g.ucla.edu)

Education goals: Students will be introduced to several bioinformatics tools essential for proteomics data analyses. After the training, they will be able to independently utilize these resources to characterize biological variables of interest (e.g., Proteins, O-PTMs) from raw proteomics datasets.

Scientific goals: Understand the fundamental concepts and/or algorithms of these bioinformatics resources. Get comfortable in applying bioinformatics tools to better characterize biological systems. They should develop a data-driven mindset different to the conventional hypothesis-driven approaches that once dominated biomedical investigations.

O-PTM in Cardiovascular Biology and Medicine

Description: In a cardiac cell, the proteome consists of more than 200,000 proteins. Multiple proteins interact with each other to form a biological pathway. Each pathway performs a function and supports a cellular process. Changing the function of an individual protein may lead to alterations on the function of the entire pathway. Post-translational modification (PTM) is a common mechanism regulating protein structure and function. Oxidative stress is a redox imbalance when the generation and accumulation of reactive oxygen species (ROS) exceed the endogenous antioxidant capacity of living organisms. It is often involved with the progression of cardiovascular diseases (CVD). Oxidative stress sensitive post-translational modifications (O-PTMs) are typical features of proteins in human hearts; these O-PTMs are associated with healthy and/or diseased conditions.

Project leaders: Dr. Ding Wang (dingwang@g.ucla.edu), Dr. Dominic Ng (dominicng@g.ucla.edu), Dr. Howard Choi (cjh9595@g.ucla.edu)

Education goals: Oxidative stress biology: get familiar with common reactive oxygen species (ROS), ROS-generating enzymes, and antioxidants. O-PTMs: get familiar with 15 types of O-PTMs, know their AA targets and changes in m/z value. Extract O-PTM signatures of proteins: get components associated with a CV-relevant biological pathway; get their identification, subcellular distribution, and O-PTMs (e.g., modification type, modification site, occupancy).
Scientific goals: Identify O-PTM changes unique to health and disease conditions of human hearts. The similarity and differences between human and mouse protein homologues will be compared. These findings may offer opportunities to interpret phenotypic observations in human HF and mouse models under stress.