Projects
Overview of Research Themes
Our overall research interest is to apply data science technologies to advance biomedicine and healthcare, including the creation of software tools and computational platforms to integrate multi-omics data, text-mine electronic health records, promote data annotations, and translate big data into knowledge. Recently, our projects are dedicated to explaining cardiac death with ECG-based multi-modal knowledge distillation using LLM-LM interpretation and an omics-based AI approach to elucidate molecular insights underlying pathological phenotypes in atherosclerosis.
One major effort of our team has been successfully creating best practices for trustworthy AI and computational frameworks to support biomedical dataset AI readiness, including information retrieval from knowledgebases, text mining workflows, omics data integration platforms, and trustworthy LLMs.
Another research focus is developing a hierarchical graph neural network framework to study the problem of personalized classification of subclinical atherosclerosis. We leverage two clinical aspects of a patient: clinical features (e.g., demographics, comorbidities) and molecular data (e.g., filtered omics features mapped onto patient-specific PPI subgraphs). Our aim is to enhance precision cardiovascular diagnostics and to identify disease subtypes by integrating molecular interaction signatures of each patient with clinical features that reflect cohort-level behaviors in our predictive model.
Recent Projects
In this work, we study the problem pertaining to personalized classification of subclinical atherosclerosis by developing a hierarchical graph neural network framework to leverage two characteristic modalities of a patient: clinical features within the context of the cohort, and molecular data unique to individual patients. Current graph-based methods for disease classification detect patient-specific molecular fingerprints, but lack consistency and comprehension regarding cohort-wide features, which are an essential requirement for understanding pathogenic phenotypes across diverse atherosclerotic trajectories. Furthermore, understanding patient subtypes often considers clinical feature similarity in isolation, without integration of shared pathogenic interdependencies among patients. To address these challenges, we introduce ATHENA: Atherosclerosis Through Hierarchical Explainable Neural Network Analysis, which constructs a novel hierarchical network representation through integrated modality learning; subsequently, it optimizes learned patient-specific molecular fingerprints that reflect individual omics data, enforcing consistency with cohort-wide patterns. With a primary clinical dataset of 391 patients [27], their respective transcriptomics signatures, as well as their STRING PPI profiles, we demonstrate that this heterogeneous alignment of clinical features with molecular interaction patterns has significantly boosted subclinical atherosclerosis classification performance across various baselines by up to 13% in area under the receiver operating curve (AUC) and 20% in F1 score. We further validated ATHENA on a secondary clinical dataset on atherosclerosis [30]. Taken together, ATHENA enables mechanistically-informed patient subtype discovery through explainable AI (XAI)-driven subnetwork clustering; this novel integration framework strengthens personalized intervention strategies, thereby improving the prediction of atherosclerotic disease progression and management of their clinical actionable outcomes.
Foundation Models (FMs) are gaining increasing attention in the biomedical artificial intelligence (AI) ecosystem due to their ability to represent and contextualize multimodal biomedical data. These capabilities make FMs a valuable tool for a variety of tasks, including biomedical reasoning, hypothesis generation, and interpreting complex imaging data. In this review paper, we address the unique challenges associated with establishing an ethical and trustworthy biomedical AI ecosystem, with a particular focus on the development of FMs and their downstream applications. We explore strategies that can be implemented throughout the biomedical AI pipeline to effectively tackle these challenges, ensuring that these FMs are translated responsibly into clinical and translational settings. Additionally, we emphasize the importance of key stewardship and co-design principles that not only ensure robust regulation but also guarantee that the interests of all stakeholders-especially those involved in or affected by these clinical and translational applications-are adequately represented. We aim to empower the biomedical AI community to harness these models responsibly and effectively. As we navigate this exciting frontier, our collective commitment to ethical stewardship, co-design, and responsible translation will be instrumental in ensuring that the evolution of FMs truly enhances patient care and medical decision-making, ultimately leading to a more equitable and trustworthy biomedical AI ecosystem.
We present a deep-learning-based platform, MIND-S, for protein post-translational modification (PTM) predictions. MIND-S employs a multi-head attention and graph neural network and assembles a 15-fold ensemble model in a multi-label strategy to enable simultaneous prediction of multiple PTMs with high performance and computation efficiency. MIND-S also features an interpretation module, which provides the relevance of each amino acid for making the predictions and is validated with known motifs. The interpretation module also captures PTM patterns without any supervision. Furthermore, MIND-S enables examination of mutation effects on PTMs. We document a workflow, its applications to 26 types of PTMs of two datasets consisting of ∼50,000 proteins, and an example of MIND-S identifying a PTM-interrupting SNP with validation from biological data. We also include use case analyses of targeted proteins. Taken together, we have demonstrated that MIND-S is accurate, interpretable, and efficient to elucidate PTM-relevant biological processes in health and diseases.
Motivation: The standard 3-lead electrocardiogram (ECG) is common and useful in many clinical settings, but are limited in other scenarios, e.g., transient arrhythmias or dynamic heart rate variations are missed. In contrast, the Holter ECG continuously records the heart’s behavior during daily activities, rendering it effective for identifying irregular rhythms such as atrial fibrillation, bradycardia, and tachycardia. Further, most ECG models for predicting cardiac outcomes are trained in isolation of clinical features. Our approach overcomes this shortcoming with a model architecture integrating both ECG signals and clinical features.
Aim: To create a deep learning supported workflow automating characterization of Holter ECG recordings; i.e., to improve the prediction of cardiac death using 3-lead Holter ECG recordings. We present PULSE-KD (Predicting oUtcomes with Language and Signal Embeddings through Knowledge Distillation) – a multimodal deep learning framework that transfers clinical insights from a language module trained on ECG impressions and clinical data to an ECG module using raw signal inputs.
Methods: We preprocessed the dataset to include patients with known cardiac deaths or survival and had 3-lead Holter ECGs, resulting in 624 survivors, 85 cases of sudden cardiac death (SCD), and 104 cases of pump failure death (PFD). We then parse the ECG impressions and clinical features into a prompt. We employed both large language models (LLMs) and language models (LMs): two LLMs (Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct) and two LMs (BioBERT and BioClinicalBERT) are combined to create our language module. We passed the prompt into the language module to get a pseudo-label distribution. We next used a 1D-CNN and an ECG encoder (1D-ResNet) to create our ECG module. During training, we passed the 3-lead Holter ECG recordings into the ECG module to get a subsequent pseudo-label distribution. With this distribution, we compared the predictions to the true clinical diagnoses. For knowledge distillation, we applied a Kullback-Leibler-divergence loss function to compare the two pseudo-label distributions. For evaluation, we passed only the 3-lead Holter ECG recordings through the trained ECG module to generate predictions.
Results: Our language module was able to predict patient outcomes with excellent accuracies (0.776-0.876) and AUCs (0.801-0.880), validating the language module as a suitable teacher component for knowledge distillation. After training the ECG module without knowledge distillation, we found that the module performed poorly with an AUC value of 0.4212. With knowledge distillation, the language modules containing the Llama-3.2-3B-Instruct LLM exhibited an improvement of AUC to 0.5074 (with BioClinicalBERT) and 0.5318 (with BioBERT).
The scale of biomedical knowledge, spanning scientific literature and curated knowledge bases, poses a significant challenge for investigators in processing, evaluating, and interpreting findings effectively. Large Language Models (LLMs) have emerged as powerful tools for navigating this complex knowledge landscape but may produce hallucinatory responses. Retrieval-Augmented Generation (RAG) is essential for identifying relevant information to enhance accuracy and reliability. This protocol introduces RUGGED (Retrieval Under Graph-Guided Explainable disease Distinction), a comprehensive workflow designed to support knowledge integration, to mitigate bias, and to explore and validate new research directions. Biomedical information from publications and knowledge bases are synthesized and analyzed through text-mining association analysis and explainable graph prediction models to uncover potential drug-disease relationships. These findings, along with the source text corpus and knowledge bases, are incorporated into a framework that employs RAG-enhanced LLMs to enables users to explore hypotheses and investigate underlying mechanisms. A clinical use case demonstrates RUGGED's capability in evaluating and recommending therapeutics for Arrhythmogenic Cardiomyopathy (ACM) and Dilated Cardiomyopathy (DCM), analyzing prescribed drugs for molecular interactions and potential new applications. The platform reduces LLM hallucinations, highlights actionable insights, and streamlines the investigation of novel therapeutics.
The rapidly increasing and vast quantities of biomedical reports, each containing numerous entities and rich information, represent a rich resource for biomedical text-mining applications. These tools enable investigators to integrate, conceptualize, and translate these discoveries to uncover new insights into disease pathology and therapeutics. In this protocol, we present CaseOLAP LIFT, a new computational pipeline to investigate cellular components and their disease associations by extracting user-selected information from text datasets (e.g., biomedical literature). The software identifies sub-cellular proteins and their functional partners within disease-relevant documents. Additional disease-relevant documents are identified via the software's label imputation method. To contextualize the resulting protein-disease associations and to integrate information from multiple relevant biomedical resources, a knowledge graph is automatically constructed for further analyses. We present one use case with a corpus of ~34 million text documents downloaded online to provide an example of elucidating the role of mitochondrial proteins in distinct cardiovascular disease phenotypes using this method. Furthermore, a deep learning model was applied to the resulting knowledge graph to predict previously unreported relationships between proteins and disease, resulting in 1,583 associations with predicted probabilities >0.90 and with an area under the receiver operating characteristic curve (AUROC) of 0.91 on the test set. This software features a highly customizable and automated workflow, with a broad scope of raw data available for analysis; therefore, using this method, protein-disease associations can be identified with enhanced reliability within a text corpus.
Temporal proteomics data sets are often confounded by the challenges of missing values. These missing data points, in a time-series context, can lead to fluctuations in measurements or the omission of critical events, thus hindering the ability to fully comprehend the underlying biomedical processes. We introduce a Data Multiple Imputation (DMI) pipeline designed to address this challenge in temporal data set turnover rate quantifications, enabling robust downstream analysis to gain novel discoveries. To demonstrate its utility and generalizability, we applied this pipeline to two use cases: a murine cardiac temporal proteomics data set and a human plasma temporal proteomics data set, both aimed at examining protein turnover rates. This DMI pipeline significantly enhanced the detection of protein turnover rate in both data sets, and furthermore, the imputed data sets captured new representation of proteins, leading to an augmented view of biological pathways, protein complex dynamics, as well as biomarker-disease associations. Importantly, DMI exhibited superior performance in benchmark data sets compared to single imputation methods (DSI). In summary, we have demonstrated that this DMI pipeline is effective at overcoming challenges introduced by missing values in temporal proteome dynamics studies.
Existing machine learning methods for molecular (e.g., gene) embeddings are restricted to specific tasks or data modalities, limiting their effectiveness within narrow domains. As a result, they fail to capture the full breadth of gene functions and interactions across diverse biological contexts. In this study, we have systematically evaluated knowledge representations of biomolecules across multiple dimensions representing a task-agnostic manner spanning three major data sources, including omics experimental data, literature-derived text data, and knowledge graph-based representations. To distinguish between meaningful biological signals from chance correlations, we devised an adjusted variant of Singular Vector Canonical Correlation Analysis (SVCCA) that quantifies signal redundancy and complementarity across different data modalities and sources. These analyses reveal that existing embeddings capture largely non-overlapping molecular signals, highlighting the value of embedding integration. Building on this insight, we propose Platform for Representation and Integration of multimodal Molecular Embeddings (PRISME), a machine learning based workflow using an autoencoder to integrate these heterogeneous embeddings into a unified multimodal representation. We validated this approach across various benchmark tasks, where PRISME demonstrated consistent performance, and outperformed individual embedding methods in missing value imputations. This new framework supports comprehensive modeling of biomolecules, advancing the development of robust, broadly applicable multimodal embeddings optimized for downstream biomedical machine learning applications.