Data Integration and Multimodal Health AI

Data integration is a critical component of many groundbreaking AI advancements, enabling the development of health solutions that are both generalized for broader populations and tailored for precision medicine. A major example of this effort is the Columbia Brain Health Data Bank (CBDB), one of TRAIL4Health’s first flagship data initiatives. CBDB is a large-scale, longitudinal brain health data resource that links electronic health records, brain imaging, cognitive and functional assessments, biomarkers, and research cohort data across the cognitive spectrum. By creating infrastructure for secure, multimodal data integration, CBDB provides a foundation for studying brain aging and dementia over the life course and for developing AI methods that support earlier detection, disease trajectory modeling, and more actionable interventions.
In parallel with initiatives such as CBDB, TRAIL investigators are advancing both the infrastructure and the methodological foundations needed to integrate diverse data sources, including electronic health records (EHRs), genomic and other omics data, imaging, environmental data, and cohort studies. By addressing challenges such as health disparities, population heterogeneity, missing data, measurement error, distributed data systems, and transferability, our research pushes the boundaries of AI and statistical methods to learn from diverse information streams. This work supports more informed and holistic decision-making in health and helps drive better disease prevention, diagnosis, and treatment.
View our featured work and publications
1. Huang X & Gu T (2025) EntroLLM: Leveraging entropy and large language model embeddings for enhanced risk prediction with wearable device data. In proceedings at the AMIA Informatics Summit.
2. Gu T, Han Y, & Duan R. (2024) Robust angle-based transfer learning in high dimensions. Journal of the Royal Statistical Society Series B: Statistical Methodology: qkae111.
3. Gu T, Lee PH, & Duan R. (2023). COMMUTE: communication-efficient transfer learning for multi-site risk prediction. Journal of Biomedical Informatics, 137, 104243.
4. Gu T, Taylor JMG, & Mukherjee, B. (2023). A synthetic data integration framework to leverage external summary‐level information from heterogeneous populations. Biometrics, 79(4), 3831-3845.

