2026 Biostatistics Practicum/APEX Symposium
The Biostatistics MS and MPH Programs require students to complete one semester of hands-on, real-world experience through a Practicum (for MS students) or Applied Practice Experience, or APEx (for MPH students)
This requirement gives students the chance to apply their classroom knowledge in a real-world public health setting—translating theory into practice. Each project is tailored to the student’s interests, skills, and career goals, and reflects a partnership with a public health organization or research initiative. It is a culmination of their academic journey at Columbia Mailman School of Public Health.Â
The Practicum/APEx Symposium is a showcase of this work. On Friday, April 24th, students will present their projects to faculty, peers, and the broader Columbia community. It’s a celebration of their achievements and a day of learning that spans the breadth and depth of Biostatistics in action.
We invite all CUIMC and Mailman School students, faculty, staff, and affiliates to join us in supporting and learning from the graduating Class of 2026.
You can view the titles and abstracts for each session below, organized by topic and room number or in the digital program booklet.
Â
Session 1 (1:00pm - 2:00pm)
Applications in Mental Health - HSC LL 204
Yujing Fu
Neuroimaging Evidence for Sundowning: Altered Time-of-Day Dynamics of Salience Network Connectivity in Alzheimers Disease
Abstract: "Sundowning" describes worsening agitation and confusion in the late afternoon or evening in Alzheimers disease (AD). Preclinical studies link this phenomenon to abnormal neural connectivity during the evening period. We examined whether resting-state functional connectivity (rsFC) of the Salience Network (SN) shows time-of-day alterations in humans using 1,311 rs-fMRI scans from 736 ADNI participants (397 cognitively normal, 248 mild cognitive impairment, 91 AD). SN rsFC was derived using the Power 264 atlas. Linear mixed-effects models tested diagnosis-by-scan-time interactions while adjusting for age, sex, head motion, hypertension, and sedative use. SN connectivity declined later in the day in cognitively normal and MCI participants (slopes ,0.003 and ‚0.004), but increased in AD (slope 0.004). Edge analyses identified specific SN connections contributing to this interaction, and sex-stratified models suggested stronger effects in male AD. These findings translate preclinical signatures of sundowning to the human brain and suggest a network-level substrate for time-dependent neuropsychiatric symptoms in AD.
Su Hyun Kim
Baseline Symptom Dimensions Predict Antipsychotic Discontinuation and Side-Effect Risk in Schizophrenia
Abstract: Antipsychotic effectiveness varies widely in schizophrenia, with intolerable side effects and lack of efficacy driving treatment discontinuation. While symptoms are linked to functional impairment, their role in treatment persistence is unclear. Using Clinical Antipsychotic Trials of Intervention Effectiveness Phase 1 data (n=1460), we examined whether latent symptom dimensions predict discontinuation and side effect risk. Latent factors were derived from baseline clinical, sociodemographic, neurocognitive measures via factor analysis. Side effect onset and time to discontinuation were modeled using regularized Cox proportional hazards models, and symptom trajectories via mixed-effects models. Higher baseline depressive (HR=1.19) and positive symptom (HR=1.20) factors predicted earlier discontinuation. Baseline depressive symptom factor was associated with increased hazards of autonomic (HR=1.13), extrapyramidal (HR=1.12), and sexual (HR=1.18) side effects. Among discontinuers, those with higher baseline depressive symptoms improved symptomatically but discontinued due to side effects. Baseline symptom factor structure may help personalize treatment planning.
Alice Zhou
Understanding Cognitive Control in the ABCD Stop-Signal Task Using Diffusion-Based Models
Abstract: Antipsychotic effectiveness varies widely in schizophrenia, with intolerable side effects and lack of efficacy driving treatment discontinuation. While symptoms are linked to functional impairment, their role in treatment persistence is unclear. Using Clinical Antipsychotic Trials of Intervention Effectiveness Phase 1 data (n=1460), we examined whether latent symptom dimensions predict discontinuation and side effect risk. Latent factors were derived from baseline clinical, sociodemographic, neurocognitive measures via factor analysis. Side effect onset and time to discontinuation were modeled using regularized Cox proportional hazards models, and symptom trajectories via mixed-effects models. Higher baseline depressive (HR=1.19) and positive symptom (HR=1.20) factors predicted earlier discontinuation. Baseline depressive symptom factor was associated with increased hazards of autonomic (HR=1.13), extrapyramidal (HR=1.12), and sexual (HR=1.18) side effects. Among discontinuers, those with higher baseline depressive symptoms improved symptomatically but discontinued due to side effects. Baseline symptom factor structure may help personalize treatment planning.
Cancer Research - HSC LL 109A
Yifei Chen
Forecasting Immuno-Oncology Utilization In China: Integrating Policy, Safety, And Market Signals
Abstract: The use of PD-1/PD-L1 inhibitors has expanded rapidly across multiple cancer indications in China, yet real-world utilization cannot be predicted based solely on clinical efficacy. Adoption is influenced by safety management, policy reforms, and market competition. This practicum integrates clinical evidence, safety signals (immune-related adverse events, irAEs), policy timelines, and competitive intelligence into a reproducible workflow to support market access assessment and utilization forecasting. Corporate pipeline indicators were quantified using publicly available reports, and major irAEs were prioritized by harmonizing terminology from PubMed and pharmacovigilance sources. A PRISMA-guided meta-analysis framework was developed for PD-L1–high first-line NSCLC trials including KEYNOTE-024 and EMPOWER-Lung 1. Results suggest that Junshi’s asset counts follow a scale-up followed by a plateau pattern. Key irAEs include thyroid dysfunction, pneumonitis, colitis, hepatitis, rash, and fatigue. These outputs define covariates that can inform time-series forecasting of IO utilization.
Alice Mao
Dairy Product And Calcium Intake And Colorectal Cancer Risk In A Pooled Analysis Of Prospective Cohort Studies
Abstract: Dairy products and calcium intake have been associated with a reduced risk of colorectal cancer (CRC) in several epidemiologic studies, but evidence across large international populations remains limited. This study will examine the association between dairy intake and dietary and supplemental calcium and vitamin D intake and CRC risk using data from the Pooling Project of Prospective Studies of Diet and Cancer (DCPP). The analysis will include 21 prospective cohort studies with 1794810 participants and 19835 incident CRC cases. Diet was assessed using validated food frequency questionnaires, and energy-adjusted calcium and vitamin D intake was calculated using the residual method. Study-specific multivariable Cox proportional hazards models will estimate hazard ratios and 95% confidence intervals for CRC across levels of dairy, calcium, and vitamin D intake, with estimates pooled using random-effects models. Models will adjust for demographic, lifestyle, medical, and dietary factors including BMI, smoking, alcohol intake, physical activity, diabetes, and key dietary variables. Associations will also be evaluated across CRC subtypes and among individuals with higher familial risk.
Fikrianti Surachman
Budget Impact And Cost-Effectiveness Analysis Of HPV-DNA Test For Workplace Cervical Cancer Screening In Indonesia
Abstract: Cervical cancer is the second leading cause of cancer mortality among women in Indonesia. HPV-DNA testing offers higher sensitivity and longer screening intervals than the standard Pap smear. A budget impact and cost-effectiveness analysis was conducted for HPV-DNA screening strategies from an employer’s perspective. Adapting the R code and framework from the DARTH workgroup, a hybrid decision tree and cohort state-transition model was constructed to simulate 1,000 female employees over their working lifespan. Deterministic one-way sensitivity analysis was performed to evaluate parameter uncertainty. Results indicate that both HPV-DNA urine and vaginal tests are cost-saving compared to Pap smear over a 5-year budget horizon. Over 29 years, Pap smear was strictly dominated by both HPV-DNA strategies. The vaginal test provided greater clinical benefit than the urine test at a higher incremental cost, and the conclusions remained robust across key parameter bounds. To facilitate employer procurement decisions, the model is being translated into a Shiny application, allowing dynamic adjustment of input values.
Tianci Zhu
Epidemiology and Prognosis of Synchronous Brain Metastases in Lung and Breast Cancer: A SEER Population-Based Study
Abstract: Brain metastases (BM) are a serious complication in cancer patients and are particularly common in lung and breast cancers. When BM are present at the time of diagnosis, they may substantially affect treatment strategies and survival outcomes. But population-level evidence describing the incidence and prognosis of synchronous BM remains limited. In this study, we use data from the SEER Research Plus database to examine the epidemiology and survival outcomes of synchronous BM in patients with newly diagnosed lung and breast cancers. Adult patients with lung or breast cancer will be identified, and the presence of BM at diagnosis will be determined using SEER metastatic site variables. We will estimate the proportion of patients presenting with synchronous BM and compare demographic characteristics between patients with and without BM. Survival outcomes will be evaluated using Kaplan-Meier methods and log-rank tests, and multivariable Cox proportional hazards models will identify independent prognostic factors among BM patients. This study provides population-based evidence on the burden and prognosis of synchronous BM and highlights potential demographic disparities in survival.
Causal Inference and Hypothesis Testing - HSC LL 205
Fangchi Lu
Causal Inference for Exposure-Adjusted Crash Risk from Real and Synthetic Data
Abstract: This project investigates how short-term external shocks may influence traffic crash risk after accounting for exposure and examines causal inference methods for estimating such effects in observational data. The study combines empirical analysis with simulation-based methodological evaluation. We construct a borough-day panel dataset for New York City in 2025 Q1 using publicly available crash records and contextual variables. The real dataset is used to define key variables, characterize treatment variation (e.g., extreme weather events), and estimate baseline relationships between shocks and crash risk using panel models with spatial and temporal controls. Building on this structure, we develop a synthetic data-generating process reflecting key features of the observed data, including exposure levels, treatment assignment, and potential confounding. The framework enables systematic evaluation of causal estimators across repeated simulations and aims to illustrate how simulation calibrated to real data can support evaluation of causal inference methods in applied risk analysis.
Jianming Wang
Patient Photograph Characteristics of Landmarks and Retract-and-Reorder Events in Electronic Health Records
Abstract: Wrong-patient orders can be identified using the retract-and-reorder (RAR) algorithm, in which an order placed for one patient is retracted and reordered for another; prior studies suggest RAR events account for a substantial proportion of such errors. Objective: To examine whether characteristics of patient photographs are associated with wrong-patient order events. Materials and Methods: We analyzed EHR orders placed at NewYork-Presbyterian (NYP) from 2014 to 2024, including patient photograph features displayed at the time of order placement. Facial landmarks were used to derive geometric features describing facial structure and positioning. Exploratory data analysis and statistical hypothesis testing compared photo-related features between regular orders and retract-and-reorder (RAR) events. Results: Facial landmark-derived features differed significantly between regular orders and RAR events. These features were significantly associated with wrong-patient order events and showed notable effect sizes. Outcome: Facial landmark-derived features showed meaningful associations with wrong-patient order events.
Mingyin Wang
Has Congestion Pricing Improved Road Safety? A Case Study in New York City
Abstract: In January 2025, NYC implemented the Central Business District Tolling Program, a cordon-based congestion pricing policy. While its effects on traffic volume are documented, road safety impacts remain underexplored. This study evaluates the program's short-term effects on traffic collisions and injury rates using a monthly panel dataset of zip code-level crash data from January 2024 to December 2025. We employ difference-in-differences, matched difference-in-differences, generalized synthetic control, generalized additive models, and event study designs to estimate changes in injury rates per 10,000 residents and total crash counts. Across all specifications, we find no statistically significant or sustained reduction in traffic injuries or collisions following implementation. While the event study indicates a transient decline in crashes immediately upon implementation, this effect dissipated in subsequent months. These findings suggest congestion pricing alone may not yield immediate safety benefits in complex urban environments and likely requires complementary interventions, such as infrastructure upgrades, to achieve sustained reductions in traffic harm.
Shiyu Zhang
AI-Inferred Photo Quality and Patient Characteristics Associated With Retract-and-Reorder Events in Electronic Health Records
Abstract: To examine whether AI-inferred photo quality and patient characteristics available during EHR order entry are associated with retract-and-reorder (RAR) events, an established proxy for wrong-patient electronic ordering. Materials and Methods: Using the Photo AI RAR dataset, we analyzed EHR orders placed at NewYork-Presbyterian (NYP) from 2014 to 2024 and linked order-entry workflow records to automated, AI-extracted attributes from patient photographs displayed at the time of ordering (e.g., inferred patient characteristics, eyewear, masking/occlusion, pose, and face-geometry features). We collapsed multiple order records within each document_id to a single analytic record, labeling it positive if any contained order had is_rar_event=1, and excluding observations with missing RAR (no eligible orders in window). We performed exploratory data analysis, hypothesis testing, and logistic regression to assess associations between AI-derived photo attributes and RAR events. Conclusions: Integrating EHR workflow data with AI-inferred photograph attributes may enable scalable, hypothesis-generating assessment of factors associated with wrong-patient ordering risk.
Clinical Trials - HSC LL 109B
Shumei Liu
Temporal and Structural Analysis of Clinical Pharmacology Research Patterns in Chagas Disease and Non-Chagas Drugs
Abstract: Clinical pharmacology studies are a key component of drug development and regulatory science, but their distribution across disease areas and approval stages remains incompletely described. Regulatory and clinical trial records from multiple evidence sources, including FDA databases and trial registries, were integrated and standardized in SAS to characterize clinical pharmacology study patterns related to Chagas disease and to compare them with those of non-Chagas drugs. Analyses examined study number, study type, trial phase, study duration, site distribution, sponsor structure, and patterns before and after FDA approval. Distinct differences were observed across disease groups and regulatory stages, indicating variation in the scale, timing, and organization of clinical pharmacology research. These findings support the value of structured multi-source data integration for regulatory landscape analysis and suggest that disease focus and approval timing may influence clinical research strategy.
Kimberly Palaguachi-Lopez
Statistical Modeling of Longitudinal Kidney Function Outcomes in Diabetes: A Mixed-Effects Model Comparison
Abstract: Recent advances in cardiometabolic therapies have expanded treatment options to reduce cardiovascular and kidney complications among individuals with diabetes. However, some diabetes populations remain at elevated risk of cardiorenal disease progression. This project analyzed data from a randomized clinical study evaluating the effects of a metabolic therapeutic agent on longitudinal markers of kidney function in adults with diabetes. The primary analysis followed an intention-to-treat framework and assessed treatment effects on repeated kidney function measurements over a six-month follow-up period. Traditional longitudinal approaches, including ANCOVA and linear mixed models (LMM), were used to account for correlated repeated measures and estimate treatment effects over time. As an ancillary methodological analysis, generalized additive mixed models (GAMM) were explored as a flexible alternative to account for potential nonlinear physiological hemodynamic responses to therapy. This comparison highlights the potential value of GAMM as a complementary approach to conventional LMM for understanding treatment-related changes over time in longitudinal clinical trial data.
Shiying Wu
Development and Integration of an Exploratory R shiny App and Renovate Data Processing for SIRP-α
Abstract: This project focused on developing and enhancing the SIRP-α exploratory R Shiny application to support interactive review of multiple clinical trials. The work involved integrating and validating modules for safety, demographics, biomarkers, and clinical timelines using CDISC SDTM datasets. New trials (1501_0001 and 1501_0002) were added and validated within the application, which also incorporates existing 1443 program trials. During development, I identified workflow and functionality gaps and provided recommendations to the DaVinci development team for future improvements. In addition, a pooled dataset (1501_P02) was created by combining data from study 1501_0001 with selected subjects from 1443_0003 who received the same target drug, enabling exploratory cross-study analysis within the application. Furthermore, response datasets for studies 1501_0001 and 1501_0002 were prepared and formatted to be compatible with RENOVATE, an existing efficacy visualization tool, ensuring consistent data structures and enabling downstream efficacy analyses across trials.
Jingyan Yu
Statistical Analysis Of A Phase 3 Clinical Trial Evaluating Non-Inferiority Of A Semaglutide Injection In Type 2 Diabetes
Abstract: This practicum project analyzes data from a multi-center Phase 3 randomized clinical trial evaluating the efficacy, safety, and pharmacokinetic characteristics of a domestically developed semaglutide injection (CA505) compared with the originator product Ozempic in adults with type 2 diabetes mellitus. The primary objective is to assess the non-inferiority of CA505 relative to Ozempic with respect to change in glycated hemoglobin (HbA1c) from baseline to week 32. Secondary objectives include evaluating safety and tolerability through treatment-emergent adverse events and comparing pharmacokinetic parameters across treatment groups. Statistical analyses follows the trial’s Statistical Analysis Plan and the ICH E9(R1) estimand framework, generating tables, figures, and listings to support regulatory decision-making in pharmaceutical clinical trials. Methods include analysis of covariance and mixed models for repeated measures for continuous outcomes, logistic regression for categorical endpoints, and sensitivity analyses for missing data using multiple imputation and tipping point approaches.
Data Analysis - HSC LL 210
Zhengyong Chen
Association Between Modified Sepsis Bundle Adherence and Clinical Outcomes in Pediatric Patients with Phoenix Sepsis
Abstract: Pediatric sepsis is associated with substantial morbidity and mortality. Although adherence to sepsis care bundles has been linked to improved outcomes, the impact of modified bundled care remains unclear in the newly defined Phoenix sepsis population. This practicum focused on the statistical analysis of a retrospective cohort study evaluating the association between modified bundle adherence and clinical outcomes among pediatric patients meeting Phoenix sepsis criteria. Co-primary outcomes were hospital-free days and in-hospital mortality. Multivariable linear and logistic regression models were used to estimate the associations adjusting for prespecified covariates. In adjusted analyses, modified bundle adherence was associated with more hospital-free days and lower odds of in-hospital mortality, although neither association reached statistical significance.
Adeena Moghni
Building An Interactive R-Shiny Platform To Compare Pharmacokinetic Data Of Antibody Drug Conjugates In Preclinical Models
Abstract: Antibody Drug Conjugates (ADCs) combine the specificity of antibodies with the potency of small drugs to deliver cytotoxic chemotherapeutics to tumor cells. ADCs have complex design components, including linker chemistry, linker sites, drug-to-antibody ratio, etc., that make the pharmacokinetic (PK) behavior hard to predict. The goal of this project was to analyze the PK data in ADCs to better understand the relationship between design components and PK, and to develop an interactive visualization platform in R to aid with this analysis. A generalized linear mixed-effects model was used to test the associations. Other analyses included correlation analysis, ANOVA testing, cross-species comparisons, and comparisons at different linker site combinations. The interactive dashboard allowed for users to dynamically filter the data and to select the variables of interest. This generated plots, heat maps, and statistical summary tables in real time. Thus, this platform allowed for exploratory analysis of ADC PK data, which helped in the experimental design for future studies. Future goals of the dashboard are to fully automate the data import to keep the dashboard updated continuously.
Boxiang Tang
Adaptive Closed-Loop taVNS System for Physiology-Timed Neuromodulation
Abstract: This practicum project investigated a multimodal closed-loop transcutaneous auricular vagus nerve stimulation (taVNS) system designed to study physiology-timed neuromodulation. The system integrated synchronized EEG, ECG, respiration, and pupillometry recordings to evaluate how stimulation timing relative to internal physiological rhythms influences neural and autonomic responses. An initial pilot study compared fixed-parameter stimulation, sham stimulation, and respiration-triggered closed-loop stimulation in a small healthy-participant cohort. While EEG responses were heterogeneous, heart rate variability (HRV) measures suggested stronger autonomic modulation under respiration-timed stimulation. These observations motivated a redesigned experimental framework that decomposed respiration–stimulation coupling rules into multiple paradigms while keeping the base stimulation waveform constant. Multimodal analyses indicated that different coupling rules produced distinct physiological response patterns across EEG, HRV, and pupil measurements.
Jiawa Zhang
Mixed-Effects Models Enhance External Validation in Multi-Institutional Prediction Studies
Abstract: The external validity of clinical and epidemiological prediction models developed on multi-institutional data is often evaluated on unobserved clusters. While linear mixed models (LMMs) are commonly used to handle clustered data, their theoretical and empirical advantage over simpler linear models (LMs) for prediction in new clusters remains underexplored. Here we formally justify the superiority of LMMs in this setting. We derive risk bounds comparing the predictive mean squared error (MSPE) of LMMs and LMs for external cluster prediction. Our results show that the LMM fixed-effects estimator is inherently more efficient (lower variance) than the LM estimator when cluster-level heterogeneity is properly modeled. This efficiency gain ensures a lower MSPE for LMM predictors in any new cluster. Simulations further demonstrate that the LMM advantage is maximized in studies with high between-cluster variance and unbalanced cluster sizes. These findings clarify the fundamental basis for using LMMs in generalizability tasks and provide practical guidance for designing robust, cross-institutional prediction models. The approach is further illustrated using a real data application.
Data Visualizations - HSC LL 209A
Bruce Liu
Enhancing Inclusive Digital Communication Through AI-Powered Video Generation: A Product Management Practicum At Mininglamp Technology
Abstract: During my practicum as a Product Manager at Mininglamp Technology in Beijing, China, I contributed to the development of a cutting-edge AI-powered video editing platform. The platform's core objective is to democratize content creation by leveraging machine learning algorithms to auto-generate videos from text, audio, and image inputs. To ensure the product effectively met user needs, I applied robust data analysis techniques to evaluate user behavior and platform engagement metrics. By employing Python and SQL, I queried product data and analyzed results from A/B testing to guide iterative feature improvements. I also designed data visualizations, such as user engagement dashboards and heatmaps, to present usability insights to cross-functional engineering and design teams. This data-driven approach directly informed the development of accessibility features, such as multilingual subtitle generation and voice synthesis. Ultimately, this work demonstrates how machine learning and advanced data analysis can cultivate inclusive communication tools, presenting significant potential for broader applications in public health education, awareness campaigns, and community storytelling.
Kunlun Liu
Development of a novel nucleic acid aptamer–drug conjugate (ApDC) based on MMAE
Aptamer-Drug Conjugate (ApDCs) are targeted cancer immunotherapies designed to deliver cytotoxic agents or payload selectively to tumor cells. The "guidance system" used in ApDCs is use nucleic acid aptamers. Monomethyl auristatin E (MMAE) is a type of anti-tubulin agent that is frequently used as a cytotoxic payload in ADCs for cancer therapy. A novel ApDC termed “Multi” was generated by conjugating MMAE to an aptamer. were investigated using RNA-sequencing data from treated tumor cells. RNA sequencing data were analyzed to identify transcriptional differences between MMAE- and Multi- treated tumor cells, especially the corresponding biological functions, and the signal pathways affected. Principal component analysis showed clear separation between the two groups. Differential expression analysis identified 270 genes. Gene Ontology (GO) enrichment analysis revealed significant enrichment in signaling-related pathways, cytokine activity, and lipid metabolic processes. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis further indicated enrichment of genes associated with cytokine signaling and metabolic regulation.
Chloe Nguyen
GLP-1 Medications and Liver Health Outcomes in iCHANGE Patients
Abstract: Glucagon-like peptide-1 receptor agonists (GLP-1) are effective medications to manage type 2 diabetes and promote weight loss. There are 2 popular active medications that are used as treatment: Semaglutide (popularly known in Ozempic) and Tirzepatide. My APEX involved working with iCHANGE at Weill Cornell Medicine. I collected data on PETH values of iCHANGE patients to help analyze whether GLP-1s are associated with a decrease in alcohol consumption. Additionally, I collected data and performed linear regression on liver stiffness across different medication groups (Semaglutide, Tirzepatide, and Semaglutide/OR Tirzepatide with Resmetirom). The goal was to test whether GLP-1s alone or GLP-1s combined with Resmetirom were a better treatment in improving liver health.
Zexuan Yan
A Biostatistical Analysis of the Little Food Library Program
My Applied Practice Experience with H.E.A.L.T.H for Youths focused on evaluating the Little Free Library program, a community-based intervention aimed at improving book access and promoting literacy among residents in New York City. Using pre- and post-installation survey data, I conducted a quantitative analysis to assess the program's impact on reading habits and community engagement. My work involved cleaning and managing the dataset, applying descriptive statistical methods to calculate key metrics such as usage frequency and self-reported changes in reading behavior, and interpreting the results to draw meaningful conclusions about program effectiveness.
Environmental Health Research - HSC LL 201
Yuhao Chang
Bayesian Kernel Machine Regression Analysis of Prenatal Metal Mixtures and Child Neurodevelopment
Abstract:Evaluating the health effects of environmental mixtures is challenging because individuals are exposed to multiple correlated chemicals simultaneously. This practicum project applied Bayesian Kernel Machine Regression (BKMR) to assess the joint effects of prenatal exposure to four metals including lead, mercury, cadmium, and manganese on early childhood neurodevelopment in a birth cohort study conducted in Suriname. Neurodevelopment was measured using the Bayley Scales of Infant and Toddler Development, Third Edition (BSID-III). BKMR models were used to estimate overall mixture effects based on interquartile range (IQR) contrasts, evaluate univariate exposure–response functions, and explore potential interactions using bivariate exposure–response surfaces. The analytic dataset included 708 mother–child pairs. Results indicated a positive association between prenatal mercury exposure and gross motor scores, while estimated effects for other metals were small and not statistically significant. This project demonstrates the use of BKMR for evaluating nonlinear and joint effects of environmental mixtures on child neurodevelopment.
Ila Kanneboyina
Evaluating Prenatal Metal Exposures And Birth Length Using Negative Control Analysis And Mixture Modeling
Abstract: Prenatal exposure to environmental metals has been associated with adverse birth outcomes, yet causal inference in observational studies is complicated by correlated exposures and potential unmeasured confounding. This study examined associations between prenatal arsenic, manganese, and lead exposure and infant birth length in a cohort of pregnant women in Bangladesh. Multivariable linear regression models adjusting for demographic and maternal health factors were used to estimate associations. To assess potential residual confounding, we conducted a negative control exposure analysis using metal concentrations measured in children after birth, which should not causally influence fetal growth. Bayesian Kernel Machine Regression (BKMR) was also applied to evaluate mixture effects and potential nonlinear exposure‚ response relationships. Regression models indicated that higher prenatal manganese exposure was associated with shorter birth length. However, the negative control analysis revealed an association between post-pregnancy arsenic levels and birth length, suggesting possible residual confounding.
Miriam Lachs
Dietary nutrients associated with increased or reduced risk of amyotrophic lateral sclerosis
Abstract: Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disease with limited treatment options. This study aimed to identify dietary risk factors to inform people with ALS (PALS) and individuals at high risk. Data from the ARREST and COSMOS studies were analyzed. Dietary intake was assessed using the Modified Block Food Frequency Questionnaire (FFQ). Nutrient and food intake were compared between healthy controls in ARREST and PALS in ARREST and COSMOS cohorts, adjusting for age, sex, body mass index (BMI), and total caloric intake using linear and logistic regression. Associations between nutrient intake and change in the revised ALS Functional Rating Scale (ALSFRS-R) from baseline to 6-month follow-up were examined with linear regression among PALS. In ARREST, higher intake of riboflavin, retinol, arginine, cystine and dairy was associated with greater ALS odds, while lutein/zeaxanthin and vegetable intake were linked to lower odds. Among PALS in ARREST and COSMOS, greater intake of retinol, vitamin A, arginine, methionine, lactose, total choline, and dairy was associated with faster ALSFRS-R decline.
Thomas Tang
Integrating High-Resolution SO‚ÇÇ Exposure Data with Pediatric Cohort Data for Environmental Health Research
Abstract: This internship focused on processing and analyzing large-scale environmental exposure data for epidemiological research. Geocoded pediatric health data were linked with daily sulfur dioxide (SOâ‚‚) concentration data from the CHAP (China High Air Pollutants) dataset based on geographic coordinates. Using R packages such as sp, raster, and dplyr, raster-based SOâ‚‚ data from 2014–2019 were extracted to obtain daily exposure levels for each participant location. For infants younger than one year, average SOâ‚‚ exposure during the gestational period was estimated using conception dates derived from gestational age and birth dates. For older children, mean SOâ‚‚ exposure was calculated over retrospective windows of 3, 6, and 12 months prior to cognitive testing.The workflow included preprocessing large raster datasets, handling missing spatial coverage, and optimizing computational efficiency. The final dataset provides temporally aligned SOâ‚‚ exposure metrics for epidemiological analysis of environmental impacts on early childhood health outcomes, demonstrating a reproducible framework for integrating environmental data with individual-level health records.
Epidemiology - HSC LL 202
Malika Top
Associations between Adverse Childhood Experiences and Late-Life Brain Volumes
Adverse childhood experiences (ACEs) have been linked to poorer health outcomes across adulthood, but whether they produce measurable structural brain differences in late life remains unclear. Neurodegenerative diseases can progress silently for years before clinical symptoms emerge, so early-life exposures may be important risk factors. Data from the Adult Changes in Thought (ACT) Study, a prospective cohort of adults aged 65 and older, are used to estimate associations between ACEs and MRI-derived volumetric outcomes. The analytic sample includes 1,342 participants with ACEs survey data, 438 of whom have linked MRI data. Selection into MRI scanning is nonrandom, requiring inverse probability weights to improve generalizability to the full ACEs-eligible cohort, with separate models fitted to predict the probability of scan receipt and scan type. Associations are estimated from weighted linear regression models, adjusting for sex, age at scan, race/ethnicity, and APOE4 genotype. Findings will further characterize the complex relationship between early adversity and later-life brain health in a cohort that has not been previously studied.
Wanlin Wu
Dietary Strategies Against Alzheimer’s Disease Risk Based On Machine Learning
Abstract: Alzheimer’s disease (AD) is a major cause of death among older adults, and diet may influence the risk of AD. However, the specific food groups related to AD mortality remain unclear. This study aims to identify dietary factors associated with AD mortality and translate them into dietary guidance. Data are obtained from the National Health and Nutrition Examination Survey 2011–2018 linked with National Death Index mortality records. The study population includes U.S. adults aged 40 years and older with available dietary recall and mortality data. Dietary intake from 24-hour recalls is aggregated into food groups based on the What We Eat in America classification system. Food-wide association analysis is conducted using survey-weighted Cox proportional hazards models to estimate associations between food group intake and AD mortality. False discovery rate correction is applied for multiple testing. Machine learning is then used to rank influential food groups. Based on these results, a diet score is constructed to classify foods as adequacy, restriction, or moderation components. This provides an interpretable approach for generating practical dietary suggestions.
Meitong Zhou
Language Characteristics and Cognitive Performance in a Multi-Ethnic Cohort: Evidence from MESA
Abstract: Language experience has been proposed as a predictor of later-life cognition, but prior studies may not have fully accounted for education and related social factors. Using data from the Multi-Ethnic Study of Atherosclerosis (MESA), we examined associations between language characteristics and cognitive performance, with CASI as the primary outcome and Digit Symbol Coding and Digit Span as secondary outcomes. Linear regression models were used to assess current cross-sectional associations, adjusting for age, sex, race/ethnicity, nativity, and education. Language characteristics were not significantly associated with CASI after adjustment. Some associations were observed for Digit Span outcomes, suggesting that language experience may relate to specific cognitive domains rather than global cognition. These associations weakened after adjustment for education, while models without education showed stronger associations, suggesting confounding by education. Because language variables had substantial missingness that varied by age and education, future analyses will use longitudinal models and inverse probability weighting to address potential bias.
Internship Projects - HSC LL 207
Jingtong He
SAS Statistical Software Development And Cross-Validation
Abstract: This practicum was conducted at the Peking University Institute of Information Technology, focusing on SAS statistical software development and cross-validation for survival analysis methods. The primary responsibility was to derive statistical formulas for closed-source SAS procedures, verify results manually, and explain findings to development engineers. Core work centered on survival analysis using the Cox proportional hazards model, including deriving partial likelihood functions, score and Hessian matrices, and researching hazard ratio estimation and Type 3 hypothesis testing. Newton-Raphson iteration and diagnostic methods were implemented in R to replicate key SAS procedures such as PROC PHREG, NLIN, and LIFETEST, with all results cross-validated between SAS and R. The practicum also involved researching IQ/OQ/PQ validation frameworks and customizing the SAS Clinical Standards Toolkit for CDISC SDTM 3.2 validation.
Yizhe Li
Operational Strategy and Industry Research in China’s Pharmaceutical Sector: Internship at China Resources Pharmaceutical Group
Abstract: During my internship in the Operations Management Department of China Resources Pharmaceutical Group, I participated in a range of strategic research, operational analysis, and public health-related initiatives that provided a comprehensive perspective on how large healthcare enterprises coordinate investment decisions, industry research, and social responsibility programs. As an important part of my tasks during the internship, I conducted industry research and situational analyses of pharmaceutical and healthcare companies in China. I organized and summarized industry reports, as well as benchmarked pharmaceutical companies in China, including analyzing various financial metrics, R&D investment, and the overseas strategic advancement plan of those companies. Additionally, I participated in internal operation and investment projects, carrying out desk research and collated information from multiple departments. Another important work task I did during my internship was to participate in the China Resources Rural Health Move Actions Program, a public welfare program that actively creates equal access to health care in rural areas of backward regions.
Ben Nguyen
Tracking Physical And Psychiatric Comorbidities In Opioid Use Disorder Patients
Abstract: Opioid use disorder (OUD), the chronic use of opioids that causes a person significant distress or impairment, is a major public health crisis and epidemic in the United States, affecting roughly 2.7 million Americans. Oftentimes, a person diagnosed with OUD is exacerbated by coexisting physical and psychiatric comorbidities. Using pharmacy and medical claims, my team and I wanted to gain insight into the most common physical and psychiatric comorbidities and their prevalence. We also looked at patiens' adherence rates to medications for opioid use disorder (MOUD) and the types of medications patients were prescribed, with a specific emphasis on methadone treatments. We conducted diagnosis and prescription-based analyses by segmenting patients into various demographic, clinical, and treatment characteristics of interest and observed patients' trends and patterns using SQL. Afterwards, we built a propensity model using Python and SQL to predict the probability of treatment for patients diagnosed with OUD for whom we are not seeing MOUD claims and may still have activity not captured by open claims.
Yuhao Zhang
PCoA And NMDS Analysis For Microbiome Beta Diversity
Abstract: During my internship at OE Biotech, I assisted in maintaining a bioinformatics platform for microbiome and multi-omics data analysis. My work focused on multivariate statistical modules, particularly Principal Coordinates Analysis (PCoA) and Non-metric Multidimensional Scaling (NMDS), which are widely used to assess beta diversity and visualize differences in microbial community composition across experimental groups.Using R-based analytical pipelines, I helped process abundance tables, distance matrices, and sample metadata. The platform supported distance-based analyses using Bray–Curtis, Jaccard, and Euclidean metrics and generated ordination plots to explore similarity patterns among samples. I also processed group comparison outputs based on the ANOSIM algorithm and ensured that statistical results were correctly integrated into visualization reports.In addition, I organized over 20 technical issues during platform operation and collaborated with developers to improve system performance, contributing to a 1.5% increase in 7-day user retention.
Longitudinal Data Analysis - HSC LL 203
Hanchuan Chen
Association Between Sleep Disorder and Cognitive Impairment
Abstract: Sleep disorders frequently co-occur with cognitive impairment, forming a complex cluster of chronic comorbidities that substantially affects patient health and quality of life. In clinical settings, disturbances in sleep are commonly observed among individuals with cognitive decline, suggesting potential shared mechanisms and interrelated disease pathways. However, the relationships among sleep disorders, and cognitive impairment remain insufficiently understood. This project aims to investigate the association between sleep disorders and cognitive impairment and to clarify the interactions between sleep and cognitive function. Using real-world clinical data, we will analyze patterns of sleep disorders across individuals with varying cognitive and mental health profiles. Advanced analytical approaches will be applied to identify high-risk phenotypes of cognitive impairment associated with sleep disturbances. Multimodal data fusion methods will further integrate heterogeneous clinical information to capture complex relationships among these conditions. The findings may improve risk stratification and inform early detection and intervention strategies for cognitive decline.
Yuting Gu
The Longitudinal Changes in Psychiatric Symptoms in Alzheimer’s Disease Population
Abstract: Neurodegenerative disorders such as Alzheimer’s disease (AD) involve progressive structural and functional brain changes. Cognitive reserve refers to the brain’s ability to maintain function despite aging, damage, or disease by using alternative neural pathways or strategies. Functional brain network organization may reflect cognitive reserve and help explain variability in cognitive decline. This study examines whether functional brain network segregation moderates the relationship between gray matter integrity and cognitive performance in older adults. Using neuroimaging and cognitive data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Human Connectome Project in Aging (HCP-A), we evaluate associations between cognitive outcomes and the interaction between gray matter volume ratio and global network segregation. Outcomes include memory, executive function, language, visuospatial ability, and MMSE. Linear regression and mixed-effects models adjust for age, sex, and education. We hypothesize higher network segregation strengthens the association between brain structure and cognitive performance across normal aging, mild cognitive impairment, and AD populations.
Weiqi Liang
Separating Treatment Effects From Biological Aging In Lipid Trajectories: A Longitudinal Mixed-Effects Analysis Of The Framingham Heart Study Offspring Cohort
Abstract: Blood lipid biomarkers are commonly used indicators of cardiovascular risk, but their longitudinal trajectories may be influenced by medication use, which can confound aging-related changes. This study evaluates the association between lipid-lowering medication use and age-related trajectories of four lipid biomarkers, total cholesterol, low-density lipoprotein cholesterol, high-density lipoprotein cholesterol, and triglycerides, using repeated observations from the Framingham Heart Study offspring cohort. Analyses included 26,931 observations from 4,985 participants. Sex-stratified longitudinal mixed-effects models were fitted for each lipid biomarker with fixed effects for time, quadratic time, baseline age, medication use, and their interactions. Three alternative random-effects structures were compared, with the random quadratic slope model providing the best fit across biomarkers. Future analyses incorporating dose information, or causal inference approaches may further clarify medication effects on lipid dynamics.
Xun Sun
Multilayer Sentiment Modeling with Large-Language-Model-Assisted Labels and Mixed-Effects Models
Abstract: Understanding public sentiment on large-scale social media platforms is increasingly critical for capturing collective psychological states, tracking societal responses to disruptive events, and informing digital health and policy interventions. Yet current sentiment analysis approaches are often limited to binary or coarse-grained classifications, overlooking the multi-dimensional and context-specific nature of online emotions. We present a multilayer sentiment modeling that couples large language model (LLM)-assisted labeling with mixed-effects models to quantify temporal and contextual patterns across three layers: core affect (valence, arousal, dominance), basic emotions (joy, anger, sadness, fear, disgust, surprise), and social-media signals (sarcasm, empathy, toxicity, stance). Using a large Bilibili corpus spanning diverse video categories and time periods, domain-adapted LLMs generate labels that are analyzed with mixed-effects models and hypothesis tests. We found that the GPT-based models achieved strongest predictive performance, and that statistical modeling provides interpretable insights into emotion trends across video categories and pandemic-related periods.
Machine Learning and Dimension Reduction - HSC LL 208A
Yining Cao
Scalar-on-function Regression with Measurement Error in the Functional Regressors
Abstract: Functional data are increasingly used in biomedical and public health research, where predictors are observed as trajectories over time rather than as single measurements. In practice, these functional covariates are often contaminated by measurement error, which can bias estimation in scalar-on-function regression. We use the Simulation–Extrapolation (SIMEX) method to reduce bias from noisy functional predictors. Using minute-level physical activity curves from the NHANES accelerometer dataset, I construct a plasmode simulation framework in which outcomes are generated from a known coefficient function. This allows evaluation of estimation accuracy while preserving realistic variation in the predictors. I fit spline-based functional regression models and compare results with and without SIMEX correction under different levels of measurement error. Performance is evaluated using integrated squared error and prediction error. The results show that measurement error can substantially distort the estimated functional effect, while SIMEX improves recovery of the coefficient function and reduces bias, which illustrates the value of SIMEX for analyzing noisy functional data.
Shike Zhang
A Multimodal Fusion Algorithm Enhanced with a Cross-Modal Self-attention Mechanism for Detecting Coronary Heart Disease in Patients
Abstract: Heart disease remains a leading cause of morbidity and mortality worldwide, necessitating accurates and early diagnostic tools. This study proposes a novel multimodal fusion approach that integrates structured clinical numerical indicators with unstructured patient self-reported text. Leveraging DistilBERT for deep semantic encoding and a Transformer-based cross-modal self-attention mechanism, the model effectively captures complex interactions between heterogeneous data modalities. Experiments conducted on a publicly available cardiac dataset demonstrate that the proposed model outperforms traditional machine learning methods such as Logistic Regression, Random Forest, and XGBoost, as well as multimodal baselines using TF-IDF and BERT without attention. The best-performing model, combining DistilBERT embeddings with cross-modal attention, achieves an accuracy of 98.3% and a macro-averaged F1-score of 0.981. These results underscore the potential of attention-enhanced multimodal deep learning frameworks in advancing intelligent cardiac disease diagnosis and support further exploration into multimodal medical data fusion for clinical applications.
Survival Analysis - HSC LL 208B
Stella Koo
How Does Distance Of Covariance (DISCO) Compare To Homeostatic Dysregulation, Klemera-Doubal Biological Age, And Phenotypic Age In Its Association With Health Outcomes Across Diverse Populations?
Abstract: Several biological age metrics may capture distinct dimensions of physiological aging. Distance of Covariance (DISCO), a novel measure, quantifies how much an individual’s biomarker covariance structure differs from a reference population. We evaluated DISCO against established metrics—Homeostatic Dysregulation (DM), Klemera–Doubal Biological Age (KDM BA), and Phenotypic Age (PhenoAge)—across four cohorts: HRS, InCHIANTI, SLAS, and UKB, using cohort-specific and common biomarker panels. DISCO correlated strongly with DM (0.67–0.85) but weakly with chronological age, KDM BA, and PhenoAge, while the latter metrics were strongly intercorrelated. In random-effects meta-analysis, higher DISCO was associated with greater risk for all six health outcomes. Associations were moderate for cardiovascular disease, hypertension, and lung disease, smaller for cancer (OR 1.09–1.10), and largest for diabetes (OR 1.77–2.50). Mortality hazard ratios were positive (HR 1.41–1.56). Wide prediction intervals and high I2 values indicated substantial between-cohort heterogeneity. Overall, DISCO captured a dysregulation-related signal aligned with DM but did not consistently outperform established metrics.
Chufeng Li
Prognostic Assessment of IPSS-M and IPSS-R in Myelodysplastic Syndrome Patients Undergoing Allogeneic Transplantation
Abstract: Accurate risk stratification is essential for guiding treatment decisions in patients with myelodysplastic syndromes (MDS), particularly those undergoing allogeneic hematopoietic stem cell transplantation (allo-HSCT). The Molecular International Prognostic Scoring System (IPSS-M), which incorporates genomic mutations, may improve prognostic prediction compared with the traditional Revised International Prognostic Scoring System (IPSS-R). This practicum project retrospectively analyzed two transplant cohorts to evaluate the prognostic performance of IPSS-M and IPSS-R. Survival outcomes were examined using survival analysis and competing-risk models. Additional prognostic factors were assessed using Cox proportional hazards models, and model discrimination was evaluated with Harrell’s C-index. Results suggest that IPSS-M provides improved prognostic stratification for overall survival in transplant patients, while IPSS-R remains informative for certain competing outcomes. Conditioning intensity and patient comorbidity were also important predictors of transplant outcomes.
Dang Lin
Determinants and Functional Consequences of Multimorbidity Among Middle-Aged and Older Adults in China: Evidence from the China Health and Retirement Longitudinal Study (CHARLS)
Abstract: Population aging in China has been accompanied by a rising prevalence of multimorbidity, defined as the coexistence of two or more chronic conditions. Multimorbidity is associated with disability, increased healthcare utilization, and reduced quality of life among older adults. This practicum project examines the determinants and functional consequences of multimorbidity among adults aged 45 years and older using data from the China Health and Retirement Longitudinal Study (CHARLS). The first objective is to identify socioeconomic, behavioral, and psychosocial factors associated with multimorbidity using multivariable logistic regression models. The second objective evaluates the longitudinal relationship between multimorbidity and functional decline measured by activities of daily living (ADL) and instrumental activities of daily living (IADL) using mixed-effects models. Cox proportional hazards models will also assess the association between multimorbidity burden and mortality risk.
Yanhao Shen
Limitations of Applying the Hematopoietic Cell Transplantation Comorbidity Index in Pediatric Patients Receiving Allogeneic Hematopoietic Cell Transplantation
Abstract: Pretransplant risk stratification is critical for counseling families and optimizing allogeneic hematopoietic cell transplantation (alloHCT) care. The hematopoietic cell transplantation–comorbidity index (HCT-CI) was developed in adults, and its applicability in pediatric recipients remains uncertain. I performed a single-center retrospective cohort study of 188 children, adolescents, and young adults (<22 years) who underwent alloHCT between January 2008 and October 2016. Pretransplant comorbidities were abstracted and scored using standard HCT-CI definitions and weights, and patients were categorized as HCT-CI 0, 1–2, or ≥3. The primary endpoint was overall survival (OS), defined as death from any cause. The secondary endpoint was nonrelapse mortality (NRM), defined as death without relapse of the underlying disease. In this pediatric alloHCT cohort, HCT-CI did not discriminate OS, and several index components were challenging to operationalize in children. Pediatric-specific refinement of comorbidity definitions may be necessary to improve pretransplant risk assessment.
Break (2:00pm - 2:30pm)
Session 2 (2:30pm - 3:30pm)
Cancer Research - HSC LL 109A
Lahari Koganti
Tumor-Only RNA-Seq Framework for Differential Gene Expression Analysis Without Matched Normal Controls
Abstract: In oncology research, RNA-sequencing (RNA-seq) studies typically compare tumor samples with matched normal control samples to identify differentially expressed genes. However, in clinical practice, matched normal tissue is often unavailable due to logistical or cost-related constraints, limiting use of conventional statistical approaches for differential expression analysis. We developed a tumor-only framework for identifying differentially expressed genes from RNA-seq data without matched normal controls. The workflow incorporates preprocessing, quality control, and normalization steps. Differential expression is inferred using a percentile-ranking strategy, where each gene in a new tumor case is evaluated against a tumor-only reference database using the 10th and 90th percentile thresholds, along with the gene's percentile rank relative to the database distribution. The method is implemented as a reproducible computational pipeline using Bash, R and Python for use in precision genomic laboratories. Preliminary evaluation shows the approach can identify clinically actionable genes, enabling tumor-only RNA-seq analysis in precision oncology.
Zihan Lin
Multi-Omics Analysis Reveals Histone Modification–Associated Tumor Microenvironment Regulation in Intrahepatic Cholangiocarcinoma
Abstract: Intrahepatic cholangiocarcinoma (ICC) is an aggressive liver malignancy with increasing incidence and poor prognosis. The tumor microenvironment strongly influences tumor progression, while the epigenetic mechanisms regulating this process have not been completely understood. This study performed a multi-omics analysis to explore histone modification–associated regulatory patterns in ICC. Publicly available single-cell RNA sequencing data (GSE138709) and proteomics data (PDC000356) were analyzed using a computational pipeline in R. Single-cell analysis revealed substantial cellular heterogeneity in the ICC microenvironment and identified epithelial cells as important regulators of immune interactions. Proteomic and survival analyses highlighted KPNA4 as a key prognostic factor. Results of integrative epigenomic analysis further suggested that enhancer-associated histone modifications may regulate KPNA4 expression. These findings provide us with new insights into the epigenetic regulation of the ICC tumor microenvironment and suggest potential targets for future therapeutic investigation.
Linshan Xie
Dietary Fiber and Grain Intake and Risk of Colorectal Cancer: A Pooled Analysis of Prospective Cohort Studies
Abstract: Colorectal cancer (CRC) is the third most common cancer and the second leading cause of cancer-related deaths worldwide. Dietary fiber and grain intake have been suggested to be protective against colorectal cancer; however, evidence regarding specific fiber sources, grain types and colorectal cancer subtypes remains limited. This study investigates the associations of total and source-specific dietary fiber as well as whole and refined grain intake with colorectal cancer risk. A pooled analysis is conducted using data from 2,241,549 participants and 33,257 incident colorectal cases across 27 cohort studies in the Pooling Project of Prospective Studies of Diet and Cancer (DCPP). Cox proportional hazards models are used to estimate study-specific hazard ratios (HRs) and 95% confidence intervals (CIs), which were then combined using a random-effects meta-analysis, adjusting for potential confounders. In age-adjusted models, higher fiber and grain intake were associated with a lower colorectal risk; however, the associations are no longer statistically significant after multivariable adjustment. Interaction analysis were also conducted to evaluate potential effect modification.
Data Analysis - HSC LL 210
Hanrui Li
Statistical Genetics Analysis Of Lifetime Cannabis Use And CD4+ T Cell Dynamics In PC-Stabilized Epigenetic Aging
Abstract: Epigenetic clocks capture the divergence between chronological and biological age. This study investigates the impact of Lifetime Cannabis Use and immune integrity on biological age acceleration, accounting for the influence of leukocyte proportions on the DNA methylation (DNAm) signal. DNAm data (450k) was preprocessed using Noob normalization for scale consistency. Biological age was estimated via the PC-stabilized PhenoAge algorithm. To control for confounding, Population Stratification and Batch Effects were addressed by adjusting for physical Slide/Plate IDs and implementing Cellular Stratification (Houseman 2012) for CD4T, CD8T, NK, Bcell, and Mono proportions. The adjusted model revealed that Cannabis use significantly accelerates biological aging by ~3.27 years (P < 0.05). CD4+ T cell proportions emerged as the strongest predictor of age deceleration (P < 0.001). Non-significant Slide/Plate factors (P > 0.05) confirmed that Noob preprocessing effectively neutralized experimental noise. Lifestyle behaviors are biologically embedded through immune remodeling. CD4+ T cell dynamics serve as critical indicators of aging health, independent of ancestral or technical variation.
Gokul Pareek
Data-Driven Optimization of Community-Based Cancer Screening Reporting Using Scalable Anaytics
Abstract: Community-based cancer screening programs are essential for reducing disparities in early detection among immigrant populations, yet operational data are often not structured for systematic reporting and evaluation. This practicum project developed scalable data-driven reporting tools for the Immigrant Health and Cancer Disparities Service at Memorial Sloan Kettering Cancer Center. Using patient-level REDCap data, reproducible pipelines were built in SQL and R to generate analytic datasets. Descriptive and stratified analyses quantified screening participation and demographic variation. These analyses were operationalized through integrated Tableau dashboards supporting real-time cancer screening reporting and program monitoring. Automated REDCap API integration and weekly refresh cycles enabled continuous synchronization between operational data and dashboard outputs. As a secondary component, standardized consort diagrams modeled patient flow and identified attrition points across screening pathways. This work demonstrates the application of biostatistical methods and scalable analytic infrastructure to strengthen cancer screening reporting in community programs.
Congyu Yang
Exploratory Analysis Of Image-Derived Features In Automated IPSC Culture To Identify Indicators Of Colony Expansion
Abstract: Induced pluripotent stem cells (iPSCs) are an important component of emerging personalized regenerative therapies. In automated cell manufacturing workflows, recently reprogrammed iPSCs must reach a stable state before being transferred for further expansion, yet identifying early indicators of successful colony expansion remains challenging. This practicum project analyzes image-derived features from automated iPSC culture experiments conducted at Cellino Biotech. Images collected during laboratory cell culture experiments were processed to extract features such as cluster counts, colony morphology, and confluence measurements. Using SQL-based data extraction pipelines and Python-based analysis tools, these features were cleaned and organized for analysis across experiments, donors, and timepoints. Exploratory analyses identified trends in cluster survival, phases in colony growth dynamics, and potential relationships between seeding density and reprogramming efficiency. These findings demonstrate how systematic analysis of image-derived features can support deeper investigation of automated iPSC culture systems.
Yixin Zheng
Functional Connectivity and Apathy in Alzheimer’s Disease: Mediation and Interaction Effects of Amyloid Burden
Abstract: This practicum project investigates whether large-scale brain functional connectivity (FC) mediates or modifies the relationship between amyloid burden and apathy using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The study tested three hypotheses: (H1) FC mediates the association between amyloid burden and apathy; (H2) amyloid burden modifies the association between FC and apathy through interaction effects; and (H3) these pathways extend to other neuropsychiatric symptom domains. Resting-state fMRI measures of FC from the Default Mode Network, Frontoparietal Network, and Salience Network were analyzed using baseline and longitudinal mixed-effects models.Results showed no evidence supporting mediation (H1). However, interaction analyses (H2) indicated that the relationship between FC and apathy differed by amyloid status, particularly in the salience and default mode networks. No consistent mediation effects were observed for other neuropsychiatric symptoms (H3).
Epidemiology - HSC LL 202
Maya Krishnamoorthy
Sandwich Generation Caregiving and Cognitive Function in Midlife: Evidence from the National Longitudinal Study of Adolescent to Adult Health
Abstract: Midlife adults increasingly care for both aging parents and children, yet little is known about dual (“sandwiched”) caregivers and their cognition. Using Wave VI of the National Longitudinal Study of Adolescent to Adult Health, we examined associations between caregiving duties and cognition among adults aged 39–51. Analyses were conducted for two samples: an overall sample using TestMyBrain tests to measure cognition (n=9,777), and a subsample who completed additional tests from which a general cognition score was derived (n=2,524). Survey-weighted linear regression models adjusted for age, sex, race/ethnicity, and education. Additional models for mediation of health behaviors and depression were run. Across samples, caregiving was associated with better cognitive performance relative to non-caregiving, and sandwiched caregivers showed the strongest advantages. These findings support the Healthy Caregiver Model, suggesting that intergenerational caregiving in midlife may confer cognitive benefits. Further research is needed to assess whether these patterns persist with aging.
Sean Sorek
Resistance Calibration for Landscape Connectivity: Using Bayesian Optimization and Point Processes to Build Data-Driven Functional Connectivity Surfaces
Abstract: Functional connectivity models are increasingly being used to predict wildlife distributions and vector-borne disease risk, yet resistance surfaces are typically assigned by expert opinion with no quantifiable uncertainty or selection functions that do not truly optimize connectivity. We present a framework that calibrates resistance surfaces by optimizing downstream connectivity performance against species occurrence data. A log-linear resistance surface feeds Omniscape to produce cumulative current flow, which is used as a feature in a negative binomial GAM. Because Omniscape does not calculate gradients, resistance parameters are tuned via Bayesian optimization using Thompson sampling. Applied to 156,495 GPS readings from 57 white-tailed deer on Staten Island, New York, the framework outperforms literature baseline (holdout AUC = 0.922 vs 0.859). Optimized resistance parameters are consistent with ecological expectations for WTD, while the approach generalizes to any species with occurrence data and environmental rasters.
Yi Su
Physical Activity, Sedentary Behavior, And Risk Of Insulin Resistance And Incident Diabetes In Chinese Adults: A Longitudinal Cohort Study
Abstract: This practicum investigated the relationship between moderate-to-vigorous physical activity (MVPA), sedentary behavior, and the risk of insulin resistance and incident diabetes using longitudinal data from the China Health and Nutrition Survey. The analysis included over 10,000 participants followed across multiple survey waves. Cox proportional hazards models were applied to estimate the associations between activity levels and disease incidence, and mediation analysis was conducted to evaluate the roles of body mass index and insulin resistance. Results indicated that higher levels of MVPA were associated with a lower risk of insulin resistance and diabetes, whereas greater sedentary behavior increased insulin resistance risk but showed no significant association with diabetes incidence. The findings suggest that increasing physical activity and reducing sedentary time may play an important role in preventing metabolic disorders.
Functional Data Analysis- HSC LL 204
Fengwei Lei
Reliability of Graphical Metrics from Resting-State fMRI in Aging
Abstract: Graph-based summaries of brain functional connectivity (FC), including centrality- and modularity-related measures, are increasingly used to characterize brain network organization and individual differences. Yet, the reliability of these ROI-level network properties remains insufficiently characterized, limiting their interpretability and potential use as biomarkers. Using resting-state fMRI data from the Aging Adult Brain Connectome (AABC), this study evaluates the test–retest reliability of node-level network metrics derived from FC matrices, including node strength, clustering coefficient, eigenvector centrality, and betweenness centrality. Reliability at the ROI level is quantified using intraclass correlation coefficients (ICC). To complement ICC-based estimates, the image intraclass correlation coefficient (I2C2) is also computed on whole-brain node-metric maps to examine whether reliability conclusions differ when these measures are treated as multivariate images. This analysis identifies network properties with higher reproducibility across regions and examines whether these patterns remain consistent across follow-up sessions.
Kangyu Xu
Estimating Biological Aging from DNA Methylation Using PC-Based Epigenetic Clocks
Abstract: DNA methylation (DNAm) has emerged as an important biomarker of biological aging, and epigenetic clocks have been developed to estimate biological age from genome-wide methylation profiles. This project aims to compute multiple DNAm-based epigenetic clocks and derive age acceleration metrics that quantify deviations between biological age and chronological age. DNAm array data will undergo standard quality control and preprocessing procedures, including probe filtering, sample filtering, normalization, and batch correction. Principal component (PC)–based versions of major epigenetic clocks (Horvath, Hannum, PhenoAge, GrimAge, and DunedinPACE) will be computed using the MethylCIPHER package. The analysis pipeline will generate DNAm age estimates and standardized age acceleration (AgeAccel) scores derived from regression residuals of DNAm age on chronological age. These AgeAccel metrics provide quantitative measures of biological aging across different epigenetic clock models. This project establishes a reproducible analytical framework for calculating DNAm-based aging biomarkers and contributes to the study of biological aging processes using epigenetic clock measures.
Liqi Zhou
Reliability of Functional Connectivity in the Aging Population
Abstract: Resting-state fMRI functional connectivity (FC) is widely used to characterize brain-behavior relationships, yet the reliability of FC estimates in aging populations remains poorly understood. We evaluated edge-wise test-retest reliability of FC using resting-state fMRI from the AABC/HCP Lifespan Aging dataset (N = 1390; age range 36-90 years; 793 females/597 males). Region-based time series were extracted from 379 brain regions. For each run, FC matrices were computed using Pearson correlation, yielding 71,631 unique edges. Reliability was assessed using ICC(3,1) and generalized ICC (GICC). We aimed to characterize the distribution of reliability across edges, examine agreement between ICC and GICC, and evaluate whether reliability varies by within- versus between-network edges, scan duration, and motion measures. This study is intended to provide a systematic assessment of edge-wise FC reliability in an aging sample and to inform the identification of reproducible connectivity features for future neuroimaging research.
Longitudinal Data Analysis in Mental Health - HSC LL 203
Jeong Yun Choi
Burnout Pilot Study
Abstract: The study tracked burnout among PGY1 and PGY2 psychiatry residents using quarterly MBI-HSS scores to examine whether burnout varies by PGY level, whether patterns hold longitudinally across cohorts, and whether PA is inversely associated with OE and DP. Mean subscale scores were stratified by PGY level, quarter, and academic year; bar plots and linear regression assessed within/between-year differences and PA's association with OE and DP. PGY1s showed higher OE and DP than PGY2s, particularly mid-year, while PA remained low in both groups. PGY1-Year 1 had more severe OE (p=0.0036) and DP (p=0.026) in Q2 than PGY1-Year 2, indicating cohort effects. Longitudinally, OE and DP peaked in late PGY1 before declining in PGY2, and linear regression confirmed a significant inverse relationship between PA and both OE (β=–0.80, p<0.001) and DP (β=–0.30, p<0.001). These findings suggest early psychiatry residency is a high-risk period shaped by PGY level and cohort-specific experiences; PA's consistent inverse association with burnout underscores its potential as protective factor, and its stagnation despite improvements in OE and DP highlights the need for targeted interventions promoting PA.
Maggie Hsu
Identifying And Analyzing Longitudinal Relationships in Depressive and Suicidal Behavior Among Veterans
Abstract: This project investigates whether there are any clusters of related behaviors among veterans with depressive and suicidal ideation and how these behavioral trajectories change over time. Data from the James J. Peters VA Medical Center with clinical measures on mental health measured at baseline (1 month), 3 months, and 6 months was analyzed using a latent profile model to identify groups of related observations, where a severity-based model was determined. Using mixed effects models, the high- and low-severity classes were then compared longitudinally for depressive, suicidal, and sleep inventory responses. To test whether there were any significant differences in baseline indices by class, a chi-squared test was conducted on the baseline data between the classes. From the models, membership in the higher-severity class is significantly associated with higher depressive, suicidal ideation, and sleep disturbance ratings. The change in depression severity, ideation and sleep disturbance ratings over the different timepoints was not significantly different between the two groups, indicating that the group differences observed at baseline may be stable over at least the medium term.
Zhengkun Ou
Comorbid Pattern Analysis of Cognitive, Functional Impairment and Neuropsychiatric Symptoms Among Alzheimer's Disease: Multiple Cross-Sectional and Longitudinal Cohorts
Abstract: Alzheimer's disease involves heterogeneous co-occurrence of cognitive decline, functional impairment, and neuropsychiatric symptoms, yet these domains are typically studied in isolation.To identify latent comorbidity profiles capturing the joint patterning of cognitive, functional, and neuropsychiatric impairment in AD and characterize transitions over time.Methods: Longitudinal data from ADNI and OASIS-3 were analyzed. Fourteen binary indicators—cognitive impairment, functional impairment, and 12 NPI-Q domains—were assessed at baseline, 12, 24, and 36 months. Latent Transition Analysis identified optimal latent states, item-response probabilities, and transition dynamics via BIC.Results: A four-state model was selected, ranging from largely asymptomatic to high-burden across all domains. Intermediate states were characterized by affective/behavioral NPS or cognitive-functional impairment. Most individuals remained stable, though progressive transitions toward higher-burden states were observed.Conclusions: Four comorbidity profiles capture multidimensional AD heterogeneity beyond single-domain staging, informing targeted interventions and prognostic precision.
Jingxi Wang
Longitudinal Trajectories Of CBCL Symptoms In The ABCD Study: A Latent Transition Analysis
Abstract: This project examines longitudinal trajectories of child behavioral symptoms in the Adolescent Brain Cognitive Development (ABCD) Study using Child Behavior Checklist (CBCL) data. Eight CBCL syndrome scales were used to represent multiple domains of emotional and behavioral symptoms. Indicators were dichotomized using the clinical threshold of T ≥ 65 to identify elevated symptoms. Latent transition analysis was applied to characterize symptom profiles and examine transitions across six study waves from baseline through the five-year follow-up. Three latent classes were identified representing low, moderate, and high symptom groups. Results indicate that symptom profiles are largely stable across childhood, particularly for individuals in the low symptom class, while a subset of participants transition between adjacent symptom levels over time. Trajectory classification further showed that most children follow stable symptom patterns, with smaller proportions showing improving, worsening, or fluctuating trajectories. These findings provide insight into developmental patterns of behavioral symptoms and may help identify children at risk for persistent behavioral difficulties.
Machine Learning - HSC LL 201
Xuanyu Guo
Standardizing Clinical Trial Eligibility Criteria from Protocol with NLP and Large Language Models
Abstract: This practicum project develops and evaluates an NLP pipeline to standardize free text eligibility criteria from clinical trial protocol PDFs into the structured style used on ClinicalTrials.gov. Using a set of protocol–registry pairs, Gemini is applied to extract inclusion and exclusion sections from PDFs, which are then aligned with the registry’s formatted criteria to build supervised training and evaluation data. A fine tuned model trained on trials is compared against zero shot and few shot prompting baselines using multiple metrics, including structural completeness, coverage, hallucination rate, and numeric and negation consistency. Across the evaluated cases, the fine tuned model achieves the best quantitative scores, while the few shot approach also performs competitively and provides a practical, training free option for normalizing eligibility criteria in support of ClinicalTrials.gov data entry.
Jiyoon Lee
Identification of Colorectal Cancer Risk from Longitudinal Electronic Health Records Using Machine Learning
Abstract: Early identification of individuals at elevated risk for colorectal cancer (CRC) may enable targeted screening and earlier detection. We developed a machine learning approach using longitudinal electronic health record (EHR) data to identify individuals at increased CRC risk prior to diagnosis. Using the Columbia University Irving Medical Center (CUIMC) EHR dataset, we constructed patient-level representations from multiple clinical domains, including conditions, procedures, medications, measurements, and observations. Clinical events occurring before a defined prediction date were aggregated to generate non-temporal feature representations for modeling. Models were evaluated in annual cohorts to simulate real-world deployment and compared with an age-based screening strategy selecting the same proportion of individuals as those aged ≥45. Across multiple years, the model consistently identified cases occurring before typical screening thresholds. To assess generalizability, the analysis was replicated using the All of Us Research Program dataset. These results suggest EHR-based risk models may help identify high-risk individuals earlier than standard age-based screening.
Tong Su
This is My Presentation Patient-Journey-Based Temporal Feature Engineering and Metastasis Prediction Using the MIMIC-III Database
Early identification of patients at risk for metastatic progression is important for improving cancer management and clinical decision-making. I developed a patient-journey–based prediction framework using the MIMIC-III clinical database. A retrospective cohort of 334 patients was constructed using eligibility criteria and temporal filters to define index and snapshot dates with adequate look-back and follow-up periods. Metastasis within 180 days after the snapshot date was defined as the primary outcome using ICD-9 secondary malignant neoplasm codes.Predictors were derived from demographics, diagnoses, procedures, medications, and laboratory testing patterns using temporal roll-up and raw feature engineering. A total of 225 candidate predictors were evaluated. An XGBoost model achieved a test ROC-AUC of 0.83. The top 30 predictors selected based on mean absolute SHAP values achieved similar performance (ROC-AUC 0.81). Key predictors included time since primary diagnosis, laboratory testing patterns, treatment exposure, and patient age. This framework provides a reproducible approach for metastasis prediction using temporally aligned electronic health record data.
Jie Zhu
Modeling Hospital-Level Treatment Utilization Using Real-World Healthcare Data
Abstract: his practicum project investigates how statistical and machine learning approaches can be applied to real-world healthcare data to model and predict hospital-level treatment utilization and potential demand constraints. Using de-identified longitudinal datasets that integrate hospital utilization, market indicators, and institutional information, the project aims to forecast short-term treatment demand and identify hospitals at risk of utilization restrictions. Linear regression models are used to characterize baseline relationships between historical utilization trends and institutional characteristics, while gradient boosting algorithms including XGBoost and LightGBM are implemented to improve prediction accuracy for hospital-level treatment volume over a three-month horizon. Similarity-based analyses are also conducted to identify hospitals with comparable utilization patterns and estimate potential demand ceilings across institutions. The project demonstrates how predictive modeling applied to real-world healthcare data can generate insights into institutional variation in treatment utilization and support data-driven healthcare decision making.
Machine Learning and Dimension Reduction - HSC LL 208A
Yulin Liu
Probabilistic Unveiling of Tumor Heterogeneity via GMM of scRNA-seq Data
Abstract: scRNA-seq provides critical insights into breast cancer microenvironments, but deciphering functional subtypes is computationally hindered by high-dimensional, zero-inflated data. We present a probabilistic framework to map latent cellular substructures within a sparse transcriptomic dataset (716 cells, 558 genes). To isolate biological variance, Principal Component Analysis (PCA) is applied for optimal dimensionality reduction.Using this latent space, a custom Expectation-Maximization (EM) algorithm is derived from scratch to fit a Multivariate Gaussian Mixture Model (GMM), yielding continuous subtype assignment probabilities. Current efforts quantify optimal distinct subpopulations and identify cluster-specific driver genes. Finally, we evaluate Gaussian assumptions on zero-inflated counts, benchmarking the PCA-GMM pipeline against non-parametric density-based clustering to rigorously validate methodological robustness.
Cheng Rao
Data Analysis And Machine Learning For Evaluating Healthcare Programmatic Advertising Performance
Abstract: Programmatic advertising has become an important strategy for healthcare and pharmaceutical companies to deliver digital marketing campaigns to targeted audiences. Evaluating advertising performance and identifying key factors influencing campaign effectiveness are essential for optimizing advertising strategies and improving operational efficiency. This practicum project analyzes programmatic healthcare advertising data collected from May 31 to August 31, 2025, consisting of 83,751 observations and performance indicators including ad format, country, impressions, clicks, fill rate, win rate, click rate, and advertiser revenue. Using the R programming language, exploratory data analysis and visualization are conducted to examine performance patterns across markets and ad formats. Statistical regression models are applied to assess relationships between operational metrics and campaign outcomes, and machine learning methods are explored to predict advertising performance and identify key drivers of effectiveness. The findings provide data-driven insights for optimizing healthcare digital advertising strategies and improving operational efficiency.
Yi Xu
Assessing Physiological Stability Using Internal And External Mahalanobis Distance Frameworks
Abstract: This project assesses multivariate physiological stability under different break interventions using Mahalanobis Distance (MD) as a composite biomarker metric. In a repeated-measures study, twelve participants completed three sitting-break conditions (B0, B30, B60), with biomarkers collected at nine timepoints per day. MD was calculated using both an internal cohort-based reference and an external NHANES reference to quantify multivariate physiological deviation. Mixed-effects models were applied to evaluate treatment effects while accounting for within-subject dependence. The results characterize differences in multivariate physiological patterns across intervention conditions and compare stability estimates derived from internal and population-based reference frameworks.
Yonghao Yu
Risk Modeling of Post-Fusion Adverse Outcomes in Patients with Spinal Deformity
Abstract: This practicum project examines postoperative adverse outcomes among patients with spinal deformities who underwent definitive fusion surgery. We will evaluate two binary outcomes, PostFusionComplication and PostFusionUPROAR, where UPROAR denotes an unexpected return to the emergency room. Predictors of interest include gender, prefusion age, prefusion BMI category, and whether the patient received scoliosis treatment before definitive fusion. Building on prior work showing associations between BMI, postoperative complications, and adverse postoperative events in a retrospective cohort of 102 patients, this study will use logistic regression and relative risk regression to quantify associations and compare effect estimates. The analysis aims to identify clinically meaningful risk factors for poor postoperative outcomes and provide an interpretable framework for perioperative risk stratification in this population.
Research in Observational Studies - HSC LL 205
Savannah Flanagan
Associations Between Built Environment Features And Physical Activity Within The Activity Spaces Of Mid-Life Adults
Abstract: Physical activity (PA) is crucial for protecting against chronic diseases that emerge in mid-life. The Everyday Environmental Exposures (E3) Study tracked the PA and location of 400 mid-life adults for 21 days to investigate how PA was associated with characteristics of each person’s activity space. This analysis explored how built environment (BE) features of the activity space, including bike paths, parks, pedestrian areas, street connectivity, and PA centers, were associated with the PA rates of E3 participants. Linear models were fit to explore these relationships and potential confounding by environmental features and their principal components. Linear splines with cutpoints at the quintiles of each BE feature were used to explore more complex relationships. Results suggest that variables related to the urbanness of an activity space confound the relationship between BE features and PA, with most adjusted relationships being negative. Adjusted splines revealed that BE features are only positively associated with PA at their highest quintile, implying that heavy investment in BE features is needed to promote PA in urban areas.
Vidit Tripathi
Quantification of the Pace of Biological Aging Among Participants in the US Health and Retirement Study
Abstract: Biological age, estimated using diverse biological and functional measures, can better indicate lifespan and mortality risk than chronological age. We developed a new measure of biological aging using the Health and Retirement Study cohort (13,358 participants aged 50 and older at baseline) and a hierarchical model capturing shared slopes across nine standardized aging biomarkers spanning blood, physical, and functional measures. Our mortality-calibrated measure correlated moderately with age and strongly with an existing pace-of-aging measure. Higher values reflected faster biological aging and lower survival for the fastest tertile, consistent with stratifying mortality risk. The measure performed as well or better than an existing pace-of-aging measure at predicting aging-related outcomes, except for dementia, and its DNA methylation (DNAm) version generally outperformed or matched the DNAm version of the existing measure for predicting outcomes. Overall, our findings support the validity of the measure and suggest it may detect aging-related risk earlier in the life course. Further analyses with external validation are planned for future work.
Jingyi Wang
Weight Loss Effects Of Semaglutide And Tirzepatide: A Meta-Analysis Of Phase 3 Randomized Controlled Trials
Abstract: Obesity is a major public health concern worldwide and is associated with increased risks of cardiovascular disease and diabetes. Recently, incretin-based medications such as semaglutide and tirzepatide have shown substantial weight-loss effects in several Phase 3 randomized controlled trials. However, the magnitude and consistency of these effects across trials have not been fully synthesized. In this project, we conducted a meta-analysis of Phase 3 randomized controlled trials from the STEP program for semaglutide and the SURMOUNT program for tirzepatide. The primary outcome was percent change in body weight compared with placebo. Random-effects models were used to estimate pooled treatment effects. In the primary analysis including STEP-1, STEP-3, and STEP-5, semaglutide reduced body weight by an average of 11.7 percentage points compared with placebo (95% CI ‚àí13.2 to ‚àí10.3). Sensitivity analyses including STEP-4 showed similar results. Exploratory analyses of tirzepatide trials suggested even greater weight reductions. These findings support the strong efficacy of incretin-based therapies for obesity treatment.
Social Behavior Research - HSC LL 109B
Qingpeng Liu
Emergency Department Patient Education and Social Needs Screening: Early Findings from the ENGAGE Program
Abstract: The ENGAGE program at NewYork-Presbyterian/Columbia University Medical Center aims to improve health literacy among emergency department (ED) patients through brief educational interventions and screening for social determinants of health (SDOH). As part of my APEx project, I helped develop the data collection workflow and analyze program outcomes. Follow-up phone calls conducted 7–14 days after the visit evaluated retention of knowledge and potential behavior change. De-identified survey data were processed and analyzed using R to generate descriptive statistics and exploratory regression analyses examining relationships. Between July and August 2025, interns engaged 127 patients, with 73% reporting confidence completing medical forms but only 14% initially familiar with the health topics discussed. Among patients completing follow-up calls, approximately half reported improved familiarity and several indicated consideration of behavior changes. These preliminary findings demonstrate the feasibility of integrating brief health education and SDOH screening into ED workflows and potential for ED-based interventions to support patient knowledge and connection to community resources.
Yiwen Zhang
Returning Aggregate Research Results: A Scoping Review of Practices, Preferences, and Barriers
Abstract: Returning research results to participants is increasingly seen as an ethical responsibility that may build public trust, improve transparency, and support future research participation. However, dissemination of aggregate results remains inconsistent and under-resourced in clinical research. This scoping review synthesized evidence on current practices, participant preferences, and challenges in returning aggregate results, with implications for public health, health equity, and policy.The review followed PRISMA-ScR guidelines. Four databases—PubMed, EMBASE, CINAHL, and Cochrane Library—were searched for English-language articles published through February 2025 on dissemination of aggregate research results to participants. Two reviewers independently screened studies and extracted data. Study quality was assessed using a modified Oxford Centre for Evidence-Based Medicine scale. Thematic synthesis identified patterns in dissemination methods, participant preferences, and implementation barriers.
Jicong Zhang
Evaluation of Art-Based Health Literacy Brochures
Abstract: This APEx project focused on evaluating art-based health literacy materials designed to improve public understanding of ADHD and asthma. As a biostatistics student, I contributed to the review and refinement of evaluation tools used to assess the effectiveness of these materials. I designed questionnaires for ADHD and asthma brochures based on the Patient Education Materials Assessment Tool (PEMAT) to measure factors such as understandability, actionability, and user engagement. Simulated datasets were generated and analyzed using R to demonstrate how quantitative and qualitative feedback could be evaluated using descriptive statistics and basic inferential methods. These analyses illustrate how survey-based evaluation data can inform improvements in health communication materials and support evidence-based public health practice.
Statistical Genetics - HSC LL 207
My An Huynh
Multimodal Integration of Single-cell RNA-sequencing and Spatial Transcriptomics to Recover Cell Location in Single-cell Data
Abstract: Single-cell RNA sequencing (scRNA-seq) measures gene expression at cell resolution but loses spatial context due to tissue dissociation. Spatial transcriptomics (ST) preserves spatial information in intact tissue but typically at lower resolution. I aim to integrate these complementary modalities to infer the location of individual cells to answer downstream biological questions.My two-stage framework first augments ST training data using a variational autoencoder to generate additional ST samples, then learns a mapping between gene expression and spatial location using a neural network.I used MERFISH data which provides both gene expression and single-cell spatial coordinates to evaluate model performance. I assessed the framework’s sensitivity to training resolution and data quantity by generating pseudo-Visium spots from MERFISH data.To test generalization across disease states, I applied the model to a dataset on Alzheimer’s mouse brains containing wild-type (WT) and AD conditions. Models were trained on pseudo-Visium spots from WT tissue sections and evaluated on held-out cells across different genotypes to assess within- and cross-genotype prediction performance.
Hyun Kim
Genetic Variant Selection Using Git for Risk Prediction
Abstract: The effects of genetic variants may vary across a phenotype’s distribution, which may not be fully captured by standard variable selection methods in genetic studies. In this study, we evaluate GIT, a method for screening features with varying effects across quantiles of a response variable, for significant genetic variant selection stratified by population. GIT was applied to high-density lipoprotein (HDL) and height, and the screened variants were compared with variants reported in genetic databases and prior studies to assess if GIT can prioritize established, biologically relevant variants. To evaluate prediction performance, we compared LASSO models built using variants screened by GIT and variants selected by clumping + thresholding (C+T). Genome-wide association studies (GWAS) were conducted to obtain the summary statistics required for C+T. In addition, we assessed if incorporating GIT-screened variant information can improve prediction based on polygenic risk scores (PRS) by examining models based on PRS derived from score files in the Polygenic Score (PGS) Catalog with models that combine those PRS with GIT-screened variant information.
Jeffrey Lin
Predicting Protein Levels From Cis- And Trans-Variants While Accounting For Environmental Confounding
Abstract: Complex traits are often influenced by genetic variants through their effects on proteins, motivating growing interest in proteome-wide association studies (PWAS). However, PWAS is often limited by the lack of proteomic measurements in large cohorts. In addition, confounding remains a concern, as environmental factors may correlate with both genotype structure and protein levels, potentially producing spurious associations. To address this, we build on the TransCisPredict framework to predict protein levels using both cis- and trans-variants while adjusting for environmental confounding. We analyzed genotype and proteomic data for 2,920 proteins from 42,454 individuals in the UK Biobank Pharma Proteomics Project. Principal Component Analysis (PCA) was applied to the protein data to capture shared correlation structure and latent variation. Variant–protein associations, together with LD block information, were then used to define prediction models for out-of-sample protein prediction. The ultimate goal is to estimate protein levels in samples lacking proteomic data and perform PWAS for a complex trait of interest.
Yunjia Liu
Evolutionary Constraint–Informed Modeling of Rare Variant Effects in Autism Spectrum Disorder
Abstract: Autism spectrum disorder (ASD) is a neurodevelopmental disorder that affects social communication and behavior. Rare genetic variants, including both de novo mutations and inherited rare variants, play an important role in shaping the genetic architecture of ASD. Given the early onset of ASD and the strong selective pressures acting on deleterious variants, we investigate whether evolutionary constraint can provide a more biologically informed framework for modeling rare variant effects.In this study, we analyze de novo loss-of-function mutations using a Bayesian hierarchical model in trio families and inherited rare variants via Burden Heritability Regression (BHR) in case–control cohorts. Rather than standard gene-level aggregation, we incorporate evolutionary constraint metrics, s_het (from GeneBayes) for loss-of-function variants and the MisFit score for missense, to stratify variants according to their levels of evolutionary constraint.Our results show that incorporating selection-based stratification improves the ability to distinguish effect sizes and reveals systematic differences in effect sizes across constraint levels.
Survey Statistics - HSC LL 209A
Zhenkun Fang
Enhancing Resident Self-Efficacy in Pediatric Mental Health: A Preliminary Outcome Evaluation of a Multimodal Curriculum
Abstract: Background: Pediatric mental health (MH) concerns are rising, yet primary care resources and trainee confidence remain low.Objectives: To increase residents’ self-efficacy (SE) in MH screening, assessment, and management via a tiered, interdisciplinary curriculum.Methods: Using the Kern Model, we implemented a longitudinal curriculum featuring three modalities: didactics (pocket guides), service learning (community partnerships), and experiential learning (simulations). Outcomes were evaluated via pre- and post-intervention surveys (N=18) following 15 months of engagement.Results: Baseline assessments (N=47) showed 43.6% of residents felt "ineffective" in MH care. Post-intervention Wilcoxon signed-rank tests demonstrated a statistically significant increase in SE (Median 3.56 to 3.94; p=0.010) with a large effect size. While specific clinical skills improved significantly, there was no significant change in the frequency of self-reported MH care practices.Conclusions: This multimodal curriculum is feasible and effectively enhances resident confidence. Future research will utilize patient chart reviews to evaluate the impact on clinical practice.
Chunlei Li
Tackling Campus Nutrition Inequity Amidst Urban Food Deserts
Abstract: This practicum project investigates nutrition inequity among students at Columbia University Medical Center (CUMC) and examines the behavioral, economic, and environmental factors influencing dietary quality. A structured survey was conducted among students to assess food access, affordability, cooking behaviors, cultural food preferences, and nutrition awareness. Descriptive statistics were used to summarize dietary behaviors, while logistic and linear regression models were applied to evaluate associations between food access, affordability, cooking skills, and dietary quality indicators such as fruit and vegetable intake and processed food consumption. Results indicated that many students experienced financial barriers to accessing nutritious foods. Despite high levels of nutrition awareness and cooking skills, structural barriers such as affordability and food availability remained significant predictors of poorer dietary behaviors. These findings highlight the importance of addressing structural determinants of nutrition inequity in campus environments.
Tingyu Qian
HIV in Tanzania (2016–2022): Investigating National Prevalence Shifts and the Determinants of Testing Uptake
Abstract: This study uses 2016 and 2022 PHIA datasets to study Tanzania HIV prevalence trends and socio-demographic factors that influencing HIV testing. Survey-weighted descriptive analysis and multivariable logistic regression were used to show changes in prevalence and identify independent predictors of HIV testing. The weighted national HIV prevalence in Tanzania decreased slightly from 4.88% in 2016 to 4.46% in 2022, with no significant difference (P-value = 0.073). Survey-weighted multivariate logistic regression using 2022 data showed that education was the strongest predictor. People with higher levels of education are 3.71 times more likely to get tested HIV than those with lower levels of education (adjusted OR: 3.71, P-value <0.01). Individuals who disagreed with the statement “I would be ashamed if someone in my family had HIV” are more likely to take HIV test than those who agreed with the statement (adjusted OR: 1.33, P-value <0.01). Interventions should prioritize male participation, widespread education, and community-led efforts to eliminate discrimination, and finally achieve universal testing in Tanzania.
Mari Sanders
Analyzing the Impact of Professional Support Networks on Job Satisfaction and Intent to Leave Among Home-Based Primary Care Nurse Practitioners
Abstract: Burnout is characterized by exhaustion and reduced work performance. Nurse practitioners in home-based primary care for patients with dementia face stressors that may influence job satisfaction and intent to leave their jobs. Data was collected from nurse practitioners in the northeast. Team connectivity was measured through social network analysis and ego-centric networks. Dimensionality reduction via principal component analysis was applied to support features, identifying a primary component (PC1) and k-means clustering classified participants into three support groups. Job satisfaction was modeled by ordinal regression and generalized linear regression modeled binary outcomes. The first principal component explained 43.61% of variance in team support. Likelihood ratio tests showed that complex models with demographics and network measures did not significantly outperform a simplified model using PC1. Support clusters did not significantly influence intent to leave or job satisfaction and relationships were non-significant. Findings suggest that professional advice-sharing is a core element of teams, but further research must be done to identify factors that influence burnout.
Survival Analysis - HSC LL 208B
Riyadh Baksh
Quantifying Lifetime Risk of Lyme Disease: A Population-Based Modelling Study in the United States
Abstract: Lyme disease is the most commonly reported vector-borne disease in the United States, but annual incidence estimates do not fully capture its long-term population burden. This study estimated the lifetime risk of Lyme disease across regions with differing levels of endemicity using a population-based modeling approach. Age-specific Lyme disease incidence rates from U.S. administrative claims data were combined with all-cause mortality data from the CDC. A multiple-decrement life table model simulated a cohort of individuals followed from birth through age 80–84 years, accounting for the competing risks of Lyme disease and death. Lifetime risk increased with age across all regions but varied substantially by geography. By age 80–84 years, the estimated lifetime risk was 13.7% in high-incidence states, compared with 2.0% in neighboring states and 1.3% in low-incidence states. These findings demonstrate the substantial long-term burden of Lyme disease in endemic regions and highlight the pronounced geographic heterogeneity of risk. A lifetime risk provides an intuitive summary measure that may support public health communication, prevention planning, and resource allocation.
Jianing Chen
Breast Cancer Survival Analysis and Patient-Level Response Visualization
Abstract: This practicum project focused on developing a reproducible workflow for breast cancer survival analysis, reporting, and interactive visualization using SAS, R, and Shiny. Using the METABRIC dataset, I imported and cleaned raw clinical data, constructed an analysis cohort restricted to patients receiving either mastectomy or breast-conserving surgery, and generated structured outputs including a baseline demographic table, survival summaries, Kaplan–Meier plots, and a Cox-model-based hazard ratio analysis in the ER-positive subgroup. In addition, I developed interactive Shiny components to improve the presentation and accessibility of analysis results through dynamic tables, plots, and downloadable outputs. This project strengthened my skills in clinical data preparation, time-to-event analysis, reproducible programming, and communication of statistical findings. More broadly, it demonstrated how analytical workflows can translate complex oncology data into interpretable evidence for research reporting and decision support.
Kai Tan
Automated Survival Analysis Pipeline for Time-To-Event Clincial Trial Data
Abstract: Survival analysis is widely used in clinical trials to evaluate treatment efficacy. However, generating reproducible survival analysis results from raw clinical datasets manually can be time-consuming and prone to inconsistencies. A Python based automated analysis pipeline was developed to standardize pre-analysis exploratory data analysis, data cleaning, and survival analysis procedures. Using an ADTTE clinical trial dataset, the automated pipeline was tested.KM curves showed consistently higher survival probabilities for patients in the treatment group compared with those in the control group. However, the log rank test did not show a statistically significant difference between the 2 groups (test statistic = 2.48, p = 0.116). Overall, this project shows how automated statistical programming pipelines can improve the programming efficiency, reproducibility, and interpretability of survival analyses in clinical trial data analyses. Such standardized analytical workflows can support data-driven decision making in early phase clinical trials and enable generation of survival analysis outputs with consistent format for regulatory and research reporting.
Xiaoni Xu
Causal Mediation Analysis With Longitudinal Mediators and Competing Risks: Implementation and Validation in the CMAverse R Package
Abstract: This practicum project focuses on the development and implementation of a new R command within the CMAverse package designed to automate causal mediation analysis for longitudinal data, specifically facilitating the application of the med_longitudinal framework to complex problems involving time-varying and time-to-event mediators in the presence of competing risks. The primary aim of this project is to develop, validate, and maintain a user-friendly function that automates this process using multistate models to estimate path-specific effects. This tool enables the causal interpretation of indirect effects when mediators are measured at arbitrary time points or treated as time-to-event processes. Secondary aims include validating estimator performance through simulations, applying the command to simulated datasets, and publishing a comprehensive manual and tutorial on GitHub to support public use. Ultimately, this project provides a robust biostatistical tool for investigating complex mediation pathways in survival-related data, facilitating researchers studying time-dependent health processes.
Break (3:30pm - 3:45pm)
Session 3 (3:45pm - 4:45pm)
Clinical Trials - HSC LL 109B
Wenjie Wu
A Bidirectional Crosswalk Between COWS and SOWS
Abstract: Withdrawal severity is a key outcome in opioid use disorder (OUD) research, but trials frequently measure withdrawal using different instruments. Two widely used scales are the Clinical Opiate Withdrawal Scale (COWS) and the Subjective Opiate Withdrawal Scale (SOWS), creating challenges for harmonizing data across studies. We develop an interpretable, bidirectional crosswalk to predict COWS from SOWS and SOWS from COWS using paired measurements collected within a short time window during an inpatient phase of a clinical trial. Our approach fits linear models in each direction, incorporating baseline and treatment-related covariates, and allows effect modification through selected interactions. To obtain a parsimonious set of predictors and interactions, we use patient-wise cross-validated lasso screening followed by weighted post-lasso refitting, with weights and cross-validation folds defined at the patient level to account for within-patient correlation and unequal numbers of observations.This crosswalk enables harmonization of withdrawal severity as a continuous measure across trials and improving comparability of treatment effects.
Puyuan Zhang
Statistical Evaluation of Treatment Efficacy in a Phase III Randomized Clinical Trial of Jiao Qi She Gel Patch for Knee Osteoarthritis
Abstract: Knee osteoarthritis is a common chronic condition associated with persistent pain and functional limitations. This practicum project evaluates treatment efficacy in a multicenter, randomized, double-blind, placebo-controlled Phase III clinical trial investigating Jiao Qi She Gel Patch for knee osteoarthritis. The primary endpoint was the change from baseline in knee pain measured by the Visual Analogue Scale (VAS) after two weeks of treatment. Treatment effects were estimated using an ANCOVA model with treatment group as a fixed effect and baseline VAS as a covariate. Missing data were addressed using multiple imputation under a Jump-to-Reference assumption. Secondary responder analyses summarized treatment effects using risk differences and corresponding 95% confidence intervals calculated with the Newcombe method. Dry-run analyses confirmed consistent implementation of the statistical analysis plan and produced reproducible outputs for key efficacy tables and figures. This project demonstrates practical statistical methods used in Phase III clinical trials and highlights considerations when implementing SAP-defined analyses in clinical research.
Data Analysis - HSC LL 205
Zhehao (Malcolm) Chen
Diabetes Risk Prediction Using Machine Learning On NHANES 2009–2019 Data
Abstract: Using NHANES 2009–2019 data, this project developed supervised machine learning models to predict diabetes status in a nationally representative observational dataset. Predictors included demographics/social determinants, laboratory measures, medical conditions, and lifestyle factors. Data were cleaned and harmonized with SQL (BigQuery) and modeled in Python using a reproducible pipeline: standardizing missing values, dropping variables with high missingness, median/mode imputation with missingness indicators, and encoding categorical predictors. We trained and evaluated logistic regression and a gradient boosting decision tree model using a train/test split and classification metrics. Both models achieved high overall accuracy (~0.92), while sensitivity for diabetes was lower, underscoring the impact of class imbalance and the need to emphasize recall-oriented evaluation. Tree-based feature importance was used to identify key predictors and support model simplification. This work demonstrates an end-to-end risk prediction workflow on large-scale public health survey data and motivates future improvements via imbalance-aware training and calibration.
Chuyuan Xu
Principal Component Analysis on Co-occurring Climate Threats across ZCTAs in New York City from 2000 to 2016
Abstract: Climate constitutes an ongoing and enduring threat to global population health. Prior research has documented that climate change intensifies existing diseases and leads to the emergence of new health threats. In the United States, the public health system can be affected by disturbances in ecosystems that frequently coincide with major climate events, including snowstorms, wildfires, and droughts. The simultaneous occurrence of extreme climate events complicates the analysis of their correlation with human health outcomes. This project aims to characterize spatio-temporal patterns in the co-occurrence of climate threats across the US. We coupled high spatiotemporal-resolution estimates of climate-relevant exposures and developed exposure indices for multiple co-occurring climate threats in NYC, including droughts, meteorological statistics, wildfires, PM2.5, and cyclones, using principal component analysis.
Xiner Zhu
Adolescence and Young Adult Mental Health In Afghanistan
Abstract: Adolescents living in conflict-affected regions face substantial mental health challenges due to chronic exposure to violence, instability, and socioeconomic hardship. Afghanistan is one of the most high-risk environments for youth mental health, but comprehensive evidence remains limited. This study examines the prevalence of mental health conditions and their associations with environmental risk, socioeconomic factors, and past traumatic experiences among Afghan youth using cross-sectional survey data. The analysis included 1,103 participants, consisting of 518 adolescents aged 15–19 and 585 young adults aged 20–24. Mental health outcomes such as MDD, GAD, PTSD, substance use risk, suicidal behaviors, psychosis, and manic episodes were assessed using Multivariate logistic regression models to evaluate associations between mental health outcomes and key sociodemographic and contextual factors. By providing nationally representative evidence across a broad range of psychiatric outcomes, this study aims to contribute to the limited literature on youth mental health in Afghanistan and highlights the mental health consequences of prolonged conflict for young populations.
Data Analysis and Visualizations - HSC LL 210
Jiayi Ge
Dietary Patterns And Urinary Metal Exposure In The Multi Ethnic Study Of Atherosclerosis (MESA)
Abstract: Diet is an important determinant of environmental exposure through food sources. This practicum examines the association between dietary patterns and urinary metal exposure in participants from Exam 1 of the Multi Ethnic Study of Atherosclerosis (MESA). The exposure of interest is dietary protein composition, characterized by the relative contribution of animal versus plant protein intake. Outcomes include specific gravity corrected urinary arsenic, copper, zinc, and selenium concentrations. Multivariable linear regression models are used to evaluate associations with adjustment for demographic and clinical covariates. This project aims to clarify whether dietary protein sources are differentially related to biomarkers of metal exposure in a multiethnic cohort.
Mufan Liu
Screen-Media Use Subtypes in Youth and Their Associations with Cognition and Brain Connectivity
Abstract: This practicum project examined whether distinct patterns of digital media use can be identified and whether these patterns are associated with cognitive functioning and brain connectivity. Data from 624 participants aged 8-21 years from the Healthy Brain and Child Development (HCPD) study were analyzed. Media-use indicators included weekday and weekend engagement across multiple domains such as video games, messaging, social networking, streaming, television, and exposure to mature content.Three unsupervised learning approaches—K-means clustering (primary), hierarchical clustering, and model-based clustering—were applied to identify behavioral profiles. A four-cluster K-means solution characterized groups with low engagement, social/streaming use, broad high use, and gaming-focused behavior. Cluster membership was then evaluated using multivariable linear regression models predicting four cognitive outcomes (fluid, crystallized, early childhood, and total cognition) and brain connectivity measures, adjusting for demographic covariates.
Xikun Wang
Agentic AI System For Automated Clinical Data UpdatesAgentic AI System For Automated Clinical Data Updates
Abstract: This project develops an agentic AI system designed to automatically interpret user instructions and update structured clinical data. In healthcare settings, managing structured records often requires manual editing, which is time-consuming and prone to error. The goal of this project is to build an AI-driven workflow that reads natural language instructions, determines the required modifications, and applies updates to a predefined clinical data schema.The system uses a large language model as the reasoning component and a validation layer to ensure all updates follow predefined rules. Given a set of instructions, the agent generates structured outputs to modify, add, or reorganize clinical variables while maintaining data consistency. The updated results are then displayed through an interactive interface, allowing users to review the changes in real time. The project aims at demonstrating how agentic AI can support efficient clinical data management and reduce manual workload while maintaining accuracy, transparency, and reproducibility in healthcare data processing.
Yifei Yu
Life Expectancy Disparities In The Philippines Using Bayesian Spatiotemporal Modeling
Abstract: This project looks at differences in life expectancy across municipalities in the Philippines using annual mortality and population data from 2006 to 2023. Since some municipalities have small populations and unstable mortality rates, Bayesian spatiotemporal modeling is used to make the estimates more reliable by considering both geographic and time patterns. The model is fitted in R using INLA, and age-specific mortality estimates are used to build life tables and calculate life expectancy at birth by municipality, age group, sex, and year. The analysis also considers some important time periods such as presidential election years and the COVID-19 pandemic to better understand changes in mortality over time. This project helps show geographic differences in life expectancy and better understand how these patterns vary across the Philippines.
Deep Learning - HSC LL 109A
Joe LaRocca
Toward Metadata-Free Style Transfer: Hierarchical Clustering For Rare Neurological Disease Phenotyping And Protein Localization
Abstract: Rare neurological diseases often have an unknown genetic basis. Computer vision models can uncover latent, disease-associated phenotypes in single-cell microscopy images but are confounded by batch effects; advances such as Interventional Style Transfer (IST) have used imputation to mitigate them. However, current IST formulations require metadata about observational environments which are often biased or unavailable. Alternative specification of IST through hierarchical clustering may improve its effectiveness on out-of-distribution (OOD) data. Features were generated using the self-supervised DINO-v2 model on two datasets (GRID and HPA). We generated clusters and compared cluster-batch and cluster-class overlap before and after using Symphony, a post-hoc dimension reduction procedure, on both datasets. Symphony improved cluster-class association on HPA but reduced association on GRID; however, cluster-class association remained stronger post-Symphony on GRID than on HPA. Cluster-batch association was noticeable pre-Symphony but negligible post-Symphony on both datasets. Future work includes testing other clustering procedures and running IST using clusters instead of metadata.
Xueting Li
Brain Age Prediction from ROI-Based Structural MRI: A Comparison of CNN and BrainAgeR
Abstract: Estimating brain age from structural MRI is widely used to study brain aging. In this study, we compare two approaches for predicting chronological age using ROI-based MRI features. One approach uses a convolutional neural network (CNN) with hyperparameter tuning, while the other uses a traditional regression pipeline implemented through the brainageR framework. Model performance is evaluated using standard metrics, including mean absolute error (MAE), root mean squared error (RMSE), and the coefficient of determination (R²). We also examine the brain age gap (BAG), defined as the difference between predicted age and chronological age, which is often interpreted as reflecting accelerated or delayed brain aging. Finally, we discuss whether highly accurate models should be viewed purely as predictive tools or whether they point to a meaningful biological concept of brain age.
Zitao Zhang
Deep Generative Knockoffs For Feature Selection In High-Dimensional Microbiome Data
Abstract: This project develops a computational framework for feature selection in high-dimensional microbiome data using deep generative models and knockoff-based inference. Microbiome datasets often contain a large number of correlated microbial features, making it challenging to identify meaningful signals while controlling false discoveries. To address this problem, we implement a variational autoencoder (VAE) to generate knockoff variables that preserve the dependence structure of the original microbial features. These synthetic knockoffs are incorporated into a knockoff-based statistical inference framework to perform feature selection with false discovery rate (FDR) control. The project focuses on developing and implementing the computational pipeline, evaluating the quality of the generated knockoffs, and assessing the performance of the method through simulation studies and microbiome datasets. This framework aims to provide a scalable and statistically principled approach for identifying important microbial features in high-dimensional biological data.
Epidemiology - HSC LL 209A
Wen Li
Predicting Severe Hypertriglyceridemia Using A Triglyceride Polygenic Risk Score
Abstract: This project evaluates the predictive utility of a triglyceride polygenic risk score (TG-PRS) for severe hypertriglyceridemia (shTG). Severe hypertriglyceridemia is an important risk factor for pancreatitis and cardiovascular disease, and early identification of high-risk individuals may improve prevention strategies. Using genetic and clinical data, this study examines whether incorporating TG-PRS into prediction models improves risk stratification beyond traditional clinical variables. Polygenic risk scores will be calculated using genome-wide association study summary statistics, and regression models will be used to assess the association between TG-PRS and triglyceride phenotypes. Model performance will be evaluated using ROC curves, area under the curve (AUC). The findings may help evaluate the potential value of integrating genetic risk prediction into public health screening strategies.
Wenyu Lu
Strengthening Data Infrastructure And Strategic Decision-Making For A Community-Based Aging In Place Program
Abstract: This practicum focused on strengthening the data infrastructure and strategic capacity of a community-based Aging in Place program serving older adults in Northern Manhattan. The project involved cleaning and restructuring an existing Airtable client database, redesigning intake forms to improve data quality and completeness, and developing analytic tools to support program monitoring and reporting. Using descriptive analysis and geographic gap mapping, I identified service utilization patterns and areas of unmet need. In addition, I designed a satisfaction-by-need matrix to help prioritize resource allocation and outreach efforts. The enhanced system improves data accuracy, reporting efficiency, and the program's ability to communicate impact to funders and stakeholders. This project demonstrates how applied biostatistical and data management skills can strengthen community health programs and support evidence-informed decision-making in aging services.
Jiayi Wang
Mental Health Symptoms As Predictors Of Arrest Type Among Justice-Involved Young Adults
Abstract: Introduction: Justice-involved young adults experience high levels of psychiatric distress, yet the role of internalizing symptoms in predicting future offense type remains unclear. This study examined whether baseline internalizing symptoms predicted arrest outcomes at 12-month follow-up among participants enrolled in a diversion program. Methods: Data were drawn from a randomized intervention trial of justice-involved young adults. Multinomial logistic regression models assessed whether somatization, depression, anxiety, and global psychological distress predicted violent or nonviolent arrest, adjusting for prior violent arrest and demographic factors. Results: Internalizing symptoms were not associated with nonviolent arrest. However, higher baseline somatization, depression, and global severity were associated with increased odds of violent arrest, while anxiety showed a similar but nonsignificant trend. Conclusion: Internalizing psychiatric distress may represent an early marker of risk for violent justice involvement and may help inform screening and prevention strategies in diversion programs.
Chenghao Yan
Covid and RSV
Abstract: This practicum project involved statistical analysis of epidemiological datasets related to COVID-19 and respiratory syncytial virus (RSV) in China. For the COVID-19 component, de-identified case data from Nanjing Medical University Affiliated Hospital were analyzed to describe demographic characteristics, comorbidities, and hospitalization patterns during a surge in 2025. In addition, a separate dataset of over 2,000 RSV records from Tianjin was organized and analyzed to explore infection patterns among adults.
Internship Projects - HSC LL 208A
Cameron Chesbrough
Evaluating Clinical Performance With A Power BI Dashboard
Abstract: The Division of Healthcare Management and Occupational Safety and Health (DHMOSH), a branch of the United Nations, is responsible for overseeing UN administered field clinics across multiple continents, countries, and regions. DHMOSH evaluates clinic performance and provides technical assistance. Clinics send quarterly reports detailing the programs they are running and their requests for assistance, as well as data concerning number of patients seen, incidence of diseases, and other operational metrics. The aims of the project are twofold: create a unified dataset from the collection of quarterly reports and create a dashboard that can effectively communicate the relevant information for the DHMOSH team. Six primary groupings of data were established from the reports and information from over 50 clinics was aggregated. The dashboard was constructed using Power BI to visualize continuous, categorical, and textual data. A collection of graphs, visualizations, and tables were created to compare clinics over time or evaluate singular clinics during a desired quarter.
Joie Li
Statistical Analysis and Visualization of Longitudinal Symptom Patterns in Endometriosis
Abstract: Endometriosis is a chronic inflammatory condition characterized by the presence of endometrial-like tissue outside the uterus. This condition affects approximately 10% of individuals assigned female at birth globally from menarche to menopause and currently has no definitive cure. One of the major challenges in managing endometriosis is the heterogeneity of symptoms, which often leads to delayed diagnosis that can extend up to 12 years. This project presents part of an internship conducted with SymptoLab, a FemTech startup developing a digital health platform to help manage female hormone-related diseases and support physician decision-making, beginning with endometriosis. In this project, longitudinal symptom data collected from endometriosis patients who used SymptoLab’s tracking platform were analyzed. Using statistical analysis and modeling approaches, we investigated recurring symptom trajectories and examined their associations with different phases of the hormonal cycle. In addition, we designed the monthly patient reporting workflows with visualizations to improve patient-physician communication and support personalized care planning for individuals with endometriosis.
Yucheng Zhao
Interactive Clinical Trial Visualization: Developing R Shiny Dashboards for Real-Time Safety and Efficacy Monitoring in Oncology
Abstract: Real-time data monitoring is essential for safety and efficacy evaluation in Phase I/II oncology trials. This report details a practicum at RemeGen focused on developing interactive clinical dashboards to enhance data oversight. Using R Shiny, a suite of modular visualization tools was engineered to standardize and process clinical datasets across study domains. Technical implementations included Adverse Event (AE) swimmer plots for longitudinal safety tracking, waterfall plots and Empirical Cumulative Distribution Function (ECDF) curves for tumor response, and visit-based visualization of laboratory results. These dashboards integrated data-cleaning functionalities to monitor patient enrollment, treatment modifications, and trial milestones such as End of Treatment and End of Study. The resulting platform provided biostatistics and clinical teams with a centralized interface for real-time monitoring. These tools facilitated the identification of dose-limiting toxicities and tumor size changes, supporting informed decision-making during interim analyses. The modular design ensures these components are scalable across internal servers for multiple oncology programs.
Longitudinal Data Analysis - HSC LL 203
Yuechu Hu
Within-Person Longitudinal Associations between Restrictive eating, Intuitive Eating and BMI across Three Time Points: 2010-2023
Abstract: Body mass index (BMI) is influenced not only by biological factors but also by eating behaviors that evolve over time. Restrictive eating and unhealthy weight control behaviors (UWCB) have been associated with adverse weight outcomes, while intuitive eating has been proposed as a protective factor. However, the longitudinal relationships between these eating behaviors and BMI across adulthood remain unclear. This study aims to examine how long-term patterns of eating regulation, ranging from dieting and restrictive eating to intuitive, body-attuned eating, relate to changes in BMI within the same individuals over time. The data are from a population-based survey cohort, which includes repeated measures of eating behaviors and BMI collected over multiple waves (2010 - 2023). Data will be restructured into long format to account for repeated observations within individuals. Descriptive statistics will be used to summarize the distribution of key variables. BMI over time will be visualized using spaghetti plots. Longitudinal associations between eating behaviors and BMI will be evaluated using mixed-effects models to account for within-person correlation across time.
Leyang Rui
Linking Neural Signatures To Behavioral And Cognitive Developments: A Longitudinal EEG Study In Early Childhood
Abstract: Early brain development is fundamental to later behavioral, cognitive, and perceptual functioning. Using longitudinal EEG data from the HBCD study, a nationwide cohort of more than 7,000 children across 27 research sites, this project examines whether directed neural connectivity predicts early developmental outcomes. We focus on EEG data from six brain regions, together with demographic variables and measures including the SPM-2, Vineland Adaptive Behavior Scales, Bayley Scales, MacArthur-Bates Inventories, and multilingual language scores. EEG signals were cleaned with Independent Component Analysis, harmonized across sites using ComBat, and modeled with a mixed-effects vector autoregression framework to estimate subject-specific directed functional connectivity. Global connectivity strength, summarized by the Frobenius norm, was then used to predict behavioral and cognitive phenotypes through regression models. Preliminary findings show significant associations with sensory processing, planning and ideas, balance and motion, and social participation, suggesting that early directed neural connectivity may serve as a meaningful marker of developmental functioning.
Qianyu Wu
Interactive Effects of Low-level PM2.5 Exposure And Metabolic Syndrome On Brain Structure: A Study Based On UK Biobank
Abstract: Both long-term exposure to fine particulate matter (PM2.5) and metabolic syndrome (MetS) are known to significantly affect brain structures, yet their combined effects remain unclear. This study investigated the interaction between long-term low-level PM2.5 exposure (≤15μg/m³) and MetS on brain structural changes, utilizing data from the UK Biobank with 31,492 participants. A total of 40 brain regions were analyzed, including volume, cortical thickness, and surface area. Generalized linear models were employed, adjusted for key covariates. Results indicate that the interaction between MetS and low-level PM2.5 exposure exacerbated these effects, with an interquartile range (1.34 μg/m³) increase associated with volume reduction in 9 regions, extending to 26 regions under PM2.5 levels of 10-15 μg/m³. Females with MetS were particularly vulnerable, exhibiting significant reductions in brain volume and cortical thickness. These findings highlight the synergistic effects of MetS and low-level environmental PM2.5 exposure on brain health, providing critical insights for public health interventions.
Yiran Xu
A Deep Learning Model for Early Detection of Stiff-Person Syndrome Using Administrative Claims Data
Abstract:Stiff-Person Syndrome is a rare neurologic disorder characterized by rigidity, spasms, andoverlapping features with conditions such as multiple sclerosis, Parkinson’s disease, and anxiety disorders. These similarities create diagnostic complexity, leading to frequent misdiagnosis and delayed recognition. Research applying machine learning to Stiff-Person Syndrome (SPS) diagnosis has been minimal, with only one prior study testing a machine learning model on the data of less than one hundred patients. This study seeks to advance that work by developing and testing a new Transformer-based algorithm on large-scale claims data and identifying the key clinical factors that differentiate SPS from related conditions along the patient journey
Machine Learning - HSC LL 201
Ravi Brenner
Assessing Bias In Injectable HIV Treatment Access Using Machine Learning
Abstract: The Accelerating Implementation of Multilevel Strategies to Advance Long Acting Injectables for Underserved Populations (ALAI UP) project helps clinics in the US develop injectable HIV treatment programs that address inequity in health outcomes using a variety of strategies. To be prescribed injectable cabotegravir, patients must be educated about the treatment to determine their interest. Education practices vary widely by site, meaning that bias in who is offered the treatment could occur. I used tree-based machine learning methods to predict the probability of education for patients at 4 different clinic sites, using demographic variables and social determinants of health as predictors, and accounting for interactions and site-level effects. Model performance was modest in aggregate, and no better than chance at the site level (based on AUC). This indicates that there was minimal detectable bias in who was educated, a major success for the program. Site was the strongest predictor of education, indicating that clinic-level practices around patient education play a substantial role in patient access to treatment and act as the primary social determinant in this area.
Tong Su
This is My Presentation Patient-Journey-Based Temporal Feature Engineering and Metastasis Prediction Using the MIMIC-III Database
Abstract: Early identification of patients at risk for metastatic progression is important for improving cancer management and clinical decision-making. I developed a patient-journey–based prediction framework using the MIMIC-III clinical database. A retrospective cohort of 334 patients was constructed using eligibility criteria and temporal filters to define index and snapshot dates with adequate look-back and follow-up periods. Metastasis within 180 days after the snapshot date was defined as the primary outcome using ICD-9 secondary malignant neoplasm codes.Predictors were derived from demographics, diagnoses, procedures, medications, and laboratory testing patterns using temporal roll-up and raw feature engineering. A total of 225 candidate predictors were evaluated. An XGBoost model achieved a test ROC-AUC of 0.83. The top 30 predictors selected based on mean absolute SHAP values achieved similar performance (ROC-AUC 0.81). Key predictors included time since primary diagnosis, laboratory testing patterns, treatment exposure, and patient age. This framework provides a reproducible approach for metastasis prediction using temporally aligned electronic health record data.
Minghe Wang
Enhancing Dementia Classification in Small Clinical Cohorts via Density-Ratio Weighted Transfer Learning
Abstract: Machine learning for dementia diagnosis requires large datasets, yet gold-standard clinical cohorts are small. Larger modern cohorts exist but present significant covariate shifts. This study develops a Transfer Learning (TL) framework to address this shift, leveraging HCAP (Source) to enhance dementia classification in ADAMS (Target).We harmonized sociodemographic, functional, and cognitive variables between ADAMS and HCAP. Replicating the feature engineering of a domain-expert baseline logistic regression, we applied TL algorithm using density ratio estimation via convex optimization. This assigns instance weights to the source domain to mitigate distribution mismatch, followed by a dual-layer phenotyping optimization. Performance was evaluated using 5-fold cross-validation.The TL framework outperformed the expert baseline using identical data splits. Our TL model achieved a higher mean AUC of 0.9113 (SD: 0.0228) compared to the baseline's 0.8987 (SD: 0.0331), demonstrating improved accuracy and stability.TL successfully bridges disparate aging cohorts. Future work will introduce simultaneous self- and proxy-reported features to capture complex patient-informant interactions.
Zebang Zhang
Interpretable Machine Learning for Diabetes Diagnosis Prediction
Abstract: This practicum project develops and evaluates supervised machine learning models to predict diabetes diagnosis at a single visit using routinely available demographic, lifestyle, family history, and clinical variables. Using an existing dataset of 100,000 adult patient profiles, the study compares interpretable and flexible approaches, including logistic regression/Elastic Net, generalized additive models, support vector machines, random forests, and XGBoost. Preprocessing includes exploratory analysis, categorical encoding, standardization of continuous features, and principled feature selection. Model performance will be assessed through stratified cross-validation and a held-out test set using discrimination, calibration, and clinically relevant metrics such as ROC-AUC, PR-AUC, sensitivity, specificity, F1-score, and Brier score. The project will also identify and rank key determinants of diabetes risk through standardized effects, partial effects, and feature importance measures. The goal is to produce a parsimonious and interpretable risk prediction tool to support early screening and targeted counseling.
Machine Learning and Dimension Reduction - HSC LL 202
Riya Kalra
Sex-Stratified Brain Age Gap Estimation in Alzheimer's Disease: A Machine Learning Approach with High-Dimensional Structural Neuroimaging Features
Abstract: Alzheimer's disease is not sex-neutral. Two-thirds of patients are women, yet standard brain age models impose a single aging reference across sexes, embedding systematic bias into every prediction. Using a large baseline ADNI cohort, I developed sex-stratified Ridge Regression models on neuroimaging features with automated alpha tuning and leakage-free cross-validation. Direct comparison against a pooled model demonstrates that stratification reduces prediction error and eliminates between-sex BAG divergence. Male and female brains age through distinct mechanisms: male aging is driven by localized hippocampal atrophy while female aging reflects distributed cortical network changes. Corrected Brain Age Gap followed a stepwise disease gradient, increasing from near zero in cognitively normal subjects through mild impairment to its peak in Alzheimer's disease, a statistically significant progression. After stratification, sex itself was non-significant, confirming systematic bias was eliminated and diagnosis alone drives accelerated brain aging. Sex-stratified normative modeling is both statistically superior and biologically essential for equitable Alzheimer's precision medicine.
Soo Min You
Evaluating Representation Embeddings from LLMs and Time-Series Foundation Models for Wearable Accelerometer-Based Health Prediction
Abstract: Wearable accelerometers capture rich behavioral signals relevant to health monitoring, yet the comparative evidence on modern representation-learning approaches remains limited. Using accelerometer data from the National Health and Nutrition Examination Survey (NHANES), we evaluated three representation families for predicting multiple clinical outcomes: simple entropy-based features, pretrained large-language-model (LLM) embeddings, and time-series foundation model embeddings. Outcomes included overweight status, lipid biomarkers, glucose, arthritis, and cancers. Across outcomes, entropy-based features consistently performed comparably to, and often slightly better than, embedding approaches. LLM-derived embeddings offered only marginal improvements (ΔAUC≈0.01-0.05), while time-series foundation model embeddings added little predictive value. Prompt-based LLM reasoning performed worst (AUC≈0.56-0.65), demonstrating limited ability to infer physiological states from structured text. These results highlight the strength of simple variability features and underscore that domain-aligned pretraining is needed for time-series foundation models in wearable health applications.
Statistical Genetics - HSC LL 207
Mengyuan Chen
Statistical Analysis of MPRA Data to Identify Regulatory Elements in 3′ UTRs of Haploinsufficient Genes
Abstract: Haploinsufficiency, in which a single functional copy of a gene is insufficient to maintain normal physiological function, contributes to more than 600 genetic disorders, many affecting neurological development. This study investigates regulatory mechanisms within the 3′ untranslated regions (3′ UTRs) of haploinsufficient genes that influence gene expression. Data were obtained from a massively parallel reporter assay (MPRA) containing about 12,000 reporter constructs representing 270-nt fragments from the 3′ UTRs of over 600 genes, with expression measured using RNA/DNA ratios from sequencing in two human cell lines. Statistical analyses included correlation analysis to assess reproducibility across replicates and associations between sequence features and transcript abundance. Results showed strong agreement between replicates and across cell types. GC-rich sequences and more stable RNA structures were associated with increased mRNA abundance, while AU-rich motifs were linked to decreased expression. These findings provide insight into post-transcriptional regulation in haploinsufficient genes and identify candidate regions for antisense oligonucleotide therapeutic strategies.
Zhaokun Lin
Extending the Lamian Framework for Differential Pseudotime Analysis of Single-Cell multi-sampled DNA Methylation Data
Abstract: Pseudotime analysis is widely used to study dynamic biological processes such as neural stem cell differentiation. The Lamian framework uses spline-based modeling and mixed-effects regression to perform differential pseudotime analysis across multiple scRNA-seq samples. However, its statistical model assumes normally distributed data. Single-cell DNA methylation (scDNAm) data are proportions or counts, so this assumption does not hold.This limitation prevents direct application of Lamian to epigenomic data types. To address this, the Lamian framework is extended to single-cell multiomics data with a focus on scDNAm. CpG sites are aggregated to build stable methylation features, and pseudotime trajectories are constructed from the neural stem cell neurogenic lineage. Then this extension is applied to scNMT-seq data from the adult mouse ventricular-subventricular zone.
Wayne Monical
Genome-Wide Association Study for Chronic Kidney Disease
Abstract: Chronic kidney disease (CKD) has an estimated global prevalence of 700 million cases, driven in large part by diabetes and hypertension. To understand the genetic component of several CKD types, including Focal Segmental Glomerulosclerosis (FSGS) and Membranous Nephropathy (MN), we conducted genome-wide association study (GWAS) utilizing the MEGAEx array. A rigorous quality control pipeline was implemented, including filters for minor allele frequency (MAF), Hardy-Weinberg Equilibrium, and sample-level checks for missingness, sex discrepancies, and relatedness. The final analytical cohort consisted of 5,636 individuals and 513,581 variants. Population stratification was negligible, and genomic inflation factor was well-controlled to 1.0028. We identified 8 genome-wide significant variants with p-values less than 5e-8. The most significant associations were located at the CAMKK1 locus and the RNF150 locus, both of which exhibited protective effects.
Flora Pang
A Statistical Genetics Framework for Investigating for Noise-Related Tinnitus in the UK Biobank
Abstract: Tinnitus is a heterogeneous auditory condition influenced by environmental exposures and age-related hearing loss (ARHL), complicating the identification of genetic risk factors. Using UK Biobank data, we implemented a scalable statistical genetics pipeline to investigate genetic associations with noise-related tinnitus. Population structure was evaluated using principal components derived from FlashPCA, and genome-wide association analyses were performed with REGENIE to account for relatedness and relevant covariates in a large biobank cohort. Three complementary phenotype and covariate specifications were analyzed to assess the robustness of association signals and distinguish variants linked to tinnitus from those driven by hearing loss or exposure-related confounding. Genome-wide significant loci were identified across analyses, including regions near MIR4790/GRM7-AS3, CRIP3/ZNF318, APOE, and APOC1. These findings establish candidate regions for downstream statistical fine-mapping using SuSiE and highlight the importance of rigorous modeling strategies when analyzing complex auditory phenotypes in large-scale genomic datasets.
Chenhui Yan
LamianOmni: Multi-Sample Pseudotime Inference for Single-Cell DNA Methylation Data
Abstract: Pseudotime analysis is a powerful approach for studying how gene regulation changes along continuous biological processes such as cell differentiation. While Lamian provides a robust framework for comparing temporal patterns across multiple samples or conditions, it has so far only been applied to single-cell RNA-seq data.
In this study, we propose LamianOmni, which extends Lamian to single-cell DNA methylation (DNAm) data. A major challenge is extreme sparsity — over 99% of CpG sites are unmeasured per cell. We address this with a promoter-level aggregation strategy that produces stable methylation signals for pseudotime modeling. We apply LamianOmni to scNMT-seq data from mouse erythroid differentiation, comparing wild-type and TET triple knockout conditions, and further integrate DNAm with gene expression to explore coordinated regulatory dynamics. LamianOmni offers a generalizable framework for multi-sample pseudotime analysis of single-cell epigenomes.
Survival Analysis - HSC LL 208B
Lingyuan Huang
Diet–Exercise Patterns And Liver Fibrosis Risk Among U.S. Adults: A PCA–Cluster Analysis
Abstract: This study examined the association between integrated diet–exercise patterns and liver fibrosis among U.S. adults using NHANES 2017–2020 data. Adults aged ≥18 years with valid vibration-controlled transient elastography measurements were included (n=7,587). Ten energy-adjusted dietary variables and three physical activity indicators were combined using principal component analysis (PCA), followed by K-means clustering to identify lifestyle patterns. Two distinct patterns emerged: a high-fat, high-protein, sedentary pattern and a higher-carbohydrate, more physically active pattern. Survey-weighted logistic regression models assessed associations with liver stiffness measurements (LSM) and controlled attenuation parameter (CAP). In unadjusted models, the more active pattern was associated with lower odds of cirrhosis (OR=0.73, 95% CI: 0.51–0.94), but the association attenuated after adjustment for demographic and metabolic factors. Hypertension partially mediated the relationship. No independent association was observed with hepatic steatosis. Findings suggest lifestyle patterns influence liver fibrosis primarily through cardiometabolic pathways.
Leah Li
Evaluating The Effectiveness Of Tafamidis In Patients with Transthyretin Amyloid Cardiomyopathy (ATTR-CM)
Abstract: Transthyretin Amyloid Cardiomyopathy (ATTR-CM) is a progressive and often underdiagnosed condition characterized by deposition of transthyretin amyloid in the myocardium, leading to heart failure and reduced survival. Recent therapeutic advances have introduced disease-modifying treatments such as Tafamidis, a transthyretin stabilizer that slows disease progression. This project evaluates the effectiveness of tafamidis therapy in patients with ATTR-CM, focusing on survival outcomes, cardiovascular mortality, and hospitalization risk. Using clinical trial evidence and long-term follow-up data from patients across different disease stages, outcomes among patients receiving continuous tafamidis treatment were compared with those initially receiving placebo before transitioning to tafamidis. Results suggest that tafamidis treatment is associated with reduced all-cause and cardiovascular mortality and lower rates of cardiovascular-related hospitalization, particularly among patients diagnosed at earlier disease stages. These findings highlight the importance of early diagnosis and timely initiation of disease-modifying therapy in improving clinical outcomes for patients with ATTR-CM.
Yuanyuan Zhang
Association Between Low-Density Lipoprotein Cholesterol Levels And Risks Of All-Cause And Cardiovascular Mortality In Patients With Hypertension And Comorbidities
Abstract: This study examined the association between low-density lipoprotein cholesterol (LDL-C) levels and the risks of all-cause and cardiovascular mortality among patients with hypertension and comorbidities. Data came from a population-based cohort in Guizhou Province, China, established in 2010 with follow-up in 2016–2020 and 2023. After exclusions, 2,426 hypertensive participants were included. Cox proportional hazards regression was used to estimate hazard ratios, and restricted cubic spline models were applied to assess dose-response relationships. Over a median follow-up of 11.69 years, higher LDL-C levels were associated with increased risks of all-cause and cardiovascular mortality. Among patients with comorbidities, mortality risks were substantially higher in the upper LDL-C quartiles than in the lowest quartile. A linear dose-response relationship was observed. These findings suggest that LDL-C is an independent risk factor for mortality in hypertensive patients with comorbidities and support stricter lipid management in this high-risk population.