2026 Biostatistics Practicum/APEX Symposium

The Biostatistics MS and MPH Programs require students to complete one semester of hands-on, real-world experience through a Practicum (for MS students) or Applied Practice Experience, or APEx (for MPH students)

This requirement gives students the chance to apply their classroom knowledge in a real-world public health setting—translating theory into practice. Each project is tailored to the student’s interests, skills, and career goals, and reflects a partnership with a public health organization or research initiative. It is a culmination of their academic journey at Columbia Mailman School of Public Health.

The Practicum/APEx Symposium is a showcase of this work. On Friday, April 24th, students will present their projects to faculty, peers, and the broader Columbia community. It’s a celebration of their achievements and a day of learning that spans the breadth and depth of Biostatistics in action.

We invite all CUIMC and Mailman School students, faculty, staff, and affiliates to join us in supporting and learning from the graduating Class of 2026.

You can view the titles and abstracts for each session below, organized by topic and room number or in the digital program booklet.

Session 1 (1:00pm - 2:00pm)

Applications in Mental Health - HSC LL 208A

Yujing Fu
Neuroimaging Evidence for Sundowning: Altered Time-of-Day Dynamics of Salience Network Connectivity in Alzheimers Disease

Abstract: "Sundowning" describes worsening agitation and confusion in the late afternoon or evening in Alzheimers disease (AD). Preclinical studies link this phenomenon to abnormal neural connectivity during the evening period. We examined whether resting-state functional connectivity (rsFC) of the Salience Network (SN) shows time-of-day alterations in humans using 1,311 rs-fMRI scans from 736 ADNI participants (397 cognitively normal, 248 mild cognitive impairment, 91 AD). SN rsFC was derived using the Power 264 atlas. Linear mixed-effects models tested diagnosis-by-scan-time interactions while adjusting for age, sex, head motion, hypertension, and sedative use. SN connectivity declined later in the day in cognitively normal and MCI participants (slopes ,0.003 and ‚0.004), but increased in AD (slope 0.004). Edge analyses identified specific SN connections contributing to this interaction, and sex-stratified models suggested stronger effects in male AD. These findings translate preclinical signatures of sundowning to the human brain and suggest a network-level substrate for time-dependent neuropsychiatric symptoms in AD.

Su Hyun Kim

Baseline Symptom Dimensions Predict Antipsychotic Discontinuation and Side-Effect Risk in Schizophrenia

Abstract: Antipsychotic effectiveness varies widely in schizophrenia, with intolerable side effects and lack of efficacy driving treatment discontinuation. While symptoms are linked to functional impairment, their role in treatment persistence is unclear. Using Clinical Antipsychotic Trials of Intervention Effectiveness Phase 1 data (n=1460), we examined whether latent symptom dimensions predict discontinuation and side effect risk. Latent factors were derived from baseline clinical, sociodemographic, neurocognitive measures via factor analysis. Side effect onset and time to discontinuation were modeled using regularized Cox proportional hazards models, and symptom trajectories via mixed-effects models. Higher baseline depressive (HR=1.19) and positive symptom (HR=1.20) factors predicted earlier discontinuation. Baseline depressive symptom factor was associated with increased hazards of autonomic (HR=1.13), extrapyramidal (HR=1.12), and sexual (HR=1.18) side effects. Among discontinuers, those with higher baseline depressive symptoms improved symptomatically but discontinued due to side effects. Baseline symptom factor structure may help personalize treatment planning.

Alice Zhou

Understanding Cognitive Control in the ABCD Stop-Signal Task Using Diffusion-Based Models

Abstract: Response inhibition deficits are frequently observed in several psychiatric conditions, including attention-deficit/hyperactivity disorder (ADHD) and obsessive-compulsive disorder (OCD). Cognitive process models provide a framework for explaining these behavioral differences by decomposing task performance into latent decision-making components. In the present study, we fit hierarchical cognitive models to Stop Signal Task (SST) data from 1,299 children aged 9-10 years in the Adolescent Brain Cognitive Development (ABCD) Study. Two model parameterizations were compared: a traditional race-diffusion ex-Gaussian (RDEX) model assuming constant evidence accumulation, and an ABCD-specific extension that relaxes the context-independence assumption by decomposing evidence accumulation into a baseline component (v0) and a growth component (g) that varies with stop-signal delay. We first examined whether relaxing context independence improved model fit in this larger and clinically diverse sample. Consistent with previous work, the ABCD-specific parameterization showed improved DIC and BPIC relative to the traditional RDEX model. We then examined associations between model parameters and external behavioral, clinical, and neuroanatomical measures. The baseline evidence accumulation parameter (v0) was associated with performance on multiple cognitive tasks, including the N-back working memory task and the Flanker interference task, and was negatively associated with CBCL ADHD and stress symptom scores. In contrast, none of the model parameters differed between participants with and without OCD, despite longer stop-signal reaction times in the OCD group. These findings suggest that baseline evidence accumulation reflects a general aspect of cognitive processing efficiency.

Cancer Research - HSC LL 204

Yifei Chen
Forecasting Immuno-Oncology Utilization In China: Integrating Policy, Safety, And Market Signals

Abstract: The use of PD-1/PD-L1 inhibitors has expanded rapidly across multiple cancer indications in China, yet real-world utilization cannot be predicted based solely on clinical efficacy. Adoption is influenced by safety management, policy reforms, and market competition. This practicum integrates clinical evidence, safety signals (immune-related adverse events, irAEs), policy timelines, and competitive intelligence into a reproducible workflow to support market access assessment and utilization forecasting. Corporate pipeline indicators were quantified using publicly available reports, and major irAEs were prioritized by harmonizing terminology from PubMed and pharmacovigilance sources. A PRISMA-guided meta-analysis framework was developed for PD-L1‚Äìhigh first-line NSCLC trials including KEYNOTE-024 and EMPOWER-Lung 1. Results suggest that Junshi‚Äôs asset counts follow a scale-up followed by a plateau pattern. Key irAEs include thyroid dysfunction, pneumonitis, colitis, hepatitis, rash, and fatigue. These outputs define covariates that can inform time-series forecasting of IO utilization.

Alice Mao
Dairy Product And Calcium Intake And Colorectal Cancer Risk In A Pooled Analysis Of Prospective Cohort Studies

Abstract: Dairy products and calcium intake have been associated with a reduced risk of colorectal cancer (CRC) in several epidemiologic studies, but evidence across large international populations remains limited. This study will examine the association between dairy intake and dietary and supplemental calcium and vitamin D intake and CRC risk using data from the Pooling Project of Prospective Studies of Diet and Cancer (DCPP). The analysis will include 21 prospective cohort studies with 1794810 participants and 19835 incident CRC cases. Diet was assessed using validated food frequency questionnaires, and energy-adjusted calcium and vitamin D intake was calculated using the residual method. Study-specific multivariable Cox proportional hazards models will estimate hazard ratios and 95% confidence intervals for CRC across levels of dairy, calcium, and vitamin D intake, with estimates pooled using random-effects models. Models will adjust for demographic, lifestyle, medical, and dietary factors including BMI, smoking, alcohol intake, physical activity, diabetes, and key dietary variables. Associations will also be evaluated across CRC subtypes and among individuals with higher familial risk.

Fikrianti Surachman

Budget Impact And Cost-Effectiveness Analysis Of HPV-DNA Test For Workplace Cervical Cancer Screening In Indonesia

Abstract: Cervical cancer is the second leading cause of cancer mortality among women in Indonesia. HPV-DNA testing offers higher sensitivity and longer screening intervals than the standard Pap smear. A budget impact and cost-effectiveness analysis was conducted for HPV-DNA screening strategies from an employer‚Äôs perspective. Adapting the R code and framework from the DARTH workgroup, a hybrid decision tree and cohort state-transition model was constructed to simulate 1,000 female employees over their working lifespan. Deterministic one-way sensitivity analysis was performed to evaluate parameter uncertainty. Results indicate that both HPV-DNA urine and vaginal tests are cost-saving compared to Pap smear over a 5-year budget horizon. Over 29 years, Pap smear was strictly dominated by both HPV-DNA strategies. The vaginal test provided greater clinical benefit than the urine test at a higher incremental cost, and the conclusions remained robust across key parameter bounds. To facilitate employer procurement decisions, the model is being translated into a Shiny application, allowing dynamic adjustment of input values.

Tianci Zhu

Epidemiology and Prognosis of Synchronous Brain Metastases in Lung and Breast Cancer: A SEER Population-Based Study

Abstract: Brain metastases (BM) are a serious complication in cancer patients and are particularly common in lung and breast cancers. When BM are present at the time of diagnosis, they may substantially affect treatment strategies and survival outcomes. But population-level evidence describing the incidence and prognosis of synchronous BM remains limited. In this study, we use data from the SEER Research Plus database to examine the epidemiology and survival outcomes of synchronous BM in patients with newly diagnosed lung and breast cancers. Adult patients with lung or breast cancer will be identified, and the presence of BM at diagnosis will be determined using SEER metastatic site variables. We will estimate the proportion of patients presenting with synchronous BM and compare demographic characteristics between patients with and without BM. Survival outcomes will be evaluated using Kaplan-Meier methods and log-rank tests, and multivariable Cox proportional hazards models will identify independent prognostic factors among BM patients. This study provides population-based evidence on the burden and prognosis of synchronous BM and highlights potential demographic disparities in survival.

Causal Inference and Hypothesis Testing - HSC LL 205

Fangchi Lu

Causal Inference for Exposure-Adjusted Crash Risk from Real and Synthetic Data

Abstract: This project investigates how short-term external shocks may influence traffic crash risk after accounting for exposure and examines causal inference methods for estimating such effects in observational data. The study combines empirical analysis with simulation-based methodological evaluation. We construct a borough-day panel dataset for New York City in 2025 Q1 using publicly available crash records and contextual variables. The real dataset is used to define key variables, characterize treatment variation (e.g., extreme weather events), and estimate baseline relationships between shocks and crash risk using panel models with spatial and temporal controls. Building on this structure, we develop a synthetic data-generating process reflecting key features of the observed data, including exposure levels, treatment assignment, and potential confounding. The framework enables systematic evaluation of causal estimators across repeated simulations and aims to illustrate how simulation calibrated to real data can support evaluation of causal inference methods in applied risk analysis.

Jianming Wang

Patient Photograph Characteristics of Landmarks and Retract-and-Reorder Events in Electronic Health Records

Abstract: Wrong-patient orders can be identified using the retract-and-reorder (RAR) algorithm, in which an order placed for one patient is retracted and reordered for another; prior studies suggest RAR events account for a substantial proportion of such errors. Objective: To examine whether characteristics of patient photographs are associated with wrong-patient order events. Materials and Methods: We analyzed EHR orders placed at NewYork-Presbyterian (NYP) from 2014 to 2024, including patient photograph features displayed at the time of order placement. Facial landmarks were used to derive geometric features describing facial structure and positioning. Exploratory data analysis and statistical hypothesis testing compared photo-related features between regular orders and retract-and-reorder (RAR) events. Results: Facial landmark-derived features differed significantly between regular orders and RAR events. These features were significantly associated with wrong-patient order events and showed notable effect sizes. Outcome: Facial landmark-derived features showed meaningful associations with wrong-patient order events.

Mingyin Wang

Has Congestion Pricing Improved Road Safety? A Case Study in New York City

Abstract: In January 2025, NYC implemented the Central Business District Tolling Program, a cordon-based congestion pricing policy. While its effects on traffic volume are documented, road safety impacts remain underexplored. This study evaluates the program's short-term effects on traffic collisions and injury rates using a monthly panel dataset of zip code-level crash data from January 2024 to December 2025. We employ difference-in-differences, matched difference-in-differences, generalized synthetic control, generalized additive models, and event study designs to estimate changes in injury rates per 10,000 residents and total crash counts. Across all specifications, we find no statistically significant or sustained reduction in traffic injuries or collisions following implementation. While the event study indicates a transient decline in crashes immediately upon implementation, this effect dissipated in subsequent months. These findings suggest congestion pricing alone may not yield immediate safety benefits in complex urban environments and likely requires complementary interventions, such as infrastructure upgrades, to achieve sustained reductions in traffic harm.

Shiyu Zhang

AI-Inferred Photo Quality and Patient Characteristics Associated With Retract-and-Reorder Events in Electronic Health Records

Abstract: To examine whether AI-inferred photo quality and patient characteristics available during EHR order entry are associated with retract-and-reorder (RAR) events, an established proxy for wrong-patient electronic ordering. Materials and Methods: Using the Photo AI RAR dataset, we analyzed EHR orders placed at NewYork-Presbyterian (NYP) from 2014 to 2024 and linked order-entry workflow records to automated, AI-extracted attributes from patient photographs displayed at the time of ordering (e.g., inferred patient characteristics, eyewear, masking/occlusion, pose, and face-geometry features). We collapsed multiple order records within each document_id to a single analytic record, labeling it positive if any contained order had is_rar_event=1, and excluding observations with missing RAR (no eligible orders in window). We performed exploratory data analysis, hypothesis testing, and logistic regression to assess associations between AI-derived photo attributes and RAR events. Conclusions: Integrating EHR workflow data with AI-inferred photograph attributes may enable scalable, hypothesis-generating assessment of factors associated with wrong-patient ordering risk.

Clinical Trials - HSC LL 208B

Shumei Liu

Temporal and Structural Analysis of Clinical Pharmacology Research Patterns in Chagas Disease and Non-Chagas Drugs

Abstract: Clinical pharmacology studies are a key component of drug development and regulatory science, but their distribution across disease areas and approval stages remains incompletely described. Regulatory and clinical trial records from multiple evidence sources, including FDA databases and trial registries, were integrated and standardized in SAS to characterize clinical pharmacology study patterns related to Chagas disease and to compare them with those of non-Chagas drugs. Analyses examined study number, study type, trial phase, study duration, site distribution, sponsor structure, and patterns before and after FDA approval. Distinct differences were observed across disease groups and regulatory stages, indicating variation in the scale, timing, and organization of clinical pharmacology research. These findings support the value of structured multi-source data integration for regulatory landscape analysis and suggest that disease focus and approval timing may influence clinical research strategy.

Kimberly Palaguachi-Lopez

Statistical Modeling of Longitudinal Kidney Function Outcomes in Diabetes: A Mixed-Effects Model Comparison

Abstract: Recent advances in cardiometabolic therapies have expanded treatment options to reduce cardiovascular and kidney complications among individuals with diabetes. However, some diabetes populations remain at elevated risk of cardiorenal disease progression. This project analyzed data from a randomized clinical study evaluating the effects of a metabolic therapeutic agent on longitudinal markers of kidney function in adults with diabetes. The primary analysis followed an intention-to-treat framework and assessed treatment effects on repeated kidney function measurements over a six-month follow-up period. Traditional longitudinal approaches, including ANCOVA and linear mixed models (LMM), were used to account for correlated repeated measures and estimate treatment effects over time. As an ancillary methodological analysis, generalized additive mixed models (GAMM) were explored as a flexible alternative to account for potential nonlinear physiological hemodynamic responses to therapy. This comparison highlights the potential value of GAMM as a complementary approach to conventional LMM for understanding treatment-related changes over time in longitudinal clinical trial data.

Shiying Wu

Development and Integration of an Exploratory R shiny App and Renovate Data Processing for SIRP-Œ±

Abstract: This project focused on developing and enhancing the SIRP-Œ± exploratory R Shiny application to support interactive review of multiple clinical trials. The work involved integrating and validating modules for safety, demographics, biomarkers, and clinical timelines using CDISC SDTM datasets. New trials (1501_0001 and 1501_0002) were added and validated within the application, which also incorporates existing 1443 program trials. During development, I identified workflow and functionality gaps and provided recommendations to the DaVinci development team for future improvements. In addition, a pooled dataset (1501_P02) was created by combining data from study 1501_0001 with selected subjects from 1443_0003 who received the same target drug, enabling exploratory cross-study analysis within the application. Furthermore, response datasets for studies 1501_0001 and 1501_0002 were prepared and formatted to be compatible with RENOVATE, an existing efficacy visualization tool, ensuring consistent data structures and enabling downstream efficacy analyses across trials.

Jingyan Yu

Statistical Analysis Of A Phase 3 Clinical Trial Evaluating Non-Inferiority Of A Semaglutide Injection In Type 2 Diabetes

Abstract: This practicum project analyzes data from a multi-center Phase 3 randomized clinical trial evaluating the efficacy, safety, and pharmacokinetic characteristics of a domestically developed semaglutide injection (CA505) compared with the originator product Ozempic in adults with type 2 diabetes mellitus. The primary objective is to assess the non-inferiority of CA505 relative to Ozempic with respect to change in glycated hemoglobin (HbA1c) from baseline to week 32. Secondary objectives include evaluating safety and tolerability through treatment-emergent adverse events and comparing pharmacokinetic parameters across treatment groups. Statistical analyses follows the trial‚Äôs Statistical Analysis Plan and the ICH E9(R1) estimand framework, generating tables, figures, and listings to support regulatory decision-making in pharmaceutical clinical trials. Methods include analysis of covariance and mixed models for repeated measures for continuous outcomes, logistic regression for categorical endpoints, and sensitivity analyses for missing data using multiple imputation and tipping point approaches.

Data Analysis - HSC LL 203

Zhengyong Chen

Association Between Modified Sepsis Bundle Adherence and Clinical Outcomes in Pediatric Patients with Phoenix Sepsis

Abstract: Pediatric sepsis is associated with substantial morbidity and mortality. Although adherence to sepsis care bundles has been linked to improved outcomes, the impact of modified bundled care remains unclear in the newly defined Phoenix sepsis population. This practicum focused on the statistical analysis of a retrospective cohort study evaluating the association between modified bundle adherence and clinical outcomes among pediatric patients meeting Phoenix sepsis criteria. Co-primary outcomes were hospital-free days and in-hospital mortality. Multivariable linear and logistic regression models were used to estimate the associations adjusting for prespecified covariates. In adjusted analyses, modified bundle adherence was associated with more hospital-free days and lower odds of in-hospital mortality, although neither association reached statistical significance.

Adeena Moghni

Building An Interactive R-Shiny Platform To Compare Pharmacokinetic Data Of Antibody Drug Conjugates In Preclinical Models

Abstract: Antibody Drug Conjugates (ADCs) combine the specificity of antibodies with the potency of small drugs to deliver cytotoxic chemotherapeutics to tumor cells. ADCs have complex design components, including linker chemistry, linker sites, drug-to-antibody ratio, etc., that make the pharmacokinetic (PK) behavior hard to predict. The goal of this project was to analyze the PK data in ADCs to better understand the relationship between design components and PK, and to develop an interactive visualization platform in R to aid with this analysis. A generalized linear mixed-effects model was used to test the associations. Other analyses included correlation analysis, ANOVA testing, cross-species comparisons, and comparisons at different linker site combinations. The interactive dashboard allowed for users to dynamically filter the data and to select the variables of interest. This generated plots, heat maps, and statistical summary tables in real time. Thus, this platform allowed for exploratory analysis of ADC PK data, which helped in the experimental design for future studies. Future goals of the dashboard are to fully automate the data import to keep the dashboard updated continuously.

Boxiang Tang

Adaptive Closed-Loop taVNS System for Physiology-Timed Neuromodulation

Abstract: This practicum project investigated a multimodal closed-loop transcutaneous auricular vagus nerve stimulation (taVNS) system designed to study physiology-timed neuromodulation. The system integrated synchronized EEG, ECG, respiration, and pupillometry recordings to evaluate how stimulation timing relative to internal physiological rhythms influences neural and autonomic responses. An initial pilot study compared fixed-parameter stimulation, sham stimulation, and respiration-triggered closed-loop stimulation in a small healthy-participant cohort. While EEG responses were heterogeneous, heart rate variability (HRV) measures suggested stronger autonomic modulation under respiration-timed stimulation. These observations motivated a redesigned experimental framework that decomposed respiration‚Äìstimulation coupling rules into multiple paradigms while keeping the base stimulation waveform constant. Multimodal analyses indicated that different coupling rules produced distinct physiological response patterns across EEG, HRV, and pupil measurements.

Jiawa Zhang

Mixed-Effects Models Enhance External Validation in Multi-Institutional Prediction Studies

Abstract: The external validity of clinical and epidemiological prediction models developed on multi-institutional data is often evaluated on unobserved clusters. While linear mixed models (LMMs) are commonly used to handle clustered data, their theoretical and empirical advantage over simpler linear models (LMs) for prediction in new clusters remains underexplored. Here we formally justify the superiority of LMMs in this setting. We derive risk bounds comparing the predictive mean squared error (MSPE) of LMMs and LMs for external cluster prediction. Our results show that the LMM fixed-effects estimator is inherently more efficient (lower variance) than the LM estimator when cluster-level heterogeneity is properly modeled. This efficiency gain ensures a lower MSPE for LMM predictors in any new cluster. Simulations further demonstrate that the LMM advantage is maximized in studies with high between-cluster variance and unbalanced cluster sizes. These findings clarify the fundamental basis for using LMMs in generalizability tasks and provide practical guidance for designing robust, cross-institutional prediction models. The approach is further illustrated using a real data application.

Data Visualizations - HSC LL 207

Bruce Liu

Enhancing Inclusive Digital Communication Through AI-Powered Video Generation: A Product Management Practicum At Mininglamp Technology

Abstract: During my practicum as a Product Manager at Mininglamp Technology in Beijing, China, I contributed to the development of a cutting-edge AI-powered video editing platform. The platform's core objective is to democratize content creation by leveraging machine learning algorithms to auto-generate videos from text, audio, and image inputs. To ensure the product effectively met user needs, I applied robust data analysis techniques to evaluate user behavior and platform engagement metrics. By employing Python and SQL, I queried product data and analyzed results from A/B testing to guide iterative feature improvements. I also designed data visualizations, such as user engagement dashboards and heatmaps, to present usability insights to cross-functional engineering and design teams. This data-driven approach directly informed the development of accessibility features, such as multilingual subtitle generation and voice synthesis. Ultimately, this work demonstrates how machine learning and advanced data analysis can cultivate inclusive communication tools, presenting significant potential for broader applications in public health education, awareness campaigns, and community storytelling.

Kunlun Liu

Development of a novel nucleic acid aptamer‚Äìdrug conjugate (ApDC) based on MMAE

Aptamer-Drug Conjugate (ApDCs) are targeted cancer immunotherapies designed to deliver cytotoxic agents or payload selectively to tumor cells. The "guidance system" used in ApDCs is use nucleic acid aptamers. Monomethyl auristatin E (MMAE) is a type of anti-tubulin agent that is frequently used as a cytotoxic payload in ADCs for cancer therapy. A novel ApDC termed ‚ÄúMulti‚Äù was generated by conjugating MMAE to an aptamer. were investigated using RNA-sequencing data from treated tumor cells. RNA sequencing data were analyzed to identify transcriptional differences between MMAE- and Multi- treated tumor cells, especially the corresponding biological functions, and the signal pathways affected. Principal component analysis showed clear separation between the two groups. Differential expression analysis identified 270 genes. Gene Ontology (GO) enrichment analysis revealed significant enrichment in signaling-related pathways, cytokine activity, and lipid metabolic processes. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis further indicated enrichment of genes associated with cytokine signaling and metabolic regulation.

Chloe Nguyen

GLP-1 Medications and Liver Health Outcomes in iCHANGE Patients

Abstract: Glucagon-like peptide-1 receptor agonists (GLP-1) are effective medications to manage type 2 diabetes and promote weight loss. There are 2 popular active medications that are used as treatment: Semaglutide (popularly known in Ozempic) and Tirzepatide. My APEX involved working with iCHANGE at Weill Cornell Medicine. I collected data on PETH values of iCHANGE patients to help analyze whether GLP-1s are associated with a decrease in alcohol consumption. Additionally, I collected data and performed linear regression on liver stiffness across different medication groups (Semaglutide, Tirzepatide, and Semaglutide/OR Tirzepatide with Resmetirom). The goal was to test whether GLP-1s alone or GLP-1s combined with Resmetirom were a better treatment in improving liver health.

Zexuan Yan

A Biostatistical Analysis of the Little Food Library Program

My Applied Practice Experience with H.E.A.L.T.H for Youths focused on evaluating the Little Free Library program, a community-based intervention aimed at improving book access and promoting literacy among residents in New York City. Using pre- and post-installation survey data, I conducted a quantitative analysis to assess the program's impact on reading habits and community engagement. My work involved cleaning and managing the dataset, applying descriptive statistical methods to calculate key metrics such as usage frequency and self-reported changes in reading behavior, and interpreting the results to draw meaningful conclusions about program effectiveness.

Environmental Health Research - HSC LL109A

Yuhao Chang

Bayesian Kernel Machine Regression Analysis of Prenatal Metal Mixtures and Child Neurodevelopment

Abstract:Evaluating the health effects of environmental mixtures is challenging because individuals are exposed to multiple correlated chemicals simultaneously. This practicum project applied Bayesian Kernel Machine Regression (BKMR) to assess the joint effects of prenatal exposure to four metals including lead, mercury, cadmium, and manganese on early childhood neurodevelopment in a birth cohort study conducted in Suriname. Neurodevelopment was measured using the Bayley Scales of Infant and Toddler Development, Third Edition (BSID-III). BKMR models were used to estimate overall mixture effects based on interquartile range (IQR) contrasts, evaluate univariate exposure‚Äìresponse functions, and explore potential interactions using bivariate exposure‚Äìresponse surfaces. The analytic dataset included 708 mother‚Äìchild pairs. Results indicated a positive association between prenatal mercury exposure and gross motor scores, while estimated effects for other metals were small and not statistically significant. This project demonstrates the use of BKMR for evaluating nonlinear and joint effects of environmental mixtures on child neurodevelopment.

Ila Kanneboyina

Evaluating Prenatal Metal Exposures And Birth Length Using Negative Control Analysis And Mixture Modeling

Abstract: Prenatal exposure to environmental metals has been associated with adverse birth outcomes, yet causal inference in observational studies is complicated by correlated exposures and potential unmeasured confounding. This study examined associations between prenatal arsenic, manganese, and lead exposure and infant birth length in a cohort of pregnant women in Bangladesh. Multivariable linear regression models adjusting for demographic and maternal health factors were used to estimate associations. To assess potential residual confounding, we conducted a negative control exposure analysis using metal concentrations measured in children after birth, which should not causally influence fetal growth. Bayesian Kernel Machine Regression (BKMR) was also applied to evaluate mixture effects and potential nonlinear exposure‚ response relationships. Regression models indicated that higher prenatal manganese exposure was associated with shorter birth length. However, the negative control analysis revealed an association between post-pregnancy arsenic levels and birth length, suggesting possible residual confounding.

Miriam Lachs

Dietary nutrients associated with increased or reduced risk of amyotrophic lateral sclerosis

Abstract: Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disease with limited treatment options. This study aimed to identify dietary risk factors to inform people with ALS (PALS) and individuals at high risk. Data from the ARREST and COSMOS studies were analyzed. Dietary intake was assessed using the Modified Block Food Frequency Questionnaire (FFQ). Nutrient and food intake were compared between healthy controls in ARREST and PALS in ARREST and COSMOS cohorts, adjusting for age, sex, body mass index (BMI), and total caloric intake using linear and logistic regression. Associations between nutrient intake and change in the revised ALS Functional Rating Scale (ALSFRS-R) from baseline to 6-month follow-up were examined with linear regression among PALS. In ARREST, higher intake of riboflavin, retinol, arginine, cystine and dairy was associated with greater ALS odds, while lutein/zeaxanthin and vegetable intake were linked to lower odds. Among PALS in ARREST and COSMOS, greater intake of retinol, vitamin A, arginine, methionine, lactose, total choline, and dairy was associated with faster ALSFRS-R decline.

Thomas Tang

Integrating High-Resolution SO‚ÇÇ Exposure Data with Pediatric Cohort Data for Environmental Health Research

Abstract: This internship focused on processing and analyzing large-scale environmental exposure data for epidemiological research. Geocoded pediatric health data were linked with daily sulfur dioxide (SO₂) concentration data from the CHAP (China High Air Pollutants) dataset based on geographic coordinates. Using R packages such as sp, raster, and dplyr, raster-based SO₂ data from 2014–2019 were extracted to obtain daily exposure levels for each participant location. For infants younger than one year, average SO₂ exposure during the gestational period was estimated using conception dates derived from gestational age and birth dates. For older children, mean SO₂ exposure was calculated over retrospective windows of 3, 6, and 12 months prior to cognitive testing.The workflow included preprocessing large raster datasets, handling missing spatial coverage, and optimizing computational efficiency. The final dataset provides temporally aligned SO₂ exposure metrics for epidemiological analysis of environmental impacts on early childhood health outcomes, demonstrating a reproducible framework for integrating environmental data with individual-level health records.

Epidemiology - HSC LL 209A

Malika Top

Associations between Adverse Childhood Experiences and Late-Life Brain Volumes

Adverse childhood experiences (ACEs) have been linked to poorer health outcomes across adulthood, but whether they produce measurable structural brain differences in late life remains unclear. Neurodegenerative diseases can progress silently for years before clinical symptoms emerge, so early-life exposures may be important risk factors. Data from the Adult Changes in Thought (ACT) Study, a prospective cohort of adults aged 65 and older, are used to estimate associations between ACEs and MRI-derived volumetric outcomes. The analytic sample includes 1,342 participants with ACEs survey data, 438 of whom have linked MRI data. Selection into MRI scanning is nonrandom, requiring inverse probability weights to improve generalizability to the full ACEs-eligible cohort, with separate models fitted to predict the probability of scan receipt and scan type. Associations are estimated from weighted linear regression models, adjusting for sex, age at scan, race/ethnicity, and APOE4 genotype. Findings will further characterize the complex relationship between early adversity and later-life brain health in a cohort that has not been previously studied.

Wanlin Wu

Dietary Strategies Against Alzheimer‚Äôs Disease Risk Based On Machine Learning

Abstract: Alzheimer’s disease (AD) is a major cause of death among older adults, and diet may influence the risk of AD. However, the specific food groups related to AD mortality remain unclear. This study aims to identify dietary factors associated with AD mortality and translate them into dietary guidance. Data are obtained from the National Health and Nutrition Examination Survey 2011–2018 linked with National Death Index mortality records. The study population includes U.S. adults aged 40 years and older with available dietary recall and mortality data. Dietary intake from 24-hour recalls is aggregated into food groups based on the What We Eat in America classification system. Food-wide association analysis is conducted using survey-weighted Cox proportional hazards models to estimate associations between food group intake and AD mortality. False discovery rate correction is applied for multiple testing. Machine learning is then used to rank influential food groups. Based on these results, a diet score is constructed to classify foods as adequacy, restriction, or moderation components. This provides an interpretable approach for generating practical dietary suggestions.

Meitong Zhou

Language Characteristics and Cognitive Performance in a Multi-Ethnic Cohort: Evidence from MESA

Abstract: Language experience has been proposed as a predictor of later-life cognition, but prior studies may not have fully accounted for education and related social factors. Using data from the Multi-Ethnic Study of Atherosclerosis (MESA), we examined associations between language characteristics and cognitive performance, with CASI as the primary outcome and Digit Symbol Coding and Digit Span as secondary outcomes. Linear regression models were used to assess current cross-sectional associations, adjusting for age, sex, race/ethnicity, nativity, and education. Language characteristics were not significantly associated with CASI after adjustment. Some associations were observed for Digit Span outcomes, suggesting that language experience may relate to specific cognitive domains rather than global cognition. These associations weakened after adjustment for education, while models without education showed stronger associations, suggesting confounding by education. Because language variables had substantial missingness that varied by age and education, future analyses will use longitudinal models and inverse probability weighting to address potential bias.

Internship Projects - HSC LL 201

Jingtong He

SAS Statistical Software Development And Cross-Validation

Abstract: This practicum was conducted at the Peking University Institute of Information Technology, focusing on SAS statistical software development and cross-validation for survival analysis methods. The primary responsibility was to derive statistical formulas for closed-source SAS procedures, verify results manually, and explain findings to development engineers. Core work centered on survival analysis using the Cox proportional hazards model, including deriving partial likelihood functions, score and Hessian matrices, and researching hazard ratio estimation and Type 3 hypothesis testing. Newton-Raphson iteration and diagnostic methods were implemented in R to replicate key SAS procedures such as PROC PHREG, NLIN, and LIFETEST, with all results cross-validated between SAS and R. The practicum also involved researching IQ/OQ/PQ validation frameworks and customizing the SAS Clinical Standards Toolkit for CDISC SDTM 3.2 validation.

Yizhe Li

Operational Strategy and Industry Research in China‚Äôs Pharmaceutical Sector: Internship at China Resources Pharmaceutical Group

Abstract: During my internship in the Operations Management Department of China Resources Pharmaceutical Group, I participated in a range of strategic research, operational analysis, and public health-related initiatives that provided a comprehensive perspective on how large healthcare enterprises coordinate investment decisions, industry research, and social responsibility programs. As an important part of my tasks during the internship, I conducted industry research and situational analyses of pharmaceutical and healthcare companies in China. I organized and summarized industry reports, as well as benchmarked pharmaceutical companies in China, including analyzing various financial metrics, R&D investment, and the overseas strategic advancement plan of those companies. Additionally, I participated in internal operation and investment projects, carrying out desk research and collated information from multiple departments. Another important work task I did during my internship was to participate in the China Resources Rural Health Move Actions Program, a public welfare program that actively creates equal access to health care in rural areas of backward regions.

Ben Nguyen

Tracking Physical And Psychiatric Comorbidities In Opioid Use Disorder Patients

Abstract: Opioid use disorder (OUD), the chronic use of opioids that causes a person significant distress or impairment, is a major public health crisis and epidemic in the United States, affecting roughly 2.7 million Americans. Oftentimes, a person diagnosed with OUD is exacerbated by coexisting physical and psychiatric comorbidities. Using pharmacy and medical claims, my team and I wanted to gain insight into the most common physical and psychiatric comorbidities and their prevalence. We also looked at patiens' adherence rates to medications for opioid use disorder (MOUD) and the types of medications patients were prescribed, with a specific emphasis on methadone treatments. We conducted diagnosis and prescription-based analyses by segmenting patients into various demographic, clinical, and treatment characteristics of interest and observed patients' trends and patterns using SQL. Afterwards, we built a propensity model using Python and SQL to predict the probability of treatment for patients diagnosed with OUD for whom we are not seeing MOUD claims and may still have activity not captured by open claims.

Chenxiang Yang

Applied Practice Experience in Clinical Laboratory Operations and Biostatistical Validation

Abstract: FORTHCOMING

Yuhao Zhang

PCoA And NMDS Analysis For Microbiome Beta Diversity

Abstract: During my internship at OE Biotech, I assisted in maintaining a bioinformatics platform for microbiome and multi-omics data analysis. My work focused on multivariate statistical modules, particularly Principal Coordinates Analysis (PCoA) and Non-metric Multidimensional Scaling (NMDS), which are widely used to assess beta diversity and visualize differences in microbial community composition across experimental groups.Using R-based analytical pipelines, I helped process abundance tables, distance matrices, and sample metadata. The platform supported distance-based analyses using Bray‚ÄìCurtis, Jaccard, and Euclidean metrics and generated ordination plots to explore similarity patterns among samples. I also processed group comparison outputs based on the ANOSIM algorithm and ensured that statistical results were correctly integrated into visualization reports.In addition, I organized over 20 technical issues during platform operation and collaborated with developers to improve system performance, contributing to a 1.5% increase in 7-day user retention.

Longitudinal Data Analysis - HSC LL 202

Hanchuan Chen

Association Between Sleep Disorder and Cognitive Impairment

Abstract: Sleep disorders frequently co-occur with cognitive impairment, forming a complex cluster of chronic comorbidities that substantially affects patient health and quality of life. In clinical settings, disturbances in sleep are commonly observed among individuals with cognitive decline, suggesting potential shared mechanisms and interrelated disease pathways. However, the relationships among sleep disorders, and cognitive impairment remain insufficiently understood. This project aims to investigate the association between sleep disorders and cognitive impairment and to clarify the interactions between sleep and cognitive function. Using real-world clinical data, we will analyze patterns of sleep disorders across individuals with varying cognitive and mental health profiles. Advanced analytical approaches will be applied to identify high-risk phenotypes of cognitive impairment associated with sleep disturbances. Multimodal data fusion methods will further integrate heterogeneous clinical information to capture complex relationships among these conditions. The findings may improve risk stratification and inform early detection and intervention strategies for cognitive decline.

Yuting Gu

The Longitudinal Changes in Psychiatric Symptoms in Alzheimer’s Disease Population

Abstract: Neurodegenerative disorders such as Alzheimer’s disease (AD) involve progressive structural and functional brain changes. Cognitive reserve refers to the brain’s ability to maintain function despite aging, damage, or disease by using alternative neural pathways or strategies. Functional brain network organization may reflect cognitive reserve and help explain variability in cognitive decline. This study examines whether functional brain network segregation moderates the relationship between gray matter integrity and cognitive performance in older adults. Using neuroimaging and cognitive data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Human Connectome Project in Aging (HCP-A), we evaluate associations between cognitive outcomes and the interaction between gray matter volume ratio and global network segregation. Outcomes include memory, executive function, language, visuospatial ability, and MMSE. Linear regression and mixed-effects models adjust for age, sex, and education. We hypothesize higher network segregation strengthens the association between brain structure and cognitive performance across normal aging, mild cognitive impairment, and AD populations.

Weiqi Liang

Separating Treatment Effects From Biological Aging In Lipid Trajectories: A Longitudinal Mixed-Effects Analysis Of The Framingham Heart Study Offspring Cohort

Abstract: Blood lipid biomarkers are commonly used indicators of cardiovascular risk, but their longitudinal trajectories may be influenced by medication use, which can confound aging-related changes. This study evaluates the association between lipid-lowering medication use and age-related trajectories of four lipid biomarkers, total cholesterol, low-density lipoprotein cholesterol, high-density lipoprotein cholesterol, and triglycerides, using repeated observations from the Framingham Heart Study offspring cohort. Analyses included 26,931 observations from 4,985 participants. Sex-stratified longitudinal mixed-effects models were fitted for each lipid biomarker with fixed effects for time, quadratic time, baseline age, medication use, and their interactions. Three alternative random-effects structures were compared, with the random quadratic slope model providing the best fit across biomarkers. Future analyses incorporating dose information, or causal inference approaches may further clarify medication effects on lipid dynamics.

Xun Sun

Multilayer Sentiment Modeling with Large-Language-Model-Assisted Labels and Mixed-Effects Models

Abstract: Understanding public sentiment on large-scale social media platforms is increasingly critical for capturing collective psychological states, tracking societal responses to disruptive events, and informing digital health and policy interventions. Yet current sentiment analysis approaches are often limited to binary or coarse-grained classifications, overlooking the multi-dimensional and context-specific nature of online emotions. We present a multilayer sentiment modeling that couples large language model (LLM)-assisted labeling with mixed-effects models to quantify temporal and contextual patterns across three layers: core affect (valence, arousal, dominance), basic emotions (joy, anger, sadness, fear, disgust, surprise), and social-media signals (sarcasm, empathy, toxicity, stance). Using a large Bilibili corpus spanning diverse video categories and time periods, domain-adapted LLMs generate labels that are analyzed with mixed-effects models and hypothesis tests. We found that the GPT-based models achieved strongest predictive performance, and that statistical modeling provides interpretable insights into emotion trends across video categories and pandemic-related periods.

Machine Learning and Dimension Reduction - HSC LL 210

Yining Cao

Scalar-on-function Regression with Measurement Error in the Functional Regressors

Abstract: Functional data are increasingly used in biomedical and public health research, where predictors are observed as trajectories over time rather than as single measurements. In practice, these functional covariates are often contaminated by measurement error, which can bias estimation in scalar-on-function regression. We use the Simulation‚ÄìExtrapolation (SIMEX) method to reduce bias from noisy functional predictors. Using minute-level physical activity curves from the NHANES accelerometer dataset, I construct a plasmode simulation framework in which outcomes are generated from a known coefficient function. This allows evaluation of estimation accuracy while preserving realistic variation in the predictors. I fit spline-based functional regression models and compare results with and without SIMEX correction under different levels of measurement error. Performance is evaluated using integrated squared error and prediction error. The results show that measurement error can substantially distort the estimated functional effect, while SIMEX improves recovery of the coefficient function and reduces bias, which illustrates the value of SIMEX for analyzing noisy functional data.

Shike Zhang

A Multimodal Fusion Algorithm Enhanced with a Cross-Modal Self-attention Mechanism for Detecting Coronary Heart Disease in Patients

Abstract: Heart disease remains a leading cause of morbidity and mortality worldwide, necessitating accurates and early diagnostic tools. This study proposes a novel multimodal fusion approach that integrates structured clinical numerical indicators with unstructured patient self-reported text. Leveraging DistilBERT for deep semantic encoding and a Transformer-based cross-modal self-attention mechanism, the model effectively captures complex interactions between heterogeneous data modalities. Experiments conducted on a publicly available cardiac dataset demonstrate that the proposed model outperforms traditional machine learning methods such as Logistic Regression, Random Forest, and XGBoost, as well as multimodal baselines using TF-IDF and BERT without attention. The best-performing model, combining DistilBERT embeddings with cross-modal attention, achieves an accuracy of 98.3% and a macro-averaged F1-score of 0.981. These results underscore the potential of attention-enhanced multimodal deep learning frameworks in advancing intelligent cardiac disease diagnosis and support further exploration into multimodal medical data fusion for clinical applications.

Survival Analysis - HSC LL 109B

Stella Koo

How Does Distance Of Covariance (DISCO) Compare To Homeostatic Dysregulation, Klemera-Doubal Biological Age, And Phenotypic Age In Its Association With Health Outcomes Across Diverse Populations?

Abstract: Several biological age metrics may capture distinct dimensions of physiological aging. Distance of Covariance (DISCO), a novel measure, quantifies how much an individual’s biomarker covariance structure differs from a reference population. We evaluated DISCO against established metrics—Homeostatic Dysregulation (DM), Klemera–Doubal Biological Age (KDM BA), and Phenotypic Age (PhenoAge)—across four cohorts: HRS, InCHIANTI, SLAS, and UKB, using cohort-specific and common biomarker panels. DISCO correlated strongly with DM (0.67–0.85) but weakly with chronological age, KDM BA, and PhenoAge, while the latter metrics were strongly intercorrelated. In random-effects meta-analysis, higher DISCO was associated with greater risk for all six health outcomes. Associations were moderate for cardiovascular disease, hypertension, and lung disease, smaller for cancer (OR 1.09–1.10), and largest for diabetes (OR 1.77–2.50). Mortality hazard ratios were positive (HR 1.41–1.56). Wide prediction intervals and high I2 values indicated substantial between-cohort heterogeneity. Overall, DISCO captured a dysregulation-related signal aligned with DM but did not consistently outperform established metrics.

Chufeng Li

Prognostic Assessment of IPSS-M and IPSS-R in Myelodysplastic Syndrome Patients Undergoing Allogeneic Transplantation

Abstract: Accurate risk stratification is essential for guiding treatment decisions in patients with myelodysplastic syndromes (MDS), particularly those undergoing allogeneic hematopoietic stem cell transplantation (allo-HSCT). The Molecular International Prognostic Scoring System (IPSS-M), which incorporates genomic mutations, may improve prognostic prediction compared with the traditional Revised International Prognostic Scoring System (IPSS-R). This practicum project retrospectively analyzed two transplant cohorts to evaluate the prognostic performance of IPSS-M and IPSS-R. Survival outcomes were examined using survival analysis and competing-risk models. Additional prognostic factors were assessed using Cox proportional hazards models, and model discrimination was evaluated with Harrell’s C-index. Results suggest that IPSS-M provides improved prognostic stratification for overall survival in transplant patients, while IPSS-R remains informative for certain competing outcomes. Conditioning intensity and patient comorbidity were also important predictors of transplant outcomes.

Dang Lin

Determinants and Functional Consequences of Multimorbidity Among Middle-Aged and Older Adults in China: Evidence from the China Health and Retirement Longitudinal Study (CHARLS)

Abstract: Population aging in China has been accompanied by a rising prevalence of multimorbidity, defined as the coexistence of two or more chronic conditions. Multimorbidity is associated with disability, increased healthcare utilization, and reduced quality of life among older adults. This practicum project examines the determinants and functional consequences of multimorbidity among adults aged 45 years and older using data from the China Health and Retirement Longitudinal Study (CHARLS). The first objective is to identify socioeconomic, behavioral, and psychosocial factors associated with multimorbidity using multivariable logistic regression models. The second objective evaluates the longitudinal relationship between multimorbidity and functional decline measured by activities of daily living (ADL) and instrumental activities of daily living (IADL) using mixed-effects models. Cox proportional hazards models will also assess the association between multimorbidity burden and mortality risk.

Yanhao Shen

Limitations of Applying the Hematopoietic Cell Transplantation Comorbidity Index in Pediatric Patients Receiving Allogeneic Hematopoietic Cell Transplantation

Abstract: Pretransplant risk stratification is critical for counseling families and optimizing allogeneic hematopoietic cell transplantation (alloHCT) care. The hematopoietic cell transplantation–comorbidity index (HCT-CI) was developed in adults, and its applicability in pediatric recipients remains uncertain. I performed a single-center retrospective cohort study of 188 children, adolescents, and young adults (<22 years) who underwent alloHCT between January 2008 and October 2016. Pretransplant comorbidities were abstracted and scored using standard HCT-CI definitions and weights, and patients were categorized as HCT-CI 0, 1–2, or ≥3. The primary endpoint was overall survival (OS), defined as death from any cause. The secondary endpoint was nonrelapse mortality (NRM), defined as death without relapse of the underlying disease. In this pediatric alloHCT cohort, HCT-CI did not discriminate OS, and several index components were challenging to operationalize in children. Pediatric-specific refinement of comorbidity definitions may be necessary to improve pretransplant risk assessment.

Break (2:00pm - 2:30pm)

Session 2 (2:30pm - 3:30pm)

Cancer Research - HSC LL 208A

Lahari Koganti

Tumor-Only RNA-Seq Framework for Differential Gene Expression Analysis Without Matched Normal Controls

Abstract: In oncology research, RNA-sequencing (RNA-seq) studies typically compare tumor samples with matched normal control samples to identify differentially expressed genes. However, in clinical practice, matched normal tissue is often unavailable due to logistical or cost-related constraints, limiting use of conventional statistical approaches for differential expression analysis. We developed a tumor-only framework for identifying differentially expressed genes from RNA-seq data without matched normal controls. The workflow incorporates preprocessing, quality control, and normalization steps. Differential expression is inferred using a percentile-ranking strategy, where each gene in a new tumor case is evaluated against a tumor-only reference database using the 10th and 90th percentile thresholds, along with the gene's percentile rank relative to the database distribution. The method is implemented as a reproducible computational pipeline using Bash, R and Python for use in precision genomic laboratories. Preliminary evaluation shows the approach can identify clinically actionable genes, enabling tumor-only RNA-seq analysis in precision oncology.

Zihan Lin

Multi-Omics Analysis Reveals Histone Modification‚ÄìAssociated Tumor Microenvironment Regulation in Intrahepatic Cholangiocarcinoma

Abstract: Intrahepatic cholangiocarcinoma (ICC) is an aggressive liver malignancy with increasing incidence and poor prognosis. The tumor microenvironment strongly influences tumor progression, while the epigenetic mechanisms regulating this process have not been completely understood. This study performed a multi-omics analysis to explore histone modification‚Äìassociated regulatory patterns in ICC. Publicly available single-cell RNA sequencing data (GSE138709) and proteomics data (PDC000356) were analyzed using a computational pipeline in R. Single-cell analysis revealed substantial cellular heterogeneity in the ICC microenvironment and identified epithelial cells as important regulators of immune interactions. Proteomic and survival analyses highlighted KPNA4 as a key prognostic factor. Results of integrative epigenomic analysis further suggested that enhancer-associated histone modifications may regulate KPNA4 expression. These findings provide us with new insights into the epigenetic regulation of the ICC tumor microenvironment and suggest potential targets for future therapeutic investigation.

Linshan Xie

Dietary Fiber and Grain Intake and Risk of Colorectal Cancer: A Pooled Analysis of Prospective Cohort Studies

Abstract: Colorectal cancer (CRC) is the third most common cancer and the second leading cause of cancer-related deaths worldwide. Dietary fiber and grain intake have been suggested to be protective against colorectal cancer; however, evidence regarding specific fiber sources, grain types and colorectal cancer subtypes remains limited. This study investigates the associations of total and source-specific dietary fiber as well as whole and refined grain intake with colorectal cancer risk. A pooled analysis is conducted using data from 2,241,549 participants and 33,257 incident colorectal cases across 27 cohort studies in the Pooling Project of Prospective Studies of Diet and Cancer (DCPP). Cox proportional hazards models are used to estimate study-specific hazard ratios (HRs) and 95% confidence intervals (CIs), which were then combined using a random-effects meta-analysis, adjusting for potential confounders. In age-adjusted models, higher fiber and grain intake were associated with a lower colorectal risk; however, the associations are no longer statistically significant after multivariable adjustment. Interaction analysis were also conducted to evaluate potential effect modification.

Data Analysis - HSC LL 204

Hanrui Li

Statistical Genetics Analysis Of Lifetime Cannabis Use And CD4+ T Cell Dynamics In PC-Stabilized Epigenetic Aging

Abstract: Epigenetic clocks capture the divergence between chronological and biological age. This study investigates the impact of Lifetime Cannabis Use and immune integrity on biological age acceleration, accounting for the influence of leukocyte proportions on the DNA methylation (DNAm) signal. DNAm data (450k) was preprocessed using Noob normalization for scale consistency. Biological age was estimated via the PC-stabilized PhenoAge algorithm. To control for confounding, Population Stratification and Batch Effects were addressed by adjusting for physical Slide/Plate IDs and implementing Cellular Stratification (Houseman 2012) for CD4T, CD8T, NK, Bcell, and Mono proportions. The adjusted model revealed that Cannabis use significantly accelerates biological aging by ~3.27 years (P < 0.05). CD4+ T cell proportions emerged as the strongest predictor of age deceleration (P < 0.001). Non-significant Slide/Plate factors (P > 0.05) confirmed that Noob preprocessing effectively neutralized experimental noise. Lifestyle behaviors are biologically embedded through immune remodeling. CD4+ T cell dynamics serve as critical indicators of aging health, independent of ancestral or technical variation.

Gokul Pareek

Data-Driven Optimization of Community-Based Cancer Screening Reporting Using Scalable Anaytics

Abstract: Community-based cancer screening programs are essential for reducing disparities in early detection among immigrant populations, yet operational data are often not structured for systematic reporting and evaluation. This practicum project developed scalable data-driven reporting tools for the Immigrant Health and Cancer Disparities Service at Memorial Sloan Kettering Cancer Center. Using patient-level REDCap data, reproducible pipelines were built in SQL and R to generate analytic datasets. Descriptive and stratified analyses quantified screening participation and demographic variation. These analyses were operationalized through integrated Tableau dashboards supporting real-time cancer screening reporting and program monitoring. Automated REDCap API integration and weekly refresh cycles enabled continuous synchronization between operational data and dashboard outputs. As a secondary component, standardized consort diagrams modeled patient flow and identified attrition points across screening pathways. This work demonstrates the application of biostatistical methods and scalable analytic infrastructure to strengthen cancer screening reporting in community programs.

Congyu Yang

Exploratory Analysis Of Image-Derived Features In Automated IPSC Culture To Identify Indicators Of Colony Expansion

Abstract: Induced pluripotent stem cells (iPSCs) are an important component of emerging personalized regenerative therapies. In automated cell manufacturing workflows, recently reprogrammed iPSCs must reach a stable state before being transferred for further expansion, yet identifying early indicators of successful colony expansion remains challenging. This practicum project analyzes image-derived features from automated iPSC culture experiments conducted at Cellino Biotech. Images collected during laboratory cell culture experiments were processed to extract features such as cluster counts, colony morphology, and confluence measurements. Using SQL-based data extraction pipelines and Python-based analysis tools, these features were cleaned and organized for analysis across experiments, donors, and timepoints. Exploratory analyses identified trends in cluster survival, phases in colony growth dynamics, and potential relationships between seeding density and reprogramming efficiency. These findings demonstrate how systematic analysis of image-derived features can support deeper investigation of automated iPSC culture systems.

Yixin Zheng

Functional Connectivity and Apathy in Alzheimer's Disease: Mediation and Interaction Effects of Amyloid Burden

Abstract: Apathy is a common and functionally disruptive neuropsychiatric symptom in Alzheimer's disease, but its neural link to amyloid pathology remains unclear. Using resting-state fMRI from the Alzheimer's Disease Neuroimaging Initiative (ADNI; n = 745 participants, 1,513 longitudinal scans), this study tested three hypotheses: (H1) within-network functional connectivity (FC) of the Default Mode (DMN), Frontoparietal (FPN), and Salience (SN) networks mediates the amyloid–apathy association; (H2) amyloid burden modifies the FC–apathy relationship; (H3) these pathways extend to other neuropsychiatric symptom (NPS) domains. Apathy was analyzed as binary (NPI-G) and continuous (NPIGTOT_z) outcomes. Baseline effects were estimated with linear and logistic regression; longitudinal effects used linear and generalized linear mixed-effects models with participant-level random intercepts and the bobyqa optimizer. Mediation was estimated by quasi-Bayesian simulation with 5,000 draws (no bootstrap) at baseline and via posterior simulation for mixed-effects models, with FDR correction within each model family. Models adjusted for age, sex, education, diagnosis, sedative/hypnotic use, total NPI burden, and hypertension. H1 was null across all three networks (ACME ≈ 0). H2 revealed a significant DMN × Amyloid interaction on continuous apathy severity (FDR p ≈ 0.044); an SN × Amyloid effect on binary apathy is interpreted cautiously due to convergence warnings. H3 was null across 33 FC-by-NPS combinations. Findings suggest amyloid may modify — rather than be mediated by — DMN connectivity in shaping apathy severity.

Epidemiology - HSC LL 205

Maya Krishnamoorthy

Sandwich Generation Caregiving and Cognitive Function in Midlife: Evidence from the National Longitudinal Study of Adolescent to Adult Health

Abstract: Midlife adults increasingly care for both aging parents and children, yet little is known about dual (“sandwiched”) caregivers and their cognition. Using Wave VI of the National Longitudinal Study of Adolescent to Adult Health, we examined associations between caregiving duties and cognition among adults aged 39–51. Analyses were conducted for two samples: an overall sample using TestMyBrain tests to measure cognition (n=9,777), and a subsample who completed additional tests from which a general cognition score was derived (n=2,524). Survey-weighted linear regression models adjusted for age, sex, race/ethnicity, and education. Additional models for mediation of health behaviors and depression were run. Across samples, caregiving was associated with better cognitive performance relative to non-caregiving, and sandwiched caregivers showed the strongest advantages. These findings support the Healthy Caregiver Model, suggesting that intergenerational caregiving in midlife may confer cognitive benefits. Further research is needed to assess whether these patterns persist with aging.

Sean Sorek

Resistance Calibration for Landscape Connectivity: Using Bayesian Optimization and Point Processes to Build Data-Driven Functional Connectivity Surfaces

Abstract: Functional connectivity models are increasingly being used to predict wildlife distributions and vector-borne disease risk, yet resistance surfaces are typically assigned by expert opinion with no quantifiable uncertainty or selection functions that do not truly optimize connectivity. We present a framework that calibrates resistance surfaces by optimizing downstream connectivity performance against species occurrence data. A log-linear resistance surface feeds Omniscape to produce cumulative current flow, which is used as a feature in a negative binomial GAM. Because Omniscape does not calculate gradients, resistance parameters are tuned via Bayesian optimization using Thompson sampling. Applied to 156,495 GPS readings from 57 white-tailed deer on Staten Island, New York, the framework outperforms literature baseline (holdout AUC = 0.922 vs 0.859). Optimized resistance parameters are consistent with ecological expectations for WTD, while the approach generalizes to any species with occurrence data and environmental rasters.

Yi Su

Physical Activity, Sedentary Behavior, And Risk Of Insulin Resistance And Incident Diabetes In Chinese Adults: A Longitudinal Cohort Study

Abstract: This practicum investigated the relationship between moderate-to-vigorous physical activity (MVPA), sedentary behavior, and the risk of insulin resistance and incident diabetes using longitudinal data from the China Health and Nutrition Survey. The analysis included over 10,000 participants followed across multiple survey waves. Cox proportional hazards models were applied to estimate the associations between activity levels and disease incidence, and mediation analysis was conducted to evaluate the roles of body mass index and insulin resistance. Results indicated that higher levels of MVPA were associated with a lower risk of insulin resistance and diabetes, whereas greater sedentary behavior increased insulin resistance risk but showed no significant association with diabetes incidence. The findings suggest that increasing physical activity and reducing sedentary time may play an important role in preventing metabolic disorders.

Functional Data Analysis- HSC LL 208B

Fengwei Lei

Reliability of Graphical Metrics from Resting-State fMRI in Aging

Abstract: Graph-based summaries of brain functional connectivity (FC), including centrality- and modularity-related measures, are increasingly used to characterize brain network organization and individual differences. Yet, the reliability of these ROI-level network properties remains insufficiently characterized, limiting their interpretability and potential use as biomarkers. Using resting-state fMRI data from the Aging Adult Brain Connectome (AABC), this study evaluates the test‚Äìretest reliability of node-level network metrics derived from FC matrices, including node strength, clustering coefficient, eigenvector centrality, and betweenness centrality. Reliability at the ROI level is quantified using intraclass correlation coefficients (ICC). To complement ICC-based estimates, the image intraclass correlation coefficient (I2C2) is also computed on whole-brain node-metric maps to examine whether reliability conclusions differ when these measures are treated as multivariate images. This analysis identifies network properties with higher reproducibility across regions and examines whether these patterns remain consistent across follow-up sessions.

Tai Yue

Reliability Of Functional Connectivity Measures

Abstract: Assessing the reliability of functional connectivity (FC) is essential for reproducible neuroimaging research. This project reviews commonly used reliability measures, including the Intraclass Correlation Coefficient (ICC), variance components models, Generalizability Theory, and discriminability, comparing them in terms of interpretability, statistical rigor, and suitability for high-dimensional data. To illustrate these methods, a proof-of-concept analysis is conducted using simulated Harvard–Oxford (HO)-based ROI time series under a pseudo test–retest design. Functional connectivity matrices are constructed, and edge-wise ICC values are computed. Results show consistently high reliability (mean ICC ≈ 0.96) under controlled conditions, demonstrating the analysis pipeline. Although based on simulated data, this work provides a clear framework for evaluating FC reliability and highlights the importance of method selection. Future work should apply these approaches to real test–retest datasets.

Kangyu Xu

Estimating Biological Aging from DNA Methylation Using PC-Based Epigenetic Clocks

Abstract: DNA methylation (DNAm) has emerged as an important biomarker of biological aging, and epigenetic clocks have been developed to estimate biological age from genome-wide methylation profiles. This project aims to compute multiple DNAm-based epigenetic clocks and derive age acceleration metrics that quantify deviations between biological age and chronological age. DNAm array data will undergo standard quality control and preprocessing procedures, including probe filtering, sample filtering, normalization, and batch correction. Principal component (PC)–based versions of major epigenetic clocks (Horvath, Hannum, PhenoAge, GrimAge, and DunedinPACE) will be computed using the MethylCIPHER package. The analysis pipeline will generate DNAm age estimates and standardized age acceleration (AgeAccel) scores derived from regression residuals of DNAm age on chronological age. These AgeAccel metrics provide quantitative measures of biological aging across different epigenetic clock models. This project establishes a reproducible analytical framework for calculating DNAm-based aging biomarkers and contributes to the study of biological aging processes using epigenetic clock measures.

Liqi Zhou

Reliability of Functional Connectivity in the Aging Population

Abstract: Resting-state fMRI functional connectivity (FC) is widely used to characterize brain-behavior relationships, yet the reliability of FC estimates in aging populations remains poorly understood. We evaluated edge-wise test-retest reliability of FC using resting-state fMRI from the AABC/HCP Lifespan Aging dataset (N = 1390; age range 36-90 years; 793 females/597 males). Region-based time series were extracted from 379 brain regions. For each run, FC matrices were computed using Pearson correlation, yielding 71,631 unique edges. Reliability was assessed using ICC(3,1) and generalized ICC (GICC). We aimed to characterize the distribution of reliability across edges, examine agreement between ICC and GICC, and evaluate whether reliability varies by within- versus between-network edges, scan duration, and motion measures. This study is intended to provide a systematic assessment of edge-wise FC reliability in an aging sample and to inform the identification of reproducible connectivity features for future neuroimaging research.

Longitudinal Data Analysis in Mental Health - HSC LL 203

Jeong Yun Choi

Burnout Pilot Study

Abstract: The study tracked burnout among PGY1 and PGY2 psychiatry residents using quarterly MBI-HSS scores to examine whether burnout varies by PGY level, whether patterns hold longitudinally across cohorts, and whether PA is inversely associated with OE and DP. Mean subscale scores were stratified by PGY level, quarter, and academic year; bar plots and linear regression assessed within/between-year differences and PA's association with OE and DP. PGY1s showed higher OE and DP than PGY2s, particularly mid-year, while PA remained low in both groups. PGY1-Year 1 had more severe OE (p=0.0036) and DP (p=0.026) in Q2 than PGY1-Year 2, indicating cohort effects. Longitudinally, OE and DP peaked in late PGY1 before declining in PGY2, and linear regression confirmed a significant inverse relationship between PA and both OE (Œ≤=‚Äì0.80, p<0.001) and DP (Œ≤=‚Äì0.30, p<0.001). These findings suggest early psychiatry residency is a high-risk period shaped by PGY level and cohort-specific experiences; PA's consistent inverse association with burnout underscores its potential as protective factor, and its stagnation despite improvements in OE and DP highlights the need for targeted interventions promoting PA.

Maggie Hsu

Identifying And Analyzing Longitudinal Relationships in Depressive and Suicidal Behavior Among Veterans

Abstract: This project investigates whether there are any clusters of related behaviors among veterans with depressive and suicidal ideation and how these behavioral trajectories change over time. Data from the James J. Peters VA Medical Center with clinical measures on mental health measured at baseline (1 month), 3 months, and 6 months was analyzed using a latent profile model to identify groups of related observations, where a severity-based model was determined. Using mixed effects models, the high- and low-severity classes were then compared longitudinally for depressive, suicidal, and sleep inventory responses. To test whether there were any significant differences in baseline indices by class, a chi-squared test was conducted on the baseline data between the classes. From the models, membership in the higher-severity class is significantly associated with higher depressive, suicidal ideation, and sleep disturbance ratings. The change in depression severity, ideation and sleep disturbance ratings over the different timepoints was not significantly different between the two groups, indicating that the group differences observed at baseline may be stable over at least the medium term.

Zhengkun Ou

Comorbid Pattern Analysis of Cognitive, Functional Impairment and Neuropsychiatric Symptoms Among Alzheimer's Disease: Multiple Cross-Sectional and Longitudinal Cohorts

Abstract: Alzheimer's disease involves heterogeneous co-occurrence of cognitive decline, functional impairment, and neuropsychiatric symptoms, yet these domains are typically studied in isolation.To identify latent comorbidity profiles capturing the joint patterning of cognitive, functional, and neuropsychiatric impairment in AD and characterize transitions over time.Methods: Longitudinal data from ADNI and OASIS-3 were analyzed. Fourteen binary indicators‚Äîcognitive impairment, functional impairment, and 12 NPI-Q domains‚Äîwere assessed at baseline, 12, 24, and 36 months. Latent Transition Analysis identified optimal latent states, item-response probabilities, and transition dynamics via BIC.Results: A four-state model was selected, ranging from largely asymptomatic to high-burden across all domains. Intermediate states were characterized by affective/behavioral NPS or cognitive-functional impairment. Most individuals remained stable, though progressive transitions toward higher-burden states were observed.Conclusions: Four comorbidity profiles capture multidimensional AD heterogeneity beyond single-domain staging, informing targeted interventions and prognostic precision.

Jingxi Wang

Longitudinal Trajectories Of CBCL Symptoms In The ABCD Study: A Latent Transition Analysis

Abstract: This project examines longitudinal trajectories of child behavioral symptoms in the Adolescent Brain Cognitive Development (ABCD) Study using Child Behavior Checklist (CBCL) data. Eight CBCL syndrome scales were used to represent multiple domains of emotional and behavioral symptoms. Indicators were dichotomized using the clinical threshold of T ≥ 65 to identify elevated symptoms. Latent transition analysis was applied to characterize symptom profiles and examine transitions across six study waves from baseline through the five-year follow-up. Three latent classes were identified representing low, moderate, and high symptom groups. Results indicate that symptom profiles are largely stable across childhood, particularly for individuals in the low symptom class, while a subset of participants transition between adjacent symptom levels over time. Trajectory classification further showed that most children follow stable symptom patterns, with smaller proportions showing improving, worsening, or fluctuating trajectories. These findings provide insight into developmental patterns of behavioral symptoms and may help identify children at risk for persistent behavioral difficulties.

Machine Learning - HSC LL 207

Xuanyu Guo

Standardizing Clinical Trial Eligibility Criteria from Protocol with NLP and Large Language Models

Abstract: This practicum project develops and evaluates an NLP pipeline to standardize free text eligibility criteria from clinical trial protocol PDFs into the structured style used on ClinicalTrials.gov. Using a set of protocol‚Äìregistry pairs, Gemini is applied to extract inclusion and exclusion sections from PDFs, which are then aligned with the registry‚Äôs formatted criteria to build supervised training and evaluation data. A fine tuned model trained on trials is compared against zero shot and few shot prompting baselines using multiple metrics, including structural completeness, coverage, hallucination rate, and numeric and negation consistency. Across the evaluated cases, the fine tuned model achieves the best quantitative scores, while the few shot approach also performs competitively and provides a practical, training free option for normalizing eligibility criteria in support of ClinicalTrials.gov data entry.

Jiyoon Lee

Identification of Colorectal Cancer Risk from Longitudinal Electronic Health Records Using Machine Learning

Abstract: Early identification of individuals at elevated risk for colorectal cancer (CRC) may enable targeted screening and earlier detection. We developed a machine learning approach using longitudinal electronic health record (EHR) data to identify individuals at increased CRC risk prior to diagnosis. Using the Columbia University Irving Medical Center (CUIMC) EHR dataset, we constructed patient-level representations from multiple clinical domains, including conditions, procedures, medications, measurements, and observations. Clinical events occurring before a defined prediction date were aggregated to generate non-temporal feature representations for modeling. Models were evaluated in annual cohorts to simulate real-world deployment and compared with an age-based screening strategy selecting the same proportion of individuals as those aged ‚â•45. Across multiple years, the model consistently identified cases occurring before typical screening thresholds. To assess generalizability, the analysis was replicated using the All of Us Research Program dataset. These results suggest EHR-based risk models may help identify high-risk individuals earlier than standard age-based screening.

Tong Su

This is My Presentation Patient-Journey-Based Temporal Feature Engineering and Metastasis Prediction Using the MIMIC-III Database

Abstract: Early identification of patients at risk for metastatic progression is important for improving cancer management and clinical decision-making. I developed a patient-journey–based prediction framework using the MIMIC-III clinical database. A retrospective cohort of 334 patients was constructed using eligibility criteria and temporal filters to define index and snapshot dates with adequate look-back and follow-up periods. Metastasis within 180 days after the snapshot date was defined as the primary outcome using ICD-9 secondary malignant neoplasm codes. Predictors were derived from demographics, diagnoses, procedures, medications, and laboratory testing patterns using temporal roll-up and raw feature engineering. A total of 225 candidate predictors were evaluated. An XGBoost model achieved a test ROC-AUC of 0.83. The top 30 predictors selected based on mean absolute SHAP values achieved similar performance (ROC-AUC 0.81). Key predictors included time since primary diagnosis, laboratory testing patterns, treatment exposure, and patient age. This framework provides a reproducible approach for metastasis prediction using temporally aligned electronic health record data.

Jie Zhu

Modeling Hospital-Level Treatment Utilization Using Real-World Healthcare Data

Abstract: his practicum project investigates how statistical and machine learning approaches can be applied to real-world healthcare data to model and predict hospital-level treatment utilization and potential demand constraints. Using de-identified longitudinal datasets that integrate hospital utilization, market indicators, and institutional information, the project aims to forecast short-term treatment demand and identify hospitals at risk of utilization restrictions. Linear regression models are used to characterize baseline relationships between historical utilization trends and institutional characteristics, while gradient boosting algorithms including XGBoost and LightGBM are implemented to improve prediction accuracy for hospital-level treatment volume over a three-month horizon. Similarity-based analyses are also conducted to identify hospitals with comparable utilization patterns and estimate potential demand ceilings across institutions. The project demonstrates how predictive modeling applied to real-world healthcare data can generate insights into institutional variation in treatment utilization and support data-driven healthcare decision making.

Machine Learning and Dimension Reduction - HSC LL 109A

Yulin Liu

Probabilistic Unveiling of Tumor Heterogeneity via GMM of scRNA-seq Data

Abstract: scRNA-seq provides critical insights into breast cancer microenvironments, but deciphering functional subtypes is computationally hindered by high-dimensional, zero-inflated data. We present a probabilistic framework to map latent cellular substructures within a sparse transcriptomic dataset (716 cells, 558 genes). To isolate biological variance, Principal Component Analysis (PCA) is applied for optimal dimensionality reduction.Using this latent space, a custom Expectation-Maximization (EM) algorithm is derived from scratch to fit a Multivariate Gaussian Mixture Model (GMM), yielding continuous subtype assignment probabilities. Current efforts quantify optimal distinct subpopulations and identify cluster-specific driver genes. Finally, we evaluate Gaussian assumptions on zero-inflated counts, benchmarking the PCA-GMM pipeline against non-parametric density-based clustering to rigorously validate methodological robustness.

Cheng Rao

Data Analysis And Machine Learning For Evaluating Healthcare Programmatic Advertising Performance

Abstract: Programmatic advertising has become an important strategy for healthcare and pharmaceutical companies to deliver digital marketing campaigns to targeted audiences. Evaluating advertising performance and identifying key factors influencing campaign effectiveness are essential for optimizing advertising strategies and improving operational efficiency. This practicum project analyzes programmatic healthcare advertising data collected from May 31 to August 31, 2025, consisting of 83,751 observations and performance indicators including ad format, country, impressions, clicks, fill rate, win rate, click rate, and advertiser revenue. Using the R programming language, exploratory data analysis and visualization are conducted to examine performance patterns across markets and ad formats. Statistical regression models are applied to assess relationships between operational metrics and campaign outcomes, and machine learning methods are explored to predict advertising performance and identify key drivers of effectiveness. The findings provide data-driven insights for optimizing healthcare digital advertising strategies and improving operational efficiency.

Yi Xu

Assessing Physiological Stability Using Internal And External Mahalanobis Distance Frameworks

Abstract: This project assesses multivariate physiological stability under different break interventions using Mahalanobis Distance (MD) as a composite biomarker metric. In a repeated-measures study, twelve participants completed three sitting-break conditions (B0, B30, B60), with biomarkers collected at nine timepoints per day. MD was calculated using both an internal cohort-based reference and an external NHANES reference to quantify multivariate physiological deviation. Mixed-effects models were applied to evaluate treatment effects while accounting for within-subject dependence. The results characterize differences in multivariate physiological patterns across intervention conditions and compare stability estimates derived from internal and population-based reference frameworks.

Yonghao Yu

Risk Modeling of Post-Fusion Adverse Outcomes in Patients with Spinal Deformity

Abstract: This practicum project examines postoperative adverse outcomes among patients with spinal deformities who underwent definitive fusion surgery. We will evaluate two binary outcomes, PostFusionComplication and PostFusionUPROAR, where UPROAR denotes an unexpected return to the emergency room. Predictors of interest include gender, prefusion age, prefusion BMI category, and whether the patient received scoliosis treatment before definitive fusion. Building on prior work showing associations between BMI, postoperative complications, and adverse postoperative events in a retrospective cohort of 102 patients, this study will use logistic regression and relative risk regression to quantify associations and compare effect estimates. The analysis aims to identify clinically meaningful risk factors for poor postoperative outcomes and provide an interpretable framework for perioperative risk stratification in this population.

Research in Observational Studies - HSC LL 209A

Savannah Flanagan

Associations Between Built Environment Features And Physical Activity Within The Activity Spaces Of Mid-Life Adults

Abstract: Physical activity (PA) is crucial for protecting against chronic diseases that emerge in mid-life. The Everyday Environmental Exposures (E3) Study tracked the PA and location of 400 mid-life adults for 21 days to investigate how PA was associated with characteristics of each person‚Äôs activity space. This analysis explored how built environment (BE) features of the activity space, including bike paths, parks, pedestrian areas, street connectivity, and PA centers, were associated with the PA rates of E3 participants. Linear models were fit to explore these relationships and potential confounding by environmental features and their principal components. Linear splines with cutpoints at the quintiles of each BE feature were used to explore more complex relationships. Results suggest that variables related to the urbanness of an activity space confound the relationship between BE features and PA, with most adjusted relationships being negative. Adjusted splines revealed that BE features are only positively associated with PA at their highest quintile, implying that heavy investment in BE features is needed to promote PA in urban areas.

Vidit Tripathi

Quantification of the Pace of Biological Aging Among Participants in the US Health and Retirement Study

Abstract: Biological age, estimated using diverse biological and functional measures, can better indicate lifespan and mortality risk than chronological age. We developed a new measure of biological aging using the Health and Retirement Study cohort (13,358 participants aged 50 and older at baseline) and a hierarchical model capturing shared slopes across nine standardized aging biomarkers spanning blood, physical, and functional measures. Our mortality-calibrated measure correlated moderately with age and strongly with an existing pace-of-aging measure. Higher values reflected faster biological aging and lower survival for the fastest tertile, consistent with stratifying mortality risk. The measure performed as well or better than an existing pace-of-aging measure at predicting aging-related outcomes, except for dementia, and its DNA methylation (DNAm) version generally outperformed or matched the DNAm version of the existing measure for predicting outcomes. Overall, our findings support the validity of the measure and suggest it may detect aging-related risk earlier in the life course. Further analyses with external validation are planned for future work.

Jingyi Wang

Weight Loss Effects Of Semaglutide And Tirzepatide: A Meta-Analysis Of Phase 3 Randomized Controlled Trials

Abstract: Obesity is a major public health concern worldwide and is associated with increased risks of cardiovascular disease and diabetes. Recently, incretin-based medications such as semaglutide and tirzepatide have shown substantial weight-loss effects in several Phase 3 randomized controlled trials. However, the magnitude and consistency of these effects across trials have not been fully synthesized. In this project, we conducted a meta-analysis of Phase 3 randomized controlled trials from the STEP program for semaglutide and the SURMOUNT program for tirzepatide. The primary outcome was percent change in body weight compared with placebo. Random-effects models were used to estimate pooled treatment effects. In the primary analysis including STEP-1, STEP-3, and STEP-5, semaglutide reduced body weight by an average of 11.7 percentage points compared with placebo (95% CI ‚àí13.2 to ‚àí10.3). Sensitivity analyses including STEP-4 showed similar results. Exploratory analyses of tirzepatide trials suggested even greater weight reductions. These findings support the strong efficacy of incretin-based therapies for obesity treatment.

Social Behavior Research - HSC LL 201

Qingpeng Liu

Emergency Department Patient Education and Social Needs Screening: Early Findings from the ENGAGE Program

Abstract: The ENGAGE program at NewYork-Presbyterian/Columbia University Medical Center aims to improve health literacy among emergency department (ED) patients through brief educational interventions and screening for social determinants of health (SDOH). As part of my APEx project, I helped develop the data collection workflow and analyze program outcomes. Follow-up phone calls conducted 7‚Äì14 days after the visit evaluated retention of knowledge and potential behavior change. De-identified survey data were processed and analyzed using R to generate descriptive statistics and exploratory regression analyses examining relationships. Between July and August 2025, interns engaged 127 patients, with 73% reporting confidence completing medical forms but only 14% initially familiar with the health topics discussed. Among patients completing follow-up calls, approximately half reported improved familiarity and several indicated consideration of behavior changes. These preliminary findings demonstrate the feasibility of integrating brief health education and SDOH screening into ED workflows and potential for ED-based interventions to support patient knowledge and connection to community resources.

Yiyang Jiao

Using Healthcare Utilization Data and AI-Driven Digital Platforms to Improve Patient Access and Care Delivery in China

Abstract: China’s healthcare system faces challenges related to uneven resource distribution, long wait times, fragmented medical records, and unequal access to care. This project examines how Tencent uses healthcare utilization data, operational metrics, and AI-enabled digital platforms to address these issues through tools such as WeChat, Tencent Cloud, healthcare mini-programs, medical AI, and personal health records. It highlights how Tencent leverages large-scale user data and AI to improve patient access, streamline care delivery, support chronic disease management, and enhance operational efficiency across the healthcare system. Overall, this project demonstrates how healthcare analytics and digital innovation can improve accessibility, efficiency, and patient-centered care at scale.

Fiona Wang

Program Evaluation of Mental Health Programming

Abstract: I worked with NewYork-Presbyterian Hospital as a Program Evaluation Analyst Assistant to support assessment of the Integrated Mental Health in Primary Care (IMP) Program within the Ambulatory Care Network. The goal of my placement was to assess whether the program was meeting its stated objectives, addressing patient needs, and identifying opportunities for improvement in care delivery. I analyzed over 10,000 patient encounters pulled from Epic, using Excel and Epic’s SlicerDicer tool to generate monthly productivity reports tracking patient volume, provider caseloads, and quality of care metrics. These reports were used by leadership to inform decisions on staffing, service delivery, and resource allocation. My contribution supported the development of a broader program evaluation framework by integrating quantitative findings with emerging opportunities to strengthen quality, access, and service availability within the IMP team. The experience helped bridge operational analytics with program improvement and evidence-based decision-making in a real-world healthcare setting.

Yiwen Zhang

Returning Aggregate Research Results: A Scoping Review of Practices, Preferences, and Barriers

Abstract: Returning research results to participants is increasingly seen as an ethical responsibility that may build public trust, improve transparency, and support future research participation. However, dissemination of aggregate results remains inconsistent and under-resourced in clinical research. This scoping review synthesized evidence on current practices, participant preferences, and challenges in returning aggregate results, with implications for public health, health equity, and policy.The review followed PRISMA-ScR guidelines. Four databases‚ÄîPubMed, EMBASE, CINAHL, and Cochrane Library‚Äîwere searched for English-language articles published through February 2025 on dissemination of aggregate research results to participants. Two reviewers independently screened studies and extracted data. Study quality was assessed using a modified Oxford Centre for Evidence-Based Medicine scale. Thematic synthesis identified patterns in dissemination methods, participant preferences, and implementation barriers.

Jicong Zhang

Evaluation of Art-Based Health Literacy Brochures

Abstract: This APEx project focused on evaluating art-based health literacy materials designed to improve public understanding of ADHD and asthma. As a biostatistics student, I contributed to the review and refinement of evaluation tools used to assess the effectiveness of these materials. I designed questionnaires for ADHD and asthma brochures based on the Patient Education Materials Assessment Tool (PEMAT) to measure factors such as understandability, actionability, and user engagement. Simulated datasets were generated and analyzed using R to demonstrate how quantitative and qualitative feedback could be evaluated using descriptive statistics and basic inferential methods. These analyses illustrate how survey-based evaluation data can inform improvements in health communication materials and support evidence-based public health practice.

Statistical Genetics - HSC LL 202

My An Huynh

Multimodal Integration of Single-cell RNA-sequencing and Spatial Transcriptomics to Recover Cell Location in Single-cell Data

Abstract: Single-cell RNA sequencing (scRNA-seq) measures gene expression at cell resolution but loses spatial context due to tissue dissociation. Spatial transcriptomics (ST) preserves spatial information in intact tissue but typically at lower resolution. I aim to integrate these complementary modalities to infer the location of individual cells to answer downstream biological questions.My two-stage framework first augments ST training data using a variational autoencoder to generate additional ST samples, then learns a mapping between gene expression and spatial location using a neural network.I used MERFISH data which provides both gene expression and single-cell spatial coordinates to evaluate model performance. I assessed the framework‚Äôs sensitivity to training resolution and data quantity by generating pseudo-Visium spots from MERFISH data.To test generalization across disease states, I applied the model to a dataset on Alzheimer‚Äôs mouse brains containing wild-type (WT) and AD conditions. Models were trained on pseudo-Visium spots from WT tissue sections and evaluated on held-out cells across different genotypes to assess within- and cross-genotype prediction performance.

Hyun Kim

Genetic Variant Selection Using Git for Risk Prediction

Abstract: The effects of genetic variants may vary across a phenotype‚Äôs distribution, which may not be fully captured by standard variable selection methods in genetic studies. In this study, we evaluate GIT, a method for screening features with varying effects across quantiles of a response variable, for significant genetic variant selection stratified by population. GIT was applied to high-density lipoprotein (HDL) and height, and the screened variants were compared with variants reported in genetic databases and prior studies to assess if GIT can prioritize established, biologically relevant variants. To evaluate prediction performance, we compared LASSO models built using variants screened by GIT and variants selected by clumping + thresholding (C+T). Genome-wide association studies (GWAS) were conducted to obtain the summary statistics required for C+T. In addition, we assessed if incorporating GIT-screened variant information can improve prediction based on polygenic risk scores (PRS) by examining models based on PRS derived from score files in the Polygenic Score (PGS) Catalog with models that combine those PRS with GIT-screened variant information.

Jeffrey Lin

Predicting Protein Levels From Cis- And Trans-Variants While Accounting For Environmental Confounding

Abstract: Complex traits are often influenced by genetic variants through their effects on proteins, motivating growing interest in proteome-wide association studies (PWAS). However, PWAS is often limited by the lack of proteomic measurements in large cohorts. In addition, confounding remains a concern, as environmental factors may correlate with both genotype structure and protein levels, potentially producing spurious associations. To address this, we build on the TransCisPredict framework to predict protein levels using both cis- and trans-variants while adjusting for environmental confounding. We analyzed genotype and proteomic data for 2,920 proteins from 42,454 individuals in the UK Biobank Pharma Proteomics Project. Principal Component Analysis (PCA) was applied to the protein data to capture shared correlation structure and latent variation. Variant-protein associations, together with LD block information, were then used to define prediction models for out-of-sample protein prediction. The ultimate goal is to estimate protein levels in samples lacking proteomic data and perform PWAS for a complex trait of interest.

Yunjia Liu

Evolutionary Constraint‚ÄìInformed Modeling of Rare Variant Effects in Autism Spectrum Disorder

Abstract: Autism spectrum disorder (ASD) is a neurodevelopmental disorder that affects social communication and behavior. Rare genetic variants, including both de novo mutations and inherited rare variants, play an important role in shaping the genetic architecture of ASD. Given the early onset of ASD and the strong selective pressures acting on deleterious variants, we investigate whether evolutionary constraint can provide a more biologically informed framework for modeling rare variant effects.In this study, we analyze de novo loss-of-function mutations using a Bayesian hierarchical model in trio families and inherited rare variants via Burden Heritability Regression (BHR) in case‚Äìcontrol cohorts. Rather than standard gene-level aggregation, we incorporate evolutionary constraint metrics, s_het (from GeneBayes) for loss-of-function variants and the MisFit score for missense, to stratify variants according to their levels of evolutionary constraint.Our results show that incorporating selection-based stratification improves the ability to distinguish effect sizes and reveals systematic differences in effect sizes across constraint levels.

Survey Statistics - HSC LL 210

Zhenkun Fang

Enhancing Resident Self-Efficacy in Pediatric Mental Health: A Preliminary Outcome Evaluation of a Multimodal Curriculum

Abstract: Background: Pediatric mental health (MH) concerns are rising, yet primary care resources and trainee confidence remain low.Objectives: To increase residents‚Äô self-efficacy (SE) in MH screening, assessment, and management via a tiered, interdisciplinary curriculum.Methods: Using the Kern Model, we implemented a longitudinal curriculum featuring three modalities: didactics (pocket guides), service learning (community partnerships), and experiential learning (simulations). Outcomes were evaluated via pre- and post-intervention surveys (N=18) following 15 months of engagement.Results: Baseline assessments (N=47) showed 43.6% of residents felt "ineffective" in MH care. Post-intervention Wilcoxon signed-rank tests demonstrated a statistically significant increase in SE (Median 3.56 to 3.94; p=0.010) with a large effect size. While specific clinical skills improved significantly, there was no significant change in the frequency of self-reported MH care practices.Conclusions: This multimodal curriculum is feasible and effectively enhances resident confidence. Future research will utilize patient chart reviews to evaluate the impact on clinical practice.

Chunlei Li

Tackling Campus Nutrition Inequity Amidst Urban Food Deserts

Abstract: This practicum project investigates nutrition inequity among students at Columbia University Medical Center (CUMC) and examines the behavioral, economic, and environmental factors influencing dietary quality. A structured survey was conducted among students to assess food access, affordability, cooking behaviors, cultural food preferences, and nutrition awareness. Descriptive statistics were used to summarize dietary behaviors, while logistic and linear regression models were applied to evaluate associations between food access, affordability, cooking skills, and dietary quality indicators such as fruit and vegetable intake and processed food consumption. Results indicated that many students experienced financial barriers to accessing nutritious foods. Despite high levels of nutrition awareness and cooking skills, structural barriers such as affordability and food availability remained significant predictors of poorer dietary behaviors. These findings highlight the importance of addressing structural determinants of nutrition inequity in campus environments.

Tingyu Qian

HIV in Tanzania (2016‚Äì2022): Investigating National Prevalence Shifts and the Determinants of Testing Uptake

Abstract: This study uses 2016 and 2022 PHIA datasets to study Tanzania HIV prevalence trends and socio-demographic factors that influencing HIV testing. Survey-weighted descriptive analysis and multivariable logistic regression were used to show changes in prevalence and identify independent predictors of HIV testing. The weighted national HIV prevalence in Tanzania decreased slightly from 4.88% in 2016 to 4.46% in 2022, with no significant difference (P-value = 0.073). Survey-weighted multivariate logistic regression using 2022 data showed that education was the strongest predictor. People with higher levels of education are 3.71 times more likely to get tested HIV than those with lower levels of education (adjusted OR: 3.71, P-value <0.01). Individuals who disagreed with the statement ‚ÄúI would be ashamed if someone in my family had HIV‚Äù are more likely to take HIV test than those who agreed with the statement (adjusted OR: 1.33, P-value <0.01). Interventions should prioritize male participation, widespread education, and community-led efforts to eliminate discrimination, and finally achieve universal testing in Tanzania.

Mari Sanders

Analyzing the Impact of Professional Support Networks on Job Satisfaction and Intent to Leave Among Home-Based Primary Care Nurse Practitioners

Abstract: Burnout is characterized by exhaustion and reduced work performance. Nurse practitioners in home-based primary care for patients with dementia face stressors that may influence job satisfaction and intent to leave their jobs. Data was collected from nurse practitioners in the northeast. Team connectivity was measured through social network analysis and ego-centric networks. Dimensionality reduction via principal component analysis was applied to support features, identifying a primary component (PC1) and k-means clustering classified participants into three support groups. Job satisfaction was modeled by ordinal regression and generalized linear regression modeled binary outcomes. The first principal component explained 43.61% of variance in team support. Likelihood ratio tests showed that complex models with demographics and network measures did not significantly outperform a simplified model using PC1. Support clusters did not significantly influence intent to leave or job satisfaction and relationships were non-significant. Findings suggest that professional advice-sharing is a core element of teams, but further research must be done to identify factors that influence burnout.

Survival Analysis - HSC LL 109B

Riyadh Baksh

Quantifying Lifetime Risk of Lyme Disease: A Population-Based Modelling Study in the United States

Abstract: Lyme disease is the most commonly reported vector-borne disease in the United States, but annual incidence estimates do not fully capture its long-term population burden. This study estimated the lifetime risk of Lyme disease across regions with differing levels of endemicity using a population-based modeling approach. Age-specific Lyme disease incidence rates from U.S. administrative claims data were combined with all-cause mortality data from the CDC. A multiple-decrement life table model simulated a cohort of individuals followed from birth through age 80–84 years, accounting for the competing risks of Lyme disease and death. Lifetime risk increased with age across all regions but varied substantially by geography. By age 80–84 years, the estimated lifetime risk was 13.7% in high-incidence states, compared with 2.0% in neighboring states and 1.3% in low-incidence states. These findings demonstrate the substantial long-term burden of Lyme disease in endemic regions and highlight the pronounced geographic heterogeneity of risk. A lifetime risk provides an intuitive summary measure that may support public health communication, prevention planning, and resource allocation.

Jianing Chen

Breast Cancer Survival Analysis and Patient-Level Response Visualization

Abstract: This practicum project focused on developing a reproducible workflow for breast cancer survival analysis, reporting, and interactive visualization using SAS, R, and Shiny. Using the METABRIC dataset, I imported and cleaned raw clinical data, constructed an analysis cohort restricted to patients receiving either mastectomy or breast-conserving surgery, and generated structured outputs including a baseline demographic table, survival summaries, Kaplan‚ÄìMeier plots, and a Cox-model-based hazard ratio analysis in the ER-positive subgroup. In addition, I developed interactive Shiny components to improve the presentation and accessibility of analysis results through dynamic tables, plots, and downloadable outputs. This project strengthened my skills in clinical data preparation, time-to-event analysis, reproducible programming, and communication of statistical findings. More broadly, it demonstrated how analytical workflows can translate complex oncology data into interpretable evidence for research reporting and decision support.

Kai Tan

Automated Survival Analysis Pipeline for Time-To-Event Clincial Trial Data

Abstract: Survival analysis is widely used in clinical trials to evaluate treatment efficacy. However, generating reproducible survival analysis results from raw clinical datasets manually can be time-consuming and prone to inconsistencies. A Python based automated analysis pipeline was developed to standardize pre-analysis exploratory data analysis, data cleaning, and survival analysis procedures. Using an ADTTE clinical trial dataset, the automated pipeline was tested.KM curves showed consistently higher survival probabilities for patients in the treatment group compared with those in the control group. However, the log rank test did not show a statistically significant difference between the 2 groups (test statistic = 2.48, p = 0.116). Overall, this project shows how automated statistical programming pipelines can improve the programming efficiency, reproducibility, and interpretability of survival analyses in clinical trial data analyses. Such standardized analytical workflows can support data-driven decision making in early phase clinical trials and enable generation of survival analysis outputs with consistent format for regulatory and research reporting.

Xiaoni Xu

Causal Mediation Analysis With Longitudinal Mediators and Competing Risks: Implementation and Validation in the CMAverse R Package

Abstract: This practicum project focuses on the development and implementation of a new R command within the CMAverse package designed to automate causal mediation analysis for longitudinal data, specifically facilitating the application of the med_longitudinal framework to complex problems involving time-varying and time-to-event mediators in the presence of competing risks. The primary aim of this project is to develop, validate, and maintain a user-friendly function that automates this process using multistate models to estimate path-specific effects. This tool enables the causal interpretation of indirect effects when mediators are measured at arbitrary time points or treated as time-to-event processes. Secondary aims include validating estimator performance through simulations, applying the command to simulated datasets, and publishing a comprehensive manual and tutorial on GitHub to support public use. Ultimately, this project provides a robust biostatistical tool for investigating complex mediation pathways in survival-related data, facilitating researchers studying time-dependent health processes.

Break (3:30pm - 3:45pm)

Session 3 (3:45pm - 4:45pm)

Clinical Trials - HSC LL 208A

Wenjie Wu

A Bidirectional Crosswalk Between COWS and SOWS

Abstract: Withdrawal severity is a key outcome in opioid use disorder (OUD) research, but trials frequently measure withdrawal using different instruments. Two widely used scales are the Clinical Opiate Withdrawal Scale (COWS) and the Subjective Opiate Withdrawal Scale (SOWS), creating challenges for harmonizing data across studies. We develop an interpretable, bidirectional crosswalk to predict COWS from SOWS and SOWS from COWS using paired measurements collected within a short time window during an inpatient phase of a clinical trial. Our approach fits linear models in each direction, incorporating baseline and treatment-related covariates, and allows effect modification through selected interactions. To obtain a parsimonious set of predictors and interactions, we use patient-wise cross-validated lasso screening followed by weighted post-lasso refitting, with weights and cross-validation folds defined at the patient level to account for within-patient correlation and unequal numbers of observations.This crosswalk enables harmonization of withdrawal severity as a continuous measure across trials and improving comparability of treatment effects.

Puyuan Zhang

Statistical Evaluation of Treatment Efficacy in a Phase III Randomized Clinical Trial of Jiao Qi She Gel Patch for Knee Osteoarthritis

Abstract: Knee osteoarthritis is a common chronic condition associated with persistent pain and functional limitations. This practicum project evaluates treatment efficacy in a multicenter, randomized, double-blind, placebo-controlled Phase III clinical trial investigating Jiao Qi She Gel Patch for knee osteoarthritis. The primary endpoint was the change from baseline in knee pain measured by the Visual Analogue Scale (VAS) after two weeks of treatment. Treatment effects were estimated using an ANCOVA model with treatment group as a fixed effect and baseline VAS as a covariate. Missing data were addressed using multiple imputation under a Jump-to-Reference assumption. Secondary responder analyses summarized treatment effects using risk differences and corresponding 95% confidence intervals calculated with the Newcombe method. Dry-run analyses confirmed consistent implementation of the statistical analysis plan and produced reproducible outputs for key efficacy tables and figures. This project demonstrates practical statistical methods used in Phase III clinical trials and highlights considerations when implementing SAP-defined analyses in clinical research.

Data Analysis - HSC LL 204

Zhezhao (Malcolm) Chen

Diabetes Risk Prediction Using Machine Learning On NHANES 2009–2019 Data

Abstract: Using NHANES 2009‚Äì2019 data, this project developed supervised machine learning models to predict diabetes status in a nationally representative observational dataset. Predictors included demographics/social determinants, laboratory measures, medical conditions, and lifestyle factors. Data were cleaned and harmonized with SQL (BigQuery) and modeled in Python using a reproducible pipeline: standardizing missing values, dropping variables with high missingness, median/mode imputation with missingness indicators, and encoding categorical predictors. We trained and evaluated logistic regression and a gradient boosting decision tree model using a train/test split and classification metrics. Both models achieved high overall accuracy (~0.92), while sensitivity for diabetes was lower, underscoring the impact of class imbalance and the need to emphasize recall-oriented evaluation. Tree-based feature importance was used to identify key predictors and support model simplification. This work demonstrates an end-to-end risk prediction workflow on large-scale public health survey data and motivates future improvements via imbalance-aware training and calibration.

Xiwei Fu

Perioperative Outcomes in Pediatric Surgical Patients With Preoperative DNR Orders

Abstract:

Ada Guo

Evaluating The Impact Of DAX Copilot On Clinical Documentation Quality

Abstract: This practicum project evaluates whether the Dragon Ambient eXperience (DAX) Copilot system improves or alters the quality of clinical documentation in primary care. The project is based on a retrospective, non-interventional observational study of coded clinical notes from primary care providers at Columbia University Irving Medical Center. The approved protocol specifies a study period from July 3, 2023 to June 24, 2025, with a one-year pre-period, a two-week washout period, and a post-period beginning at least one month after DAX initiation. Documentation quality will be compared both within providers before and after DAX adoption and between DAX users and matched non-DAX users. Quality will be assessed using QNOTE, PDQI-9, and IDEA, which capture accuracy, completeness, clarity, and clinical fidelity. Planned analyses include descriptive summaries, paired comparisons, mixed-effects linear regression, and inter-rater reliability assessment using intraclass correlation coefficients.

Xuange Liang

Foundation Models for Heterogeneous Treatment Effect Estimation

Abstract: This project investigates whether pretrained electronic health record (EHR) foundation model representations can improve heterogeneous treatment effect (HTE) estimation in small observational cohorts. Using Stanford Medicine EHR data for type 2 diabetes, we compared GLP-1 receptor agonists versus insulin/sulfonylureas across four outcomes (hypertension, hyperlipidemia, systolic blood pressure, and heart rate; outcome-specific sample sizes n=46-151). We implemented an R-learner pipeline with complete cross-fitting, fold-specific dimensionality reduction, and strong regularization tuned through large-scale hyperparameter search. Adapted CLMBR embeddings achieved the best empirical R-loss across all four outcomes, improving over the strongest baseline by 4.3%, 4.6%, 10.8%, and 12.0% (average 7.9%). We also developed an event-level attribution analysis to map learned CATE signals back to clinically interpretable diagnoses, medications, and laboratory events. These findings suggest that pretrained EHR representations can improve small-sample causal estimation when leakage control, dimensionality reduction, and aggressive regularization are carefully applied, while external validation is still needed due to observational confounding and single-site data limitations.

Chuyuan Xu

Principal Component Analysis on Co-occurring Climate Threats across ZCTAs in New York City from 2000 to 2016

Abstract: Climate constitutes an ongoing and enduring threat to global population health. Prior research has documented that climate change intensifies existing diseases and leads to the emergence of new health threats. In the United States, the public health system can be affected by disturbances in ecosystems that frequently coincide with major climate events, including snowstorms, wildfires, and droughts. The simultaneous occurrence of extreme climate events complicates the analysis of their correlation with human health outcomes. This project aims to characterize spatio-temporal patterns in the co-occurrence of climate threats across the US. We coupled high spatiotemporal-resolution estimates of climate-relevant exposures and developed exposure indices for multiple co-occurring climate threats in NYC, including droughts, meteorological statistics, wildfires, PM2.5, and cyclones, using principal component analysis.

Xiner Zhu

Adolescence and Young Adult Mental Health In Afghanistan

Abstract: Adolescents living in conflict-affected regions face substantial mental health challenges due to chronic exposure to violence, instability, and socioeconomic hardship. Afghanistan is one of the most high-risk environments for youth mental health, but comprehensive evidence remains limited. This study examines the prevalence of mental health conditions and their associations with environmental risk, socioeconomic factors, and past traumatic experiences among Afghan youth using cross-sectional survey data. The analysis included 1,103 participants, consisting of 518 adolescents aged 15‚Äì19 and 585 young adults aged 20‚Äì24. Mental health outcomes such as MDD, GAD, PTSD, substance use risk, suicidal behaviors, psychosis, and manic episodes were assessed using Multivariate logistic regression models to evaluate associations between mental health outcomes and key sociodemographic and contextual factors. By providing nationally representative evidence across a broad range of psychiatric outcomes, this study aims to contribute to the limited literature on youth mental health in Afghanistan and highlights the mental health consequences of prolonged conflict for young populations.

Data Analysis and Visualizations - HSC LL 205

Jiayi Ge

Dietary Patterns And Urinary Metal Exposure In The Multi Ethnic Study Of Atherosclerosis (MESA)

Abstract: Diet is an important determinant of environmental exposure through food sources. This practicum examines the association between dietary patterns and urinary metal exposure in participants from Exam 1 of the Multi Ethnic Study of Atherosclerosis (MESA). The exposure of interest is dietary protein composition, characterized by the relative contribution of animal versus plant protein intake. Outcomes include specific gravity corrected urinary arsenic, copper, zinc, and selenium concentrations. Multivariable linear regression models are used to evaluate associations with adjustment for demographic and clinical covariates. This project aims to clarify whether dietary protein sources are differentially related to biomarkers of metal exposure in a multiethnic cohort.

Mufan Liu

Screen-Media Use Subtypes in Youth and Their Associations with Cognition and Brain Connectivity

Abstract: This practicum project examined whether distinct patterns of digital media use can be identified and whether these patterns are associated with cognitive functioning and brain connectivity. Data from 624 participants aged 8-21 years from the Healthy Brain and Child Development (HCPD) study were analyzed. Media-use indicators included weekday and weekend engagement across multiple domains such as video games, messaging, social networking, streaming, television, and exposure to mature content.Three unsupervised learning approaches‚ÄîK-means clustering (primary), hierarchical clustering, and model-based clustering‚Äîwere applied to identify behavioral profiles. A four-cluster K-means solution characterized groups with low engagement, social/streaming use, broad high use, and gaming-focused behavior. Cluster membership was then evaluated using multivariable linear regression models predicting four cognitive outcomes (fluid, crystallized, early childhood, and total cognition) and brain connectivity measures, adjusting for demographic covariates.

Xikun Wang

Agentic AI System For Automated Clinical Data UpdatesAgentic AI System For Automated Clinical Data Updates

Abstract: This project develops an agentic AI system designed to automatically interpret user instructions and update structured clinical data. In healthcare settings, managing structured records often requires manual editing, which is time-consuming and prone to error. The goal of this project is to build an AI-driven workflow that reads natural language instructions, determines the required modifications, and applies updates to a predefined clinical data schema.The system uses a large language model as the reasoning component and a validation layer to ensure all updates follow predefined rules. Given a set of instructions, the agent generates structured outputs to modify, add, or reorganize clinical variables while maintaining data consistency. The updated results are then displayed through an interactive interface, allowing users to review the changes in real time. The project aims at demonstrating how agentic AI can support efficient clinical data management and reduce manual workload while maintaining accuracy, transparency, and reproducibility in healthcare data processing.

Yifei Yu

Life Expectancy Disparities In The Philippines Using Bayesian Spatiotemporal Modeling

Abstract: This project looks at differences in life expectancy across municipalities in the Philippines using annual mortality and population data from 2006 to 2023. Since some municipalities have small populations and unstable mortality rates, Bayesian spatiotemporal modeling is used to make the estimates more reliable by considering both geographic and time patterns. The model is fitted in R using INLA, and age-specific mortality estimates are used to build life tables and calculate life expectancy at birth by municipality, age group, sex, and year. The analysis also considers some important time periods such as presidential election years and the COVID-19 pandemic to better understand changes in mortality over time. This project helps show geographic differences in life expectancy and better understand how these patterns vary across the Philippines.

Deep Learning - HSC LL 208B

Joe LaRocca

Toward Metadata-Free Style Transfer: Hierarchical Clustering For Rare Neurological Disease Phenotyping And Protein Localization

Abstract: Rare neurological diseases often have an unknown genetic basis. Computer vision models can uncover latent, disease-associated phenotypes in single-cell microscopy images but are confounded by batch effects; advances such as Interventional Style Transfer (IST) have used imputation to mitigate them. However, current IST formulations require metadata about observational environments which are often biased or unavailable. Alternative specification of IST through hierarchical clustering may improve its effectiveness on out-of-distribution (OOD) data. Features were generated using the self-supervised DINO-v2 model on two datasets (GRID and HPA). We generated clusters and compared cluster-batch and cluster-class overlap before and after using Symphony, a post-hoc dimension reduction procedure, on both datasets. Symphony improved cluster-class association on HPA but reduced association on GRID; however, cluster-class association remained stronger post-Symphony on GRID than on HPA. Cluster-batch association was noticeable pre-Symphony but negligible post-Symphony on both datasets. Future work includes testing other clustering procedures and running IST using clusters instead of metadata.

Xueting Li

Brain Age Prediction from ROI-Based Structural MRI: A Comparison of CNN and BrainAgeR

Abstract: Estimating brain age from structural MRI is widely used to study brain aging. In this study, we compare two approaches for predicting chronological age using ROI-based MRI features. One approach uses a convolutional neural network (CNN) with hyperparameter tuning, while the other uses a traditional regression pipeline implemented through the brainageR framework. Model performance is evaluated using standard metrics, including mean absolute error (MAE), root mean squared error (RMSE), and the coefficient of determination (R¬≤). We also examine the brain age gap (BAG), defined as the difference between predicted age and chronological age, which is often interpreted as reflecting accelerated or delayed brain aging. Finally, we discuss whether highly accurate models should be viewed purely as predictive tools or whether they point to a meaningful biological concept of brain age.

Zitao Zhang

Deep Generative Knockoffs For Feature Selection In High-Dimensional Microbiome Data

Abstract: This project develops a computational framework for feature selection in high-dimensional microbiome data using deep generative models and knockoff-based inference. Microbiome datasets often contain a large number of correlated microbial features, making it challenging to identify meaningful signals while controlling false discoveries. To address this problem, we implement a variational autoencoder (VAE) to generate knockoff variables that preserve the dependence structure of the original microbial features. These synthetic knockoffs are incorporated into a knockoff-based statistical inference framework to perform feature selection with false discovery rate (FDR) control. The project focuses on developing and implementing the computational pipeline, evaluating the quality of the generated knockoffs, and assessing the performance of the method through simulation studies and microbiome datasets. This framework aims to provide a scalable and statistically principled approach for identifying important microbial features in high-dimensional biological data.

Epidemiology - HSC LL 203

Junming Chen

Sex-Specific Gray Matter MRI Trajectories in RRMS

Abstract: Sex differences in multiple sclerosis may reflect distinct neurodegenerative mechanisms, but longitudinal MRI evidence remains limited. In this study, we used a staged, interval-based longitudinal design to examine sex-specific brain volume trajectories in relapsing-remitting multiple sclerosis (RRMS). After addressing phenotype heterogeneity, follow-up imbalance, and treatment-related noise, the main analysis identified a consistent male-unfavorable gray matter pattern, with the strongest and most stable effects observed for gray matter fraction (GMF) and cortical gray matter fraction (CGMF). Deep gray matter fraction (DGMF) showed a weaker overall effect but appeared more age-dependent, becoming increasingly male-unfavorable at older ages. Supportive sensitivity analyses were directionally consistent, while absolute-volume analyses suggested caution in over-interpreting the fraction-based findings as direct evidence of tissue loss. Overall, these results highlight sex-specific gray matter trajectories in RRMS and support the value of bias-aware longitudinal MRI modeling for understanding disease progression.

Wen Li

Predicting Severe Hypertriglyceridemia Using A Triglyceride Polygenic Risk Score

Abstract: This project evaluates the predictive utility of a triglyceride polygenic risk score (TG-PRS) for severe hypertriglyceridemia (shTG). Severe hypertriglyceridemia is an important risk factor for pancreatitis and cardiovascular disease, and early identification of high-risk individuals may improve prevention strategies. Using genetic and clinical data, this study examines whether incorporating TG-PRS into prediction models improves risk stratification beyond traditional clinical variables. Polygenic risk scores will be calculated using genome-wide association study summary statistics, and regression models will be used to assess the association between TG-PRS and triglyceride phenotypes. Model performance will be evaluated using ROC curves, area under the curve (AUC). The findings may help evaluate the potential value of integrating genetic risk prediction into public health screening strategies.

Wenyu Lu

Strengthening Data Infrastructure And Strategic Decision-Making For A Community-Based Aging In Place Program

Abstract: This practicum focused on strengthening the data infrastructure and strategic capacity of a community-based Aging in Place program serving older adults in Northern Manhattan. The project involved cleaning and restructuring an existing Airtable client database, redesigning intake forms to improve data quality and completeness, and developing analytic tools to support program monitoring and reporting. Using descriptive analysis and geographic gap mapping, I identified service utilization patterns and areas of unmet need. In addition, I designed a satisfaction-by-need matrix to help prioritize resource allocation and outreach efforts. The enhanced system improves data accuracy, reporting efficiency, and the program's ability to communicate impact to funders and stakeholders. This project demonstrates how applied biostatistical and data management skills can strengthen community health programs and support evidence-informed decision-making in aging services.

Jiayi Wang

Mental Health Symptoms As Predictors Of Arrest Type Among Justice-Involved Young Adults

Abstract: Introduction: Justice-involved young adults experience high levels of psychiatric distress, yet the role of internalizing symptoms in predicting future offense type remains unclear. This study examined whether baseline internalizing symptoms predicted arrest outcomes at 12-month follow-up among participants enrolled in a diversion program. Methods: Data were drawn from a randomized intervention trial of justice-involved young adults. Multinomial logistic regression models assessed whether somatization, depression, anxiety, and global psychological distress predicted violent or nonviolent arrest, adjusting for prior violent arrest and demographic factors. Results: Internalizing symptoms were not associated with nonviolent arrest. However, higher baseline somatization, depression, and global severity were associated with increased odds of violent arrest, while anxiety showed a similar but nonsignificant trend. Conclusion: Internalizing psychiatric distress may represent an early marker of risk for violent justice involvement and may help inform screening and prevention strategies in diversion programs.

Chenghao Yan

Covid and RSV

Abstract: This practicum project involved statistical analysis of epidemiological datasets related to COVID-19 and respiratory syncytial virus (RSV) in China. For the COVID-19 component, de-identified case data from Nanjing Medical University Affiliated Hospital were analyzed to describe demographic characteristics, comorbidities, and hospitalization patterns during a surge in 2025. In addition, a separate dataset of over 2,000 RSV records from Tianjin was organized and analyzed to explore infection patterns among adults.

Internship Projects - HSC LL 207

Cameron Chesbrough

Evaluating Clinical Performance With A Power BI Dashboard

Abstract: The Division of Healthcare Management and Occupational Safety and Health (DHMOSH), a branch of the United Nations, is responsible for overseeing UN administered field clinics across multiple continents, countries, and regions. DHMOSH evaluates clinic performance and provides technical assistance. Clinics send quarterly reports detailing the programs they are running and their requests for assistance, as well as data concerning number of patients seen, incidence of diseases, and other operational metrics. The aims of the project are twofold: create a unified dataset from the collection of quarterly reports and create a dashboard that can effectively communicate the relevant information for the DHMOSH team. Six primary groupings of data were established from the reports and information from over 50 clinics was aggregated. The dashboard was constructed using Power BI to visualize continuous, categorical, and textual data. A collection of graphs, visualizations, and tables were created to compare clinics over time or evaluate singular clinics during a desired quarter.

Joie Li

Statistical Analysis and Visualization of Longitudinal Symptom Patterns in Endometriosis

Abstract: Endometriosis is a chronic inflammatory condition characterized by the presence of endometrial-like tissue outside the uterus. This condition affects approximately 10% of individuals assigned female at birth globally from menarche to menopause and currently has no definitive cure. One of the major challenges in managing endometriosis is the heterogeneity of symptoms, which often leads to delayed diagnosis that can extend up to 12 years. This project presents part of an internship conducted with SymptoLab, a FemTech startup developing a digital health platform to help manage female hormone-related diseases and support physician decision-making, beginning with endometriosis. In this project, longitudinal symptom data collected from endometriosis patients who used SymptoLab’s tracking platform were analyzed. Using statistical analysis and modeling approaches, we investigated recurring symptom trajectories and examined their associations with different phases of the hormonal cycle. In addition, we designed the monthly patient reporting workflows with visualizations to improve patient-physician communication and support personalized care planning for individuals with endometriosis.

Yucheng Zhao

Interactive Clinical Trial Visualization: Developing R Shiny Dashboards for Real-Time Safety and Efficacy Monitoring in Oncology

Abstract: Real-time data monitoring is essential for safety and efficacy evaluation in Phase I/II oncology trials. This report details a practicum at RemeGen focused on developing interactive clinical dashboards to enhance data oversight. Using R Shiny, a suite of modular visualization tools was engineered to standardize and process clinical datasets across study domains. Technical implementations included Adverse Event (AE) swimmer plots for longitudinal safety tracking, waterfall plots and Empirical Cumulative Distribution Function (ECDF) curves for tumor response, and visit-based visualization of laboratory results. These dashboards integrated data-cleaning functionalities to monitor patient enrollment, treatment modifications, and trial milestones such as End of Treatment and End of Study. The resulting platform provided biostatistics and clinical teams with a centralized interface for real-time monitoring. These tools facilitated the identification of dose-limiting toxicities and tumor size changes, supporting informed decision-making during interim analyses. The modular design ensures these components are scalable across internal servers for multiple oncology programs.

Longitudinal Data Analysis - HSC LL 109A

Yuechu Hu

Within-Person Longitudinal Associations between Restrictive eating, Intuitive Eating and BMI across Three Time Points: 2010-2023

Abstract: Body mass index (BMI) is influenced not only by biological factors but also by eating behaviors that evolve over time. Restrictive eating and unhealthy weight control behaviors (UWCB) have been associated with adverse weight outcomes, while intuitive eating has been proposed as a protective factor. However, the longitudinal relationships between these eating behaviors and BMI across adulthood remain unclear. This study aims to examine how long-term patterns of eating regulation, ranging from dieting and restrictive eating to intuitive, body-attuned eating, relate to changes in BMI within the same individuals over time. The data are from a population-based survey cohort, which includes repeated measures of eating behaviors and BMI collected over multiple waves (2010 - 2023). Data will be restructured into long format to account for repeated observations within individuals. Descriptive statistics will be used to summarize the distribution of key variables. BMI over time will be visualized using spaghetti plots. Longitudinal associations between eating behaviors and BMI will be evaluated using mixed-effects models to account for within-person correlation across time.

Leyang Rui

Linking Neural Signatures To Behavioral And Cognitive Developments: A Longitudinal EEG Study In Early Childhood

Abstract: Early brain development is fundamental to later behavioral, cognitive, and perceptual functioning. Using longitudinal EEG data from the HBCD study, a nationwide cohort of more than 7,000 children across 27 research sites, this project examines whether directed neural connectivity predicts early developmental outcomes. We focus on EEG data from six brain regions, together with demographic variables and measures including the SPM-2, Vineland Adaptive Behavior Scales, Bayley Scales, MacArthur-Bates Inventories, and multilingual language scores. EEG signals were cleaned with Independent Component Analysis, harmonized across sites using ComBat, and modeled with a mixed-effects vector autoregression framework to estimate subject-specific directed functional connectivity. Global connectivity strength, summarized by the Frobenius norm, was then used to predict behavioral and cognitive phenotypes through regression models. Preliminary findings show significant associations with sensory processing, planning and ideas, balance and motion, and social participation, suggesting that early directed neural connectivity may serve as a meaningful marker of developmental functioning.

Qianyu Wu

Interactive Effects of Low-level PM2.5 Exposure And Metabolic Syndrome On Brain Structure: A Study Based On UK Biobank

Abstract: Both long-term exposure to fine particulate matter (PM2.5) and metabolic syndrome (MetS) are known to significantly affect brain structures, yet their combined effects remain unclear. This study investigated the interaction between long-term low-level PM2.5 exposure (≤15μg/m³) and MetS on brain structural changes, utilizing data from the UK Biobank with 31,492 participants. A total of 40 brain regions were analyzed, including volume, cortical thickness, and surface area. Generalized linear models were employed, adjusted for key covariates. Results indicate that the interaction between MetS and low-level PM2.5 exposure exacerbated these effects, with an interquartile range (1.34 μg/m³) increase associated with volume reduction in 9 regions, extending to 26 regions under PM2.5 levels of 10-15 μg/m³. Females with MetS were particularly vulnerable, exhibiting significant reductions in brain volume and cortical thickness. These findings highlight the synergistic effects of MetS and low-level environmental PM2.5 exposure on brain health, providing critical insights for public health interventions.

Yiran Xu

A Deep Learning Model for Early Detection of Stiff-Person Syndrome Using Administrative Claims Data

Abstract:Stiff-Person Syndrome is a rare neurologic disorder characterized by rigidity, spasms, andoverlapping features with conditions such as multiple sclerosis, Parkinson‚Äôs disease, and anxiety disorders. These similarities create diagnostic complexity, leading to frequent misdiagnosis and delayed recognition. Research applying machine learning to Stiff-Person Syndrome (SPS) diagnosis has been minimal, with only one prior study testing a machine learning model on the data of less than one hundred patients. This study seeks to advance that work by developing and testing a new Transformer-based algorithm on large-scale claims data and identifying the key clinical factors that differentiate SPS from related conditions along the patient journey

Machine Learning - HSC LL 209A

Ravi Brenner

Assessing Bias In Injectable HIV Treatment Access Using Machine Learning

Abstract: The Accelerating Implementation of Multilevel Strategies to Advance Long Acting Injectables for Underserved Populations (ALAI UP) project helps clinics in the US develop injectable HIV treatment programs that address inequity in health outcomes using a variety of strategies. To be prescribed injectable cabotegravir, patients must be educated about the treatment to determine their interest. Education practices vary widely by site, meaning that bias in who is offered the treatment could occur. I used tree-based machine learning methods to predict the probability of education for patients at 4 different clinic sites, using demographic variables and social determinants of health as predictors, and accounting for interactions and site-level effects. Model performance was modest in aggregate, and no better than chance at the site level (based on AUC). This indicates that there was minimal detectable bias in who was educated, a major success for the program. Site was the strongest predictor of education, indicating that clinic-level practices around patient education play a substantial role in patient access to treatment and act as the primary social determinant in this area.

Minghe Wang

Enhancing Dementia Classification in Small Clinical Cohorts via Density-Ratio Weighted Transfer Learning

Abstract: Machine learning for dementia diagnosis requires large datasets, yet gold-standard clinical cohorts are small. Larger modern cohorts exist but present significant covariate shifts. This study develops a Transfer Learning (TL) framework to address this shift, leveraging HCAP (Source) to enhance dementia classification in ADAMS (Target).We harmonized sociodemographic, functional, and cognitive variables between ADAMS and HCAP. Replicating the feature engineering of a domain-expert baseline logistic regression, we applied TL algorithm using density ratio estimation via convex optimization. This assigns instance weights to the source domain to mitigate distribution mismatch, followed by a dual-layer phenotyping optimization. Performance was evaluated using 5-fold cross-validation.The TL framework outperformed the expert baseline using identical data splits. Our TL model achieved a higher mean AUC of 0.9113 (SD: 0.0228) compared to the baseline's 0.8987 (SD: 0.0331), demonstrating improved accuracy and stability.TL successfully bridges disparate aging cohorts. Future work will introduce simultaneous self- and proxy-reported features to capture complex patient-informant interactions.

Zebang Zhang

Interpretable Machine Learning for Diabetes Diagnosis Prediction

Abstract: This practicum project develops and evaluates supervised machine learning models to predict diabetes diagnosis at a single visit using routinely available demographic, lifestyle, family history, and clinical variables. Using an existing dataset of 100,000 adult patient profiles, the study compares interpretable and flexible approaches, including logistic regression/Elastic Net, generalized additive models, support vector machines, random forests, and XGBoost. Preprocessing includes exploratory analysis, categorical encoding, standardization of continuous features, and principled feature selection. Model performance will be assessed through stratified cross-validation and a held-out test set using discrimination, calibration, and clinically relevant metrics such as ROC-AUC, PR-AUC, sensitivity, specificity, F1-score, and Brier score. The project will also identify and rank key determinants of diabetes risk through standardized effects, partial effects, and feature importance measures. The goal is to produce a parsimonious and interpretable risk prediction tool to support early screening and targeted counseling.

Machine Learning and Dimension Reduction - HSC LL 201

Riya Kalra

Sex-Stratified Brain Age Gap Estimation in Alzheimer's Disease: A Machine Learning Approach with High-Dimensional Structural Neuroimaging Features

Abstract: Alzheimer's disease is not sex-neutral. Two-thirds of patients are women, yet standard brain age models impose a single aging reference across sexes, embedding systematic bias into every prediction. Using a large baseline ADNI cohort, I developed sex-stratified Ridge Regression models on neuroimaging features with automated alpha tuning and leakage-free cross-validation. Direct comparison against a pooled model demonstrates that stratification reduces prediction error and eliminates between-sex BAG divergence. Male and female brains age through distinct mechanisms: male aging is driven by localized hippocampal atrophy while female aging reflects distributed cortical network changes. Corrected Brain Age Gap followed a stepwise disease gradient, increasing from near zero in cognitively normal subjects through mild impairment to its peak in Alzheimer's disease, a statistically significant progression. After stratification, sex itself was non-significant, confirming systematic bias was eliminated and diagnosis alone drives accelerated brain aging. Sex-stratified normative modeling is both statistically superior and biologically essential for equitable Alzheimer's precision medicine.

Soo Min You

Evaluating Representation Embeddings from LLMs and Time-Series Foundation Models for Wearable Accelerometer-Based Health Prediction

Abstract: Wearable accelerometers capture rich behavioral signals relevant to health monitoring, yet the comparative evidence on modern representation-learning approaches remains limited. Using accelerometer data from the National Health and Nutrition Examination Survey (NHANES), we evaluated three representation families for predicting multiple clinical outcomes: simple entropy-based features, pretrained large-language-model (LLM) embeddings, and time-series foundation model embeddings. Outcomes included overweight status, lipid biomarkers, glucose, arthritis, and cancers. Across outcomes, entropy-based features consistently performed comparably to, and often slightly better than, embedding approaches. LLM-derived embeddings offered only marginal improvements (ŒîAUC‚âà0.01-0.05), while time-series foundation model embeddings added little predictive value. Prompt-based LLM reasoning performed worst (AUC‚âà0.56-0.65), demonstrating limited ability to infer physiological states from structured text. These results highlight the strength of simple variability features and underscore that domain-aligned pretraining is needed for time-series foundation models in wearable health applications.

Statistical Genetics - HSC LL 202

Mengyuan Chen

Statistical Analysis of MPRA Data to Identify Regulatory Elements in 3‚Ä≤ UTRs of Haploinsufficient Genes

Abstract: Haploinsufficiency, in which a single functional copy of a gene is insufficient to maintain normal physiological function, contributes to more than 600 genetic disorders, many affecting neurological development. This study investigates regulatory mechanisms within the 3′ untranslated regions (3′ UTRs) of haploinsufficient genes that influence gene expression. Data were obtained from a massively parallel reporter assay (MPRA) containing about 12,000 reporter constructs representing 270-nt fragments from the 3′ UTRs of over 600 genes, with expression measured using RNA/DNA ratios from sequencing in two human cell lines. Statistical analyses included correlation analysis to assess reproducibility across replicates and associations between sequence features and transcript abundance. Results showed strong agreement between replicates and across cell types. GC-rich sequences and more stable RNA structures were associated with increased mRNA abundance, while AU-rich motifs were linked to decreased expression. These findings provide insight into post-transcriptional regulation in haploinsufficient genes and identify candidate regions for antisense oligonucleotide therapeutic strategies.

Zhaokun Lin

Extending the Lamian Framework for Differential Pseudotime Analysis of Single-Cell multi-sampled DNA Methylation Data

Abstract: Pseudotime analysis is widely used to study dynamic biological processes such as neural stem cell differentiation. The Lamian framework uses spline-based modeling and mixed-effects regression to perform differential pseudotime analysis across multiple scRNA-seq samples. However, its statistical model assumes normally distributed data. Single-cell DNA methylation (scDNAm) data are proportions or counts, so this assumption does not hold.This limitation prevents direct application of Lamian to epigenomic data types. To address this, the Lamian framework is extended to single-cell multiomics data with a focus on scDNAm. CpG sites are aggregated to build stable methylation features, and pseudotime trajectories are constructed from the neural stem cell neurogenic lineage. Then this extension is applied to scNMT-seq data from the adult mouse ventricular-subventricular zone.

Wayne Monical

Genome-Wide Association Study for Chronic Kidney Disease

Abstract: Chronic kidney disease (CKD) has an estimated global prevalence of 700 million cases, driven in large part by diabetes and hypertension. To understand the genetic component of several CKD types, including Focal Segmental Glomerulosclerosis (FSGS) and Membranous Nephropathy (MN), we conducted genome-wide association study (GWAS) utilizing the MEGAEx array. A rigorous quality control pipeline was implemented, including filters for minor allele frequency (MAF), Hardy-Weinberg Equilibrium, and sample-level checks for missingness, sex discrepancies, and relatedness. The final analytical cohort consisted of 5,636 individuals and 513,581 variants. Population stratification was negligible, and genomic inflation factor was well-controlled to 1.0028. We identified 8 genome-wide significant variants with p-values less than 5e-8. The most significant associations were located at the CAMKK1 locus and the RNF150 locus, both of which exhibited protective effects.

Flora Pang

A Statistical Genetics Framework for Investigating for Noise-Related Tinnitus in the UK Biobank

Abstract: Tinnitus is a heterogeneous auditory condition influenced by environmental exposures and age-related hearing loss (ARHL), complicating the identification of genetic risk factors. Using UK Biobank data, we implemented a scalable statistical genetics pipeline to investigate genetic associations with noise-related tinnitus. Population structure was evaluated using principal components derived from FlashPCA, and genome-wide association analyses were performed with REGENIE to account for relatedness and relevant covariates in a large biobank cohort. Three complementary phenotype and covariate specifications were analyzed to assess the robustness of association signals and distinguish variants linked to tinnitus from those driven by hearing loss or exposure-related confounding. Genome-wide significant loci were identified across analyses, including regions near MIR4790/GRM7-AS3, CRIP3/ZNF318, APOE, and APOC1. These findings establish candidate regions for downstream statistical fine-mapping using SuSiE and highlight the importance of rigorous modeling strategies when analyzing complex auditory phenotypes in large-scale genomic datasets.

Chenhui Yan

LamianOmni: Multi-Sample Pseudotime Inference for Single-Cell DNA Methylation Data

Abstract: Pseudotime analysis is a powerful approach for studying how gene regulation changes along continuous biological processes such as cell differentiation. While Lamian provides a robust framework for comparing temporal patterns across multiple samples or conditions, it has so far only been applied to single-cell RNA-seq data.
In this study, we propose LamianOmni, which extends Lamian to single-cell DNA methylation (DNAm) data. A major challenge is extreme sparsity — over 99% of CpG sites are unmeasured per cell. We address this with a promoter-level aggregation strategy that produces stable methylation signals for pseudotime modeling. We apply LamianOmni to scNMT-seq data from mouse erythroid differentiation, comparing wild-type and TET triple knockout conditions, and further integrate DNAm with gene expression to explore coordinated regulatory dynamics. LamianOmni offers a generalizable framework for multi-sample pseudotime analysis of single-cell epigenomes.

Survival Analysis - HSC LL 210

Dia Deng

Integrating Clinical and Gene Expression Data for Survival Prediction in Clear Cell Renal Cell Carcinoma

This study aimed to develop and validate a prognostic survival model for patients with clear cell renal cell carcinoma (ccRCC) by integrating clinical variables and high-dimensional gene expression data. Using data from The Cancer Genome Atlas (TCGA), we constructed Cox proportional hazards models and applied LASSO-Cox regression for feature selection to address the high dimensionality of genomic features. Model performance was evaluated using cross-validation and the concordance index (C-index). To assess generalizability, the model was further validated using an independent real-world cohort from a collaborating hospital. The model demonstrated stable performance and maintained predictive accuracy in the external dataset, supporting its robustness across different populations. Several biologically relevant gene features were identified, providing potential insights into prognostic biomarkers. This study highlights the value of integrating genomic and clinical data for survival prediction and demonstrates the feasibility of translating such models into real-world clinical settings.

Lingyuan Huang

Diet–Exercise Patterns And Liver Fibrosis Risk Among U.S. Adults: A PCA–Cluster Analysis

Abstract: This study examined the association between integrated diet‚Äìexercise patterns and liver fibrosis among U.S. adults using NHANES 2017‚Äì2020 data. Adults aged ‚â•18 years with valid vibration-controlled transient elastography measurements were included (n=7,587). Ten energy-adjusted dietary variables and three physical activity indicators were combined using principal component analysis (PCA), followed by K-means clustering to identify lifestyle patterns. Two distinct patterns emerged: a high-fat, high-protein, sedentary pattern and a higher-carbohydrate, more physically active pattern. Survey-weighted logistic regression models assessed associations with liver stiffness measurements (LSM) and controlled attenuation parameter (CAP). In unadjusted models, the more active pattern was associated with lower odds of cirrhosis (OR=0.73, 95% CI: 0.51‚Äì0.94), but the association attenuated after adjustment for demographic and metabolic factors. Hypertension partially mediated the relationship. No independent association was observed with hepatic steatosis. Findings suggest lifestyle patterns influence liver fibrosis primarily through cardiometabolic pathways.

Leah Li

Evaluating The Effectiveness Of Tafamidis In Patients with Transthyretin Amyloid Cardiomyopathy (ATTR-CM)

Abstract: Transthyretin Amyloid Cardiomyopathy (ATTR-CM) is a progressive and often underdiagnosed condition characterized by deposition of transthyretin amyloid in the myocardium, leading to heart failure and reduced survival. Recent therapeutic advances have introduced disease-modifying treatments such as Tafamidis, a transthyretin stabilizer that slows disease progression. This project evaluates the effectiveness of tafamidis therapy in patients with ATTR-CM, focusing on survival outcomes, cardiovascular mortality, and hospitalization risk. Using clinical trial evidence and long-term follow-up data from patients across different disease stages, outcomes among patients receiving continuous tafamidis treatment were compared with those initially receiving placebo before transitioning to tafamidis. Results suggest that tafamidis treatment is associated with reduced all-cause and cardiovascular mortality and lower rates of cardiovascular-related hospitalization, particularly among patients diagnosed at earlier disease stages. These findings highlight the importance of early diagnosis and timely initiation of disease-modifying therapy in improving clinical outcomes for patients with ATTR-CM.

Yuanyuan Zhang

Association Between Low-Density Lipoprotein Cholesterol Levels And Risks Of All-Cause And Cardiovascular Mortality In Patients With Hypertension And Comorbidities

Abstract: This study examined the association between low-density lipoprotein cholesterol (LDL-C) levels and the risks of all-cause and cardiovascular mortality among patients with hypertension and comorbidities. Data came from a population-based cohort in Guizhou Province, China, established in 2010 with follow-up in 2016‚Äì2020 and 2023. After exclusions, 2,426 hypertensive participants were included. Cox proportional hazards regression was used to estimate hazard ratios, and restricted cubic spline models were applied to assess dose-response relationships. Over a median follow-up of 11.69 years, higher LDL-C levels were associated with increased risks of all-cause and cardiovascular mortality. Among patients with comorbidities, mortality risks were substantially higher in the upper LDL-C quartiles than in the lowest quartile. A linear dose-response relationship was observed. These findings suggest that LDL-C is an independent risk factor for mortality in hypertensive patients with comorbidities and support stricter lipid management in this high-risk population.