Levin Lecture Series Colloquium Seminars

For all Zoom inquiries to virtually attend seminars, please send an email to Erin Elliott, Programs Coordinator (ee2548@cumc.columbia.edu).

During the Fall and Spring semesters, the Department of Biostatistics holds seminars, called the Levin Lecture Series, on a wide variety of topics which are of interest to both students and faculty. The speakers are occasionally departmental faculty members themselves but very often are invited guests who spend the day of their seminar discussing their research with Biostatistics faculty and students. 

Fall 2023 Levin Lectures

September 7, 2023, 8th Floor Auditorium, 11:45am

Shuo Chen, PhD
Professor of Epidemiology & Public Health, Director of Biostatistics and Data Science, Maryland Psychiatric Research Center
University of Maryland School of Medicine

Evaluating the effects of high-throughput structural neuroimaging predictors on whole-brain connectome outcomes via network-based vector-on-matrix regression 

Abstract: The joint analysis of multimodal neuroimaging data is critical in the field of brain research by revealing complex interactive relationships between neurobiological structures and functions. In this study, we focus on investigating the effects of structural neuroimaging features including white matter micro-structure integrity and cortical thickness on the whole brain functional connectome network. To achieve this goal, we propose a network-based vector-on-matrix regression model to characterize the systematic association patterns between connectome networks and structural imaging variables. We develop a novel multi-level dense bipartite and clique subgraph extraction method to identify which subsets of spatially specific structural features can intensively influence organized functional connectome sub-networks. We demonstrate that our proposed network-based vector-on-matrix regression model can simultaneously identify highly correlated structural-connectomic association patterns and suppress false positive findings while handling millions of potential interactions.  We apply our method to a multimodal neuroimaging dataset of 4,242 participants from the UK Biobank to evaluate the effects of whole-brain white matter microstructure integrity and cortical thickness on the resting-state functional connectome.  The results reveal that the white matter microstructure integrity measures on corticospinal tracts and inferior cerebellar peduncle significantly affect functional connections of sensorimotor, salience, and executive sub-networks.

September 14, 2023 (8th Floor Auditorium)

Tianchen Qian, PhD
Assistant Professor of Statistics
University of California, Irvine

Modeling Time-Varying Effects of Mobile Health Interventions Using Longitudinal Functional Data from HeartSteps Study

Abstract: Micro-randomized trial (MRT) is a novel experimental design for developing and optimizing mobile health interventions. In MRT, each individual is repeatedly randomized among treatment options hundreds or thousands of times. We consider MRTs where the proximal outcome measured after each randomization is a function. An example is the HeartSteps MRT, where each participant is in the study for 210 decision points and the proximal outcome is the minute-level step count for 90 minutes following each decision point. This gives rise to a longitudinal functional data structure. To assess how the effect of the mobile health intervention varies over the 90 minutes and over the 210 decision points, we propose a varying-coefficient model for the causal excursion effect. We establish robustness, consistency and asymptotic normality of the proposed causal effect estimator. The method is applied to the HeartSteps MRT data to understand how an individual responds to a physical activity suggestion.

September 21, 2023 (8th Floor Auditorium)

Will Wei Sun, PhD
Associate Professor of Quantitative Methods and Statistics
Daniels School of Business, Purdue University

Online Statistical Inference for Matrix Contextual Bandit

Contextual bandit has been widely used for sequential decision-making based on the current contextual information and historical feedback data. In modern applications, such context format can be rich and can often be formulated as a matrix. Moreover, while existing bandit algorithms mainly focused on reward-maximization, less attention has been paid to the statistical inference. To fill in these gaps, in this work we consider a matrix contextual bandit framework where the true model parameter is a low-rank matrix, and propose a fully online procedure to simultaneously make sequential decision-making and conduct statistical inference. The low-rank structure of the model parameter and the adaptivity nature of the data collection process makes this difficult: standard low-rank estimators are not fully online and are biased, while existing inference approaches in bandit algorithms fail to account for the low-rankness and are also biased. To address these, we introduce a new online doubly-debiasing inference procedure to simultaneously handle both sources of bias. In theory, we establish the asymptotic normality of the proposed online doubly-debiased estimator and prove the validity of the constructed confidence interval. Our inference results are built upon a newly developed low-rank stochastic gradient descent estimator and its non-asymptotic convergence result, which is also of independent interest. This is joint work with Qiyu Han and Yichen Zhang.

September 28, 2023 (Hess Commons)

Stefan Wager, PhD
Associate Professor of Operations, Information and Technology
Stanford University, Graduate School of Business

Learning from a Biased Sample

The empirical risk minimization approach to data-driven decision making assumes that we can learn a decision rule from training data drawn under the same conditions as the ones we want to deploy it in. However, in a number of settings, we may be concerned that our training sample is biased, and that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. We propose a model of sampling bias called Γ-biased sampling, where observed covariates can affect the probability of sample selection arbitrarily much but the amount of unexplained variation in the probability of sample selection is bounded by a constant factor. Applying the distributionally robust optimization framework, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions that can generate the training distribution under Γ-biased sampling. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a model that is robust to sampling bias via the method of sieves, and propose a deep learning algorithm whose loss function captures our robust learning target. We empirically validate our proposed method in simulations and a case study on ICU length of stay prediction.

October 3, 2023 (8th Floor Auditorium)

*Special Time 1pm-2pm

Valentin Patilea 
Professor of Statistics, Head of PhD Program
ENSAI (École Nationale de la Statistique et de l'Analyse de l'Information)

Adaptive Functional Data Analysis

Functional Data Analysis (FDA) depends critically on the regularity of the observed curves or surfaces. Estimating this regularity is a difficult problem in nonparametric statistics. In FDA, however, it is much easier due to the replication nature of the data. After introducing the concept of local regularity for functional data, we provide user-friendly nonparametric methods for investigating it, for which we derive non-asymptotic concentration results. As an application of the local regularity estimation, the implications for functional PCA are shown. Flexible and computationally tractable estimators for the eigenelements of noisy, discretely observed functional data are proposed. These estimators adapt to the local smoothness of the sample paths, which may be non-differentiable and have time-varying regularity. In the course of constructing our estimator, we derive upper bounds on the quadratic risk and obtain the optimal smoothing bandwidth that minimizes these risk bounds. The optimal bandwidth can be different for each of the eigenelements. Simulation results justify our methodological contribution, which is available for use in the R package FDAdapt. Extensions of the adaptive FDA approach to functional time series and multivariate functional data are also mentioned.

October 5, 2023 (8th Floor Auditorium)

Hongyu Zhao, PhD
Ira V. Hiscock Professor of Biostatistics, Professor of Genetics and Professor of Statistics and Data Science
Yale University, School of Public Health

Beyond Global Correlations

Correlations are one of the most commonly used statistics to quantify the dependence between two variables. Although global correlations can be calculated across all the data points collected in a study, some distinct and important local and context-dependent patterns may be masked by global measures. In this presentation, we will discuss the need, benefit, and challenges in inferring local and context-dependent correlations in genetics and genomics studies. We will focus on examples on local genetic correlations to identify genomic regions with shared effects on different complex traits and cell-type-specific correlations to identify co-regulated genes in disease-relevant cell types.



October 12, 2023 (8th Floor Auditorium)

Nilanjan Chatterjee, PhD
Bloomberg Distinguished Professor
Johns Hopkins University, Bloomberg School of Public Health

A Transfer Learning Approach to Build High-Dimensional Models Integrating Datasets with Disparate Features 

Development of comprehensive prediction models are often of great interest in many disciplines of science, but datasets that have information on all desired features are typically small. In this talk, I will describe a transfer learning approach for building high-dimensional generalized linear models using data from a main study that has detailed information on all predictors, and from one or more external, possibly much larger, studies which have ascertained a more limited set of predictors. We propose using the external dataset(s) to build reduced model(s) and then “transfer” the information on underlying parameters for the analysis of the main study through a set of calibration equations that can account for study-specific effect of certain features, like design variables. We then use a generalized method of moments (GMM) with penalization approach for parameter estimation and develop highly scalable algorithms for fitting models with lasso and adaptive lasso penalties taking advantage of the popular GLM-net package. We conduct extensive simulation studies to investigate both predictive performance and post-selection inference properties of the proposed methods compared to traditional analysis of only the internal study. Finally, we illustrate a timely application of the proposed method for the development of risk prediction models for nine common diseases using the UK Biobank study, combining baseline information from all study participants (N=500K) and recently released high-throughout proteomic data (P=1500) on a subset (N=50K) of the participants.

October 19, 2023 (8th Floor Auditorium)

Judith Lok, PhD
Associate Professor of Mathematics and Statistics
Boston University

Causal organic indirect and direct effects: closer to Baron and Kenny, and enabling estimation of indirect effects without on-treatment outcome data

Mediation analysis, which was popularized by Baron and Kenny (1986), is used extensively by applied researchers. Indirect and direct effects are the part of a treatment effect that is mediated by a covariate and the part that is not. Subsequent work on natural and pure indirect and direct effects provides a formal causal interpretation, based on cross-worlds counterfactuals: outcomes under treatment with the mediator set to its value without treatment. Organic indirect and direct effects avoid cross-worlds counterfactuals, using so-called organic interventions on the mediator while keeping the initial treatment fixed, and apply also to settings where the mediator cannot be set. Combining organic interventions with ``no treatment’’ instead of with ``treatment’’, relates to pure indirect effects as opposed to natural indirect effects. The resulting organic indirect effect relative to no treatment represents the impact of an intervention on a mediator that mirrors the effect of the treatment. Subsequently, the system proceeds to follow its natural course as if the mediator had attained its value without any intervention or treatment. This organic indirect effect relative to no treatment can be estimated using information or an estimate of the treatment's effect on the mediator, along with outcome and mediator data collected under "no treatment." In other words, it is not necessary to have outcome data under treated conditions to estimate this effect. I will illustrate this by estimating the organic indirect effect relative to no treatment for a potential COVID medication developed in the Intensive Care Unit at Boston University. This medication targets what are known as Despr-positive neutrophil nets. In laboratory tests, this COVID medication has been shown to eliminate Despr-positive neutrophil nets, but it has not yet been administered to humans.

October 26, 2023 (8th Floor Auditorium)

Jessica G. Young, PhD
Associate Professor, Department of Population Medicine
Harvard Medical School & Harvard Pilgrim Health Care Institute

Causal inference with competing events

A competing (risk) event is any event that makes it impossible for the event of interest in a study to occur. As one example, cardiovascular disease death is a competing event for prostate cancer death because an individual cannot die of prostate cancer once he has died of cardiovascular disease. As another example, all-cause mortality is a competing event for onset of dementia because once an individual dies without dementia, they will never subsequently develop it. Various estimands have been posed in the classical competing risks literature, most prominently the cause-specific cumulative incidence, the marginal cumulative incidence, the cause-specific hazard, and the subdistribution hazard. Here we will consider the interpretation of counterfactual contrasts in each of these estimands under different treatments. In turn, we argue that choosing a target causal effect in this setting boils down to whether investigators choose to be satisfied estimating total effects, that capture all mechanisms by which treatment affects the event of interest, including via effects on competing events. When investigators deem the total effect insufficient to answer the underlying question, we consider alternative targets of inference that capture treatment mechanism for competing event settings, with emphasis on the recently proposed separable effects. Our formalization of this problem is importantly agnostic to whether the event of interest is "terminal" (prostate cancer death) or "nonterminal" (incident dementia) which calls into question the value of the historical separate designations of "competing" and "semi-competing" risk problems when we transparently motivate statistical analysis via the question of substantive interest.

November 2, 2023 (8th Floor Auditorium)

Chen Hu, PhD
Associate Professor of Oncology and Biostatistics
Johns Hopkins University

On statistical inference of multiple competing risks in comparative clinical trials

Competing risks data are commonly encountered in comparative clinical trials. Ignoring competing risks in survival analysis leads to biased risk estimates and improper conclusions. Often, one of the competing events is of primary interest and the rest competing events are handled as nuisances. These approaches can be inadequate when multiple competing events have important clinical interpretations and thus of equal interest. For example, in hospitalized critical care treatment trials, the outcomes are either death or discharge from hospital, which have completely different clinical implications and are of equal interest. In oncology trials, while composite endpoints, such as disease-free survival, are used frequently, it is often concerned that novel interventions do not necessarily impact all components of a composite endpoint equally. We develop nonparamteric estimation and simultaneous inferential methods for multiple cumulative incidence functions (CIFs) and corresponding restricted mean times. Based on Monte Carlo simulations and a data analysis of a completed clinical trial, we demonstrate that the proposed method provides global insights of the treatment effects across multiple endpoints. This is a joint work with Dr. Mei-Cheng Wang and Dr. Jiyang Wen.

November 9, 2023 (8th Floor Auditorium)

Lily Wang, PhD
Professor of Statistics
George Mason University

Next-generation Functional Data Analysis with Applications in Neuroimaging Studies

Recent advanced technologies have catalyzed the rapid growth of medical imaging data, which carry significant amounts of information for disease diagnostic and treatment outcomes. In the domain of neuroimaging, the use of complex three-dimensional (3D) objects is becoming more prevalent, necessitating the development of advanced data analysis techniques. This presentation focuses on an innovative nonparametric approach designed to learn and infer from intricate objects embedded in 3D space, including surface-based and voxel-based functional data. Our method enables accurate signal estimation and efficient detection of significant effects within complex 3D structures. We introduce a framework that leverages advanced statistical modeling approaches, including multivariate splines on triangulations and next-generation function data analysis, to address the challenges associated with 3D imaging objects, such as complex geometry and spatial dependencies. In addition, we propose a novel approach for constructing simultaneous confidence corridors to effectively quantify estimation uncertainty and enhance the reliability of inference. Furthermore, we present a novel distributed learning framework that features a straightforward, scalable, and communication-efficient implementation scheme that can achieve near-linear speedup. We evaluate the finite-sample performance through numerical experiments and demonstrate the methods’ applicability in real-data analysis in neuroimaging studies.

November 16, 2023 (ARB 5th floor, Rooms 532A and 532B)

Sudipto Banerjee, PhD

Sr. Associate Dean for Academic Programs, UCLA Fielding School of Public Health; Professor of Biostatistics; Professor of Statistics & Data Science; Affiliate Professor, UCLA Institute of the Environment & Sustainability
The University of California, Los Angeles

Process-based Inference for Actigraph Data from Wearable Devices

Rapid developments in streaming data technologies have enabled real-time monitoring of human activity. Wearable devices, such as wrist-worn sensors that monitor gross motor activity (actigraphy), have become prevalent. An actigraph unit continually records the activity level of an individual, producing large amounts of high-resolution measurements that can be immediately downloaded and analyzed. While this type of BIG DATA includes both spatial and temporal information, we argue that the underlying process is more appropriately modeled as a stochastic evolution through time, while accounting for spatial information separately. A key challenge is the construction of valid stochastic processes over paths. We devise a spatial-temporal modeling framework for massive amounts of actigraphy data, while delivering fully model-based inference and uncertainty quantification. Building upon recent developments we discuss traditional Bayesian inference using Markov chain Monte Carlo algorithms as well as faster alternatives such as Bayesian predictive stacking. We test and validate our methods on simulated data and subsequently evaluate their predictive ability on an original dataset from the Physical Activity through Sustainable Transport Approaches (PASTA-LA) study conducted by UCLA's Fielding School of Public Health.

November 30, 2023 (8th Floor Auditorium)

MaryLena Bleile, PhD
Postdoctoral Research Fellow
Memorial Sloan Kettering Cancer Center

A Domain-Oriented Analysis Pipeline for Spatial Transcriptomics With Application to Melanoma Data

The spatial arrangement of cells within the tumor microenvironment plays a crucial role in the progression of melanoma. Recent advances in spatial sequencing technology, such as the 10X Visium platform, have enabled the creation of genomic datasets that capture spatial relationships. These analyses have demonstrated that mapping the immune landscape within the tumor can aid in the identification of potential candidates for immune checkpoint inhibitors (ICI), such as anti-PD1 immunotherapy. Specifically, exploring gene-cell interactions at the boundary between the tumor and its stroma has provided valuable insights. Therefore, it is essential to divide a tissue slide into regions of interest for this analysis. While various methods support this effort, standardized pipelines have not yet been established. Our novel approach for segmenting and analyzing the tumor-stroma interface in a 10X Visium melanoma dataset begins with gene expression cluster analysis and spotwise cell type deconvolution using Latent Dirichlet association, then region identification using percentages of the deconvolved "melanoma" cell type per cluster. One can then perform region-specific analysis of gene module prevalence and interaction.

December 7, 2023 (8th Floor Auditorium)

Jianqing Fan, PhD
Frederick L. Moore '18 Professor of Finance, Professor of Statistics, and Professor of Operations Research and Financial Engineering
Princeton University

Communication-Efficient Distributed Estimation and Inference for Cox’s Model


Motivated by multi-center biomedical studies that cannot share individual data due to privacy and ownership concerns, we develop communication-efficient iterative distributed algorithms for estimation and inference in the high-dimensional sparse Cox proportional hazards model. We demonstrate that our estimator, with a relatively small number of iterations, achieves the same convergence rate as the ideal full-sample estimator under very mild conditions. To construct confidence intervals for linear combinations of high-dimensional hazard regression coefficients, we introduce a novel debiased method, establish central limit theorems, and provide consistent variance estimators that yield asymptotically valid distributed confidence intervals. In addition, we provide valid and powerful distributed hypothesis tests for any of its coordinate elements based on decorrelated score test. We allow time-dependent covariates as well as censored survival times. Extensive numerical experiments on both simulated and real data lend further support to our theory and demonstrate that our communication-efficient distributed estimators, confidence intervals, and hypothesis tests improve upon alternative methods.