Levin Lecture Series: Spring 2021 Colloquium Seminars

***Due to the pandemic all Levin Lecture Talks will be conducted via Zoom for the Spring 2021 semester*** 

For all Zoom inquiries to virtually attend seminars, please send an email to Corey Adams, Administrative/Practicum Coordinator (ca2594@cumc.columbia.edu), or Dr. Caleb Miles, Professor (cm3825@cumc.columbia.edu


January 21, 2021

Topic: "Flexible methods for mediation analysis with discrete and continuous exposures​"

11:30am-12:30pm​
via Zoom
Hosted by: Department of Biostatistics

Speaker: Dr. Iván Díaz, Assistant Professor of Biostatistics at Weill Cornell Medicine​

Abstract: Mediation analysis in causal inference has traditionally focused on binary exposures and deterministic interventions, and a decomposition of the average treatment effect in terms of direct and indirect effects. We present an analogous decomposition of the population intervention effect, defined through stochastic interventions on the exposure. Population intervention effects provide a generalized framework in which a variety of interesting causal contrasts can be defined, including effects for continuous and categorical exposures. We show that the identification of direct and indirect effects for the population intervention effect requires weaker assumptions than its average treatment effect counterpart, under the assumption of no mediator-outcome confounders affected by exposure. In particular, identification of direct effects is guaranteed in experiments that randomize the exposure and the mediator. We propose various estimators of the direct and indirect effects, including substitution, re-weighted, and efficient estimators based on flexible regression techniques, allowing for multi-variate mediators. Our efficient estimator is asymptotically linear under a condition requiring n1/4-consistency of certain regression functions. We perform a simulation study in which we assess the finite-sample properties of our proposed estimators. We present the results of an illustrative study where we assess the effect of participation in a sports team on BMI among children, using mediators such as exercise habits, daily consumption of snacks, and overweight status.


January 28, 2021

Topic: "Statistical Methods for Combining Multiple Data Sources to Estimate Environmental Exposures "

11:30am-12:30pm​
via Zoom
Hosted by: Department of Biostatistics

Speaker: Dr. Howard Chang, Associate Professor of Biostatistics and Bioinformatics, Emory University Rollins School of Public Health​

Abstract: Accurate exposure estimates are crucial to the success of any environmental health study. However, monitoring measurements are often only available sparsely in space and time. One approach to improve exposure assessment is by supplementing measurements with additional data sources, such as computer model simulations, satellite imagery, and low-cost sensors. These data products are increasingly being used to support spatially-resolved health effect analyses and perform impact assessments in low- and middle-income countries. This presentation will discuss the use of spatial statistical downscaling, ensemble averaging, and quantile mapping to combine different data sources for modeling environmental exposures. These methods are applied to estimate daily fine particulate matter concentration and to bias-correct climate model projections. Several studies on the health effects of ambient air pollution and extreme temperature will also be presented to highlight the advantages and challenges associated with the use of these data products. 


February 4, 2021

Topic: "Efficient SNP-based Heritability Estimation using Gaussian Predictive Process in Large-scale Cohort Studies"

11:30am-12:30pm​
via Zoom
Hosted by: Department of Biostatistics

Speaker: Dr. Saonli Basu, Professor of Division of Biostatistics at the University of Minnesota

Abstract: For decades, linear mixed models (LMM) have been widely used to estimate heritability in twin and family studies. Recently, with the advent of high throughput genetic data, there have been attempts to estimate heritability from genome-wide SNP data on a cohort of distantly related individuals. Fitting such an LMM in large scale cohort studies, however, is tremendously challenging due to high dimensional linear algebraic operations. In this paper, we simplify the LMM by unifying the concept of Genetic Coalescence and Gaussian Predictive Process, and thereby greatly alleviating the computational burden. Our proposed approach PredLMM has much better computational complexity than most of the existing packages and thus, provides an efficient alternative for estimating heritability in large scale cohort studies. We illustrate our approach with extensive simulation studies and use it to estimate the heritability of multiple quantitative traits from the UK Biobank cohort.

This is joint work with Souvik Seal, Colorado School of Public Health, and Abhirup Datta, Johns Hopkins University


February 11, 2021

Topic: "Individualized Multi-directional Variable Selection​"

11:30am-12:30pm​
via Zoom
Hosted by: Department of Biostatistics

Speaker:  Dr. Annie Qu, Chancellor's Professor, University of California-Irvine

Abstract: In this talk, we propose a heterogeneous modeling framework that achieves individual-wise feature selection and individualized covariates’ effects subgrouping simultaneously. In contrast to conventional model selection approaches, the new approach constructs a separation penalty with multi-directional shrinkages, which facilitates individualized modeling to distinguish strong signals from noisy ones and selects different relevant variables for different individuals. Meanwhile, the proposed model identifies subgroups among which individuals share similar covariates’ effects, and thus improves individualized estimation efficiency and feature selection accuracy. Moreover, the proposed model also incorporates within-individual correlation for longitudinal data to gain extra efficiency. We provide a general theoretical foundation under a double-divergence modeling framework where the number of individuals and the number of individual-wise measurements can both diverge, which enables inference on both an individual level and a population level. In particular, we establish strong oracle property for the individualized estimator to ensure its optimal large sample property under various conditions.


February 18, 2021

Topic: "Data squashing: What is it? How can we improve it?"

11:30am-12:30pm​
via Zoom
Hosted by: Department of Biostatistics

Speaker: Dr. Matthew Shotwell, Associate Professor, Department of Biostatistics, Vanderbilt University

Abstract: Data squashing aims to reduce the size of a data set, but preserve the information contained in it for the purposes of fitting a model. There are computational and other practical benefits of data squashing. Although several methods were developed more than 20 years ago, they have not been widely adopted. This may be due to the practical limitations of these methods, which might be overcome with additional research. This seminar introduces the original method and some opportunities for improving the method to make it more widely applicable.


February 25, 2021

Topic: "Accounting for Data Provenance in EHR-based Phenotyping"

11:30am-12:30pm​
via Zoom
Hosted by: Department of Biostatistics

Speaker: Dr. Rebecca Hubbard, Professor of Biostatistics, University of Pennsylvania

Abstract: Opportunities to use “real-world data,” data generated as a by-product of digital transactions, have exploded over the past decade. Such data sources facilitate research in a naturalistic setting and with greater speed than is possible for research that relies on primary data collection. However, using data sources that were not collected for research purposes comes at a cost, and naïve use of such data without considering the complex data generating mechanisms they arise from can lead to a biased inference. In this talk, I will use my research on electronic health records (EHR)-based phenotyping to illustrate how statistical approaches to missing data and complex study designs can be harnessed to improve the validity of analyses using real-world data. EHR-based phenotyping is hampered by complex missing data patterns and heterogeneity across patients and healthcare systems, features that have been largely ignored by existing phenotyping methods. As a result, not only are EHR-derived phenotypes expected to be imperfect, but they often feature exposure-dependent differential misclassification, which can bias results towards or away from the null. I will review novel and existing approaches to EHR-based phenotyping, highlighting the impact of missing data on phenotype estimation. Finally, I will discuss approaches to minimize bias when incorporating error-prone phenotypes into subsequent analyses. The overarching goal of this presentation is to use the example of phenotyping to illustrate the unique contribution of statistics to the process of generating evidence from real-world data.


March 4, 2021

Topic: "Computational methods for studying genetics of complex human traits"

11:30am-12:30pm​
via Zoom
Hosted by: Department of Biostatistics

Speaker: Dr. Xin He, Assistant Professor of Human Genetics and Statistics at the University of Chicago

Abstract: I will talk about two topics in the statistical genetics of human traits. In the first part, I will describe a recently developed method to identify putative risk factors of complex traits, using Mendelian Randomization (MR). MR is a framework to assess causal relationship of two traits, using genetic variants of an exposure trait as "natural randomization" to estimate its effect on an outcome. However, current MR methods make strong assumptions that are violated when SNPs act on outcome not through the exposure, known as pleiotropic effects. We propose a method, CAUSE to deal with pleiotropy, by explicitly modeling hidden factors that would confound the relationship of exposure and outcome. We show in simulations and GWAS that CAUSE significantly reduces false discoveries while maintaining power. In the second part, I will talk about a method for using rare variants to study complex trait genetics. Comparing with common variants, the focus of many genetic studies, rare variants are usually not in linkage disequilibrium, making it easier to detect causal variants and genes. However, the power of identifying rare variants is usually low. We described our approach to addressing this challenge, in the context of de novo mutations. Our method combines variants at the level of genes, and leverages functional information of variants. This method enabled the discovery of a number of risk genes of autism. 


March 11, 2021

Topic: "Stein shrinkage, random matrix and imaginary direction smoothing"

11:30am-12:30pm​
via Zoom
Hosted by: Department of Biostatistics

Speaker: Dr. Gary Chan, Professor of Biostatistics, School of Public Health-University of Washington

Abstract: The Jame-Stein estimator of a multivariate mean vector and the Stein estimator of a covariance matrix are classical examples of shrinkage estimators.   Conceptually very different from the majority of contemporary estimators formulated using penalization with sparsity and/or low-rank assumptions, Stein covariance estimator targets non-sparse and full-rank covariance matrices.  We review recent literature in random matrix theory which are relevant in studying Stein-type shrinkage estimators and extensions, and present certain asymptotic optimality results under various loss functions.  A recurring unknown quantity to be estimated is a real-direction limit of a Stieltjes transform of the limiting spectral distribution, defined on the upper complex half-plane.  Moment-based estimators of such quantities are quite unstable but were used in Stein's original paper.  We consider an alternative estimator by defining a smoothed loss function based on smoothed empirical spectral distribution, with an optimum in terms of a smoothed Stieljes transform, where smoothing is along the imaginary direction via a Cauchy kernel.  A data-adaptive choice of the smoothing parameter is proposed due to a relationship between the imaginary part of a complex Stieltjes transform and the sample eigenvalue distribution.  Some theoretical properties and numerical illustration will be discussed.  This is a joint work with Sheung Chi Phillip Yam, Xiaolong Li and Yifan Shi.


March 25, 2021

Topic: "A modern take on Huber regression"

11:30am-12:30pm​
via Zoom
Hosted by: Department of Biostatistics

Speaker:  Dr. Po-Ling Loh, Associate Professor, Statistical Laboratory in the Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge

Abstract: In the first part of the talk, we discuss the use of a penalized Huber M-estimator for high-dimensional linear regression. We explain how a fairly straightforward analysis yields high-probability error bounds that hold even when the additive errors are heavy-tailed. However, the parameter governing the shape of the Huber loss must be chosen in relation to the scale of the error distribution. We discuss how to use an adaptive technique, based on Lepski's method, to overcome the difficulties traditionally faced by applying Huber M-estimation in a context where both location and scale are unknown.

In the second part of the talk, we turn to a more complicated setting where both the covariates and responses may be heavy-tailed and/or adversarially contaminated. We show how to modify the Huber regression estimator by first applying an appropriate "filtering" procedure to the data based on the covariates. We prove that in low-dimensional settings, this filtered Huber regression estimator achieves near-optimal error rates. We further show that the commonly used least trimmed squares and least absolute deviation estimators may similarly be made robust to contaminated covariates via the same covariate filtering step. This is based on joint work with Ankit Pensia and Varun Jog.


April 1, 2021

Topic: "Statistical Approaches for the Genetic Epidemiology of Prostate Cancer: From Trans-ancestry GWAS to Multi-omic Analysis​"

11:30am-12:30pm​
via Zoom
Hosted by: Department of Biostatistics

Speaker: Dr. David Conti, Professor of Preventive Medicine and Associate Director for Data Science Integration at the Keck School of Medicine at the University of Southern California

Abstract: Motivated by current research in the genetic epidemiology of prostate cancer, I will discuss several different statistical approaches we have recently developed for analysis. This includes the joint modeling of summary statistics and the investigation of multi-omics data with integrated analysis. Prostate cancer is a highly heritable disease with large disparities in incidence rates across ancestry populations. While genome-wide association studies (GWAS) have identified numerous risk loci, a majority of these discoveries have occurred with studies containing individuals of European ancestry. However, it is critical that future studies perform GWAS in non-European populations to provide insight into both ancestry-specific variation and common risk variation across populations. We recently conducted a trans-ancestry meta-analysis of prostate cancer GWAS (107,247 cases and 127,006 controls) and identified 86 new genetic risk variants, bringing the total to 269 known risk variants. Importantly, a genetic risk score (GRS) constructed with these variants was consistently associated across all four ancestry populations. This investigation leveraged several novel statistical techniques. For fine-mapping in a single population, we have previously developed an approach (JAM) that uses marginal summary statistics and estimated correlation structure to fit joint multi-SNP models. More recently, we have extended this approach within a hierarchical modeling framework to greatly facilitate additional applications. For example, this framework can be used for Mendelian randomization (MR) and transcriptome-wide association studies (TWAS) to estimate the causal effects of risk factors and genes using summary statistics. It also allows for multi-population fine-mapping by explicitly accounting for the difference in covariance between SNPs across each population. Moreover, a scalable version can be applied to a large number of intermediates and a large number of correlated genetic variants – situations often encountered in modern experiments leveraging omic technologies. Finally, in complimentary work, we have been investigating the metabolic mechanisms of the prostate cancer GRS using an integrated model for individual-level multi-omic data. Overall, better characterization of risk loci will facilitate understanding the underlying biological mechanisms and application of GRS for personalized risk prediction for all populations.


April 8, 2021

Topic: "Bayesian Spatial Blind Source Separation via the Thresholded Gaussian Process"

11:30am-12:30pm​
via Zoom
Hosted by: Department of Biostatistics

Speaker: Dr. Jian Kang, Professor of Biostatistics, School of Public Health, University of Michigan

Abstract: Blind source separation (BSS) aims to separate latent source signals from their mixtures. For spatially dependent signals in high dimensional and large-scale data, such as neuroimaging, most existing BSS methods do not take into account the spatial dependence and the sparsity of the latent source signals. To address these major limitations, we propose a Bayesian spatial blind source separation (BSP-BSS) approach for neuroimaging data analysis. We assume the expectation of the observed images as a linear mixture of multiple sparse and piece-wise smooth latent source signals, for which we construct a new class of Bayesian nonparametric prior models by thresholding Gaussian processes. We assign the von Mises-Fisher priors to mixing coefficients in the model. Under some regularity conditions, we show that the proposed method has several desirable theoretical properties including the large support for the priors, the consistency of joint posterior distribution of the latent source intensity functions, and the mixing coefficients and the selection consistency on the number of latent sources. We use extensive simulation studies and an analysis of the resting-state fMRI data in the Autism Brain Imaging Data Exchange (ABIDE) study to demonstrate that BSP-BSS outperforms the existing alternatives for separating latent brain networks and detecting activated brain activation in the latent sources. This is joint work with Ben Wu and Ying Guo. 


April 15, 2021

Topic: "Clinical Trials and Tribulations in a Global Pandemic"

11:30am-12:30pm​
via Zoom
Hosted by: Department of Biostatistics

Speaker: Dr. Manisha Desai,  Professor of Medicine and Biomedical Data Science, Section Chief of Biostatistics, and Director of the Quantitative Sciences Unit at Stanford University

Abstract: Data scientists have been at the forefront of helping to resolve the COVID pandemic. Their roles to address numerous critical questions have been integral to finding solutions to the pandemic. Our team has responded on numerous fronts but has primarily focused efforts on the design and analysis of clinical trials to establish optimal treatments for COVID-19 in the outpatient setting. We have proposed a shared infrastructure for the design and analysis of outpatient COVID-19 trials at our institution that includes a pragmatic platform protocol. The COVID-19 Outpatient Pragmatic Platform Study (COPPS) was designed to be flexible and efficient. I will discuss the journey to establish this trial including the challenges, lessons learned, and some solutions.