Past Lectures

January 15 

Meta-Analysis Based Gene and Pathway Identification for High-Throughput Genomic Data 
Quefeng Li, PhD
Postdoctoral Research Associate, Department of Operations Research and Financial Engineering
Princeton University
Hosted by: Ying Wei, PhD

Recent advances in biotechnology and their wide application have generated high-dimensional gene expression data that can be used to address similar biological questions. Meta-analysis plays an important role in summarizing and synthesizing scientific evidence from multiple studies. Different from the classical Meta-analysis method, which is often applied to each individual gene, Li proposes a joint modeling method that simultaneously analyzes all genes. This method accounts for heterogeneity among different studies by allowing some important genes to have null effect in certain studies. This is much more flexible than the other existing methods, which do not take such heterogeneity into account. Li shows that his method possesses the gene selection consistency, even when the number of genes is much larger than the sample size. In this talk, Li introduces this method, outline its extension to pathway identification, and also describes how to infer the individual/overall effects of genes and pathways based on his method for the high-dimensional genomic data.

January 22 

Understanding Racial/Ethnic Disparities in Cancer Survival – A Counterfactual Approach for Conceptual and Missing Data Issues 
Linda Valeri, PhD
Research Fellow, Department of Biostatistics
Harvard University
Hosted by: Ying Wei, PhD

The National Cancer Institute has identified the elimination of cancer disparities as one of the most urgent goals for reducing disease burden in the United States. Recent research documents that disparities across racial/ethnic groups involve cancer etiology, incidence, screening, diagnosis, treatment, and survival. Quantifying the interplay of mediating factors across this continuum and informing targeted interventions is therefore a priority. Stage at diagnosis is considered a major factor in explaining the observed differences in survival between black and white colorectal cancer patients. Valeri proposes to quantify the extent to which racial/ethnic survival disparities would be reduced if disparities in stage at diagnosis were eliminated. In particular, they develop a causal inference approach to assess the impact of a hypothetical shift in the distribution of stage at diagnosis in the black population to match that of the white population. Inferences are valid under the assumption of no unmeasured confounders of the mediator-survival relationship. Valeri then develops sensitivity analysis techniques to assess the robustness of their results to the violation of the no-unmeasured confounding assumption and to selection bias due to stage at diagnosis missing not-atrandom. Their results support the hypothesis that elimination of disparities in stage at diagnosis would contribute to the reduction in racial survival disparities in colorectal cancer. Important heterogeneities across the patients’ characteristics were observed and Valeri's approach easily accommodates for these features. This work illustrates how the counterfactual framework aids in identifying and formalizing relevant hypotheses in health disparities research that can inform policy decisions.

January 29 

Robust Covariance Functional Inference 
Fang Han
PhD Candidate, Department of Biostatistics
Johns Hopkins University
Hosted by: Ying Wei, PhD

Covariance functional inference plays a key role in high dimensional statistics. A wide variety of statistical methods, including principal component analysis, Gaussian graphical model estimation, and multiple linear regression, are intrinsically inferring covariance functionals. In this talk, Han presents a unified framework for analysis of complex (non-Gaussian, heavy-tailed, dependent) high dimensional data. It connects covariance functional inference to robust statistics. Within this unified framework, Han introduces three new methods: elliptical component analysis, robust homogeneity test, and rank-based estimation of latent VAR model. They cover both estimation and testing problems in high dimensions and are applicable to independent or time series data. Although the generative models are complex, Han shows the rather surprising result that all proposed methods are minimax optimal and their performance is comparable to Gaussian-based counterparts under the Gaussian assumption. Han further illustrates the strength of the proposed unified framework on a real financial data.

February 5 

Treatment Effect Heterogeneity 
Peng Ding
PhD Candidate, Department of Statistics
Harvard University
Hosted by: Ying Wei, PhD

Applied researchers are increasingly interested in whether and how treatment effects vary in randomized evaluations, especially variation not explained by observed covariates. Ding proposes a model-free approach for testing for the presence of such unexplained variation. To use this randomization-based approach, Ding must address the fact that the average treatment effect, generally the object of interest in randomized experiments, actually acts as a nuisance parameter in this setting. Ding explores potential solutions and advocates for a method that guarantees valid tests in finite samples despite this nuisance. Ding also shows how this method readily extends to testing for heterogeneity beyond a given model, which can be useful for assessing the sufficiency of a given scientific theory. Ding finally applies our method to the National Head Start Impact Study, a large-scale randomized evaluation of a federal preschool program, finding that there is indeed significant unexplained treatment effect variation.

February 12

Designing Clinical Studies of Targeted Drugs Across Cancer Modalities
Lorenzo Trippa, PhD
Assistant Professor, Department of Biostatistics
Harvard University
Hosted by: Ying Wei, PhD

Advances in cancer research have shown that tumors have heterogeneous genetic events, many of which are targetable by anticancer agents. Tests of treatment-biomarker interactions tend to be underpowered when done as secondary objectives in trials of drug efficacy. Clinical trials that make use of adaptive randomization and Bayesian prediction have been proposed as more efficient for investigating multiple agents and predictive biomarkers. Trippa proposes that clinical studies enrolling patients with multiple cancer modalities (CSMCM) would greatly contribute to statistical learning and accelerate the pace at which new drugs are studied. In silico simulations to evaluate the benefit of CSMCMs in a research portfolio require accurate parameters of (1) accession of patients by disease modality, and (2) joint prevalence of target gene mutations. At the Dana-Farber Cancer Institute (DFCI), a research study was initiated in 2011 to parallelize molecular profiling with routine histopathology at diagnosis or disease progression, and to date has assayed >5000 patients across 11 disease centers. Trippa presents the design and characteristics of in silico studies parameterized from this cohort.

February 26

Bayesian Nonparametric Functional Models for High-Dimensional Genomics and Imaging Data
Veera Baladandayuthapani, PhD
Associate Professor, Department of Biostatistics, Division of Quantitative Sciences
MD Anderson Cancer Center, University of Texas
Hosted by: Jeff Goldsmith, PhD

Due to rapid technological advances, various types of genomic, epigenomic, transcriptomic, and proteomic data with different sizes, formats, and structures have become available. These experiments typically yield data consisting of high-resolution genetic changes of hundreds/thousands of markers across the whole chromosomal map. Modeling and inference in such studies is challenging, not only due to high dimensionality, but also due to presence of structured dependencies (e.g. serial and spatial correlations). Using genome continuum models as a general principle, Baladandayuthapani presents a class of Bayesian methods to model these genomic profiles using functional data analysis approaches. Baladandayuthapani's methods allow for simultaneous characterization of these high-dimensional functions using non-parametric basis functions, joint modeling of spatially correlated functional data, and detection of local features in spatially heterogeneous functional data to answer several important biological questions. Baladandayuthapani illustrates his methodology by using several real and simulated datasets and proposes methods to integrate various types of genomics and imaging data as well.

March 5 

Statistical Analysis of Epigenomic Data 
Kasper Daniel Hansen, PhD
Assistant Professor, McKusick-Nathans Insitute of Genetic Medicine and Department of Biostatistics
Johns Hopkins University
Hosted by: Shuang Wang, PhD

Currently, biologists are very interested in analyzing high-throughout epigenetic data. Hansen starts with a brief introduction to epigenetics, illustrated with an example from clinical epigenetics, describes methodology for analysis of DNA methylation data, and discusses population level variation in histone modifications.

March 12 

Some Experience with Biomarker Driven Cancer Clinical Trials 
Full Member, Fred Hutchinson Cancer Research Center
Hosted by: Bin Cheng, PhD

The incorporation of genomic profiling as part of the cancer clinical trials designs has been realized in several disease settings. Designs include a wide variety randomized Phase II, Phase III, and Phase II/III strategies and typically focus on inferences involving one or more genomically defined subgroups of patients. Cheng describes some experience in designing and implementing such biomarker-driven trials in several disease settings including gastrointestinal and lung cancer as part of the United States NCI National Clinical Trial Network. This discussion includes the recently activated biomarker-driven clinical trial for patients with advanced squamous cell lung cancer (Lung-MAP), which utilizes genomic profiling to assign patients to multiple randomized sub studies. The relative efficiency of a multi-sub study targeted protocol versus a single targeted design is also investigated.

March 26 

A Two-Part Mixed-Effects Modeling Framework for Analyzing Whole-Brain Network Data 
Sean L. Simpson, PhD
Assistant Professor, Department of Biostatistical Sciences
Wake Forest School of Medicine
Hosted by: DuBois Bowman, PhD

Whole-brain network analyses remain the vanguard in neuroimaging research, coming to prominence within the last decade. Network science approaches have facilitated these analyses and allowed examining the brain as an integrated system. However, statistical methods for modeling and comparing groups of networks have lagged behind. Fusing multivariate statistical approaches with network science presents the best path to develop these methods. Toward this end, Simpson proposes a two-part mixed-effects modeling framework that allows modeling both the probability of a connection (presence/absence of an edge) and the strength of a connection if it exists. Models within this framework enable quantifying the relationship between an outcome (e.g., disease status) and connectivity patterns in the brain while reducing spurious correlations through inclusion of confounding covariates. They also enable prediction about an outcome based on connectivity structure and vice versa, simulating networks to gain a better understanding of normal ranges of topological variability, and thresholding networks leveraging group information. Thus, they provide a comprehensive approach to studying system level brain properties to further our understanding of normal and abnormal brain function.

April 2 

Identifying and Correcting for DNA Sample Contamination in Large-Scale Genetic Studies 
Michael Boehnke, PhD
Director, Center for Statistical Genetics and Genome Science Training Program
University of Michigan School of Public Health
Hosted by: Iuliana Ionita-Laza, PhD

DNA sample contamination is a frequent problem in DNA sequencing studies, and may result in genotyping errors and reduced power for association testing. In this talk, Boehnke first describes methods to identify DNA sample contamination based on sequencing read data and shows that the methods reliably detect and estimate contamination levels as low as one percent. He then describes methods to model contamination during genotype calling as an alternative to removal of contaminated samples and demonstrates that these methods can recover much of the information lost due to contamination. This is joint work with Matthew Flickinger, Goo Jun, Goncalo Abecasis, and Hyun Min Kang.

April 9 

Physical Activity Evaluation: Importance of the Raw Accelerometry Signal Collection 
Jarek Harezlak, PhD
Assistant Professor, Department of Biostatistics
Fairbanks School of Public Health, Indiana University
Hosted by: Todd Ogden, PhD

Accelerometers are used widely for the activity measurement in large observational studies. Such devices capture the aspects of movement that are impossible to collect from the self-report questionnaires. Using the collected raw accelerometry data we develop methods for identifying and characterizing common activities including walking and car driving in the free-living environment. Although existing methods can identify a number of activities in laboratory settings, free-living environment presents a major methodological challenge. Harezlak focuses on activities sustained for at least a few seconds. Our methods are robust to within- and between-subject variability, measurement configuration, and device rotation. Proposed methods are computationally efficient, allowing processing a week of raw accelerometry data (~150M measurements) in half an hour on average laptop computer. Methods are motivated by and applied to the free-living data obtained from the Developmental Cohort Study (DECOS) where 49 elderly subjects were monitored for one week each.

April 16 

BMI and Death 
Dean Foster, PhD
Yahoo Labs, NYC
Department of Statistics, University of Pennsylvania (on leave)

Foster discusses how body-mass index (BMI=mass/height squared) can be used to predict the length of life from a data set of 1/2 million people, and argues that we can get a huge amount of milage out of simply treating BMI as a continuous variable rather than as putting people into bins. From this, Foster posits that we can estimate interaction terms and end up with a personalized optimal BMI.

April 23 

Overcoming Bias, Systematic Error and Unwanted Variability in Next Generation Sequencing 
Rafael Irizarry, PhD
Professor of Biostatistics and Computational Biology, Dana Farber Cancer Center
Professor of Biostatistics, Harvard School of Public Health
Hosted by: Jeff Goldsmith, PhD

In this talk, Irizarry demonstrates the presence of bias, systematic error, and unwanted variability in next generation sequencing. He does this using data from the public repositories as well as his own, and then describes some preliminary solutions to these problems. These include 1) methods for RNA transcript estimation that are robust to sequencing artifacts, 2) statistical methods that estimate heterogenous cell composition in DNA methylation data and methods that account for protocol-induced bias in genome-wide enrichment scans (e.g., ChIP-seq and DNase I-seq).


April 30 

Prediction of Individual Long-Term Smoking Outcomes Based on a Multivariate Cure-Mixture Frailty Model 
Yimei Li, PhD
Assistant Professor of Biostatistics
Perelman School of Medicine, University of Pennsylvania
Hosted by: Ken Cheung, PhD

In smoking cessation trials, subjects commonly receive treatment and report daily cigarette consumption over a period of several weeks. Although the outcome at the end of this period is an important indicator of treatment success, substantial uncertainty remains on how an individual's smoking behavior will evolve over time. Therefore it is of interest to predict long-term smoking cessation success from short-term clinical observations. Li develops a Bayesian method for prediction, based on a novel cure-mixture frailty model. The proposed frailty model describes the complex process of transition between abstinence and smoking, allows probabilities of permanent quits and lapses ('cure' probabilities) and accounts for heterogeneity in the data (frailty). Based on this model, a two-stage prediction algorithm is developed that first uses importance sampling to generate subject-specific frailties from their posterior distributions conditional on the observed data, then samples predicted future smoking behavior trajectories from the estimated model parameters and sampled frailties. Li applies the method to data from a randomized smoking cessation trial comparing bupropion to placebo. Comparisons of actual smoking status at one year with predictions from this model and from a variety of empirical methods suggest that Li's method gives excellent predictions.