Let the Data Do the Talking
In recent years, Columbia Mailman School faculty have employed new techniques to harness big data to evaluate health challenges, ranging from multiple chemical exposures to motor control impairment in stroke patients to the spread of COVID-19. In January 2020, the School hosted the Data Science for Public Health Summit, convening experts from more than 60 schools of public health. And over the past year, the University’s Data Science Institute, the hub for collaborations between hundreds of faculty from every corner of the university, announced that a third of its seed grants will fund studies by Columbia Mailman School researchers. Most centrally, Columbia Mailman School and the Data Science Institute have launched a joint program in Data Science for Health, providing opportunities for public health students and postdocs to hone their skills. We gathered three faculty members to consider the challenges and opportunities of using data and teaching in this emerging field.
Let’s start by defining our terms. What is data science?
Kiros Berhane, PhD, chair of the Department of Biostatistics: Everyone has their own answer to this question. It’s a bit like the adage of three blind men describing an elephant—it all depends on where you are situated. There are different flavors of data science, from the purely algorithmic predictive approaches that computer scientists use, to methods favored by public health scientists, which align data science’s capabilities with understanding of biological mechanisms and employ statistical reasoning. In public health research, we work with a hypothesis and try to learn about what causes a certain disease in a way that makes sense in terms of human biology.
Jeff Goldsmith, PhD, associate professor of biostatistics: Data science includes a lot of things that we’ve always done in public health research—thinking about the questions we’d like to answer, understanding the context in which data came about, and using the most appropriate tool to answer these questions—and doing all of these things in a way that is transparent and reproducible. What data science adds to this is a new perspective, one that recognizes modern data sources, exploratory methods, and computational approaches, and that raises the floor on the expected proficiency in tools for working with data.
Jeanette Stingone, PhD, assistant professor of epidemiology: Rather than thinking about data science as a departure from biostatistics, we should think of it as an integration of traditional biostatistics and computational skills, grounded within substantive knowledge. That intersection is where we can find new ways to gain insights based on the massive and still growing information now available to us.
Berhane: As the ways we acquire and use data change, increasingly we’re going to be infusing artificial intelligence into the way we do things. We’re in a continuous state of evolution. The data and capabilities are always changing.
How are you each using data science to answer questions you wouldn’t have been able to answer otherwise?
Stingone: I’m an environmental epidemiologist focused on how communities are affected by multiple chemical exposures. Traditionally, we could look at one, two, maybe three chemicals at a time. Now, using machine learning , I can look at the health effects of many chemicals simultaneously. These methods are a starting point that allows me to see the data in a holistic way so I can ask better questions. They may trigger a hypothesis that can be tested using a method more closely tied to human biology—for example, that specific chemicals may interact with each other to disrupt the endocrine system and bring about disease.
Goldsmith: I study factors that are associated with physical activity—age, sex, season, measures of sociodemographic status, comorbidities, and others—often with a focus on understanding the determinants of activity in children. I’ve conducted similar analyses in older adults, in patients following surgery, and in healthy workers. These projects utilize vast amounts of data and seek to identify possible interventions that promote activity, or to evaluate the impact of such interventions. The methods I’ve developed for these analyses are grounded in functional data analysis, a branch of statistics that focuses on high-dimensional data with complex structures.
Berhane: I’m in a research hub for Eastern Africa dedicated to environmental health and climate change. Much of the data have inconsistencies in spelling, formatting, etc. One of our data managers had the idea of using machine learning techniques to clean the data, which has saved enormous effort. In my primary area of research, increasing complexity and volume of environmental exposures are moving me to employ machine learning techniques.
What do you make of so-called black box data science methods where data is fed into an algorithm, which then supplies an answer in a way that is opaque even to researchers? Are these methods at odds with public health science?
Berhane: Early on, computer programming techniques like neural networks and artificial intelligence were like a black box. Data science has come a long way. Today we’re in a middle area with greater transparency as dialogue has increased between health experts, biostatisticians, and computer scientists.
Stingone: Even if you’re using a really complex algorithm, you should have some rough idea of how it works. It’s important that we know where the data are coming from; we need to be thoughtful about our analysis, and the conclusions we draw from the results. Without this kind of grounding, you’re more likely to have ethical issues.
Can you say more about the ethical issues that may arise when using some of these data science methods?
Stingone: I’m currently working with the New York City health department as it uses machine learning tools to link various public health registries and create integrated health datasets. The problem is, data quality isn’t equal across populations and can affect whose data may be excluded from those datasets. Without careful attention to underlying differences in data quality by race or neighborhood, the information they get would be biased and would only benefit certain populations. These ethical considerations are very real. It’s an opportunity for public health to bring our expertise to the table.
Berhane: A high volume of data can give people a false sense of truth. When you have a lot of information, people think that whatever answer comes out of it has to be true. As Jeanette points out, these data can be imbalanced. Your answer is supposed to apply to everyone, but it was not driven by data from everyone. If you ignore the core principles of public health science, including how you designed your study and where the data come from, you’re in a garbage-in, garbage-out situation that reinforces inherent biases.
Goldsmith: Ethics is an area where public health researchers are really optimally positioned to make contributions. The training you get in public health includes quite a bit that has to do with ethics, with the impact of studies on populations, with being concerned about justice and fairness and consent and data use.
Today’s job market requires students to be familiar with data science. How do you teach data science to public health students?
Berhane: Training our students in data science for public health is different from training in data science in general. We have the responsibility to train them not to just be proficient with the techniques but to use them in a responsible way, to pay attention to biological principles and public health principles.
Goldsmith: I teach a class in data science that was originally intended for master’s students in Biostatistics, and is now made up of roughly half master’s students in Biostatistics and half students from other departments, including from across the University. Everywhere, students want this kind of training, and the training itself is evolving quickly. What counts as useful data skills now is completely different from what would have counted five or ten years ago.
Stingone: Our students want to study techniques such as machine learning. In Epidemiology, we focus on giving students a baseline knowledge of what they need to know so they can collaborate with others. Not every student needs to understand the linear algebra that underlies machine learning, but they should all be able to understand a scientific paper that describes how these techniques are used in public health.
Let’s talk about the future. Where do you see data science for public health going?
Goldsmith: We’re going to see a demystification of data science where you don’t have to be a computer engineer and know how to train a neural network to be a data scientist. I also think we’re going to see greater emphasis on ethics and reproducibility in data science outside of public health. Across the board, we’re going to see a lot more data scientists.
Stingone: Technology is making it so we will get more and more refined data on population health and new tools to make sense of that data. That’s going to be really exciting. We’re also going to see more collaborations. Data scientists don’t work alone. It’s a group activity.
Berhane: Data science is in an evolutionary state and will continue to change, hopefully for the better, as we have dialogue across disciplines. In public health, we will continue to emphasize careful study design and selection of data sources, inclusivity and representativeness of data, and careful interpretation of findings. I think it’s going to have a very bright future.
Tim Paul is the editorial director of communications and editor of the Transmission newsletter. He has written articles on forced migration and health, environmental health, and other subjects.