The Future Is: Data Science for Health

An unprecedented volume of health data demands a new generation of scientists equipped to translate data into improved health outcomes—and do so ethically.

November 6, 2023

Long before the advent of machine learning, interactive data visualizations, or the flurry of concern and wonder surrounding ChatGPT, there were public health officials manually collecting and cataloging health data, then analyzing those data by hand with the aim of improving the health of their communities. “Public health has always been a very data-oriented discipline,” says Jeff Goldsmith, PhD, associate dean of data science and associate professor of Biostatistics. “Today, we are seeing a natural progression and growing sophistication of analytic techniques that public health researchers are using to address the same fundamental questions that we always have.”

Goldsmith sees the arrival of artificial intelligence (AI), augmented intelligence, and machine learning as a natural evolution in public health. (Augmented intelligence itself evolved out of AI; it involves applying AI to enhance, rather than replace, human tasks and decision-making.) These tools are becoming increasingly essential to translate an unprecedented volume of data into population-wide health improvements. “We’ve moved from a world with a paucity of data to one with an overabundance of it,” says Moise Desvarieux, MD, PhD, MPH ’91, associate professor of Epidemiology. According to Nature Genetics, there were an estimated 2,314 exabytes of health data produced worldwide in 2020, up from 153 exabytes in 2013. (Five exabytes is thought to be equal to all the words ever spoken by humanity.) With this explosion of data comes tremendous potential to improve public health, but also the dangerous possibility that technology—or those who wield it—will exacerbate disparities.  

Big Data, Getting Bigger

Health data now extend far beyond information that has traditionally been collected—demographics, environmental exposures, medical history, family history—to new sources such as continuously collected activity levels. Desvarieux offers the example of renting a Citi Bike in New York. “We know when and where the person got on and off the bike, the distance they rode, whether there was a hill, and the amount of time they spent riding.” Data from sources such as Citi Bikes, smartphones, and wearable devices present a rich opportunity for public health. “We have not only personal data, but also data on our environment, the quality of the air we breathe, the soil quality,” Desvarieux says. 

In research at Columbia Mailman School, Desvarieux and colleagues are using personal and environmental data, as well as genetic sequencing data, to pinpoint personalized risk estimates for someone’s likelihood of developing a given chronic condition. Genetic sequencing technology can now paint deep and comprehensive pictures of individual genomes, too. Taken together, the data on behaviors, biology, risks, environment, genomics, and more can help public health researchers determine who may be at a greater risk for adverse health outcomes, and the best ways to mitigate those risks. This quantity and diversity of information mark what Desvarieux calls “the new world” in public health data science. However it’s characterized, this abundance of data requires new skills from public health professionals.   

Equipping a New Generation

The School recognizes this demand, and just graduated its first cohort of students from the MS Public Health Data Science track. Introduced three years ago, it has quickly become the most popular MS degree program track, with 54 new students this fall. “I don’t see demand slowing down any time soon,” says Kiros Berhane, PhD, the chair of Biostatistics. “All signs point to the need for more computationally heavy techniques.”

Berhane describes data science as an umbrella term encompassing a fusion of rigorous statistical principles (vitally important where health is concerned) and quickly evolving computer science–driven machine learning and AI techniques. “The discipline is about the ability to arrive at conclusions based on evidence you get from the data, coupled with machine learning and artificial intelligence techniques able to handle huge quantities of data,” he says. Students in the MS Public Health Data Science track learn skills including data reproducibility, management, and manipulation; how to use graphics effectively; dissemination and visualization of data; and web scraping, which involves gathering data from many different web sources when those data aren’t formatted or structured the same way. These students also focus on data science methods, including supervised learning (machine learning on labeled data) and unsupervised learning (machine learning on unlabeled data). With unsupervised methods, students learn how machines can identify hidden patterns in data that humans may not have observed.

Jeff Goldsmith, PhD

Jeff Goldsmith, PhD Photo credit: Diana Reddy

Although required courses in the Public Health Data Science track are designed to give students a strong grasp of the theories, tools, methods, and terms that make up the vast world of AI, the practical application of these methods can’t be taught in a classroom alone. To that end, each student is required to complete a practicum. Students work on designing and proposing their practicum projects with their faculty advisors, then submit a report and give an oral presentation.

These practical applications tap into a range of tools and methods that are inherently broad—just as AI itself is broad. Goldsmith acknowledges how this breadth can make defining terms challenging. “It’s hard to pin down exactly what we mean by AI,” he says. “Artificial intelligence encompasses traditional statistical approaches but also convolutional neural networks,” which are used to identify patterns, including in images, as is the case with facial recognition.  

 When defining AI in a public health context, Goldsmith says, “We’re trying to take relatively big, complicated information on individuals and understand how that changes their health outcomes.” In the case of cancer, for instance, a researcher might utilize AI to pull from an individual’s biological and genomic sequencing information but would also explore vast troves of data from the broader population. Then the insights could be coupled with environmental or exposure data to calculate an individual’s risk.

Beyond graduate program curricula, the School is also building a pipeline for undergraduates to enter the public health data science field. Several years ago, it received funding from the National Institutes of Health (NIH) to launch the Summer Institute in Biostatistics and Data Science at Columbia. Undergraduate students spend seven weeks learning about data science software, analytic tools, and responsible research conduct. They also tackle data analysis projects using data from the National Heart, Lung, and Blood Institute and the National Institute of Allergy and Infectious Diseases. The data come from clinical studies of chronic disease and infectious disease treatment and prevention. 

In working with practicing biostatisticians and the investigators actively engaged in these studies, students have a chance to see and experience firsthand how the principles of data science shape public health—and to contribute to the field. Students enrolled in this summer program have applied data science to projects ranging from analyses of dementia biomarkers to comparisons of schizophrenia treatments to the effect of expanded access to HIV treatment in Lesotho. The school welcomed its second cohort into the free program last summer.  

Columbia Mailman School is also focused on partnering with scientists in the international community to share knowledge, including through a program with the Addis Ababa University in Ethiopia and the University of Nairobi in Kenya. “Data science is a global phenomenon,” Berhane says. This program, born out of a $1.7 million award from the NIH, is meant to create new training opportunities in health data science in Eastern Africa. It’s part of a five-year, $74.5 million NIH initiative, and its goal is to support research projects focused on the ethical, legal, and social implications of data science research.

Through this grant, faculty members abroad are paired with Columbia faculty mentors who work with them through research, coursework, boot camps, and training. Then, for a week in the fall, these faculty members visit Columbia Mailman School for a week, after which they bring their knowledge back to their own institutions and serve as peer mentors to subsequent scholars.

AI Tools for Better Health

As Columbia Mailman School researchers mentor and train the next generation of public health data scientists both on Columbia’s campus and abroad, they themselves continue to pave the way for data science’s role in improved public health outcomes. Desvarieux, for instance, is part of a team using AI to develop tools to personalize prevention for chronic conditions including asthma, cardiovascular disease, diabetes, respiratory diseases, and cancer. The project was one of eight projects to receive a Centennial Grand Challenge research grant from the School in 2022, a signal of its critical importance to the field of public health. 

“We’re hoping to develop a semi-automatic algorithm that we could translate into tools that are impactful for any person or health professional at the user level,” Desvarieux says. The algorithm behind the tools learns from clinical research and real-world information spanning medical data, environmental exposure data, behavioral data, and biological data, among other categories.

To date, the health applications for these sophisticated data science tools have mostly involved personalizing treatment decisions for patients who already have a disease or condition. In contrast, Desvarieux’s tools would help determine when to intervene with certain prevention strategies depending on an individual’s risk factors. 

Another data science research project underway is the Interstitial Lung Disease Diagnostics Tool, through which Qixuan Chen, PhD, associate professor of Biostatistics, and her colleagues aim to help radiologists more accurately and quickly determine diagnoses. The tool, which is meant to help radiologists distinguish between chronic hypersensitivity pneumonitis, usual interstitial pneumonia, and nonspecific interstitial pneumonia, is designed to be accessed through a free, user-friendly app. It asks radiologists to specify the presence or absence of four CT scan features that Chen and her team identified as the most important in differentiating between diagnoses. Although the list of potential features could have been longer, Chen and her Biostatistics colleagues used a statistical method called a Bayesian additive regression tree to pinpoint four. “The radiologists don’t need to answer 20 questions,” Chen says, explaining how they are already spread thin. “When we only talk about four important features, it makes their life easier, and they are more likely to adopt it.”

Chen and her team are still fine-tuning the tool, but the algorithm has already demonstrated a strong predictive ability she believes could help minimize errors and ensure patients get the right intervention at the right time.

Acknowledging and Ending Bias

Becoming a public health data scientist in 2023 means building and applying new algorithms, tools, code, and techniques to vast troves of data to improve health outcomes. Increasingly, it also means learning to recognize biases, and to understand these tools have a dangerous capacity to deepen health disparities. “Public health must take the lead in protecting health while leveraging new technologies to improve human health,” says Gary Miller, PhD, vice dean for research strategy and innovation. “There is extraordinary potential for AI and data science to improve human health, but there is also extraordinary potential for these systems to exacerbate disparities and create new health problems.” Berhane notes that we have seen this risk already in many settings (for example, criminal profiling, in which AI has fueled dangerous biases). During the COVID-19 pandemic, it came to light that the pulse oximeters used to measure blood oxygen levels were more often inaccurate when used on Black patients versus white patients. If complex algorithms meant to predict disease risk are trained using data that don’t adequately represent certain subgroups, those tools cannot be expected to work in those populations. The same is true of drugs and treatments clinically tested on homogenous patient populations.  

“Big data can give you a false sense of comfort if misused,” Berhane says. “If the millions or billions of data points you have are not coming from the entire population that is being targeted for subsequent actions, then it’s actually dangerous.” Safeguarding the next generation of public health data scientists against these biases and disparities involves continuous learning, ethical discussions, and open discourse. It also involves ensuring that the researchers, faculty, and students engaging in this discourse themselves represent the diverse populations. Diversity in the public health data science field, to that end, is crucial both in the U.S. and globally, Berhane says. “There are many sections of the world that don’t have the capacity to collect data, but decisions are being made for them based on data from elsewhere,” he adds. “A seat at the table in data science is powerful.”


Health and science reporter Caroline Hopkins is a 2019 graduate of Columbia Journalism School.

The Future Is: Data Science for Health was first published in the 2023-2024 issue of Columbia Public Health Magazine