Big Data Academy: Public Health Supercomputing

A new course in high performance computing optimizes large datasets for public health.

January 5, 2016

Starting today, Mailman is the first among peer schools of public health to offer a class in supercomputing for large datasets. Through the Department of Public Health IT, Mailman students have the opportunity to register for Fundamentals of High Performance Computing (HPC), an introduction to public health supercomputing accessible to people without a computer science background. The class grows from a demand for researchers with experience working with large datasets.

“We're in a Big Data era,” says Rebecca Yohannes, director of High Performance Computing at Mailman. “There is a lot of information out there, and to make sense of it, you can’t do that on a personal computer.”

The course will focus on the structure of modern computing and performance and serve as an introduction to supercomputing methods. Students will get experience in parallel programming using scripts in Python and C that take full advantage of the 6,384 CPU cores and 23 terabytes of RAM that make up the HPC cluster managed by the Department of Systems Biology’s IT group. (By comparison, a Macbook Pro laptop has only four cores and 16 megabytes of RAM.)

Representatives from Google, Amazon, and other companies have already signed on to give guest lectures during the seven-week course.

Once featured on the Top500 list for supercomputers worldwide, the system's use at Mailman has been directed by Yohannes since April 2013. So far, researchers have published at least 11 scientific papers using data analyzed by the HPC; another four papers are under journal review, and 10 others are readying for publication.

Statistical geneticist Iuliana Ionita-Laza, associate professor of Biostatistics, currently uses the cluster to run genetic simulations on over 3 billion DNA base pairs. Andrew Ratanatharathorn, a second year PhD student in Epidemiology, examines the epigenetics of PTSD through thousands of genetic fragments across a cohort of 800 people.

Associate Research Scientist Wan Yang and Associate Professor Jeffrey Shaman, both in Environmental Health Sciences, have relied on the HPC since 2013 to generate predictions similar to weather forecasts for diseases such as influenza, West Nile, and Ebola. To get an idea of the kind of work they do, imagine the graphical representation of wind on a meteorologist's weather map where, instead of wind, the computer-generated pattern depicts a disease outbreak as it sweeps across the country.

“Whatever the question,” says Ionita-Laza, “if you work with large datasets, you need HPC.”

The new class in supercomputing is the first of its kind to be offered at Columbia University and is run through the recently formed Department of Public Health IT at the Mailman School. Public health students receive preferential enrollment, but registration is open to the entire Columbia community until the cap of 20 attendees is reached. Students can register through January 7. Details are available on the Mailman course directory page.

As Yohannes works with researchers to get them up and running with the HPC, her first lesson is helping them overcome their fears. Demystifying supercomputing is the prerequisite to exploring new avenues in their work.

“In this day and age, these are essential skills, not only for public health,” she says. “If you're a researcher and you have a lot of data, you need to know this.”