May 16 2013

Gigaflop. Teraflop. Petaflop. Get used to hearing these terms. The era of Big Data has arrived in public health and for the past few months, Mailman School faculty have had a powerful new tool at their fingertips: the C2B2 supercomputer, one of the fastest computer clusters in the world. Its 6,500 air-cooled silicon chips can process upwards of 200 trillion calculations a second and sequence a section of genetic code before lunchtime.

DSC_4666b.jpgAccess to the supercomputer is the result of an agreement negotiated by Dean Linda P. Fried and Roger Vaughan, the School’s Vice Dean for Academic Advancement, with Prof. Andrea Califano, director of Columbia’s Center for Computational Biology and Bioinformatics—or C2B2 for short.

“I am always looking for ways to make your science maximally competitive and meet your aspirations for outstanding innovation,” said Dean Fried in announcing the agreement at an April 9 School Assembly. “High performance computing will deliver that; I am very excited to see this happen,” she said.

Getting Started

High performance computing, or HPC for short, is “an essential tool for almost any operationalization of big data,” says Dr. Vaughan, who is also the Interim Department Chair of Biostatistics. Increasingly, grant funding is awarded based on researchers’ ability to analyze oversized datasets and do it fast. “You have to have high performance computing to stay competitive.”

To ensure that all faculty can access using the supercomputer, the Dean’s Office is paying for new user accounts through the first year.  The rollout is being coordinated by ry111 [at] cumc.columbia.edu (Rebecca Yohannes), who was recently hired as the technical liaison between the School and C2B2.

A former senior programmer and analyst for the department of Biostatistics, Ms. Yohannes will get faculty signed up, answer questions, resolve any issues, load software, and even help researchers add the capability to their grant applications.

Going forward, faculty can help each other, through regular meetings of an anticipated HPC working group, to share ideas and push each other to explore new terrain. “You don’t have to imagine all the possible applications yourself,” said Dr. Vaughan. “You can hear how other people are using it and see whether that will help you, and you can share how you are using it, so your colleagues can benefit from your experience.”

Pushing the Limits

Many faculty at the Mailman School already know their way around oversized data sets. Charles DiMaggio, associate professor of Epidemiology, has been working with a database called Medicaid MAX. The name doesn’t lie: 1.5 terabytes of information, including some 1.2 billion patient encounters. So far he has leaned on his Mac Pro desktop computer to crunch the numbers on prescription fills and psychiatric diagnoses to find the effect of the 9/11 terrorist attacks on mental health. But some more complex analyses have made his machine “seize up or grind to a halt.”

High performance computing “is like hearing the bugle call of the cavalry on the horizon,” says Dr. DiMaggio. The new supercomputer, he says, will influence his willingness to explore new ways to parse the data.

For Sally Findley, a professor in the departments of Population and Family Health and Sociomedical Sciences, having access to the supercomputer will open up new ways to analyze nearly half a million WIC records in pursuit of the question: Are New York City school-based obesity interventions working? As it stands, data analyses are performed by the nonprofit Public Health Solutions with the City Department of Health. Going forward, Dr. Findley says, they could be done in house—and expanded. “With high performance computing we could create and merge the files and do a multi-year analysis that explicitly controls for the change in impact over time.”

Another faculty member using an outside facility is Jeffrey Shaman, assistant professor of Environmental Health Sciences. Currently he is renting time on a Harvard supercomputer to simulate the spread of West Nile virus by creating a virtual wetlands, but he’s looking forward to the convenience and greater muscle of Mailman’s HPC.

“The simulation literally has hundreds of thousands of mosquitos flying around doing their business,” feeding on birds, laying eggs, and dying, he explains. A more powerful computer would help him make this simulation more life-like. “If I want to expand it and make it spatially more explicit with different habitats and whatnot, it’s going to require some additional computational power.”

Tal Gross, assistant professor of Health Policy and Management, has used a high performance computing facility on the Morningside campus to go through boatloads of data to see whether having insurance makes you more or less likely to go to an emergency room. Going forward, he sees even greater need for supercomputing. “It’s important that academics have access to high performance computing,” he says. “More and more hospitals not only have electronic medical records but are allowing research on those medical records.”

A few lucky Mailman School faculty members got an early start using the HPC capability as part of a pilot program. One of them is Jeffrey Goldsmith, assistant professor of Biostatistics. Along with Andrew Rundle, associate professor of Epidemiology, he is looking at the relationship between BMI and physical activity in 10 to 12 year-olds in Northern Manhattan and the Bronx. To do this, they arranged for the children to go about their days wearing accelerometers—devices that record movement. The resulting data were massive—approximately 100,000 observations in what will eventually be more than 100 children.

“When you start trying to do complex analyses, you need a computing cluster,” says Dr. Goldsmith.

On a day in May, one such intensive analysis was midway through being processed by the supercomputer. Dr. Goldsmith opened a window on his desktop computer to check on its progress—the window displays a text-based command-line interface. “It’s not super exciting to look at,” he admits. Yet a five-minute walk away in the basement room at the University’s Irving Cancer Research Institute where the supercomputer lives, billions of calculations are happening at that very moment.

This spring, the supercomputer became even more super with the installation of 532 additional brand-new HP servers—each loaded with multiple state-of-the-art processors—into rows of cooled lockers. The upgrade provided a serious boost to the facility’s computing power—just in time for Mailman faculty to start using it.

Learn more about high performance computing at the webpage of the Mailman School HPC Support Office.