New data science institute to help scholars harness ‘big data’

In a world awash in data, UC Berkeley is meeting the flood head-on by establishing a new institute to support faculty, researchers and students in their efforts to mine this information in areas as diverse as astronomy and economics, genetics and demography.

Saul Perlmutter

Berkeley Nobelist Saul Perlmutter describes the new Berkeley Institute for Data Science at a Nov. 12 meeting in Washington, D.C., sponsored by the White House Office of Science and Technology Policy. He was joined by Ed Lazowska, director of the eScience Institute at the University of Washington, and Joshua Greenberg of the Sloan Foundation. (Photo by Sandy Schaeffer Photography for the NSF)

The Berkeley Institute for Data Science, to be housed in the campus’s central library building, is made possible by grants from the Gordon and Betty Moore Foundation and the Sloan Foundation, which together pledged $37.8 million over five years to three universities – UC Berkeley, the University of Washington and New York University – to foster collaboration in the area of “data science.”

The goal is to accelerate the pace of scientific discovery, with implications for our understanding of the universe, climate and biodiversity research, seismology, neuroscience, human behavior and many other areas.

The partnership was announced Nov. 12 at a Washington, D.C., event, “Data to Knowledge to Action,” sponsored by the White House and hosted by John Holdren, assistant to the President for Science and Technology and director of the White House Office of Science and Technology Policy.

“Big data is now a super-big deal,” Holdren said as he unveiled several high-impact collaborations – the Moore and Sloan foundations’ among them – involving federal agencies, private industry, academia, state and local governments, non-profits and foundations. Part of President Obama’s Big Data Initiative, which was announced last year, these collaborations will harness data in all areas to enhance economic growth and job creation, education and health, energy and environmental sustainability, public safety and global development.

“There are fields like physics and astrophysics, or statistics and demography, that have always used large sets of data, but now, more and more, people in all fields are asking questions that require similar big-data analysis techniques,” said Saul Perlmutter, UC Berkeley professor of physics, Lawrence Berkeley National Laboratory senior scientist and director of the campus’s new institute. “We are trying to help researchers and students who previously did not work with the large amounts of data that they do now.”

Perlmutter noted that today, scientists typically spend 80 percent of their data analysis time prepping the data, cleaning it and cueing it up for analysis.

“It’s clear we can do better to focus the attention of scientists on scientific problems and not on the trivial steps required to get data to the point where you understand what you are doing,” he said.

Critical barrier to advancing science

This is a “critical barrier to advancing science” and the reason the two foundations are focusing on data-driven research at universities, said Moore Foundation program officer Chris Mentzel.

artist's rendering of data science center

An artist’s rendering of the planned data science center in Doe Library.

In Perlmutter’s own field of cosmology, he said that “in most projects I’ve been involved with, the software and analysis tools quickly became the limiting factor.”

“It’s possible to record everything at large scale today, creating huge amounts of data in all fields,” said David Culler, chair of UC Berkeley’s Department of Electrical Engineering and Computer Science and one of the co-principal investigators on the data science grant. “The Moore and Sloan foundations recognize this, and that the future of science and a large portion of the social sciences will be data-driven.”

Already, three out of four undergraduate students take introductory programming courses, and statistics is one of the fastest growing majors. This indicates that “students today absolutely see computing and other analytical skills – some programming knowledge – as  an essential part of their education,” Culler said.

Some fields, such as physics and astronomy, have long dealt with reams of data. One of the differences today is that computers are not merely crunching numbers, but making inferences about the data that previously were the province of the researcher. A handful of Higgs bosons were discovered earlier this year by algorithms that recognized their signature among terabytes of data. Led by Berkeley scientists in the 1980s, astronomers and physicists have been using computer algorithms that weed through data and find new exploding stars before any human sees the data. UC Berkeley seismologists have created algorithms that analyze ground shaking in the state and send out warnings within seconds of a major earthquake, without the intervention of a human.

“Computing is not just a tool, it has become an integral part of the scientific process,” Culler said.

UC Berkeley researchers are already at the forefront of data science, as evidenced by the recent creation of the Social Sciences Data Laboratory (D-Lab) for data-intensive social science research; the AMPLab (Algorithms Machines People), which focuses on machine learning; the Simons Institute for the Theory of Computing; and a Masters of Data Science program in the School of Information.

The new initiative and data science center, however, will provide a central place for people to meet and collaborate and learn, Perlmutter said, as well as a center for collaboration with like-minded scholars at NYU and the University of Washington.

“This joint project will work to create examples at the three universities that demonstrate how an institution-wide commitment to data scientists can deliver dramatic gains in scientific productivity,” said Josh Greenberg, who directs the Sloan Foundation’s Digital Information Technology program.