Tuesday, April 24, 2012

Dealing with the Data Deluge: A silver lining is in the Cloud

US, Milwaukee, Wisconsin—Seventy-one multidisciplinary attendees from 23 US institutions recently enjoyed Data Symposium 2012, a one-day event co-sponsored by the Clinical and Translational Science Institute of Southeast Wisconsin (CTSI) and SeWHiP (Southeast Wisconsin High Performance Cyberinfrastructure). Eight data-intensive research experts shared storage solutions for research communities, a model for long-term funding, and tools to facilitate data transfer and sharing. US National Institutes of Health (NIH) and National Science Foundation (NSF) representatives explained funding agencies’ expectations regarding the stewardship and sharing of grant-funded research data.

Ian Foster (UChicago)
The first keynote speaker was Ian Foster (University of Chicago). Foster is co-architect of the U.S. National Science Foundation’s eXtreme Science and Engineering Discovery Environment (XSEDE), and director of the Computation Institute, a joint endeavor of UChicago and Argonne National Laboratory. His presentation was entitled Rethinking cyberinfrastructure for massive data, but modest budgets. Foster described a critical situation that has been developing for more than 15 years. As global demand for services increased, funds did not. Service providers struggled to maintain status quo as research data mushroomed. The situation led to a paradigm shift in the way resources are provided. UChicago’s Globus Online team is developing cloud-based software, platform, and infrastructure as-a-service solutions that facilitate access, sharing, and management of research data. In doing so, they have successfully removed many barriers to entry making it possible for millions of researchers to access powerful computational tools for the first time. Rachana Ananthakrishnan (UChicago), a member of Foster’s research team, presented an overview of the Globus Online Platform, demonstrating how it is used to efficiently move data across multiple and disparate security domains.

Brian Athey (UMichigan)
Brian Athey (University of Michigan) is an expert in the emerging field of Translational Bioinformatics and research cyberinfrastructure. As principal investigator of multiple data-intensive NIH-funded projects, such as the Visible Human Project, Athey is well acquainted with the data dilemma. Many of his concerns mirror Foster’s, with an added emphasis on sharing. Athey said that we need a factory, not a warehouse. There are currently too many brick walls that prevent access and sharing. Most research hails from academic institutions where silos of bureaucracy discourage sharing before publication. In addition, many legally prevent faculty from making their research available to others in an effort to protect intellectual property, preventing some key findings from being discovered—hence the need for pre-competitive data sharing. Contributing to the situation are the laws intended to protect the privacy of individuals which have induced a “risk-averse culture” in academic medicine.

Panelist Clifford Lynch, Director of the Coalition for Networked Information, reminded attendees that data doesn’t have to be big to be important. A single spreadsheet can represent a tremendous research investment. Access to underlying data and tools is essential if the work is to be replicated or built upon by others without wasting effort and resources.

Serge Goldstein (Princeton) shared Princeton’s financial model for perpetual maintenance, called DataSpace, or Pay Once Store Endlessly (POSE), which is based on the premise that the cost of storage will continue to decline. Goldstein explained that the natural expiration of research data is typically attached to the tenure or life of the principal investigator. With POSE, storage can be funded long after two or four-year grant initiatives run out of money. This buys time to determine whether or not data is of interest to a broader arena.

Michael Huerta (NIH) explained the 2003 NIH data sharing policy. While sharing all data may not be useful, there are key issues to consider when deciding whether and what to share. Sylvia Spengler (NSF) said that since January 2011, NSF requires a Data Management Plan (DMP) for all grant-funded research data. An effective DMP identifies which data is worth saving for more than a couple of years, and if so, how it will be managed and shared. Plans are subject to peer-review and the review panel determines if the DMP meets the needs of the research arena. Many focus on limited terms of under 20 years, with provision for reassessment at the end of that term.

John Cobb (Oak Ridge National Laboratory) presented progress made by NSF’s DataONE/DataNet toward a federated solution and California Digital Libraries' efforts to improve data curation. He said that due to the NSF-required DMP, data management is now an ‘allowable cost,’ therefore an opportunity for universities to make money. He also said that some are giving more thought to the development of unified DMP’s that allow information to be shared across disciplines. Lynch punctuated this remark by adding that the NSF-DMP began to mobilize the community in ways that previous data mandates, such as NIH’s, had not.

More information and presentation slides are available on the SeWHiP web site, For information about CTSI, visit:

No comments: