Monday, October 8, 2012

Riding the data deluge on the shoulders of giants

The grid and the cloud are dead! Long live big open data! At least that seems to be the statement coming from a number of recent meetings.

Last week saw an exploratory meeting of the Research Data Alliance in Washington DC. Around 120 people from around the world had gathered to try and understand the blocking issues in allowing researchers to access and share data between disciplines, institutions and across countries. The issue here is not just around the pure issues of technical interoperability, but more broadly about building a community that can propose refine their technical work to reduce the barriers to a collaborative global data infrastructure.

The driving force behind this initiative (projects and funding agencies in USA and Europe) is not just about achieving technical operability, but about ensuring open access and exploitation of the big data sets being generated by researchers by others outside the generating community. An interesting example was presented by Chris Greer from NIST. He cited the how the release of NASA Landsat images in 2008 for unrestricted use had now created an estimated value of $935M/year to the environmental management industry. While it is not always necessary for the investments made by the public to yield economic returns – it cannot hurt!

However, the challenge in building any infrastructure is to balance the common needs against those that are specific to a particular science domain. Each research community will have developed its own vocabulary, its own metadata descriptions, its own data access services to expose the underlying data models. Where should the line be drawn between common and domain specific?

What are the common mechanisms that are needed to allow different research communities to collaborate and share data? While this is still work in progress, some consensus is emerging. For instance the need for persistent data identifiers that enable individual data sets and data objects to be described, discovered and located. Authentication and authorization is still needed when discussing open science data as funders like to know how the generated data is being used and it is possible that some data will be restricted to members of a collaboration for some of the time.

Leif Laaksonen from CSC described how within Europe the EUDAT ( project is examining some of these technical issues with the recently started ICORDI ( project now providing coordination and input into international activities such as the RDA. Andrew Treloar related how activities by the Australian National Data Services is working on helping scientists transforming data (generally unmanaged, disconnected, invisible and has a single user) to structured collections (that are managed, connected, findable and reusable) that can provide more value.

At this week’s Microsoft e-Science Workshop (that is co-located with IEEE e-Science and the Open Grid Forum 36 in Chicago this week) the focus on big data continued with sessions dedicated to Open Data for Open Science. Using environmental science as an example, with many examples drawn from NSF’s Earth Cube ( initiative, the issues of data interoperability and reuse were again prominent.

The environment is an excellent example of how data reuse is needed across different domains in order to maximize knowledge discovery due to the inherent coupling between functions within the natural ecosystem and their impact within society. For instance, how do you ensure that satellite data can be coupled to land/sea/air observations collected independently over many years? How should the data coming out of the many instruments that make up an ocean observatory be integrated given the different manufacturers and their data formats?

The focus in this work is not so much on standard service interfaces but on standard data languages. Data markup languages encompassing the tracks of research vessels across the ocean, of the output from instruments, capturing the semantics of hydrologic observations are examples of many community driven initiatives. Organisations such as the Open Geospatial Consortium (OGC) – composed of representatives from academia and industry – play an important role for the environmental community due to the geospatial nature of many of their datasets and form a basis for much of the work that now takes place. The issue is about opening up your data for access not opening up your database!

Given the size and number of the environmental data sets generated from instruments and simulations, converting this data to information and knowledge provides many challenges. High Performance Computing can provide the raw simulation output and High Throughput Computing can help support ensemble studies to explore the sensitivity of the simulation. These local resources can be supplemented by capabilities accessed through grids (PRACE, EGI, OSG and XSEDE) and through commercial and publicly funded clouds.
While standards form one aspect of this discussion it is not the only , any standards need to encompass the variety of different use cases and users.

So has hype around big data grown to the point where it has now swallowed up the cloud which is still bloated on gobbling up the grid?

One of the challenges of big data is finding the infrastructure to analyse it! Here the cloud’s business model of the flexible and rapid provisioning of resources demonstrates its strengths. Creating the storage to handle the intermediate and final results generated from a HPC cluster provisioned on demand, demonstrates the need for a flexible model. As the data used in the analysis will need to be retrieved or placed in persistent data stores, issue such as authenticated and authorized access to these distributed resources becomes critical – a typical grid scenario.

In moving from one paradigm to another its important to not discard the experience and technology gained previously.

No comments: