The grid and the cloud
are dead! Long live big open data! At least that seems to be the statement coming
from a number of recent meetings.
Last week saw an
exploratory meeting of the Research Data Alliance in Washington DC. Around 120 people from around the world had gathered to try
and understand the blocking issues in allowing researchers to access and share
data between disciplines, institutions and across countries. The issue here is
not just around the pure issues of technical interoperability, but more broadly
about building a community that can propose refine their technical work to
reduce the barriers to a collaborative global data infrastructure.
The driving force
behind this initiative (projects and funding agencies in USA and Europe) is not
just about achieving technical operability, but about ensuring open access and
exploitation of the big data sets being generated by researchers by others
outside the generating community. An interesting example was presented by Chris
Greer from NIST. He cited the how the release of NASA Landsat images in 2008 for
unrestricted use had now created an estimated value of $935M/year to the
environmental management industry. While it is not always necessary for the
investments made by the public to yield economic returns – it cannot hurt!
However, the challenge
in building any infrastructure is to balance the common needs against those
that are specific to a particular science domain. Each research community will
have developed its own vocabulary, its own metadata descriptions, its own data
access services to expose the underlying data models. Where should the line be
drawn between common and domain specific?
What are the common
mechanisms that are needed to allow different research communities to
collaborate and share data? While this is still work in progress, some
consensus is emerging. For instance the need for persistent data identifiers
that enable individual data sets and data objects to be described, discovered and
located. Authentication and authorization is still needed when discussing open
science data as funders like to know how the generated data is being used and it
is possible that some data will be restricted to members of a collaboration for
some of the time.
Leif Laaksonen from
CSC described how within Europe the EUDAT (http://www.eudat.eu/)
project is examining some of these technical issues with the recently started
ICORDI (http://www.icordi.eu/) project now
providing coordination and input into international activities such as the RDA.
Andrew Treloar related how activities by the Australian National Data Services
is working on helping scientists transforming data (generally unmanaged,
disconnected, invisible and has a single user) to structured collections (that
are managed, connected, findable and reusable) that can provide more value.
At this week’s
Microsoft e-Science Workshop (that is co-located with IEEE e-Science and the
Open Grid Forum 36 in Chicago this week) the focus on big data continued with sessions
dedicated to Open Data for Open Science. Using environmental science as an
example, with many examples drawn from NSF’s Earth Cube (http://www.nsf.gov/geo/earthcube/)
initiative, the issues of data interoperability and reuse were again prominent.
The environment is an
excellent example of how data reuse is needed across different domains in order
to maximize knowledge discovery due to the inherent coupling between functions
within the natural ecosystem and their impact within society. For instance, how
do you ensure that satellite data can be coupled to land/sea/air observations
collected independently over many years? How should the data coming out of the many
instruments that make up an ocean observatory be integrated given the different
manufacturers and their data formats?
The focus in this work
is not so much on standard service interfaces but on standard data languages. Data
markup languages encompassing the tracks of research vessels across the ocean,
of the output from instruments, capturing the semantics of hydrologic
observations are examples of many community driven initiatives. Organisations such
as the Open Geospatial Consortium (OGC) – composed of representatives from
academia and industry – play an important role for the environmental community due
to the geospatial nature of many of their datasets and form a basis for much of
the work that now takes place. The issue is about opening up your data for
access not opening up your database!
Given the size and number of the environmental data sets generated from
instruments and simulations, converting this data to information and knowledge provides
many challenges. High Performance Computing can provide the raw simulation
output and High Throughput Computing can help support ensemble studies to
explore the sensitivity of the simulation. These local resources can be supplemented
by capabilities accessed through grids (PRACE, EGI, OSG and XSEDE) and through commercial
and publicly funded clouds.
While standards form
one aspect of this discussion it is not the only , any standards need to encompass
the variety of different use cases and users.
So has hype around big
data grown to the point where it has now swallowed up the cloud which is still bloated
on gobbling up the grid?
One of the challenges
of big data is finding the infrastructure to analyse it! Here the cloud’s
business model of the flexible and rapid provisioning of resources demonstrates
its strengths. Creating the storage to handle the intermediate and final
results generated from a HPC cluster provisioned on demand, demonstrates the need
for a flexible model. As the data used in the analysis will need to be
retrieved or placed in persistent data stores, issue such as authenticated and authorized
access to these distributed resources becomes critical – a typical grid
scenario.
In moving from one
paradigm to another its important to not discard the experience and technology gained
previously.
No comments:
Post a Comment