If you aren’t regularly exposed to European Commission-speak
(and I don’t necessarily recommend it as a way of life), then the term
‘research infrastructures’ might be a little opaque. It refers to facilities,
resources and related services used by the scientific community to conduct
top-level research in their respective fields, ranging from social sciences to
astronomy, genomics to nanotechnologies. The term is pretty loose and doesn’t
necessarily just include infrastructures of the bricks and mortar or even
physical variety. According to the EC website,
examples can include “single large-scale research installations, collections,
special habitats, libraries, databases, biological archives, clean rooms,
integrated arrays of small research installations, high-capacity/high speed
communication networks, highly distributed capacity and capability computing
facilities, data infrastructure, research vessels, satellite and aircraft
observation facilities, coastal observatories, telescopes, synchrotrons and
accelerators, networks of computing facilities, (pause for breath) as well as
infrastructural centres of competence providing services.” RIs may be
‘single-sited’ (a single resource at a single location), ‘distributed’ (a
network of distributed resources), or ‘virtual’ (the service is provided
electronically).
Once you take on board this pretty wide definition, you
realise that RIs must run into the hundreds. EUDAT leader Kimmo Koski’s
estimate was about 500, maybe representing an investment of 100 billion Euros
or more. As they become increasingly international and diverse, dealing with
data loads trending up to the zettabyte level, how do you support this huge
community in curating, storing and discovering its data?
One approach is the set-up of the EUDAT project itself, and
also the ESFRI cluster projects. These are EC-funded projects designed to
support disciplinary groupings of RIs which are on the European Strategy Forum
on RIs roadmap. The roadmap is intended to be a fast track to realisation for the
RIs considered best of breed and most likely to generate excellence in European
science.
One year in, EUDAT is establishing a collaborative data
infrastructure as a framework for the future, including common data services,
community support services, and a set of data generators and users. Communities
represented within EUDAT itself include life sciences (Lifewatch, VPH), earth
sciences (EPOS and is-enes) and humanities (CLARIN). Common requirements across
these communities are data staging, safe replication, simple storage,
authentication and metadata. Clearly you can’t have a different solution in each of
these areas for all 500 RIs. To date, EUDAT has seen promising progress with
service developments and proactive collaboration with user communities. “Our CDI
is evolving and heading in the right direction,” says Koski, “but we still have
a large task to tackle.”
For the ESFRI cluster projects, it’s important to identify
both those needs that are common to them all, but also those that are specific
to their individual communities, and the relative priorities. Stephanie Suhr of
BiomedBridges described how the project
brings together 10 RIs in the biomedical area, administered through the
European Bioinformatics institute, EBI. They already have a highly active and
widely accessed data infrastructure. Around 80% of their sites are generating
data, and many are using common standards. However, they may not then be going
on to share their data due to confidentiality concerns, or lack of maturity of
the data repository itself. Much of their data has ethical, legal or societal
implications and no project data can include personally identifiable
information. One example of a case study for BioMedBridges is linking data on
hyperglycaemia in humans to similar but differently-named data in mice. This is
one example of where they are adding value to existing data by bringing
previously separate communities together.
Wouter Los from ENVRI stressed the sheer diversity of the
data coming from this environmental sciences cluster, from DNA sequences to
radar interference data, aerial and satellite observations to species data,
marine sensors to plate tectonics. All of them have their own standards and
protocols. Some, like the EISCAT-3D radar have significant real time data
challenges in collecting, storing and cataloguing; processing data into derived
data products that are human understandable, and analysing data to trigger real
time alerts for follow up. ENVRI are believers in a common data infrastructure to manage the
growing amount of data and improve interoperability between infrastructures and
across disciplines, as well as to clarify roles and responsibilities. For them
the question is where does EUDAT fit in?
CRISP combines 11 RIs in the area of physics, including experiments
such as CERN, analytical facilities as in the ESRF synchrotron and instruments
such as the Square Kilometre Array telescope. Essentially these all take images
of the very big to the infinitesimally small. Laurence Field concentrated on
the data and IT aspects of CRISP, where they are looking to improve cost
efficiency and data interoperability. CRISP foresees collaborations between the
cluster projects in the areas of usage and operational policies, use cases,
solutions and technology. Specific items for collaboration include identity
management, data management and data policy. According to Field, together they
need to address shared challenges, such as providing federated identity
management, data archiving and preservation (in formats that may become
obsolete as hardware goes out of date), data discovery, data access and policies.
DASISH brings
together 5 ESFRI projects in the area of social sciences and humanities, and aims
for enhanced visibility and reusability of data, tools and services. They foresee a need to create a common
infrastructure, not just strengthen community specific ones, and have similar calls for realistic solutions for data archiving, data access, single sign-on unique
identities and persistent data identifiers. Like BioMedBridges, linking data to enhance its value is key. Users want to be able to enrich data, for example by adding notes and
relating different parts of the data to other information. There are no current
applications that can do this, especially for parts of data objects rather than the
whole thing.
So what did we learn from this parade of needs and
requirements? Perhaps rather obviously, we need a top down and bottom up
approach, to allow the circles of requirements centred on these communities to
overlap and fuse into real sustainable solutions. Obvious, but not to be
underestimated, as Walter Stewart of Research Data Canada pointed out – we
would soon notice the lack of top down structures such as EUDAT if we had to go
without them, as they do in Canada. Our top level funding mechanisms in Europe may not be ideal but they're certainly not to be dismissed lightly.
No comments:
Post a Comment