Wednesday, October 24, 2012

Bulding better data for best of breed science: collaboration between the ESFRI cluster projects

If you aren’t regularly exposed to European Commission-speak (and I don’t necessarily recommend it as a way of life), then the term ‘research infrastructures’ might be a little opaque. It refers to facilities, resources and related services used by the scientific community to conduct top-level research in their respective fields, ranging from social sciences to astronomy, genomics to nanotechnologies. The term is pretty loose and doesn’t necessarily just include infrastructures of the bricks and mortar or even physical variety. According to the EC website, examples can include “single large-scale research installations, collections, special habitats, libraries, databases, biological archives, clean rooms, integrated arrays of small research installations, high-capacity/high speed communication networks, highly distributed capacity and capability computing facilities, data infrastructure, research vessels, satellite and aircraft observation facilities, coastal observatories, telescopes, synchrotrons and accelerators, networks of computing facilities, (pause for breath) as well as infrastructural centres of competence providing services.” RIs may be ‘single-sited’ (a single resource at a single location), ‘distributed’ (a network of distributed resources), or ‘virtual’ (the service is provided electronically).

Once you take on board this pretty wide definition, you realise that RIs must run into the hundreds. EUDAT leader Kimmo Koski’s estimate was about 500, maybe representing an investment of 100 billion Euros or more. As they become increasingly international and diverse, dealing with data loads trending up to the zettabyte level, how do you support this huge community in curating, storing and discovering its data?
One approach is the set-up of the EUDAT project itself, and also the ESFRI cluster projects. These are EC-funded projects designed to support disciplinary groupings of RIs which are on the European Strategy Forum on RIs roadmap. The roadmap is intended to be a fast track to realisation for the RIs considered best of breed and most likely to generate excellence in European science. 

One year in, EUDAT is establishing a collaborative data infrastructure as a framework for the future, including common data services, community support services, and a set of data generators and users. Communities represented within EUDAT itself include life sciences (Lifewatch, VPH), earth sciences (EPOS and is-enes) and humanities (CLARIN). Common requirements across these communities are data staging, safe replication, simple storage, authentication and metadata. Clearly you can’t have a different solution in each of these areas for all 500 RIs. To date, EUDAT has seen promising progress with service developments and proactive collaboration with user communities. “Our CDI is evolving and heading in the right direction,” says Koski, “but we still have a large task to tackle.”

For the ESFRI cluster projects, it’s important to identify both those needs that are common to them all, but also those that are specific to their individual communities, and the relative priorities. Stephanie Suhr of BiomedBridges described how the project brings together 10 RIs in the biomedical area, administered through the European Bioinformatics institute, EBI. They already have a highly active and widely accessed data infrastructure. Around 80% of their sites are generating data, and many are using common standards. However, they may not then be going on to share their data due to confidentiality concerns, or lack of maturity of the data repository itself. Much of their data has ethical, legal or societal implications and no project data can include personally identifiable information. One example of a case study for BioMedBridges is linking data on hyperglycaemia in humans to similar but differently-named data in mice. This is one example of where they are adding value to existing data by bringing previously separate communities together.

Wouter Los from ENVRI stressed the sheer diversity of the data coming from this environmental sciences cluster, from DNA sequences to radar interference data, aerial and satellite observations to species data, marine sensors to plate tectonics. All of them have their own standards and protocols. Some, like the EISCAT-3D radar have significant real time data challenges in collecting, storing and cataloguing; processing data into derived data products that are human understandable, and analysing data to trigger real time alerts for follow up. ENVRI are believers in a common data infrastructure to manage the growing amount of data and improve interoperability between infrastructures and across disciplines, as well as to clarify roles and responsibilities. For them the question is where does EUDAT fit in?

CRISP combines 11 RIs in the area of physics, including experiments such as CERN, analytical facilities as in the ESRF synchrotron and instruments such as the Square Kilometre Array telescope. Essentially these all take images of the very big to the infinitesimally small. Laurence Field concentrated on the data and IT aspects of CRISP, where they are looking to improve cost efficiency and data interoperability. CRISP foresees collaborations between the cluster projects in the areas of usage and operational policies, use cases, solutions and technology. Specific items for collaboration include identity management, data management and data policy. According to Field, together they need to address shared challenges, such as providing federated identity management, data archiving and preservation (in formats that may become obsolete as hardware goes out of date), data discovery, data access and policies. 

DASISH  brings together 5 ESFRI projects in the area of social sciences and humanities, and aims for enhanced visibility and reusability of data, tools and services.  They foresee a need to create a common infrastructure, not just strengthen community specific ones, and have similar calls for realistic solutions for data archiving, data access, single sign-on unique identities and persistent data identifiers. Like BioMedBridges, linking data to enhance its value is key. Users want to be able to enrich data, for example by adding notes and relating different parts of the data to other information. There are no current applications that can do this, especially for parts of data objects rather than the whole thing. 

So what did we learn from this parade of needs and requirements? Perhaps rather obviously, we need a top down and bottom up approach, to allow the circles of requirements centred on these communities to overlap and fuse into real sustainable solutions. Obvious, but not to be underestimated, as Walter Stewart of Research Data Canada pointed out – we would soon notice the lack of top down structures such as EUDAT if we had to go without them, as they do in Canada. Our top level funding mechanisms in Europe may not be ideal but they're certainly not to be dismissed lightly.

No comments: