Pages

Wednesday, September 15, 2010

Data Management

Whilst I'm not normally a data management guru; I'm sitting in on the Data Management session - and only partly for the benefit of the GridPP storage group! This session is all about the high level, and longer term requirements for data. There are three main speakers today.

Firstly, David Giaretta reporting for the High Level Expert Group on Scientific Data. (Phew, that's a title!). The groups purpose is to present advice to the commission, as part of the Digital Agenda for Europe forward looking to 2020 - but overachievers that they are, they went to 2030! The general context of this is an almost exponentially increasing volume of data; although not all of equal value (the example of the ESA Earth Observatory data as high value, irreplacable data is a good example).

The vision that they offer talks about having stakeholders aware of the value of data; researchers having access to data; producers wanting to open up their data; more public funding to enable research; industry benefiting from this and citzens having greater awareness and confidence in science (along with better, more open governance).

This is, indeed, a grand vision, and whilst I agree that long term visions should be optimistic, I can't see how to get there from here. The initial wish list presented, from the PARADE paper, didn't seem to give me a coherent feeling on how it would move things towards the goals. I'm putting this down as one report that I'll need to read in detail.

Next up is Matti Heikkurinen, on the e-IRG Data Management Task Force. e-IRG is an inter governemental policy body, and this is a strong counterpart to the previous talk, looking at the present in the first instance, and what to do next.

The 3 key findings were there meta-data; data quality and iteroperability. Meta-data and data quality is key for re-use, but it's not free, and open access makes peer review possible, but not a magic bullet; there was a suggestion of some mark of quality. Interoperability is most problematic in cross-disciplinary uses, and resource level and semantic interoperability should be handled differently.

The final speaker, Rosette Vandenbrouke, turns up the specificity again, talking about Digital Cultural Heritage. After some background, an interesting set of results from a survey. It was clear that vocabulary was a problem - infrastructure providers and DCH groups have disjoint vocabulary; but the most starting point was that none of the DCH groups seemed to be even aware of Data infrastructure projects.

The requirements listed, however, were for reliable, long term, always available storeage; along with good meta-data. The meta-data is probably solvable, but to combination of reliable and always available for storage is expensive; and that might be difficult to provide in large quantities. Identity management is also an issue here - but one unusual aspect is that some things will be publicly available.

Neil Geddes brought a few wrap up slides, from which the key point was about aiming low; distilling out from a number of very high level, complex use cases some lower level services that could be used to build such use cases.

An interesting session, but it did, at times, feel rather far removed from what the Grid currently does well. But then, if we only look at the areas we do well, we'd never get better!

No comments: