Tuesday, December 4, 2012

Gifts, gold and data at the e-Infrastructure Reflection Group meeting in Amsterdam

Jan Steen - Het Sint NicolaasfeestFor Dutch children, this is an exciting time of year. On 5 December, Sinterklaas, a tall, bearded bishop-like figure, rides into town on a big white horse accompanied by his host of colourful helpers. Together they distribute presents to all the good children in the shoes they leave out expectantly before the fireplace (in fact I even saw a row of suspiciously grown up sized shoes ranged in the corridor at work).

This week I am in Amsterdam for an e-Infrastructure Reflection Group meeting, talking about all things data: big data, open data, secure data. Patrick Aerts of the Netherlands e-Science Centre emphasised the relevance of the local gift giving season by telling us that the word data itself comes from the root for the verb ‘dare’, meaning ‘something given’. Data could represent the discovery of a new genetic mutation leading to an improvement in healthcare, or it could be a bump in a line that means you’ve just confirmed a candidate for the elusive Higgs boson. Some of these gifts disappear quickly, like the Ferrero Rocher chocolates round my house at Christmas, or data could be something that’s with you for the long term, more like a great book that you will read and re-read. Most data falls into the latter category, which is what gives rise to the problems that the e-IRG have gathered to consider – how to store data, how to access it, how to make the best use of it. And even how to define what it actually is.

Gudmond Host, the eIRG Chair kicked off with a few quotes about data. “Raw data are like sewage, toxic if not handled properly,” wrote the Financial Times in 2012. Phillips Electric said “Data is like oil, it hasn’t much worth until you start processing it.” David McCandless queried this at TEDGloba: “Data is the new oil? No, data is the new soil.” I guess he means that from data, grows innovation and new science. Neelie Kroes, the EU Digital Commissioner has also announced that ‘data is the new gold.’ At this point I found myself wondering if it was also the new black, the best thing since sliced bread and the next winner of X Factor.

The theme of data analogies continued with Rudiger Klein, of the Royal Netherlands Academy of Arts and Sciences. According to Klein, there is no doubt that the volume of data being generated is growing exponentially, and has been variously described as a wave you can ride, or a data deluge that you presumably can’t. Peter Wittenburg of the Max Planck Society called for a flexible, open, global data infrastructure able to cope with the influx. Researchers currently spend a vast amount of time finding data, accessing it, checking it, compressing it and downloading tools to work on it. Trust is an issue, as we lack a culture of easy sharing. Ingrid Dillo of the Data Archiving and Networked Services told us that out of 400 researchers they surveyed, 70% said they keep their data on their own or a department computer. Asked why they weren’t sharing their data, they commonly said, "Because the data is mine!” Or because they worry someone might use it to discredit their findings, or they’re still analysing it.

Wittenburg urged the group to get some basics done – for example by setting up a worldwide accessible and usable system for personal ID registration and resolution to help with the trust problem. Chris Greer of the US National Institute of Standardisation told us that "PIDs are to data what TCP/IP is to networks," and reminded us that it’s the unpredictable that generates a lot of the value from data. We should rely on existing tools to get underway. “It’s not necessary to get it right on day one, just get it going on day one.”

Data may want to be free, but it also costs money. Wouter Los of Lifewatch at the University of Amsterdam told us that 80% of the cost of data is in discovering and negotiating access to it. About 80% of the remaining 20% is needed for understanding, trusting and manipulating the data set into a useful form. So only roughly 4% of the total cost lies in actually using the data.

Juan Bicarregui of STFC paraphrased the famous reduce, reuse, recycle mantra of the recycling community - repeat, repeal and repurpose. By repeating, you can validate data, repeal leads to alternative hypotheses based on the data. Repurposing data is harder, but by supplying research data to a new field, new discoveries could result. For Leif Laaksonen, of the Research Data Alliance, the initiative should be led more in a 'bottom up' rather than 'top down' way. The community needs to develop the building blocks itself, and build bridges, both to connect datasets and also to link the people that use the data. And data itself is an e-infrastructure.

In the afternoon we split into working groups on content, services and everyone’s apparent favourite, governance. For the content group, the important factor is to build bridges between the scientists producing the data and those consuming it. Currently there are no tools for automatic checking of data quality. The services group want four sets of services, to find the data in the way the scientist prefers, to assess it's quality when almost everyone is a data producer, to access it in the least restrictive way possible and to reuse it. The governance group laid down no hard and fast rules, but urged the community to look at the features that have to be in place to create governance rather than the actual elements themselves. The group noted that the northern hemisphere was very well represented at the meeting but the southern hemisphere much less so. With the Square Kilometre Array and the European Extremely Large Telescope being built in the coming years in the southern hemisphere, this region should also engage and is in fact willing to do so.

So by the end of day one, we had heard data defined as sewage, oil, soil, gold, a gift, a wave, a bridge and an e-infrastructure. As Norbert Myer of the Poznan Supercomputing and Networking Center put it: "Making it simple is so complicated!"

1 comment:

Stefan Janusz said...

The new e-ScienceBriefing on Big Data came out on Friday. If Sinterklaas didn't put a copy in your shoe, you can download the PDF from here: