Tuesday, October 30, 2012

One significant step for standards, a massive move forward for metadata

Metadata is data about data. At the EUDAT 1st conference in Barcelona, it was perhaps the most fêted buzz-word around, and for a very good reason. From a technological standpoint, the data tsunami can only be prepared for by being able to organize those vast quantities of data into manageable chunks, which means ‘tagging’ it so it can be referred to, searched for and easily accessed in the future.

When (or, to give it its ‘proper name’ for stubborn stalwarts like me,[1] arrived in 2003, the Web was already growing at an unbelievable rate. In 2000 there were a billion pages; by 2008, five years after’s inception, there were a trillion. Toolmakers, eager to solve problems in this age as in any other, were quick to provide solutions. Alongside improved search algorithms provided by Google and others, individual users of could curate their own ‘Web travelguide’, saving and signposting points and pages of interest by ‘tagging’ them however they liked – perhaps in categories relating to their hobbies, what they found funny, or were passionate about. The more socially-minded would carefully choose the tag words so that others could find them, and in this manner ‘social bookmarking’ was a major leap forward, building some of the foundations of the Web 2.0 era.
[1] For the record, I also believe was better than, despite it actually being the same thing.

In the same way, scientists, data scientists and e-infrastructure engineers have been thinking hard about how to add value to data by making it useful to others in the future. Data should be tagged to make it findable. But exactly how should it be tagged? Tags have to be flexible and dynamic to reflect the unpredictable nature of scientific research, but they have to follow standards, otherwise they’re little use to anyone else. How many of us at our first attempt at implementing a filing system end up with lots of similarly-named folders, each containing a single item, perhaps accompanied by a bulging folder called ‘misc’? Without standards developed with the experience of people who work with information and its management – librarians who have embraced the digital age – big data could end up being an incoherent, unwieldy mess, just like those first forays into filing. It’s perhaps not so ‘much catch-the-wave’ as ‘avoid the sea spray’ – all while a tsunami looms on the horizon.

“Without supporting tools, data isn’t data,” said Ross Wilkinson of the Australian National Data Service in the afternoon session on metadata; “—it rots! It needs to be made available [through e-infrastructures and computing resources] and it needs to be enhanced by making in available in alignment with other datasets.”
Metadata marsupial, the possum. (CC-BY Wolombi, Flickr)
Expanding on this point, Wilkinson explained that enhancing data through metadata allows curation of datasets with that real cross-disciplinary benefit, “not just to answer questions, but to explore the data to find new patterns”. By placing data in a rich (and that means metadata-loaded) context, scientists in Australia have been using habitat maps of where possums (which, we were told, don’t sweat) live to predict the likelihood of bush fires. Without those sweat glands the possums would not abide in areas likely to burn when the dry season comes. Finding this connection has profound economic and social implications for human habitations, construction and related policies in Australia, but before the data was curated and made open, that link might have never been found.

One area that definitely needs a robust approach to metadata is medicine, not only because of the rich terminology of biology, but because clinicians often like to see information in a diagrammatical format. This has presented problems for clinical metadata, because it’s harder to grep in a graphic than in a text file. Bernard de Bono of the European Bioinformatics Institute presented one solution, ApiNATOMY, which automates the creation of standard anatomy schematics and metabolic maps and allows the inclusion of metadata. It’s the standardization of the approach that those behind the project makes it suitable for multiscale anatomy analytics.

But what language will metadata be in? EUDAT is a project concerned with European data infrastructure, so it could be any one of 23 official languages recognised by the EU. Speaking in the plenary, Director of the Finnish IT Centre for Science, Kimmo Koski revealed that the standard language agreed on would be English – an important step towards European data standards.

No comments: