Tuesday, March 5, 2013

Nurturing an open data culture: CloudscapeV

Delegates at CloudscapeV

As more and more scientific disciplines become data hungry, and a rising number of social scientists (economists, e-humanities researchers) follow suit, governments and international bodies are starting to commit to developing a framework that allows convenient, secure and intelligent access and exchange of data. During the "Open Collaborative Models, Open Data, Big Data' session at CloudScapeV, a number of pioneering initiatives showed how cloud computing is opening-up big data for research, society and business.

During the keynote speeches, Carlos Morais Pires, Scientific Officer for the “Scientific Data e-Infrastructures” from the ECs DG CONNECT, said the vision for “global research data infrastructures” involves overcoming barriers in realising the importance of data sharing for next century science. But the vision is also hindered by those waiting for standards to be established for enabling data sharing and interoperability for the entire data life cycle. Pires commended the efforts of the Research Data Alliance (RDA), who are steering the international research community to gather user recommendations for infrastructures, policy, practice and standards. The RDA is holding its first meeting towards the end of this month.

A burgeoning partnership between data providers and scientific users is being encouraged by another international organization. EUDAT has over the last eighteen months been fully engaged in gathering user requirements to assemble the 'building blocks' for a pan-European data infrastructure to complement the EGIs grid infrastructure and GÉANT's network infrastructure.

Rob Baxter, Software Development Group manager from the Edinburgh Parallel Computing Centre, explained some of the current challenges for EUDAT.   Research councils are asking for data management strategies in their proposals but Baxter says in order to foster a culture of data sharing you have to balance incentives and rewards, and demonstrate value. EUDAT is working towards an integrated "compelling" system whereby sharing your data will give you access to everyone elses' data. “If you could bring in data from a slightly related discipline and combine your data with theirs, you could answer a different research question, and discover something new,” says Baxter.  This could also aid reproducibility, which is still an issue among some disciplines. Five core communities are working with EUDAT from the world of linguistics (CLARIN), biodiversity (LifeWatch), plate tectonics (EPOS), climate science (ENES), and medicine/physiology (Virtual Physiological Human)  The next step for EUDAT is to integrate the process of ‘submitting data in/Persistent Identifiers out” with some level of guarantee and user control. This would include users having some knowledge or say on factors such as whether replication is automated, the number of copies that can be made, and storage location etc. Their second user forum took place this week.

The long term view for EUDAT is that the resource provider could be any cloud storage provider regardless of location. However automated replication across boundaries could be problematic, says Baxter. This is especially pertinent for medical data, copyrighted data and unique data (digital art), which can all be affected by different EU data protection laws. "Building collaborative data infrastructures and storage clouds for research data is not just requirements/technical problem, but increasingly about managing policy restrictions automatically and harmonizing legal frameworks", concluded Baxter.

The question remains how do you nuture an open data culture for that drives innovation and economic growth. This is one of the goals of the independent non-for-profit, Open Data Institute (ODI), which was officially opened last September (one of the founders was Sir Tim Berners-Lee).  GridCast interviewed Stuart Coleman, the Commercial Director at the Open Data Institute, at CloudscapeV, to find out more about some of the services offered by the ODI. This includes mentoring data-driven start-ups, as well as training researchers and journalists in data literacy. Coleman says that "often the opportunity for people to innovate with data is constrained by the fact that lots of documents are released which could deliver more value as data".

Stuart Coleman at CloudscapeV
ODI aims to unlock the supply of data and generate demand and value for business, research and society. Innovation will occur if people can gain access to discern new insight from that data. Companies that have embraced an open data environment stand to benefit considerably. For example, corporate giant, Nike, opened up their supply chain, which has impacted positively on customer relations and sustainability.

One case study (OpenCorporates) on the ODI website, has with a small staff of two, managed to collate data from 51 million companies. Their aim was to an open database for the corporate world (similar to OpenStreetMap) connecting and adding clarity to corporate data. "Many people buy data from other sources with a warranty that gives them a certain level of accuracy. In an open environment, people are constantly motivated to contribute to that data and to share-alike," says Coleman."If you can identify where the data comes from, you can trust that data as being more authoritative than a closed source." The healthcare industry has also benefited from this transparency opening up prescription data. The site, prescribing analytics is  providing an insight into GP prescribing of cheaper generic  vs more expensive branded drugs, indicating where cost efficiencies could be made (e.g. over £200 million).

Coleman says understanding is the main barrier and this is where case studies and education are vital.  What might be useful is a data-roadmap to define what it the most valuable data and why. He points out that some data known as core-reference data, which is data when combined with other sources can be particularly useful again and again.

