Thursday, May 23, 2013

Donald Knuth said software is hard. So is open data.

Software is data. Data is infrastructure. But is data research, or is it development? Is it the foundation for an academic career, or is it a raw material, ready to be processed and commercialised by the entrepreneurs of the digital age?

If data is the new oil, then we need to know who can lay a credible claim to it. Taxpayer-funded public research, it is widely said, should be made openly accessible. But researchers want credit for their hard work and, concerned about their data-well being plundered by unscrupulous others, want to control access to it.

This database might be copyrightable. And quite large.
Raw data cannot be copyrighted: it fails to meet the criteria of being creative and (in its representation of an aspect of the physical laws of the Universe) original. Databases, in their curation and design, are protected by thin copyright – and so the intellectual property owner can decide how the database can be reused. That means they have copyright, but can choose to licence it under copyleft if they so choose, e.g. Creative Commons attribution or similar.

I’ve deliberately omitted specific pieces of information in the last paragraph. First: who is the intellectual property owner? (Back to this point shortly) Second: why bother licensing data that, in the true gold standard for open date, should receive a public domain dedication or, in the Creative Commons world, a CC0 licence? That’s what the Panton Principles recommend: making data truly public domain avoids the unsure nature of thin copyright (by simply renouncing ownership) and also avoids the eternal devil of Creative Commons attribution works… attribution stacking (if the data is produced by seven authors and is then recycled six times by author-groups of seven, all different each time, it will effectively have 49 authors, and this can grow).

Depending on the country and institution, it may be that the University (or research institute) is the real copyright holder (if the database is curated and copyrightable) rather than the researcher. That’s surprising even to some researchers who, in signing a contract, relinquished intellectual property rights when they took a job at University X. But clearly, researchers can be copyright holders and therefore (and especially if they are funded by taxpayers) the common good requires them to make their data open, i.e., PDDL or CC0. After all, their hard work and the credit they deserve from it both derives from and is protected by cultural norms: citations, election to learned societies and so on. But, as I learned at the e-IRG workshop in Dublin, not all publicly funded scientists do this: they are worried that their peers will plunder their data-well and not cite the source, either absent-mindedly or (in the worst case), maliciously. So they release their original, creative database under CC-BY, or something more restrictive, in direct contradiction to the Panton principles. Or they don’t, and presumably assert copyright (whether their work is copyrightable or not).

Research institutes can also be as confused in their approach to licensing, and many legal experts in the field recount examples of institutions worrying about licensing after the fact (when it is often too late). But, for simplicity’s sake, those advocating open data would rather the intellectual property resulting from publicly funded research be in the hands of institutes rather than individual scientists, when it is easier to manage.

In the open data legalities track of the e-IRG workshop, it was suggested that a lot of confusion can be avoided by better training in such matters as copyright and copyleft for researchers, and default licensing positions for publicly funded data. Having a default position recognises that many researchers shouldn’t have to care about legalities if they don’t want to; better education about copyright and copyleft, I believe, would make everyone better web citizens. Researchers whose databases are used deserve credit, but proper citation of source also deserves credit if we are to ever move towards public domain dedication. Databases used without proper citation in research to further a career, when carried out duplicitously, is one thing only: plagiarism. That alone should dissuade unscrupulous researchers, if they really exist.

The onus is also on the intellectual property owner to flag up instances of data misattribution if it occurs, but we should work towards a persistant identifier system, so you can track the data back to source.

So: that’s publicly funded research. Now, what about the private sector? Or, indeed, the start-up that the researcher runs in their spare time: who should own that? The individual? The University, if the individual thinks about the start-up between 9 and 5 on a weekday? It gets tricky. But better education in these matters, and actually having a default position, is a step towards a more sensible future with regard to open data.

Oh, and regarding software: not copyrightable. But licensable. That's a whole other argument.

