There’s an assumption in data citation that all the datasets in a repository are fixed and final, just as a research paper when published is in it’s fixed and final form.
In practice this isn’t quite true, there’s a lot of use case where the data in a datastore is subject to change – the obvious cases being longitudinal or observational studies, which these days can be extended to cover datasets derived from automated sensors, with consequent errors due to technical failures, and subsequent fixups.
Most researchers actually only work on an extract of the data, and current best practice is to redeposit the extract along with a description of how it was created as if it was a new dataset.
This is fine for small datasets, but does not scale well for big data, be it numerical, sensor based, or evenworse multimedia – video recordings of language and accompanying transcripts being one use case.
I have previously thought about using bazaar, subversion or git as a version control system, creating a new object each time the dataset updates, but that also suffers from scaling problems, but at least ahs the benefit of being able to recreate the extract against the dataset at a known point in time.
Consequently I was interested to hear Andreas Rauber of the University of Vienna speak on the Research Alliance approach to dynamic dataset citation [PDF of my webinar notes].
Essentially their approach is to use standard journaling techniques to track changes to the data base and to store timestamped queries to allow a query to be rerun against a dataset in a known state to recreate the extract. This approach is highly scalable as regards storage and allows the flexibility of rerunning queries against the datset as was at prior point in time or against the current dataset.
As a solution it seems quite elegant, my only worry would be the computational power required to handle and rerun changes where the dataset has been very cahtty with many updates as would be the case with some automated sensor derived datasets.
I also wonder about its applicability to datasets which are essentially executable binary elements such as iPython notebooks and other forms of mathementical models, but undoubtedly this approach solves the problem of continually redepositing dataset extracts with their consequent impact on storage – but then, this week at least, storage is cheap …