Having written last week about what (should) happen when an online journal dies, by pure coincidence I had a discussion on twitter earlier this week with two former colleagues about capacity and data retention, and how you make up estimates of capacity and usage.
It’s been my experience there’s no real guidelines for for archival systems on retention – there are various state, federal, not to mention contractual, restrictions on how long you have to keep data.
And at that point it all descends into a handwaving mess.
So what data do you keep and for how long?
I don’t know, and I’m sure you don’t either.
So let’s look at the case of data that’s been put in some sort of repository to support the publication of a research paper.
Journals that require you to make available the data you based your conclusions on usually mean the data that you used for the underlying analysis.
That of course may be derived from a larger data set, and computation may have been involved. Now this big data set may already have been archived, or it may not. If it’s not archived you can’t cite it – should you archive it as well?
Probably you should, and the same goes for any scripts or code you used to carry out your analysis.
At this point your archival system starts to look like what I once called a long term curated content management system – or what the ninth century monk Nennius described as a heap of all I have found.
Storage is of course finite. It may be cheap, but it is finite, and there are costs involved in keeping it going. No one actually has any working rules about this so it’s probably best to make some up so we all know where we stand.
The nice thing about rules is that we can change them if they don’t work.
So the first rule is data must be classified as to its retention status
By this I mean that when you put data into a repository you need to say whether it directly supports a publication, in which case it has to be kept for a very long time, or if it doesn’t.
On the whole the first lot of data – the permanent data – will be a lot smaller than the second lot – the non permanent data.
The second rule is that non-permanent data will be deleted after three years if it has not been accessed in the previous six months.
This means that the non permanent stuff will not clog up disk store. While one of my colleagues didn’t like the suggestion, rather than delete it, the data owners should be offered it back to keep on their personal cloud – for example Microsoft will bundle 1TB with your annual Office 365 subscription – making it a realistic option – for example I’ve got a backup of all my Dow’s project work on my personal OneDrive account plus a lot of my supporting material in OneNote.
The third rule is that after three years non permanent data will be subject to six monthly review and deleted if it has not been accessed in the previous six months.
And the reason for this is that people will try to game the system. (While it’s not quite the same, at one place I worked we gave people a small allocation of permanent storage and a large allocation of scratch storage for work in progress data – there was a lot of misplaced ingenuity that went into writing scripts to simulate access for work in progress data).
Having a rule like this makes it onerous. Obviously people can automate dummy accesses, but they then need to maintain the code they are using to simulate access. Sooner or later they’ll give up, leave, or something, and their data will age out of the system.
I’m not going to pretend this is a definitive answer – what it is is a strategy for managing the problem of data retention given that you can’t keep it all – some you can, some you have to let go …