Building an archive solution

What do we actually mean by data retention?

Posted on 17/09/2020 by dgm

Having written last week about what (should) happen when an online journal dies, by pure coincidence I had a discussion on twitter earlier this week with two former colleagues about capacity and data retention, and how you make up estimates of capacity and usage.

It’s been my experience there’s no real guidelines for for archival systems on retention – there are various state, federal, not to mention contractual, restrictions on how long you have to keep data.

And at that point it all descends into a handwaving mess.

So what data do you keep and for how long?

I don’t know, and I’m sure you don’t either.

So let’s look at the case of data that’s been put in some sort of repository to support the publication of a research paper.

Journals that require you to make available the data you based your conclusions on usually mean the data that you used for the underlying analysis.

That of course may be derived from a larger data set, and computation may have been involved. Now this big data set may already have been archived, or it may not. If it’s not archived you can’t cite it – should you archive it as well?

Probably you should, and the same goes for any scripts or code you used to carry out your analysis.

At this point your archival system starts to look like what I once called a long term curated content management system – or what the ninth century monk Nennius described as a heap of all I have found.

Storage is of course finite. It may be cheap, but it is finite, and there are costs involved in keeping it going. No one actually has any working rules about this so it’s probably best to make some up so we all know where we stand.

The nice thing about rules is that we can change them if they don’t work.

So the first rule is data must be classified as to its retention status

By this I mean that when you put data into a repository you need to say whether it directly supports a publication, in which case it has to be kept for a very long time, or if it doesn’t.

On the whole the first lot of data – the permanent data – will be a lot smaller than the second lot – the non permanent data.

The second rule is that non-permanent data will be deleted after three years if it has not been accessed in the previous six months.

This means that the non permanent stuff will not clog up disk store. While one of my colleagues didn’t like the suggestion, rather than delete it, the data owners should be offered it back to keep on their personal cloud – for example Microsoft will bundle 1TB with your annual Office 365 subscription – making it a realistic option – for example I’ve got a backup of all my Dow’s project work on my personal OneDrive account plus a lot of my supporting material in OneNote.

The third rule is that after three years non permanent data will be subject to six monthly review and deleted if it has not been accessed in the previous six months.

And the reason for this is that people will try to game the system. (While it’s not quite the same, at one place I worked we gave people a small allocation of permanent storage and a large allocation of scratch storage for work in progress data – there was a lot of misplaced ingenuity that went into writing scripts to simulate access for work in progress data).

Having a rule like this makes it onerous. Obviously people can automate dummy accesses, but they then need to maintain the code they are using to simulate access. Sooner or later they’ll give up, leave, or something, and their data will age out of the system.

I’m not going to pretend this is a definitive answer – what it is is a strategy for managing the problem of data retention given that you can’t keep it all – some you can, some you have to let go …

Posted in Uncategorized | Leave a comment

Using BibTex for artefact description

Posted on 08/07/2017 by dgm

One thing that my work on the documentation project has shown me is the need to build up a list of type artefacts that are documented in a fairly unambiguous way.

To be clear, this isn’t using bibtex per se as a catalog format but coercing the bibtex format for artefact description.

BibTex is designed for documents and while it can be bent to cover online documents and other electronic resources, such as datasets that are document like it is not designed to handle physical artefacts. However what we are doing here is not building a database of artefacts but a collection of exemplars from online electronic resources.

Whle there are a number pf perfectly good museum cataloguing format bibtext has the advantages of (a)being simple and (b) having a number of open source tools and libraries to manipulate records.

So what can we do?

Taking the Joseph Hulle poison bottle from the Museum of applied arts and sciences as an example, its wikipedia style citation is:

{{cite web 
|url=https://ma.as/32687 
|title=Poison, bottle of strychnine, Jacob Hulle,   London, England, 1860-1890s |author=Museum of Applied Arts & Sciences 
|access-date=7    July 2017 
|publisher=Museum of Applied Arts & Sciences, Australia}}

which we could coerce to something like

@misc{maas_id: 85/1323,
author = {Museum of Applied Arts & Sciences},
year = {1860-1890}, 
title = {{Poison, bottle of strychnine, Jacob Hulle, London}, 
howpublished = {https://ma.as/32687}, 
note = {accessed 07 July 2017} }

which has the merits of both simplicity and legibility, as well as being machine readable if required …

Posted in Uncategorized | Leave a comment

Open source literature and citizen science

Posted on 30/12/2016 by dgm

As I’ve said elsewhere, it’s been about a year since I retired, and despite having been around research institutions and universities all my adult life, I must say I don’t miss it, except for one thing – easy access to journals.

A surprisingly large amount of information is freely available online, but every so often in my role as a dilletante classicist and amateur Victorian historian one comes up against a blank wall where something that has piqued one’s interest sits securely behind a paywall.

When I was working, this wasn’t a problem – if the institution one worked for had access one just logged in from work, or via a vpn or a proxy server, even if it was nothing to do with the day job. Of course I was in the lucky position where my day job in digital archiving could have allowed me to plausibly claim ‘just testing’ if anyone ever asked but no one ever did.

I no longer have a university login – the first time since 1980 – so I can’t do that. If I still lived in the city, rather than country Victoria, I could get on the bus to my former employer and use the library, but even then I still wouldn’t be able to access the electronic resources as all computers require you to login – no kiosk mode machines for general use or literature searches.

Now I’m not really a serious classicist or nineteenth century historian, so I can get by with secondary sources quite happily, and I’m well enough off to buy second hand copies of any particularly interesting books.

However, having once been a field ecologist, I can see that if I was an amateur botanist, say, being able to access specialist literature would be rather more important than it is to me.

And if we want citizen science, we want it to be good science. And that means access to the literature, which in turn means open source literature, in that the content is freely available.

Now while it’s not exactly cost free, the costs of hosting an electronic journal are minimal compared to the subscription costs, and the costs of simply hosting content are trivial.

And of course it’s not just for the benefit of a few amateur researchers who live in rich countries – researchers in countries where research is poorly funded and which lack the infrastructure of large research libraries ar in an equally difficult position.

Posted in Uncategorized | 1 Comment

Scientific communication in pre 1989 Eastern Europe

Posted on 28/12/2016 by dgm

A long time ago, the early nineteen eighties in fact, I held a research studentship from the UK Medical Research Council.

And in the course of my research I read widely in physiology and ethology to understand prior work about stress, environmental influences, and learned responses to uncertainty. This was because a lot of the physiological work was carried out on fit young anglo saxon males who were either members of the armed forces or university students, and more importantly must have expected something unpleasant to happen to them.

The point being is that we know that the world is stressful and uncertain and that we (mostly) learn to get on with it, some better than others, it is just that some people do not cope very well, which results in a whole range of psychophysiological symptoms.

And because it was thought unethical to do some of the things that result in extreme stress in humans some people thought it was a little more ethical to use animals. But because of the electromechanical technology the studies were very much either/or and consequently less valuable

Computers controlling things allow for pseudo random events where you can go from ‘mostly predictable with the odd bad thing’ to absolutely random and terrifying. Needless to say humans and animals find the latter scenario very stressful.

As it was important to avoid repeating prior work because (a) human based studies were expensive and involved even then complex approval processes and (b) and work with non human subjects was heavily restricted, one had to be sure that no one had tried to do something similar before.

So one read. And because the university where I was studying didn’t have many of the journals I needed I was given a generous inter library loan allowance.

And in they would come, photocopies from from the British Library’s document supply centre. Mostly British or North American, but occasionally French or German. And one time in a German paper, I found a reference to an interesting study carried out at the Charles University in Prague, which of course in those cold war times was the grim grey capital of the Czechoslovak Socialist Republic, and not a funky place with good music and better pubs.

But for the hell of it I put in a request, expecting either a rejection slip – request disallowed – or possibly a photocopy.

But no, someone at the British Library thought it worthwhile to ask the Czechs for a copy and they sent me a copy of conference proceedings that included a copy of the study I wanted to xerox it myself.

Now at that time, journals in the west were comparatively cheap, produced by learned societies in the main, though there were a few commercial journals owned by Pergamon, Elsevier and Springer and doubtless a few others I’ve forgotten. And of course there were things like the science citation index, which is arguably the granddaddy of bibliometrics and the various reputational studies that plague us today.

But in the old east there was no such structure. Knowledge was said to be the property of the people as the people had paid for it. There were almost no journals, certainly no commercial journals, yet people still discussed and exchanged ideas and built reputations.

Now one of the concerns among researchers about moving to publication in lesser known open source journals is the loss of impact and reputation, and consequently the ability to attract funding, so my question is, does the way scientific communication proceeded in pre 1989 eastern Europe give us a model for a world with diverse methods of publication and dissemination?

Posted in Uncategorized | Leave a comment

What is a repository?

Posted on 21/10/2015 by dgm

I’ve had a twitter discussion this morning with Pete Sefton (@ptsefton) about what is a repository.

Pete has argued that repositories can be transient as a repository is about content organisation and presentation. I take a different view – repositories are simply content management systems optimised for the long term curation of data.

We are of course both right.

And the reasons why we are both right are because a lot of institutions have used their institutional repositories as a way of showcasing their research outputs. Nothing wrong with that. The institutional repository phenomenon came about because as publications and preprints increasingly became electronic institutions needed a way to manage that, and, for a lot of them, dspace was the answer.

And of course, then we had google scholar indexing them, and various sets of metrics, and the role of institutional repositories sort of shifted.

Enter the data repository. It can contain anything. Performances of the Iliad, aboriginal chants, digitised settler diaries, photographs of old Brisbane, stellar occlusion data, maps, etc. I could go on.

The key point is there’s no unifromity of content type – the content is there for reuse and probably within only a particular knowledge domain. We’re no longer about presenting content, we’re about accessing content.

A subtle distinction, but an important one. Early repositories were oriented towards text based content and that made it easy to conflate presentation with access.

In fact we’re doing different things because access is about reuse and presentation is just that, presentation.

A collection of manuscript images can be presented by a presentation layer such as Omeka to make them available in a structured manner, they can also be stored in a managed store.

In fact the nicest example is the Research management system. Data is pulled from the HR system, the institutional repositories, and some other sources to build a picture of research activity, researcher profile pages, and so on – the same data is reused and presented in multiple ways.

So, let’s call what we used to call the repository the long term curated content management system, the LCCMS.

Besides another incomprehensible acronym, this has some benefits – it recognises that content can be disposed of and may not be fixed – one of our key learning from operating a data repository is that researchers need more than data publication – they need a well managed agnostic work in progress store while they assemble datasets, be it from astronomical instruments or a series of antropology field trips to PNG – something that goes against the idea that only ‘finished’ content goes into the repository, but yet is clearly needed.

So, it’s about content, and more importantly what you do with it …

Posted in Uncategorized | 1 Comment

Lodlam 2015

Posted on 01/07/2015 by dgm

I’ve just spent the last two days at the Lodlam summit in Sydney.

Lodlam – Linked Open Data in Libraries, Archives and Museums – was an invitation only event loosely linked to the Digital Humanities 2015 conference also on in Sydney at the same time and I was lucky enough to get an invitation to the LodLam event.

It was interesting involving and certainly provided a lot of food for thought. Rather than repeat myself endlessly follow the links below for (a) a personal view of the event and (b) my session notes, suitably spell checked and cleaned up. As always the views and interpretation are mine, and not those of any other named individual (although for a linked data event I guess I should write ‘named entity’ )

Posted in Uncategorized | Leave a comment

Electronic resources and BibTeX

Posted on 13/04/2015 by dgm

BibTeX is many things to many people, but we principally use it as a bibliographic file format.

This of course produces a whole slew of problems when it comes to online resources, for the simple reason that BibTex predates online resources.

Traditional paper media have a whole set of conventions we understand about the differences between a book, a book chapter, a paper, and a conference paper, all of which BibTeX handles well by using @article, @book and so on.

Strict BibTex only really has the @misc format to incorporate url’s and so on but that works quite nicely, as we can see with using BibTeX for dataset citation, the only problem being that if everything is @misc we lose the distinction between articles, datasets and conference presentations and so on. Using the newer BibLaTeX standard allows UTF-8 characters, but really does not add anything – you end up with a generic @online type instead of simply coercing @misc to do your bidding.

This is not just a BibTeX thing – all reference managers in common use are still firmly bedded in the paper era – Endnote has similar problems distinguishing between various sorts of electronic resources.

There is another problem with online articles.

In the old days your research paper was published in only one place, and consequently had only one incarnation, and hence one citation.

With open access material you may find the pdf of the journal article in a variety of locations, the journal publisher’s site, your institutional repository, or some specialist collection, in other words you can have multiple instances of the same document, and each instance will have a different url.

This of course would play havoc with citation counts. The simplest solution is to implement a rule that the copy published in the open access journal is the primary one and that secondary copies are just that, and consequently the url cited is the one derived from the digital object identifier, and not the one generated by the local document server – logically the copy in your local repository is the analogue of the xerox of the journal copy you got from your local library’s document supply service.

So, in the case of a dataset, or a conference paper or something only published locally the doi should resolve to the local instance, but where it’s published elsewhere it should resolve to the journal doi …

Posted in Uncategorized | Leave a comment

Exporting references from Dspace in BibTeX format

Posted on 10/04/2015 by dgm

Following on from our design decision to use BibTeX as a lowest common denominator reference export format, we have developed a simple BibTeX reference export utility for Dspace 4.3.

Essentially, it simply takes the Dublin Core object description and translates it to a BibTeX style reference with the object type, for example @article for a research paper being set on the basis of the dc.type metadata field.

As a further refinement we are using the object handle as the label which would give us an entry that looks something like this:

@article{hdl.handle.net_1234_1234567, author = {Collins, Wilkie}, title = {Testing Methodologies}, journal = {The Journal of Important Things}, year = {2014} }
Testing and development is ongoing, but our test sparse entries import successfully into Zotero and JabRef

Posted in Uncategorized | Leave a comment

Our candidate programmatic Orcid updater

Posted on 15/01/2015 by dgm

Back in March 2014 we made our prototype application for programmatic Orcid updates available. This was designed only as a prototype and not intended for general use in a production environment.

As of 01 April 2014 Orcid are going to move to release 1.2 of their schema, which may break our app.

When we say may we really mean may, we don’t know as we havn’t tested it against the new version of the Orcid schema – just now we’re working on some other projects and don’t have the bandwidth to test things properly, and more importantly, update them if they break.

However, we will be testing and if necessary updating our tool later this year – we just can’t say when …

Posted in Uncategorized | Leave a comment

Citing Dynamic datasets

Posted on 01/10/2014 by dgm

There’s an assumption in data citation that all the datasets in a repository are fixed and final, just as a research paper when published is in it’s fixed and final form.

In practice this isn’t quite true, there’s a lot of use case where the data in a datastore is subject to change – the obvious cases being longitudinal or observational studies, which these days can be extended to cover datasets derived from automated sensors, with consequent errors due to technical failures, and subsequent fixups.

Most researchers actually only work on an extract of the data, and current best practice is to redeposit the extract along with a description of how it was created as if it was a new dataset.

This is fine for small datasets, but does not scale well for big data, be it numerical, sensor based, or evenworse multimedia – video recordings of language and accompanying transcripts being one use case.

I have previously thought about using bazaar, subversion or git as a version control system, creating a new object each time the dataset updates, but that also suffers from scaling problems, but at least ahs the benefit of being able to recreate the extract against the dataset at a known point in time.

Consequently I was interested to hear Andreas Rauber of the University of Vienna speak on the Research Alliance approach to dynamic dataset citation [PDF of my webinar notes].

Essentially their approach is to use standard journaling techniques to track changes to the data base and to store timestamped queries to allow a query to be rerun against a dataset in a known state to recreate the extract. This approach is highly scalable as regards storage and allows the flexibility of rerunning queries against the datset as was at prior point in time or against the current dataset.

As a solution it seems quite elegant, my only worry would be the computational power required to handle and rerun changes where the dataset has been very cahtty with many updates as would be the case with some automated sensor derived datasets.

I also wonder about its applicability to datasets which are essentially executable binary elements such as iPython notebooks and other forms of mathementical models, but undoubtedly this approach solves the problem of continually redepositing dataset extracts with their consequent impact on storage – but then, this week at least, storage is cheap …

Posted in Uncategorized | Leave a comment

Building an archive solution

What do we actually mean by data retention?

Using BibTex for artefact description

Open source literature and citizen science

Scientific communication in pre 1989 Eastern Europe

What is a repository?

Lodlam 2015

Electronic resources and BibTeX

Exporting references from Dspace in BibTeX format

Our candidate programmatic Orcid updater

Citing Dynamic datasets

Recent Posts

Archives

Categories

Meta

Recent tweets

My other blog …

And another blog …