What is a repository?

I’ve had a twitter discussion this morning with Pete Sefton (@ptsefton) about what is a repository.

Pete has argued that repositories can be transient as a repository is about content organisation and presentation. I take a different view – repositories are simply content management systems optimised for the long term curation of data.

We are of course both right.

And the reasons why we are both right are because a lot of institutions have used their institutional repositories as a way of showcasing their research outputs. Nothing wrong with that. The institutional repository phenomenon came about because as publications and preprints increasingly became electronic institutions needed a way to manage that, and, for a lot of them, dspace was the answer.

And of course, then we had google scholar indexing them, and various sets of metrics, and the role of institutional repositories sort of shifted.

Enter the data repository. It can contain anything. Performances of the Iliad, aboriginal chants, digitised settler diaries, photographs of old Brisbane, stellar occlusion data, maps, etc. I could go on.

The key point is there’s no unifromity of content type – the content is there for reuse and probably within only a particular knowledge domain. We’re no longer about presenting content, we’re about accessing content.

A subtle distinction, but an important one. Early repositories were oriented towards text based content and that made it easy to conflate presentation with access.

In fact we’re doing different things because access is about reuse and presentation is just that, presentation.

A collection of manuscript images can be presented by a presentation layer such as Omeka to make them available in a structured manner, they can also be stored in a managed store.

In fact the nicest example is the Research management system. Data is pulled from the HR system, the institutional repositories, and some other sources to build a picture of research activity, researcher profile pages, and so on – the same data is reused and presented in multiple ways.

So, let’s call what we used to call the repository the long term curated content management system, the LCCMS.

Besides another incomprehensible acronym, this has some benefits – it recognises that content can be disposed of and may not be fixed – one of our key learning from operating a data repository is that researchers need more than data publication – they need a well managed agnostic work in progress store while they assemble datasets, be it from astronomical instruments or a series of antropology field trips to PNG – something that goes against the idea that only ‘finished’ content goes into the repository, but yet is clearly needed.

So, it’s about content, and more importantly what you do with it …

Posted in Uncategorized | Leave a comment

Lodlam 2015

I’ve just spent the last two days at the Lodlam summit in Sydney.

Lodlam – Linked Open Data in Libraries, Archives and Museums – was an invitation only event loosely linked to the Digital Humanities 2015 conference also on in Sydney at the same time and I was lucky enough to get an invitation to the LodLam event.

It was interesting involving and certainly provided a lot of food for thought. Rather than repeat myself endlessly follow the links below for (a) a personal view of the event and (b) my session notes, suitably spell checked and cleaned up. As always the views and interpretation are mine, and not those of any other named individual (although for a linked data event I guess I should write ‘named entity’ )


Posted in Uncategorized | Leave a comment

Electronic resources and BibTeX

BibTeX is many things to many people, but we principally use it as a bibliographic file format.

This of course produces a whole slew of problems when it comes to online resources, for the simple reason that BibTex predates online resources.

Traditional paper media have a whole set of conventions we understand about the differences between a book, a book chapter, a paper, and a conference paper, all of which BibTeX handles well by using @article, @book and so on.

Strict BibTex only really has the @misc format to incorporate url’s and so on but that works quite nicely, as we can see with using BibTeX for dataset citation, the only problem being that if everything is @misc we lose the distinction between articles, datasets and conference presentations and so on. Using the newer BibLaTeX standard allows UTF-8 characters, but really does not add anything – you end up with a generic @online type instead of simply coercing @misc to do your bidding.

This is not just a BibTeX thing – all reference managers in common use are still firmly bedded in the paper era – Endnote has similar problems distinguishing between various sorts of electronic resources.

There is another problem with online articles.

In the old days your research paper was published in only one place, and consequently had only one incarnation, and hence one citation.

With open access material you may find the pdf of the journal article in a variety of locations, the journal publisher’s site, your institutional repository, or some specialist collection, in other words you can have multiple instances of the same document, and each instance will have a different url.

This of course would play havoc with citation counts. The simplest solution is to implement a rule that the copy published in the open access journal is the primary one and that secondary copies are just that, and consequently the url cited is the one derived from the digital object identifier, and not the one generated by the local document server – logically the copy in your local repository is the analogue of the xerox of the journal copy you got from your local library’s document supply service.

So, in the case of a dataset, or a conference paper or something only published locally the doi should resolve to the local instance, but where it’s published elsewhere it should resolve to the journal doi …

Posted in Uncategorized | Leave a comment

Exporting references from Dspace in BibTeX format

Following on from our design decision to use BibTeX as a lowest common denominator reference export format, we have developed a simple BibTeX reference export utility for Dspace 4.3.

Essentially, it simply takes the Dublin Core object description and translates it to a BibTeX style reference with the object type, for example @article for a research paper being set on the basis of the dc.type metadata field.

As a further refinement we are using the object handle as the label which would give us an entry that looks something like this:

author = {Collins, Wilkie},
title = {Testing Methodologies},
journal = {The Journal of Important Things},
year = {2014}

Testing and development is ongoing, but our test sparse entries import successfully into Zotero and JabRef

Posted in Uncategorized | Leave a comment

Our candidate programmatic Orcid updater

Back in March 2014 we made our prototype application for programmatic Orcid updates available. This was designed only as a prototype and not intended for general use in a production environment.

As of 01 April 2014 Orcid are going to move to release 1.2 of their schema, which may break our app.

When we say may we really mean may, we don’t know as we havn’t tested it against the new version of the Orcid schema – just now we’re working on some other projects and don’t have the bandwidth to test things properly, and more importantly, update them if they break.

However, we will be testing and if necessary updating our tool later this year – we just can’t say when …

Posted in Uncategorized | Leave a comment

Citing Dynamic datasets

There’s an assumption in data citation that all the datasets in a repository are fixed and final, just as a research paper when published is in it’s fixed and final form.

In practice this isn’t quite true, there’s a lot of use case where the data in a datastore is subject to change – the obvious cases being longitudinal or observational studies, which these days can be extended to cover datasets derived from automated sensors, with consequent errors due to technical failures, and subsequent fixups.

Most researchers actually only work on an extract of the data, and current best practice is to redeposit the extract along with a description of how it was created as if it was a new dataset.

This is fine for small datasets, but does not scale well for big data, be it numerical, sensor based, or evenworse multimedia – video recordings of language and accompanying transcripts being one use case.

I have previously thought about using bazaar, subversion or git as a version control system, creating a new object each time the dataset updates, but that also suffers from scaling problems, but at least ahs the benefit of being able to recreate the extract against the dataset at a known point in time.

Consequently I was interested to hear Andreas Rauber of the University of Vienna speak on the Research Alliance approach to dynamic dataset citation [PDF of my webinar notes].

Essentially their approach is to use standard journaling techniques to track changes to the data base and to store timestamped queries to allow a query to be rerun against a dataset in a known state to recreate the extract. This approach is highly scalable as regards storage and allows the flexibility of rerunning queries against the datset as was at prior point in time or against the current dataset.

As a solution it seems quite elegant, my only worry would be the computational power required to handle and rerun changes where the dataset has been very cahtty with many updates as would be the case with some automated sensor derived datasets.

I also wonder about its applicability to datasets which are essentially executable binary elements such as iPython notebooks and other forms of mathementical models, but undoubtedly this approach solves the problem of continually redepositing dataset extracts with their consequent impact on storage – but then, this week at least, storage is cheap …

Posted in Uncategorized | Leave a comment

The Data Commons and GitHub

Over the past few months there’s been a growing interest in archiving software projects, such as can be done via Zenodo. This is part of more general problem – as researchers increasing use environments such as iPython notebooks for their resercah there’s a growing need to archive these notebooks to allow a replay of the determination of the results.

Inspired by Stuart Lewis’s recent post on a GitHub to repository deposit we’ve recently added a mechanism to the Data Commons to allow the import of metadata from GitHub to the Data Commons, allowing the creation of an object record for a GitHub project.

Rather than import the code, we create a referential entry for the project, although of course files could be downloaded and added manually if a local copy was required.

We would hope to generalise this mechanism in the future to allow the import of dataset records from other repositories and stores meaning that content need not always be in the same place as the metadata …

Posted in Uncategorized | Leave a comment