Pete Sefton, over at UWS, recently published an interesting post on their use of the bagit specification in their data capture applications.
The Data Commons also uses bagit in its ingest mechanism, but for slightly different reasons. The genesis of using bagit goes back a long way, in fact to when I was working for some other people entirely, building a what was described as a digital asset management system but was actually a digital archive.
What we had then were recordings of spoken language, and transcriptions and translations of the same, both with annotations.
They were all related, and while they were separate objects they had a very close linkage – the whole was clearly greater than the sum of the parts.
My idea at the time was to zip the lot together and deposit the zip file – which ws not perfect as it meant we lost access to the technical metadata, but at least we kept the related things together.
The zip file was not about compression, it was merely a way of keeping things together.
Fast forward to last year. We had the same problem with data capture where quite often different sets of observational data were separate but related files. I had previously played with file formats like epub which use zip to keep all the component parts together, all of which have a defined role, and which has the concept of a manifest. This isn’t unique to epub, the libre/open office document format does a similar trick – if you want to poke around, copy an odt file to something.zip and open it with your favourite archive manager.
We actually thought about inventing our own format based on epub for all of five minutes until we discovered bagit.
Bagit was a good choice as (a) it had the Library of Congress behind it making likely to become a de facto standard and (b) it addressed a problem we handn’t thought of – bagit will let you have a holey archive, ie have files stored by reference elsewhere.
In a world of collborative research this is potentially important, where one archive holds the say the transcriptions of some medieval texts, but somone else entirely holds the high resolution images that our transcriptions were compile from.
So we chose bagit. To get round our problem with losing access to the technical metadata we feed files on ingest through Fido and Tika and store the technical metadata in an associated file inside the bag – the bagit archive.
We have also looked at adding normalization to the ingest process, even though we havn’t implemented this yest – essentailly when we get a document file in a format we recognise we feed it through a standard conversion tool to save a second copy in a standard format to the bag following the National Archives of Australia’s normalisation guidelines out of project Xena – that way we will always have a version of the document that we can read with software current at the time, and from which we can generate other formats.
Other ideas we have looked at is feeding the file through topic modelling software to effectively do automatic keyword extraction – something potentially can be used to improve the qality of the search results.