compression, workflows and manifests

In a digital archive handling compound objects is complex, but given that all the objects in a particular data set are related, compressing it and storing it as a single object makes a lot of sense.

This is exactly the same sort of trick as is pulled by Open/Libre office, or indeed with the epub format (wikipedia on epub). In fact epub is interesting as it has optional components, plus a set of rules regarding having a manifest file.

We could of course, just use epub, or make up our own format, but it makes sense to start of at least with something with a bit of history behind it, such as Bagit, a format developed for library exchange, which also has some nice features such as adding checksums to the manifest file so that when the file is unpacked you have a verification of the original content.

Now having a workflow to assemble the archival set is quite easy. Getting the data back is not quite so simple.

One of the dirty secrets of archives is that 80% of the data held in it is rarely if ever accessed, but of course we don’t know which 80%.

When we are talking about data archives for biggish data, say astronomy, earth sciences, phenomics, the data objects are comparitively large, so it makes sense to store them on tape in a nearline archival system using some form of HSM. It doesn’t really matter what form of HSM we use, the key point is that data sets have to be retreived from tape, and that this takes time that may be noticable to the user with large objects. Given that the tape robot may be shared with other applications using the HSM store it’s coneivable that there might be a significant wait for the correct tape cartridge to be retrived and loaded into the drive for reading.

The scenario we workshopped is something like this:

  • user browses/searches store and finds dataset of interest
  • user authenticates via shibboleth
  • user requests access to data file
  • {optional request validation process for data subject to cultural ethical or other restrictions}
  • data file retrieved from nearline storage to a cache store
  • user is sent obfuscated link to download the file

We like this scenario for a whole lot of reasons, including we can use the same workflow for unrestricted and restricted access material and that we know who is asking for access to a data set. Mandating shibboleth might seem a restriction but the Australian Access Federation (the people who run shibboleth and eduroam here) are taking about providing a user registration service to let government users, unaffiliated researchers, and others register. Unaffiliated researchers includes both citizen scientists and people working for commercial consultancies.

To implement this we need something a little like YouSendIt to do the file handling. YouSendit is of course a commercial product but we happened across from the University of Southampton which is open source and written in php. I contacted the guy who wrote it (copy of email attached as pdf), and it looks as if it would not be too difficult to modify his code to make it callable from the command line and to interface it with Shibboleth.

So everything looks good so far.

Except we’ve just sent the user a link to download a bagit archive – something that while they could crack it using zip is probably not a whole lot of use to them given that bagit doesn’t have any rules as to what method is used to serilaze the files – it turn them into one very big archive file.

This is where epub rides to the rescue. Epub is similarly a serialised format with files in a specific order. And it uses zip which has the great advantage of being universal and platform agnostic. While gzip is more efficient in terms of compression, and tar, being compression free, is faster to read and write, the problem is these pesky windows machines – there’s a lot of them out there and none of them natively use tar or gzip.

So we need to do the following:

  • when we serialize the files into an archive file use zip as it’s much more platform agnostic than gzip. The same apples for using tar
  • make sure we add a boiler plate readme.1st file to the zip file telling the user what files are present in the archive and what they all mean – this can be done during the retrieval process by modifying the zipfile prior to sending out the obfuscated link
  • provide help text and perhaps a second copy of the readme.1st message in the mail message sending out the obfuscated link

About dgm

Former IT professional, previously a digital archiving and repository person, ex research psychologist, blogger, twitterer, and amateur classical medieval and nineteenth century historian ...
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s