Dataset ingest and harvesting

Life should be simple, which means that the process of getting data into a repository should be as simple as possible.

If we want real users to upload data we need to give them a flickr type experience. Here’s my dataset, here’s some information about it, here’s the research paper it links to, bang it’s uploaded.

Hence the interest around the web in the use of Sword and Sword like protocols for data ingest. One can imagine a scenario kind of like this:

  • User completes online submission form
  • User uploads dataset using an http upload mechanism
  • The system validates the file type using droid, jhove etc
  • We package the file with some of the user data and context using Bagit etc

There are two problems with this scenario – one is that the http upload limit may stop the ingest of large datasets, and the second is that your file type validation tool may not know or understand the file type.

Now a lot of research results are presented as spreadsheets. And if all data was presented as such life would be simple.

However these spreadsheets usually represent the end point of an analysis, not the input data from the experiment. If we are talking about reuse and substantiation of results we actually need to store the data analysed, not the results of the analysis.

This starts to get messy. For example a number of biological sequencers use a zip format to export the results of analysis runs. Zip is used simply as a means to combine the results of the sequence analysis along with a configuration file describing how the machine was configured during the run, and perhaps a log file to cover any anomalies encountered.

The resulting experiment zip files are about typically around 300Mb – certainly within the http upload restrictions of modern browsers. Of course the actual dataset may contain several of these, and that can break the http upload restriction limit.

I actually have no idea how much of an issue this is in practice. In fact I’d really like to know as this has serious design implications.

Going back to the flickr analogy the user may wish to upload multiple experiment files, but each could be uploaded as a single http upload and perhaps packed with Bagit to maintain the relationship between them (Bagit seems preferable to zip as it has a concept of dealing with files stored by reference) and ingested.

This still leaves us with the validation problem. Feed one of these through Droid and it can prove its a valid zip file, which means you could unpack it and inspect the contents. At this point things start to fall apart of easily dealing with the resulting device specific project files.

If you know that the data comes from a sequencer of type X and that type X writes its configuration data using XML, you might be able to extract some of the technical metadata, on the other hand you might not. Unlike in a print and image repository where you are only dealing with a small number of formats, with data you could be dealing with a range of file types that may or may not be well described and may not be readily parsable.

Which means we have to rely on the user telling us some things about the experiment, such as the instrument used. And people make mistakes. They just do, nothing pejorative about it.

We shouldn’t of course panic about this. Any one institution is likely to have only a small number of such devices, but it does mean that there is a maintenance cost to any such solution and that it also does have implications for any automated ingest of results files.


About dgm

IT professional, ex research psychologist, blogger, twitterer, and amateur classical and medieval historian ...
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s