The Generic data capture solution

Over the past fifteen months we’ve been working on developing a data capture and data management solution. We started off with five related projects. Early on we took the decision to design a single solution that allows both data capture, ie the ingest of data from instruments and automated data sources directly into a data management solution, and the development of the data management solution itself.

Data Capture and ingest

Data capture is about two things, automating ingest and managing content. To be generic the automated ingest mechanism must be capable of being used in a number of ways and needs to be robust.

Equally the content management must be standard. And that essentially means building a standard content repository management solution and one which is sufficiently open to allow non standard means of ingest.

We chose Fedora Commons as it was

a) agnostic as regards storage – this mean that we could present storage as an NFS share and offload storage backup and replication to the storage solution, be it Isilon, Dell object store, or whatever.

b) objects can be stored by reference – ie by a pointer or URI. This was potentially important to us as having data sets stored on storage not directly managed by the application, such as cloud based storage was in the long term game plan for this project, even though it wasn’t on the deliverables list for phase one

c) tried and tested. Fedora Commons manages a large number of complex collections, meaning that it would scale and manage a large range of object types

Fedora, of course does not have a user interface out of the box, instead it provides a set of API’s to allow you to develop an interface.

This is both a blessing and a curse. Being able to build an interface that met our requirements and being able to skin it differently for different collections turned out to be important. Clients tured out to feel that having their work showcased as something individual was important. Therefore anthropologists wanted to see the anthropology collections, astronomers wanted to search in a way that made sense to them etc, etc.

The API based framework also meant that building an automated tool to upload data was considerably easier. If the web interface was a client so was the ingest tool.

So in the same way as Google’s gmail has web client, an android client, an iPhone client, all of whom use the same set of API’s even though they are different and incompatible applications, so the API focused architecture of Fedora mean that we could develop both a web client and an automated ingest client.

Automated ingest also meant that we could not predict what files we would get in anyone upload or thrie quality. In a non generic solution it would be possible to constrain the sources and the types of files expected or handled, but in a generic solution things are considerably more fluid.

To cope with this we changed the ingest workflow so that files were fed through Fido, an implementation of Pronom for the National Library of the Netherlands to verify the file format, Apache Tika to extract as much in the way of embedded metadata from the files as possible, and a virus checker to guard against ingesting infected files.

In the case of virus checking we did not reject content that showed up with a positive virus check – as some of the data were uploading could potentially incorporate binary data that resembled a virus signature.

As a guard against corruption the md5 checksum was also computed and stored with the object metadata.

As it stood this gave us all the functionality to ingest and describe fully the content deposited with us, except for one important file type.

One of the data formats that we were going to store was FITS, a format commonly used for astronomical image data. FITS is a self describing format, ie the information stored in the file headers describes the content of the image file. Besides a number of mandatory header values that are required a FITS header can contain a large number of other values that are defined according to an external dictionary file. This makes FITS extremely flexible and able to deal with a vast range of data types, but almost impossible to develop a universal parser for due to the potentially large number of parameters.

Apache Tika merely tests the file to confirm that it conforms to the FITS definition. Working with our data providers we developed a partial parser that extracts a number of key values from the FITS file produced by a particular instrument to allow the generation of a comprehensive metadata description.

The parser could easily be extended to extract other values stored in FITS files to allow the automated generation of appropriate metadata from other instruments. The parser is in the process of being contributed back into the Apache Tika project.

With the web interface we also had a comprehensive data management application which could be presented in a number of ways. To demonstrate this we developed a a custom view for the Pacific Manuscripts Bureau that allowed them to present their content as if it was hosted in a custom repository rather than in a larger more generic repository.

Using this architecture, automated ingest was comparitively straightforward.

We developed a specialist upload tool that could be used as a standalone tool, as part of a script, or initiated from a web service request.

In all cases the data provider had to write all the data files out to some temporary storage, and generate a parameter file that described the data being uploaded, basically a list of the files to upload and the bare bones metadata to describe the dataset. Datasets could consist of a single file or set of related files. On the repository side the dataset and acompanying technical metadata are stored in a Bag, a serialised zipfile conforming to the Library of Congress bagit format. This meant that the atomic unit of storage was the dataset and not the individual files, which simplified the maintenance of the relationships between the individual objects in the dataset. The dataset was essentially treated as if it were always a compound object.

The uploader could be used in a variety of ways – for example Earth Sciences wrote out a set of files from a wave form analysis in a self describing disk structure, ie the names of the directories followed a naming convention that allowed them to generate the parameters in a standard format, and invoked the uploader from a script.

Astronomy followed a similar strategy. Astronomical instruments are, in the main, an expensive shared resource. In the case of the WiFes instrument access is controlled by the Telescope Access Control (TAC) database which manages instrument time between a number of parallel projects.

The TAC database identifies the start and end times of a time slot allocated to a particular experimental project, the project and another of other relevant fields.

Images are acquired in FITS format and are written to local temporary storage at the Siding Springs Observatory.

A batch job is run on a daily basis to to query the TAC database and assemble the images acquired overnight and assemble them into datasets and upload them to the Data Commons at the ANU main site in Canberra using the general purpose data uploader tool with a parameter file generated from a query against the TAC database

On ingest, dataset metadata is automatically generated from a combination of the data obtained from the TAC database and from the FITS header data from the images themselves by using the Apache Tika parser developed as part of the project.

Phenomics took a different approach. Data is aggregated within the APN experiment management solution. Data is tied together by a common strain identifier that is used through out APN to identify tissue samples.

Aggregation takes place within the APF Web Portal server. This is a liferay based web portal that supports a range of applets to allow specific views of the data.

An automated query aggregates all data for a particular project or strain id and initiates a service request to the ANU Data Commons by passing the data and metadata required to create a dataset within the data commons and initiate an ingest of the aggregated data.

On ingest a dataset description is created and the data deposited on storage managed by the ANU Data Commons. Metadata may then be reviewed for quality control purposes and then published to Research Data Australia.

The ANU Data Commons also allows the automated minting of a Digital Object Identifier for each dataset published to Research Data Australia as an aid to citation.

With digital humanities we took an approach similar to that taken with Phenomics. The Digital Humanities Hub has recently developed OCCAMS. And application for the aggregation and annotation of non text based humanities research material

This material is principally in the form of images, but may include recording and transcriptions of spoken language. Full details of the OCCAMS system are online at http://dhh.anu.edu.au/occams.

The solution here is that once a researcher has assembled a collection within Occams to his or her satisfaction they are able to select an option to automatically deposit their collection by means of a web service request in the ANU Data Commons with the appropriate parameters generated programmatically

On ingest a dataset description is created and the data deposited on storage managed by the ANU Data Commons. Metadata may then be reviewed for quality control purposes and then published to Research Data Australia

We set out to build a generic solution – something which we have been able to demonstrate by another project reusing our infrastructure to push through their dataset and incidentally save themselves development time.

Supporting Materials

A short presentation summarising the data management solution is available for download or can be viewed as pdf. All code has been posted on GitHub and is freely available for download from https://github.com/anu-doi/anudc.

This project is supported by the Australian National Data Service (ANDS). ANDS is supported by the Australian Government through the National Collaborative Research Infrastructure Strategy Program and the Education Investment Fund (EIF) Super Science Initiative.

Rahul Khanna, Genevieve Turner, Lisa Bradley and Martin Hamilton were responsible for the heavy lifting and turning dreams into working code.

 

 

Advertisements

About dgm

IT professional, ex research psychologist, blogger, twitterer, and amateur classical and medieval historian ...
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s