I remember being told in primary school that you should start at the beginning and work through to the end.
Of course life is not quite so linear, but on the other hand it’s a good enough theory to start with. If you’re not sure where the beginning of a project is, or indeed the end, start at what you think is the beginning and work forward.
This is basically called generating a project plan – what you want to do, what you need to do to be able to do the work, who is going to do the work and in what order.
What we want to do
Build a general data capture and data archiving architecture for scientific data – well any data really.
How we are going to do it
The diagram above gives the basic design. What we want to do is acquire data from some sort of scientific instrument and then ingest it into a digital repository of some sort to make the data available for further search and analysis.
Dot pointed out the logic will look something like this:
- acquire information from instrument and write to initial data store
- how much data?
- from what sort of interface?
- do we need to do anything special along the way
- do we push or pull the data?
- ingest the data
- read the raw data store
- process the data to convert to desired archival formats
- how much processing power do we need
- generate the item level metadata
- upload the data into a digital repository
- generate item level metadata
- ingest item into long term data store
- assemble the data into a collection/thematic grouping of some sort
- in parallel with the above
- generate collection level metadata ie metadata that describes the grouping
- publish this data to a repository
- allow collection of this data by OAI-PMH hosts/external data registries
the whole idea being that a set of observations can have more value than single observations. The simplistic example is comet hunting.
A collection could be a set of images of the night sky taken with various instruments on a particular date.
Comparing the equivalent images in collections on different dates could help identify astronomical phenomena by looking for differences – basically looking to see if there’s something there one night that’s not there the next night.
Nothing in this design precludes us ingesting pre-existing data – we simply treat the disk it is stored on as a separate acquisition cache.