We have our data set. How we acquired it is no longer our concern. It is sitting spinning on a filesystem somewhere. We also understand the format of our data set and how much we have.
So our questions are now:
- do we know and understand the format we wish to ingest
- do we need to do a transform on the raw data
- are there existing libraries/crosswalks available to do this
- do we have all the data to generate the metadata we require
- how much disk store will we need for the pre ingest store
- how long does processing take
The last two are crucial – the data files for ingest should be smaller, but we could concievably want to generate multiple instantiations at this stage – say a jpeg thumbnail of a fits image data set – we will need to store these for ingest. Equally we need to ensure that we have enough processing power – we have a race condition in automatic data acquisition where we need to process all raw data before the next data set is ready for processing …