Over on one of my other blogs I have a slightly meandering post on the lost digital decades.
Meandering as I was trying to work some things out with a think piece. One of the things was how we can never really know the past because we can’t visit, and the second was (more relevantly) about legacy data.
When we build a digital archive we assume hat the content will be available and accessible for a reasonably long time. Obviously as the documents, the digital objects were created with applications we need to think about curation and how we will provide long term access.
There are a variety of approaches – the Project Xena one of normalisation and the more conservative approach of using pronom and/or jhove to extract file format information, and then storing that information and using it to inform a bulk format migration strategy down the track.
Personally I favour storing the original document plus a pdf, as if the worst comes to the worst you can always recover the text by feeding the pdf through suitable software, such as tesseract.
This works fine for textual documents such as text files and files of numeric data. With video and image data, not to mention sound, different considerations apply but the logic is the same. Store the files in a common, well documented format, and ensure that you have documented the technical metadata.
Absolutely fine for digital objects produced in the last decade or so. Microsoft had won the document format wars and we can assume most files are doc or docx, and spreadsheets of numeric data xls or xlsx. PDF generation software was widely available, and Libre/Open Office, with AbiWord provide a means of normalisation if you have a document in an exotic format (Project Xena uses Open Office under the hood).
The same period of time saw the rise of the USB drive and the CD and the disappearance of the floppy and exotic formats like the IoMega zip drive.
It’s a reasonable assumption that we can read anything produced in the last decade.
Wind back a little further and that’s not the case. Wordstar and later Wordperfect dominated word processing. Physicists and mathematicians tended to like AmiPro as the equation editor used TeX under the hood.
QuattroPro was wideley used as a spreadsheet. TeX was used to create drafts of research papers, and specialist dedicated minicomputer wordprocessing systems like Wang and WPS+ still hung on.
There was a vast range of media formats – different tape cartridges, zip drives and a multiplicity of floppy disk formats
At the time it seemed normal but looking back it was pretty chaotic and attempts to establish institutional standards were uneven.
During that time I managed a format conversion service which helped solve some of the problems by migrating data and documents at the time.
The real problem is that we now have first generation of researchers and scientists who used digital technology on a day to day basis coming up to retirement, and there is a desire to recover some of their data, especially longtitudinal and observational data before it is to late.
And we face a set of challenge
1) inadequate documentation as to what the data means and the tools used
2) inability to read the source media due either to media corruption or no longer having suitable equipment
3) inability to convert the source document format to something useful
There’s also a fourth problem – I’m 56. Shocking I know, but not only are the first digital scientists and researchers coming up for retirement, the people who knew about the tools and techniques at the time are also beginning to retire.
At the same time we are beginning to see the first glimmers of the realisation, for example through interest in the literary history of wordprocessing, that we risk losing information about what happened in the eighties and nineties through our inability to read the source documents and data of the time.