Dataset and game archiving

Last week, J and I escaped to the south coast for a break.

Even though  I took a computer we didn’t have internet access and I managed to disconnect almost completely.

But not quite – I checked my email on my phone and one email I saw go past was about archiving computer games.

Now computer games are kind of an interesting problem – there are those that were written to get the absolute most out of the hardware of the time that want to drive graphic cards and poke hardware in all sorts of ways, that really need an extremely good emulated environment and then there are what we can loosely call ‘adventure’ games in which the user navigates a virtual environment solving puzzles. These games are based on a database that tracks information and a lot of the data, the character attributes, eg nasty, magical powers, likes honey are expressed in xml. In fact they look fairly close to an ontology in representing the virtual world of the game.

In my world view there are two sorts of data – experimental and observational. This is based on personal observation of science.

A lot of people working in data archiving either have a library or a computer science background. Or else they’ve specialised in one particular discipline and moved into data management due to problems they’ve met handling their research data.

I’m different. I had a slightly different set of experiences.

I started out as an experimental scientist doing experiments in psychophysiology on physiologicalresponsestostressinunpredictableenvironments.

Essentially rats were trained to use a skinner box in a standard manner – see a light to the left, press a lever, get food. See the right hand light, press a lever, get a shock. Then we made the environment unpredictable – sometimes you got nothing if you pressed the lever. Another variant was to randomly give a shock when the food lever was pressed and vice versa.

Throughout the experiment we measured respired oxygen consumption and heart rate. What we showed was that stress levels went up more if there was a risk of getting a shock more than if pressing for food delivered nothing, and that the more stressful things were the higher the stress indicators were until it was just too stressful, the animals went on strike, and in one case, managed to escape and hide inside of a photocopier.

The rats were all from the same strain. The randomness effects was controlled by an experimental control program we fed parameters into, and we processed the raw data to produce a set of datafiles we uploaded to the university mainframe (this was 1980) to be processed using SPSS.

While we produced an uploaded a dataset for analysis we had munged our raw data. We did keep our raw data, but on Cromemco Z2D 5.25” disks.

If you wanted to reproduce our results you probably want to have our raw data, our derived dataset for analysis, our data processing and experimental control programs and our SPSS script. That way you could check our original analyses and transformations and validate our conclusions.

You could also reanalyse our data – to save having to carry out the experiment again which has obvious ethical advantages – and could validate our programs and scripts.

However, I ran out of time and funding, and instead of carrying on as a experimental scientist I went to work for a field research station, which didn’t do much in the way of experimental science but did a lot of observational science.

Ostensibly I was there to provide computing support but I ended up managing field survey teams looking at hedgerows.

There is a techniquetoageahedgerowby looking at the number of species growing in a thirty metre stretch. Basically the more species, the older the hedgerows. Obviously its statistical and you need multiple samples, and we are assuming that old farmer Giles did only use one species when he planted the hedge and that no one has removed intruders at a later date.

Despite all these possible problems you can distinguish preenclosurehedges from post enclosure hedges when you know the rough date of the enclosure in an area from historical records.

If you get four old hedges you can start to assume that the field they enclose is also old, and thus if it is too small for Victorian ploughing may not have been subject to intensive agriculture and chemical fertilisers, and in a hand wavy way may have been used primarily for holding stock and hence the species mix may reflect that of earlier pre-intensive agriculture.

The data produced is observational – it consists of sample locations and species lists. You could reanalyse it – say to plot the data on a GIS system to show that the small fields and old hedges are on steeper ground, but al you need is the data. There’s no environment for processing.

So I’m different because I’ve done both observational and experimental. (There is a third type, which is where people compile data from sources – say from medieval tax records – but compiled data is really like observational data – once documented all you need to do is preserve the data and some information about the sources). I’ve also done a lot of work on file format conversion which has made me very aware of the pitfalls in reading older files and in coercing data between different systems.

The long term preservation of observational data is not a hard problem. Columns of numbers plus some explanation of what the columns mean, and if necessary the calculations used.

Experimental data is different. Despite the fact that the tables of results look like those produced by observational science the results are derived as a result of some analyses produced at time of collection. What you really need for reanalysis is the raw (or raw-ish) data, plus enough information to validate the published analysis.

For example, thirty years ago we used 8-bit computers, and as a consequence there was a bit of rounding went on in the way floating point numbers were stored – it is possible though unlikely that this introduced an artefact into the data, and exaggerated a particular value. Likewise our programs were written in Fortran or Basic and the particular compilers used may have had subtle errors in them.

So, for reanalysis one need to preserve the raw data. For the purposes of validation or reproducing the results one needs to record the environment in some way, if only to prove that the method used t the time was valid – and suddenly the whole experimetal data preservation problem starts looking a lot like the same as the computer game preservation problem.

About dgm

Former IT professional, previously a digital archiving and repository person, ex research psychologist, blogger, twitterer, and amateur classical medieval and nineteenth century historian ...
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a comment